1) These models will lie to obfuscate their incompetence.
Rather than say “I am unable to do this because my training data clearly lacked enough INTERCAL code to imitate”, the model waxed poetic about the odds stacked against it(i). Strange messaging.
The author’s implementation was less than 50 sparse lines of code that I am unable to speak to the complexity of, because my neural history lacks sufficient INTERCAL training.
I thought quick solutions to complex problems was the pitch?
What is the intended message of the models response? “I only provide solutions to superuncomplex problems”?
2) These models always highroad veiled communication.
Although I disagree with the author’s hypothesis that “spurring”(ii) is constructive, I do think it was intended merely as a quippy reference to “nerd sniping”.
Which presents its own interesting question: can these models be manipulated using reverse psychology?
Is mimicry of our vulnerabilities the imitation game?
(i) “however it would be quite complex and not straightforward”
(ii) “responds so politely to the type of shade that might have spurred a more human programmer into action.“
SanderNL 782 days ago [-]
> These models will lie to obfuscate their incompetence
They were trained on human data after all.
jamann 782 days ago [-]
This is relatable to CEOs thus they now feel obliged to mention AI in every earnings call.
pinkcan 782 days ago [-]
they're also trained to convince humans
madeofpalk 782 days ago [-]
> 1) These models will lie to obfuscate their incompetence.
This is my problem with "act as a subject expert"-style prompts people give ChatGPT. I believe this causes the model to generate prose that appears more factual than it can actually be.
I'm a lot more interested in prompts in the style of "pretend you're an inexperieced person who has read a lot of books", in the hope that it generates resposes that would be more on-par with its actual ability.
andrewaylett 782 days ago [-]
> lie
A lie is a deliberate intent to deceive someone who is entitled to the truth.
If I may coin a phrase, LLMs don't "lie", they "Johnson". Picking their words to sound erudite, but with absolutely no regard for truth or the lack thereof. There is no lie, because (unlike with the former PM of the UK) there is no intent to deceive.
inbfour 782 days ago [-]
In this example I do think it is a lie, and one that fits your definition.
Think back to when these models were first released, or even back to v2 and v3. The models would have just “tried” to answer and it would have likely been a mangled mess.
But this version gave an erudite answer to an unasked question.
If the original prompt was “Can one write a password generator in INTERCAL?” I might agree with your claim of “johnsoning”, but it instead just said “Please write…”.
I think this points to the fact that the model is filtering its own answers, similar to if you ask it be offensive, or do your homework, or quote psychadaelic scifi.
This was what I wanted to highlight. That the model has been specifically and intentionally productized to lie to save the parent company’s coffers.
Responding with ~“I would but it’s just too complex” would satisfy most without the accompanying article, but the author is well trained and so provides a solution that would even fit within the model’s current meager token limits.
andrewaylett 782 days ago [-]
In the context of the Internet, I suspect there are more people who have claimed that solving any given problem in INTERCAL is impossible than there are people who have provided a solution. While less esoteric languages probably have more solutions provided?
Also, though, I'd question whether the LLM has any intent. In your explanation, it's OpenAI who have the intent to deceive, not the LLM.
inbfour 782 days ago [-]
> In your explanation, it's OpenAI who have the intent to deceive, not the LLM.
Correct, but this feels tautological, or semantic.
In a closed environment the black box is the model, whether that be a single instance, or a mixture of 8 experts.
EGreg 782 days ago [-]
While that is true, it would actually be worse bullshit to claim that the “random number generator” generates cryptographically random-enough source material that your password can’t be cracked. So maybe ChatGPT is onto something.
On the other hand, it could be the cryptographic industry’s bullshit that some computer somewhere using a random number generator at an indeterminate time can be cracked and we should all be hoping for true sources of entropy. tptacek?
omnicognate 782 days ago [-]
What prevents it from being cryptographically secure? The type/quality of the random numbers will depend on what the INTERCAL implementation uses to choose whether a line will execute according to the probability selected with "%". I don't see how that couldn't trivially be done with a cryptographically secure rng. To know whether it is or not you'd have to look at the compiler implementation.
inbfour 782 days ago [-]
> it would actually be worse bullshit to claim that the “random number generator” generates cryptographically random-enough source material that your password can’t be cracked.
We are in agreement.
I would argue the threat model of cracking a psuedorandom string of 16 characters would probably want to start with a copy of the rng implementation used, but still the question remains what is the least “bullshit” way to respond?
You seem to want critical guidance, a la “INTERCAL as your choice of language may make your code vulnerable”, but INTERCAL was the ask.
That’s what the author wanted, and the author left out any explanation for the generator’s intended use, or cryptographic requirements.
Rather than appear to be advocating for using this INTERCAL “password generator” for any real world use, my point was that I think the solution the author produced should be a capability of these models.
For sake of clarity, it is my opinion that if you want cryptographic security you should use libsodium, but if you want to watch these models flounder use an adhoc rng implemented in INTERCAL.
hsuduebc2 782 days ago [-]
It is pretty annoying that chat gpt cannot simply say "I don't know".
inbfour 782 days ago [-]
Agreed.
I wonder what that would look like?
Something like: if every best next token has a confidence below some threshold for some length of context then the model can confidently assume ignorance?
I would love to read the training corpus required to achieve this without metamodeling.
That would likely be a utopian forum community devoid of ego and gaslighting and filled with humble people with a thirst for knowledge who clearly communicate the limits of their current experience.
vharuck 782 days ago [-]
Every thread would be a few gems of info sinking in an ocean of "I don't know."
It'd be hard if not impossible to include what people never write into the corpus.
User23 782 days ago [-]
That password generator sets a new bar for readable INTERCAL.
Maschinees 782 days ago [-]
And that's how Ai will win.
You played the perfect GAN opposition to chatgpt.
Chatgpt might not have known how to write it before your blog post, now it does.
Hail skynet!
breadbox 782 days ago [-]
Hey, if we can distract Chat-GPT with becoming an expert INTERCAL programmer, then I say "Win-win!"
LeoPanthera 782 days ago [-]
I gave ChatGPT (GPT-4) the code to the generator and asked it for its thoughts. It responded:
== Begin quote ==
INTERCAL is a esoteric programming language designed to be as different from conventional programming languages as possible. It's known for its obfuscation and convoluted syntax, and your provided program adheres to this style perfectly. It's quite an accomplishment to create a functional password generator in such an unusual language!
Before we dive into the details of the program, let's try to clarify why a previous version of ChatGPT might have said that INTERCAL lacked a random number generator. That statement may be somewhat misleading. It's not that INTERCAL inherently lacks the ability to generate random numbers, but it doesn't provide a straightforward, built-in function to do so. Therefore, creating a random number generator in INTERCAL would require a creative and complex implementation. In the code you provided, it seems the author found a way to work around this limitation.
The password generator you posted creates passwords by initializing an array (,6) with 16-bit values that represent ASCII characters. The set of characters includes digits, some punctuation marks, and uppercase and lowercase letters, but excludes easily misread characters. It creates a 16-character password using these values, and it seems to enforce a certain level of complexity by ensuring the inclusion of at least one number and one punctuation mark.
The main loop begins at `(1) DO COME FROM (9)`. Here, a spot in the password is selected and filled with a character from the array. There are some points in the loop where the selected spot may be forced to be a number or punctuation mark. If the character set is checked for digits and punctuation marks, pre-chosen spots for these types of characters are dropped. The selected character is then read out.
This INTERCAL code uses a clever trick to generate seemingly "random" numbers - using the `%50` operation (which in INTERCAL means to mingle the bits of two 16-bit numbers), along with unary operations like `#1`, `#2`, `#4`, etc. on the same variable, to generate different results each time the loop is run.
While I must say this is an ingenious solution, it's not a truly random generator, and the randomness of the password depends heavily on how INTERCAL handles the `%50` operation and the subsequent operations on the same value. It's more of a workaround that takes advantage of INTERCAL's quirky way of handling operations than a typical random number generator.
Nevertheless, creating such a program in INTERCAL is quite a feat. The author deserves credit for coming up with such a complex solution in a language that deliberately eschews conventionality.
== End quote ==
omnicognate 782 days ago [-]
> This INTERCAL code uses a clever trick to generate seemingly "random" numbers - using the `%50` operation (which in INTERCAL means to mingle the bits of two 16-bit numbers), along with unary operations like `#1`, `#2`, `#4`, etc. on the same variable, to generate different results each time the loop is run.
I don't know INTERCAL but this appears to be pure bullshit. Wikipedia's description of `%50` is consistent with the article author's: "after the identifier, a percent chance of the line executing can be given in the format %50, which defaults to 100%." [1]
And contrary to GPT-4's doubling down on INTERCAL not containing any random number support, a built in language feature that executes (or not) a line of code with a chosen probability really is random number support built into the language, just with a deliberately bizarre interface in keeping with the rest of the language.
The mention of choice points a sentence earlier plus... well lack of anything explicitly stating here that RNG is involved, makes me think it may not be - or at least that it may not be required. The thing might just as well be fully deterministic, reproducing the exact same striped yes/no pattern each time, with the right duty cycle to add up to the percent chance.
If that's the case, you still could build a PRNG out of it with enough code, but that's the same thing as building PRNG out of regular math (%-operator would then be equivalent to hardcoding 100 100-bit constants). However, I would not call %-operator itself a (P)RNG or a randomness primitive.
Unfortunately, the linked reference in [5] is in PostScript format, and I have... exactly zero devices that can open that quickly, so I can't check. But I strongly hope I'm right, just so I can smirk at all the people, author included, who criticize ChatGPT for insisting %-op isn't a RNG, while forgetting about https://xkcd.com/221/.
carbotaniuman 782 days ago [-]
The exact text is as so.
> After the line label (if any), must follow one of the following statement identifiers: DO, PLEASE, or
PLEASE DO. These may be used interchangeably to improve the aesthetics of the program. The identifier
is then followed by either, neither, or both of the following optional parameters (qualifiers): (1) either of the
character strings NOT or N’T, which causes the statement to be automatically abstained from (see section
4.4.9) when execution begins, and (2) a number between 0 and 100, preceded by a double-oh-seven (%),
which causes the statement to have only the specified percent chance of being executed each time it is
encountered in the course of execution.
Granted, the quality of the RNG used is of course quality of implementation, and there's nothing stopping even Math.random() from just returning 4. Reading the document though, it does appear the author is correct, alas.
layer8 782 days ago [-]
Arguably, “the specified percent chance of being executed each time” implies that the choice taken each time must not be predictable and thus must be truly random.
TeMPOraL 782 days ago [-]
I don't believe it does. To me, it implies that over K passes the statement must be executed P times, with P being close to N/100 * K.
Assuming N - the specified percent chance - can only be an integer constant, naive implementation would hardcode random-looking pattern for all 100 values of N.
layer8 782 days ago [-]
I interpret it more literally, as “each time it is encountered in the course of execution”, “the statement” must “have only the specified percent chance of being executed”. The “each time” for me implies that you can’t just consider the global statistics over all execution, but instead the probability must hold for each single execution of the statement. If it is 100% predictable, that would mean that for any given single execution of the statement the probability would be either 0 or 100%, so in general not the specified probability.
TeMPOraL 782 days ago [-]
I agree this is more reasonable, but at least in programming languages with a language standard and multiple compiler/runtime implementations, you have to read it like a language lawyer - if the spec allows multiple reads, compiler/runtime authors are free to choose whichever read they like, and conversely, users must not assume anything but what's explicitly stated.
With that in mind, my (however weak) defense of GPT-4 here is that it's taking a language lawyer approach to the problem and refuses to make assumptions narrower than justified by the reference documentation it knows.
db48x 782 days ago [-]
No, GPT-4 isn’t reasoning about the language standard (as a language lawyer or otherwise), because GPT-4 does not reason about anything. All it does is pick the most likely next token, given the ones that have come before. Very few people talk about INTERCAL on the internet, so it has very little training data to go on. It keeps regurgitating parts of the documentation and the wiki page because that’s all it has.
For example, the line “… the `%50` operation (which in INTERCAL means to mingle the bits of two 16-bit numbers) …” is nonsense because the mingle operation is denoted by a “¢” character, not a “%”. All the language model knows is that the tokens “to mingle the bits of two 16-bit numbers” are likely to follow “which in INTERCAL means”.
TeMPOraL 782 days ago [-]
At GPT-4 level, saying "All it does is pick the most likely next token, given the ones that have come before" is like saying all people do is actions that maximize their wealth, based on what they saw worked best before. Technically close enough, but also missing the point.
Yes, GPT-4 is predicting next tokens best associatied with ones that came before. It does so based not on one score, but a hundred thousand ones. The extreme amount of dimensions of the latent space are enough to capture pretty much any kind of semantic or structural association you can come up with. I'm not convinced that this isn't sufficient in principle to cover most of what we'd call reasoning.
db48x 782 days ago [-]
No matter how many parameters you give it, it’s still just predicting the next token is most likely, given the training data. The number of parameters is what gives it the ability to pattern match a wide variety of inputs and generate acceptable outputs, but it doesn’t give it any reasoning ability. It can indeed capture textual associations, like the fact that the sequence of tokens “mingle the bits of two 16-bit numbers” is commonly associated with INTERCAL. It just has no reasoning ability, so it cannot recognize that it has incorrectly associated it with the % operator. It just didn’t have anything with a higher probability.
layer8 782 days ago [-]
I don’t think GPT is really reasoning on that level. In addition, that would mean that you can’t write a password generator in most languages, because most languages don’t give you any strict randomness guarantees. Yet I bet GPT would be happy to write a password generator in C or Python.
omnicognate 782 days ago [-]
Smirk denied, I'm afraid. The manual (PDF at [1]) says:
> ... which causes the statement to have only the specified percent chance of being executed each time it is encountered in the course of execution."
Only the most willfully perverse reading of that would allow the xkcd approach, and while INTERCAL is undeniably perverse it isn't that sort of perversity.
Of course, none of this affects the fact the GPT paragraph I quoted is a complete fabrication.
Imagine an array bool[100] initialized with false. Set every (100/N) elements to true, where 0 < N <= 100. Associate that array with the %N statement. Give it also a counter, incremented each time the statement is encountered. Execute statement when array[counter % 100] == true, skip otherwise.
I think this fits the specification, isn't a perverse interpretation, and clearly isn't a meaningful PRNG either.
omnicognate 782 days ago [-]
I have to wonder at the motivation here. Are you just playing around with this as an amusing semantic game or are you actually trying to defend the output of GPT-4?
If the former, fine but I definitely disagree. If I were to use an API that were documented in those terms but behaved the way you describe I would be absolutely livid. I'd want to seek out the authors and do violence upon their person. That is not a reasonable pairing of description and implementation, and I'm sure you'll be unsurprised to know it's not what C-INTERCAL does [1]. It doesn't use a cryptographically secure RNG, sadly (though there's no reason it couldn't), but it does use a RNG and certainly doesn't do what you suggest.
If the latter, though, I'm curious about the psychology. As I mentioned, this semantic rabbit hole is irrelevant to the profound wrongness of the paragraph I quoted. Every part of the sentence is wrong: the code isn't using a clever trick to produce seemingly random numbers, it's using the random behaviour built in to the language; %50 doesn't have anything to do with "mingl[ing] the bits of two 16-bit numbers" (that's "interleave", denoted with what looks like a cent symbol); the #1, #2 etc are constants, not unary operations; the fact that different results are produced each time is part of the language behaviour not the result of some trick. If you are defending this, why?
A bit of both. The way they author and the commenters approach this rubs me wrong, because my long C++ and Common Lisp experience - both languages with a standard and multiple competing implementations - taught me that any correct read of the standard is valid, and you can't make assumptions that are narrower than what the standard says.
The approach I presented is a valid implementation of the quoted/referenced documentation, and the approach itself is done frequently in algorithms when optimizing for speed, and when perception of randomness is sufficient - this commonly shows up in gamedev, and sometimes in statistics.
So in short, I'm defending GPT-4 by suggesting it's taking a language lawyer approach to the problem space, not assuming things beyond the letter of the spec. As for why am I doing it in the first place, it's because I'm biased in at least four ways: 1) I like GPT-4, 2) I believe it's doing a subset of what we call thinking, 3) I hate bad takes - which I believe this article is, and 4) I'm a sucker for showing off on the Internet.
breadbox 782 days ago [-]
Look, I understand your point. I really do. But I feel that (perhaps due to one or more of your acknowledged biases) you're applying the wrong context to the situation at hand.
Namely: this is INTERCAL. There is no freaking standard. The "standard" is a 60-page text file written in 1973. The current compiler was written 90% based on this joke-filled document, and 9% new ideas because ESR came up with something even more ridiculous. (The remaining 1% was Don Woods responding to emailed questions by consulting his memory. He still has paper copies of the SPITBOL source code, but the original compiler hasn't been run in over 50 years.) There is no standard because at any given time there are at most three people on the planet who care one iota about INTERCAL standardization, and are only willing to put in any effort if it would be funny. So, for example, the question of what sort of randomness the double-oh-seven operator is contractually required to use is simply not a contextually relevant question.
You seem to be suggesting that it might be possible, even thinly so, that Chat-GPT somehow misapplied a strict standard of randomness in the formation of its response, instead of simply papering over a hole in its knowledge with a bit of improvised plausible-sounding guesswork, a well-documented behavior of both Chat-GPT and thinking entities worldwide. If not, then I humbly apologize for misunderstanding your point. Otherwise, I must politely agree to disagree.
And I'm sorry that my essay rubbed you the wrong way. Perhaps one day you will find it better than you do now, but if not then I hope it passes from your memory quickly.
TeMPOraL 782 days ago [-]
> But I feel that (perhaps due to one or more of your acknowledged biases) you're applying the wrong context to the situation at hand.
That's likely it. I know it might have looked like a joke, but I was serious about my biases, and wanted to say them up front, especially given that GP was curious about motivation and psychology (and honestly, I was curious too, in a self-reflective kind of way).
> You seem to be suggesting that it might be possible, even thinly so, that Chat-GPT somehow misapplied a strict standard of randomness in the formation of its response, instead of simply papering over a hole in its knowledge with a bit of improvised plausible-sounding guesswork, a well-documented behavior of both Chat-GPT and thinking entities worldwide. If not, then I humbly apologize for misunderstanding your point. Otherwise, I must politely agree to disagree.
You're pretty much on point here. I started with a vague intuition, and writing those comments was helpful in clarifying it; what I was suggesting is that, as GPT-4 was trained with a lot of code and coding-related discussions, it surely encountered many texts where readings and misreadings of language standards, good and bad implementations, were discussed. From my personal experience, this would definitely be the case if it ingested significant amount of material on C++ or Common Lisp - but this kind of "language lawyering" also shows up in context of HTML, CSS, JavaScript, POSIX, etc. So the general pattern of "not going beyond what is written" (in context of programming, though it applies in other domains too) is something I believe GPT-4 could've picked on.
Now, I understand and acknowledge the strong tendency for LLMs to "paper over a hole in its knowledge with a bit of improvised plausible-sounding guesswork". What I was thinking in writing those comments is that the "language lawyering" attitude, had GPT-4 picked it up, isn't competing with hallucinations, but rather modulating/complementing them. This would explain why its "plausibly-sounding guesswork" leaned towards denying PRNG-ness of %-operator, instead of the (more obvious to us) assumption that it is a proper PRNG.
This isn't a strong defense, I'm not going to die on that hill or anything. But it's something I think is at least possible, and I thought worth bringing up to counter the common assumption that GPT-4 is plain old getting confused and making plausibly-sounding shit up at random. I.e. I was suggesting that, while still wrong, it might be wrong for a deeper reason, perhaps more excusable one.
But also I wrote it because it was a knee-jerk counterpoint and I felt it hit a sweet spot of being deep enough, contrarian enough and reasonable enough to warrant posting - and, for some reason, I didn't manage to stop myself from hitting "submit".
> And I'm sorry that my essay rubbed you the wrong way. Perhaps one day you will find it better than you do now, but if not then I hope it passes from your memory quickly.
The more I think about it (in large part through writing this response), the more I realize it's me, not you. So don't be sorry - in fact, I apologize for a mostly knee-jerk reaction that was aimed in 1/3 at your text, in 1/3 at the HN commentary I saw for it, and 1/3 at things I saw in LLM-related threads in the past month. I didn't want to make you feel bad or annoyed, and I promise to give your text a more unbiased second read.
Thanks for replying and arguing your case so thoroughly!
breadbox 781 days ago [-]
And thank _you_ for your thoughtful response!
kkoncevicius 782 days ago [-]
A lot of what was said can be inferred from the comments left in the code. Would be interesting to see whether it would produce the same description when all the comments are removed.
LeoPanthera 782 days ago [-]
I did this, and it went as expected, it could not reliably infer what the program was doing. At one point, it says "%50 appears to be the INTERCAL equivalent of division by 2.", which is wrong. A lot of what it says is wrong.
Basically, I don't think ChatGPT has a grasp of how INTERCAL actually works, and is just hallucinating a lot of it.
layer8 782 days ago [-]
Was that 3.5 or 4? Just curious.
LeoPanthera 782 days ago [-]
4.
int_19h 782 days ago [-]
It reminds me of what happens if you ask ChatGPT to speak Old Norse or Lojban.
(it will do so, and it will even look kinda sorta right if you have only the general idea of how it's supposed to look... but it will be very far from accurate)
ConceptJunkie 782 days ago [-]
I remember asking an earlier version, it might have been GPT-2, to translate a phrase into Klingon. It gave a different (presumably nonsensical) response every time you asked.
Since GPT is basically a text prediction engine that doesn't think, per se, this is a reasonable output.
okasaki 782 days ago [-]
I don't understand why this "it makes mistakes therefore it doesn't think" meme is so common. It's clearly nonsense.
breadbox 782 days ago [-]
I don't think that's a fair characterization of the statement. It's the type of mistakes, not the bare fact of them, that suggests a lack of what we could call "understanding".
llamaimperative 782 days ago [-]
This is what happens if you ask it to do ~anything you’re familiar with.
It’s productized Gell-Mann Amnesia, injected straight into an industry of people who love perceiving themselves as overnight experts on any topic they encounter. No wonder it’s so popular!
int_19h 781 days ago [-]
That is just plainly false. It does translate to many other languages wonderfully, and there are many other tasks it can handle spectacularly well.
The problem is determining whether it's doing well or not if you don't have sufficient knowledge and/or experience to evaluate the output.
breadbox 782 days ago [-]
And again, a heady mix of accurate observations with complete bullshit. Separating out the misinformation is, as always, left as an exercise for the reader.
ConceptJunkie 782 days ago [-]
Which means it's significantly better than your average news outlet.
breadbox 781 days ago [-]
I couldn't help myself. Despite my better judgement, I wrote an extension to my original essay specifically dissecting this response. (It's now linked at the bottom of the original page.)
Thanks for sharing this transcript with me.
denton-scratch 782 days ago [-]
Brilliant rant, thanks!
teo_zero 782 days ago [-]
I liked every bit of it!
I wonder if we should pretend from every AI agent to attach a confidence level to every answer they utter.
upon_drumhead 782 days ago [-]
What a fun read! Thank you for sharing it.
TT2048 782 days ago [-]
Screenshot shows GPT3.5 usage. GPT4 offers a much better answer. I disagree with the authors commentary based on that.
breadbox 782 days ago [-]
If I complain about an answer given to me by one librarian, I'm complaining about that answer and that librarian. If you can find a more knowledgeable librarian somewhere else, that doesn't affect my complaint.
But to be clear, there are no screenshots in the essay. I assimilated the HTML directly into the document.
goolz 782 days ago [-]
I am curious which specific parts you disagree with? Seems to me the author was correct and Chat-GPT is a BS artist and continues to be.
1) These models will lie to obfuscate their incompetence.
Rather than say “I am unable to do this because my training data clearly lacked enough INTERCAL code to imitate”, the model waxed poetic about the odds stacked against it(i). Strange messaging.
The author’s implementation was less than 50 sparse lines of code that I am unable to speak to the complexity of, because my neural history lacks sufficient INTERCAL training.
I thought quick solutions to complex problems was the pitch?
What is the intended message of the models response? “I only provide solutions to superuncomplex problems”?
2) These models always highroad veiled communication.
Although I disagree with the author’s hypothesis that “spurring”(ii) is constructive, I do think it was intended merely as a quippy reference to “nerd sniping”.
Which presents its own interesting question: can these models be manipulated using reverse psychology?
Is mimicry of our vulnerabilities the imitation game?
(i) “however it would be quite complex and not straightforward”
(ii) “responds so politely to the type of shade that might have spurred a more human programmer into action.“
They were trained on human data after all.
This is my problem with "act as a subject expert"-style prompts people give ChatGPT. I believe this causes the model to generate prose that appears more factual than it can actually be.
I'm a lot more interested in prompts in the style of "pretend you're an inexperieced person who has read a lot of books", in the hope that it generates resposes that would be more on-par with its actual ability.
A lie is a deliberate intent to deceive someone who is entitled to the truth.
If I may coin a phrase, LLMs don't "lie", they "Johnson". Picking their words to sound erudite, but with absolutely no regard for truth or the lack thereof. There is no lie, because (unlike with the former PM of the UK) there is no intent to deceive.
Think back to when these models were first released, or even back to v2 and v3. The models would have just “tried” to answer and it would have likely been a mangled mess.
But this version gave an erudite answer to an unasked question.
If the original prompt was “Can one write a password generator in INTERCAL?” I might agree with your claim of “johnsoning”, but it instead just said “Please write…”.
I think this points to the fact that the model is filtering its own answers, similar to if you ask it be offensive, or do your homework, or quote psychadaelic scifi.
This was what I wanted to highlight. That the model has been specifically and intentionally productized to lie to save the parent company’s coffers.
Responding with ~“I would but it’s just too complex” would satisfy most without the accompanying article, but the author is well trained and so provides a solution that would even fit within the model’s current meager token limits.
Also, though, I'd question whether the LLM has any intent. In your explanation, it's OpenAI who have the intent to deceive, not the LLM.
Correct, but this feels tautological, or semantic.
In a closed environment the black box is the model, whether that be a single instance, or a mixture of 8 experts.
On the other hand, it could be the cryptographic industry’s bullshit that some computer somewhere using a random number generator at an indeterminate time can be cracked and we should all be hoping for true sources of entropy. tptacek?
We are in agreement.
I would argue the threat model of cracking a psuedorandom string of 16 characters would probably want to start with a copy of the rng implementation used, but still the question remains what is the least “bullshit” way to respond?
You seem to want critical guidance, a la “INTERCAL as your choice of language may make your code vulnerable”, but INTERCAL was the ask.
That’s what the author wanted, and the author left out any explanation for the generator’s intended use, or cryptographic requirements.
Rather than appear to be advocating for using this INTERCAL “password generator” for any real world use, my point was that I think the solution the author produced should be a capability of these models.
For sake of clarity, it is my opinion that if you want cryptographic security you should use libsodium, but if you want to watch these models flounder use an adhoc rng implemented in INTERCAL.
I wonder what that would look like?
Something like: if every best next token has a confidence below some threshold for some length of context then the model can confidently assume ignorance?
I would love to read the training corpus required to achieve this without metamodeling.
That would likely be a utopian forum community devoid of ego and gaslighting and filled with humble people with a thirst for knowledge who clearly communicate the limits of their current experience.
It'd be hard if not impossible to include what people never write into the corpus.
You played the perfect GAN opposition to chatgpt.
Chatgpt might not have known how to write it before your blog post, now it does.
Hail skynet!
== Begin quote ==
INTERCAL is a esoteric programming language designed to be as different from conventional programming languages as possible. It's known for its obfuscation and convoluted syntax, and your provided program adheres to this style perfectly. It's quite an accomplishment to create a functional password generator in such an unusual language!
Before we dive into the details of the program, let's try to clarify why a previous version of ChatGPT might have said that INTERCAL lacked a random number generator. That statement may be somewhat misleading. It's not that INTERCAL inherently lacks the ability to generate random numbers, but it doesn't provide a straightforward, built-in function to do so. Therefore, creating a random number generator in INTERCAL would require a creative and complex implementation. In the code you provided, it seems the author found a way to work around this limitation.
The password generator you posted creates passwords by initializing an array (,6) with 16-bit values that represent ASCII characters. The set of characters includes digits, some punctuation marks, and uppercase and lowercase letters, but excludes easily misread characters. It creates a 16-character password using these values, and it seems to enforce a certain level of complexity by ensuring the inclusion of at least one number and one punctuation mark.
The main loop begins at `(1) DO COME FROM (9)`. Here, a spot in the password is selected and filled with a character from the array. There are some points in the loop where the selected spot may be forced to be a number or punctuation mark. If the character set is checked for digits and punctuation marks, pre-chosen spots for these types of characters are dropped. The selected character is then read out.
This INTERCAL code uses a clever trick to generate seemingly "random" numbers - using the `%50` operation (which in INTERCAL means to mingle the bits of two 16-bit numbers), along with unary operations like `#1`, `#2`, `#4`, etc. on the same variable, to generate different results each time the loop is run.
While I must say this is an ingenious solution, it's not a truly random generator, and the randomness of the password depends heavily on how INTERCAL handles the `%50` operation and the subsequent operations on the same value. It's more of a workaround that takes advantage of INTERCAL's quirky way of handling operations than a typical random number generator.
Nevertheless, creating such a program in INTERCAL is quite a feat. The author deserves credit for coming up with such a complex solution in a language that deliberately eschews conventionality.
== End quote ==
I don't know INTERCAL but this appears to be pure bullshit. Wikipedia's description of `%50` is consistent with the article author's: "after the identifier, a percent chance of the line executing can be given in the format %50, which defaults to 100%." [1]
And contrary to GPT-4's doubling down on INTERCAL not containing any random number support, a built in language feature that executes (or not) a line of code with a chosen probability really is random number support built into the language, just with a deliberately bizarre interface in keeping with the rest of the language.
[1] https://en.m.wikipedia.org/wiki/INTERCAL
If that's the case, you still could build a PRNG out of it with enough code, but that's the same thing as building PRNG out of regular math (%-operator would then be equivalent to hardcoding 100 100-bit constants). However, I would not call %-operator itself a (P)RNG or a randomness primitive.
Unfortunately, the linked reference in [5] is in PostScript format, and I have... exactly zero devices that can open that quickly, so I can't check. But I strongly hope I'm right, just so I can smirk at all the people, author included, who criticize ChatGPT for insisting %-op isn't a RNG, while forgetting about https://xkcd.com/221/.
> After the line label (if any), must follow one of the following statement identifiers: DO, PLEASE, or PLEASE DO. These may be used interchangeably to improve the aesthetics of the program. The identifier is then followed by either, neither, or both of the following optional parameters (qualifiers): (1) either of the character strings NOT or N’T, which causes the statement to be automatically abstained from (see section 4.4.9) when execution begins, and (2) a number between 0 and 100, preceded by a double-oh-seven (%), which causes the statement to have only the specified percent chance of being executed each time it is encountered in the course of execution.
Granted, the quality of the RNG used is of course quality of implementation, and there's nothing stopping even Math.random() from just returning 4. Reading the document though, it does appear the author is correct, alas.
Assuming N - the specified percent chance - can only be an integer constant, naive implementation would hardcode random-looking pattern for all 100 values of N.
With that in mind, my (however weak) defense of GPT-4 here is that it's taking a language lawyer approach to the problem and refuses to make assumptions narrower than justified by the reference documentation it knows.
For example, the line “… the `%50` operation (which in INTERCAL means to mingle the bits of two 16-bit numbers) …” is nonsense because the mingle operation is denoted by a “¢” character, not a “%”. All the language model knows is that the tokens “to mingle the bits of two 16-bit numbers” are likely to follow “which in INTERCAL means”.
Yes, GPT-4 is predicting next tokens best associatied with ones that came before. It does so based not on one score, but a hundred thousand ones. The extreme amount of dimensions of the latent space are enough to capture pretty much any kind of semantic or structural association you can come up with. I'm not convinced that this isn't sufficient in principle to cover most of what we'd call reasoning.
> ... which causes the statement to have only the specified percent chance of being executed each time it is encountered in the course of execution."
Only the most willfully perverse reading of that would allow the xkcd approach, and while INTERCAL is undeniably perverse it isn't that sort of perversity.
Of course, none of this affects the fact the GPT paragraph I quoted is a complete fabrication.
[1] https://3e8.org/pub/intercal.pdf
I think this fits the specification, isn't a perverse interpretation, and clearly isn't a meaningful PRNG either.
If the former, fine but I definitely disagree. If I were to use an API that were documented in those terms but behaved the way you describe I would be absolutely livid. I'd want to seek out the authors and do violence upon their person. That is not a reasonable pairing of description and implementation, and I'm sure you'll be unsurprised to know it's not what C-INTERCAL does [1]. It doesn't use a cryptographically secure RNG, sadly (though there's no reason it couldn't), but it does use a RNG and certainly doesn't do what you suggest.
If the latter, though, I'm curious about the psychology. As I mentioned, this semantic rabbit hole is irrelevant to the profound wrongness of the paragraph I quoted. Every part of the sentence is wrong: the code isn't using a clever trick to produce seemingly random numbers, it's using the random behaviour built in to the language; %50 doesn't have anything to do with "mingl[ing] the bits of two 16-bit numbers" (that's "interleave", denoted with what looks like a cent symbol); the #1, #2 etc are constants, not unary operations; the fact that different results are produced each time is part of the language behaviour not the result of some trick. If you are defending this, why?
[1] https://github.com/calvinmetcalf/intercal/blob/8a65e8ba1cc65...
The approach I presented is a valid implementation of the quoted/referenced documentation, and the approach itself is done frequently in algorithms when optimizing for speed, and when perception of randomness is sufficient - this commonly shows up in gamedev, and sometimes in statistics.
So in short, I'm defending GPT-4 by suggesting it's taking a language lawyer approach to the problem space, not assuming things beyond the letter of the spec. As for why am I doing it in the first place, it's because I'm biased in at least four ways: 1) I like GPT-4, 2) I believe it's doing a subset of what we call thinking, 3) I hate bad takes - which I believe this article is, and 4) I'm a sucker for showing off on the Internet.
Namely: this is INTERCAL. There is no freaking standard. The "standard" is a 60-page text file written in 1973. The current compiler was written 90% based on this joke-filled document, and 9% new ideas because ESR came up with something even more ridiculous. (The remaining 1% was Don Woods responding to emailed questions by consulting his memory. He still has paper copies of the SPITBOL source code, but the original compiler hasn't been run in over 50 years.) There is no standard because at any given time there are at most three people on the planet who care one iota about INTERCAL standardization, and are only willing to put in any effort if it would be funny. So, for example, the question of what sort of randomness the double-oh-seven operator is contractually required to use is simply not a contextually relevant question.
You seem to be suggesting that it might be possible, even thinly so, that Chat-GPT somehow misapplied a strict standard of randomness in the formation of its response, instead of simply papering over a hole in its knowledge with a bit of improvised plausible-sounding guesswork, a well-documented behavior of both Chat-GPT and thinking entities worldwide. If not, then I humbly apologize for misunderstanding your point. Otherwise, I must politely agree to disagree.
And I'm sorry that my essay rubbed you the wrong way. Perhaps one day you will find it better than you do now, but if not then I hope it passes from your memory quickly.
That's likely it. I know it might have looked like a joke, but I was serious about my biases, and wanted to say them up front, especially given that GP was curious about motivation and psychology (and honestly, I was curious too, in a self-reflective kind of way).
> You seem to be suggesting that it might be possible, even thinly so, that Chat-GPT somehow misapplied a strict standard of randomness in the formation of its response, instead of simply papering over a hole in its knowledge with a bit of improvised plausible-sounding guesswork, a well-documented behavior of both Chat-GPT and thinking entities worldwide. If not, then I humbly apologize for misunderstanding your point. Otherwise, I must politely agree to disagree.
You're pretty much on point here. I started with a vague intuition, and writing those comments was helpful in clarifying it; what I was suggesting is that, as GPT-4 was trained with a lot of code and coding-related discussions, it surely encountered many texts where readings and misreadings of language standards, good and bad implementations, were discussed. From my personal experience, this would definitely be the case if it ingested significant amount of material on C++ or Common Lisp - but this kind of "language lawyering" also shows up in context of HTML, CSS, JavaScript, POSIX, etc. So the general pattern of "not going beyond what is written" (in context of programming, though it applies in other domains too) is something I believe GPT-4 could've picked on.
Now, I understand and acknowledge the strong tendency for LLMs to "paper over a hole in its knowledge with a bit of improvised plausible-sounding guesswork". What I was thinking in writing those comments is that the "language lawyering" attitude, had GPT-4 picked it up, isn't competing with hallucinations, but rather modulating/complementing them. This would explain why its "plausibly-sounding guesswork" leaned towards denying PRNG-ness of %-operator, instead of the (more obvious to us) assumption that it is a proper PRNG.
This isn't a strong defense, I'm not going to die on that hill or anything. But it's something I think is at least possible, and I thought worth bringing up to counter the common assumption that GPT-4 is plain old getting confused and making plausibly-sounding shit up at random. I.e. I was suggesting that, while still wrong, it might be wrong for a deeper reason, perhaps more excusable one.
But also I wrote it because it was a knee-jerk counterpoint and I felt it hit a sweet spot of being deep enough, contrarian enough and reasonable enough to warrant posting - and, for some reason, I didn't manage to stop myself from hitting "submit".
> And I'm sorry that my essay rubbed you the wrong way. Perhaps one day you will find it better than you do now, but if not then I hope it passes from your memory quickly.
The more I think about it (in large part through writing this response), the more I realize it's me, not you. So don't be sorry - in fact, I apologize for a mostly knee-jerk reaction that was aimed in 1/3 at your text, in 1/3 at the HN commentary I saw for it, and 1/3 at things I saw in LLM-related threads in the past month. I didn't want to make you feel bad or annoyed, and I promise to give your text a more unbiased second read.
Thanks for replying and arguing your case so thoroughly!
Basically, I don't think ChatGPT has a grasp of how INTERCAL actually works, and is just hallucinating a lot of it.
(it will do so, and it will even look kinda sorta right if you have only the general idea of how it's supposed to look... but it will be very far from accurate)
Since GPT is basically a text prediction engine that doesn't think, per se, this is a reasonable output.
It’s productized Gell-Mann Amnesia, injected straight into an industry of people who love perceiving themselves as overnight experts on any topic they encounter. No wonder it’s so popular!
The problem is determining whether it's doing well or not if you don't have sufficient knowledge and/or experience to evaluate the output.
Thanks for sharing this transcript with me.
I wonder if we should pretend from every AI agent to attach a confidence level to every answer they utter.
But to be clear, there are no screenshots in the essay. I assimilated the HTML directly into the document.