Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How does GPT obtain its ability? Tracing emergent abilities of language models (yaofu.notion.site)
414 points by headalgorithm on Dec 14, 2022 | hide | past | favorite | 192 comments


Amazing insight, particularly section 6.

"- The two important but different abilities of GPT-3.5 are *knowledge* and *reasoning*. Generally, it would be ideal if we could *offload the knowledge part to the outside retrieval system and let the language model only focus on reasoning.* This is because: - The model’s internal knowledge is always cut off at a certain time. The model always needs up-to-date knowledge to answer up-to-date questions. - Recall we have discussed that is 175B parameter is heavily used for storing knowledge. If we could offload knowledge to be outside the model, then the model parameter might be significantly reduced such that eventually, it can run on a cellphone (call this crazy here, but ChatGPT is already science fiction enough, who knows what the future will be)."

& "Yet there was a WebGPT paper published in Dec 2021. It is likely that this is already tested internally within OpenAI."

It definitely feels like this may be the next step in making this kind of system robust. It ends up being an interface for search.


It's unclear to me how you could separate knowledge and reasoning:

- Reasoning typically requires base knowledge to work from. A side effect of training reasoning is embedding knowledge into the model parameters.

- Even if you offload the search portion (either through outputting special tokens that are postprocessed, or applying the model in multiple steps with postprocessing), you still need embedded knowledge for the model to decide what to search for, and then to successfully integrate that knowledge (in the multi-step case).

Maybe some kind of post-facto pruning of model weights?


Reasoning is that which knows that it lacks some necessary knowledge, whereas knowledge isn't aware that it lacks some necessary reasoning.


Fascinating - makes me think of the distinction between dreaming and being awake. Only in the latter state one can tell the difference.


Not quite true, see lucid dreaming.


To be fair to the GP comment, I've personally never experienced lucid dreaming, so for me the distinction holds - when I'm dreaming, I never know I'm dreaming, and mostly I don't even remember my life awake, I only remember whatever's in the context of the dream.


That would be a third, distinct state from each. Dreaming while knowing that you are is quite a thrill!


This has to be one of the most insightful sentences I've ever read.


Would you mind expanding on it a bit? I do sincerely appreciate its pithiness, but curious to read it explained a bit further.


Think of it as: reasoning=computation, knowledge=data. Data alone doesn’t say it must be computed. But computation, by definition, is attempting to create data (the result) that doesn’t exist. Thus: knowledge isn’t aware it must be reasoned about, but reasoning knows it’s trying to find (deduce, compute) knowledge it lacks.


I disagree. If you have knowledge, and you don't try to do anything other than compress it to make space, reasoning about that knowledge will come about by sheer unintended consequence once the patterns your compressing on reach some threshold of sophistication.

By definition, an optimal compression algo is a dimensionality reduction algo. A dimensionality reduction algo lets you do a bunch of machine learning tasks.


In the world of large language models, what part of "reasoning" is hard-coded and what part, if any, is learnt?

Is reasoning simply a scan/search of your vector space (i.e. your knowledge) according to some hard-coded algo?


Thanks for that!


Doesn't seem to unreasonable to me. If I asked you "How old is Obama" and you had a data source which had the ages of every person, you don't need to know the answer from memory. Your reasoning tells you that to find out how old someone is you need to check the external resource and what to do with the info once you get it.


The required knowledge is around entity recognition, with "Obama" referring to the 44th POTUS, and not somebody else who happens to have the same surname (and there are multiple of them actually, at least 4 given his family).


This can clearly be guessed from a search as well. Popularity can be well defined, and in the case of Obama, there is clearly one much more popular than the others.


The model still needs to infer from the sentence the entity to look it up. It is also the case that this is a relatively simple example as 'Obama' refers to a single class of entities and there is not a lot of ambiguity around resolution of class, only resolution of specific entity.

Take this sentence:

> When was KitKat released?

I could refer to the sweet, or the Android OS. Vastly different classes, and the model here needs to "decide" to ask for more information to disambiguate the class, and if the class is the sweet, then it needs to disambiguate the taste particular flavour possibly, and even ask the geographic location.


Yes, but the amount of knowledge necessary to decide how to make those sorts of decisions is far smaller than the amount of knowledge necessary to answer all such questions.


And that's perfectly fine. Humans have exactly the same problem. They will get this wrong, and you will reply "no, I'm talking about the android version". Language is ambiguous so we cannot expect machines to get it right all the time.


I do agree with you that it is fine, what I was getting at was that there needs to be a way to measure uncertainty in a manner that is robust to unbalanced distributions or context drifting.


Okay I see what you mean, I agree with you on this.


I hope nobody ever releases the famous Kitkat Club in Berlin from its chains. Because there are not so many.

My experience with ChatGPT is that it gets what I mean very well from the context.


Fair, although I’d think it acceptable if the response to your prompt was “which one? I found 4”


It would have way more than 4 people in the search results though. GP said there's at least 4 because there's him, his wife and his kids.

Even knowing Obama is a person is a knowledge-based leap. (To us humans) it's obvious the question means Barack Obama because he's the most notable subject for that name. But how do you prevent your AI from responding that the "Obama JS library is 5 years old"

https://github.com/rgbkrk/obama


If it's based on frequency of training data, the ex-president will have far more hits in the training corpus.

Now, on less talked about topics it doesn't sound any different than what happens with people

Q "How old is Tim?"

A "Which Tim are you talking about, you didn't give me crap to work with?"


The answer is obviously 62 because that's how old Tim Apple is.


Ha, it might give an age of 7 years or so when the moniker was invented.


Isn't it a sort of cultural knowledge to understand what is meant? Ie when we say Obama we mean one specific guy who is very important but if I'm talking about Mrs Watanabe it's a generic Japanese person?


You can already coax ChatGPT into interacting with external systems today; I set up a prompt where the model pretended to be a factory system on a communication bus. It could access its "inventory" by posting a prefixed message to the communication bus.

After a bit of prompt engineering the model could query inventory, "manufacture" various recipes, and store the end products in inventory.

It might be possible to look at the weight activations as it reasons through contacting the external system over the emulated communication bus? For a suitably varied set of commands you might be able to find a subset of weights that are most correlated to the task and prune the others. Then you'd be left with a model that can retrieve and store information, as well as perform reasoning tasks.

Still has problems with working memory (the input token limit, since the model is auto-regressive) given all the external information is coming back in via the prompt, but ChatGPT seems to handle that gracefully right now.


> It's unclear to me how you could separate knowledge and reasoning

yeah, me too. There are a few cases I'd point your attention to.

1, preschool toys, kids somehow manage to put the square peg in the square hole. I mean, they may chew on them or push them around, but there's a "moment of magic" when they make it all click together. Maybe there's some implicit knowledge there, I know I played games like that, but I don't remember.

2, sudoku. you don't really need to know anything, just make each line row and box different. no memorization, just look. but what about the rules? does that count as knowledge?

I've been reading some math books lately, and I think we're not alone. Coping with sets of sets is a hard question that people have been wondering about for, as far as I can tell, a long time.

For now, it's probably safe to say, knowledge about knowledge is different that just knowledge, and having one layer work on k1 and another layer work on k2 is ok. maybe someday add k3...kn. Other fields do that. Worth checking out.

I think, we could both get very fussy about what exactly that _means_. But for now, I'm happy to be charitable in my reading. I'd also expect them to run into some really thorny problems when they try to pin down exactly what's going on, just like everybody else does. For today, good for them. Seems like a nice win.


Different people might think diffrent, but when solving complex problems I do think I have seperate "think about / gather the facts" and "formulate the solution" phases.

I don't think about the totality of facts in the world - I think my brain is mentally extracting the facts that are relevant to the problem and then reason about those facts.

There is certinaly back/forth though, but I think I go "here is a bit of information, how does that apply? ok but what about this fact? ok here is how that would apply considering something else..." but I think this is still a gather -> solve -> gather -> solve


Humans (or at least I) have the ability to look up knowledge. What about an architecture where the main model has a background conversation with a knowledge model, just like I need to look up things I remember exist, but forget details off. Heck, such a knowledge model would have great value for people aswell.

Using written conversation as an interface between language models feels natural and completely bonkers at the same time.


I don’t know why you would want to separate them completely. You have some knowledge in your head, store some other in notes on your desk and some in a library (the kind with books in it).

You can make the model write to a local wiki when it encounters new information and read from its own wiki when it feels it needs to store stuff. You can also make it spend time randomly browsing the knowledge base reconsolidating it, reorganising and labeling it.

The architecture of the wiki doesn’t have to be “clever” in any way. It is just an old fasioned database with query function the model can write to and query from.


Reasoning can be defined in the abstract. Knowledge cannot.

For example, If I need A & B | D & E to get C I can reason that if I have B and want C, I need A or D & E.

Once I aquired this reasoning skill, I can apply it to any kind of "bool-sequence X required for Y" situation, regardless of what specificly X and Y are, or how many entities X encompasses.

Whereas if I know that a rocket engine requires an oxygen/methane mix to function, I cannot transfer that to the knowledge that I need a raincoat or umbrealla in order to avoid getting wet in the rain.


The problem with ChatGPT's "knowledge" is that it isn't trustworthy. It will happily output very confident sounding nonsense, or blatantly incorrect statements. We need a way to verify how accurate it's outputs are


ChatGPT made this nice COBOL program to create an S3 Bucket,a technical impossibility...

IDENTIFICATION DIVISION. PROGRAM-ID. CREATE-S3-BUCKET.

ENVIRONMENT DIVISION. CONFIGURATION SECTION.

INPUT-OUTPUT SECTION.

DATA DIVISION. FILE SECTION.

WORKING-STORAGE SECTION. 01 AWS-ACCESS-KEY PIC X(20). 01 AWS-SECRET-KEY PIC X(40). 01 BUCKET-NAME PIC X(255).

PROCEDURE DIVISION. CREATE-BUCKET. MOVE AWS-ACCESS-KEY TO AWS-ACCESS-KEY-VAR MOVE AWS-SECRET-KEY TO AWS-SECRET-KEY-VAR MOVE BUCKET-NAME TO BUCKET-NAME-VAR INVOKE AWS-S3 "CREATE-BUCKET" USING AWS-ACCESS-KEY-VAR AWS-SECRET-KEY-VAR BUCKET-NAME-VAR


How is that impossible? Plenty of libraries are available for COBOL, especially if you use COBOL.NET


What is the COBOL SDK for AWS?


Probably someone sells one, or you'd just use the AWS SDK for .NET via COBOL.NET.


There are http client libraries for COBOL, and it’s easy to use http to make S3 api calls.


What is the technical impediment to writing one?


"Eww, Cobol"


Just ask ChatGPT to implement it.


That was ChatGPT's first response, yes.


Whose knowledge is trustworthy? We've somehow come to associate certain institutions or scientific authorities with truth when that is about the furthest from real science:

"Have no respect whatsoever for authority; forget who said it and instead look what he starts with, where he ends up, and ask yourself, Is it reasonable?" -Richard P. Feynman

"One of the great commandments of science is, "Mistrust arguments from authority." -Carl Sagan

"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." -Galileo Galilei


I think part of the issue is that it’s easier to test the limits of or a humans knowledge, and ironically with your quotes I think you’ve supplied evidence that trust is crucial, in that the truest expression of those quotes would be to just deliver the payload and not attach any sort of authority by association to it.

You can’t trust it’s answers (to be fair that’s the existing status quo), but you also can’t easily test it because it will return reasonable sounding garbage. Conversely you can discover ignorance in most humans pretty quickly by exhausting their ability to respond (or your ability to ask).


A generative system, be it a neural network or a human, needs a way to test ideas in order to align with reality. If testing is available, then it is possible to advance the state of the art. Ideas are cheap, results matter.


Sure, but that doesn’t seem to square with the topic at hand - “why does an infinite truth and lies machine feel less trustworthy than another human”. It just isn’t a question that needs a high degree of abstraction to respond to.


It is sometimes a lie machine because it lacks grounding in verification. Humans get more grounding than language models but even we are not 100% there - remember the antivax hysteria. The most grounded field is science, but even in scientific papers most things don't replicate. Verification is hard on all levels and requires extensive work. In particle physics all scientists clump together around the CERN accelerator as it is the only source of verification they have (almost, I exaggerate a bit).

It's going to be important to develop AI methods to test and verify, I think unverified model outputs are worthless verbiage. Verification can be based on references, code execution, physical simulations, lab experiments and even language based simulations.

In a few years the situation is going to flip, AI is going to become more reliable than humans. Being tested on millions of cases, it will be more trustworthy than us, no human can be tested to that extent. It's going to be interesting to see how we react to super-valid AI. Our guiding role is going to shrink more and more, we will be the children.


Those quotes are not about trust, they are about the rhetorical technique of appeal to authority.

The actual payload though is the mistrust of authority exactly because we are all so susceptible to the logical fallacy of appeal to authority masquerading bullshit as truth.

There is no problem to solve here. ChatGPT should never be an authority on anything.


I, too, always recreate double blind experiments before I take the drugs my doctor gives me :)

I also double-check the transistors in my computer work correctly before I run any code on them, and of course I re-derive the physics to be able to do that :)

In practice you are an expert in a very small domain (if any) and in all the other domains you have no choice but to accept somebody's authority.


That's good, I don't go quite as far, but do try to consult multiple independent sources.

Doctors have been known to overprescribe things like Benzos, and opioids from time to time.

I also just use tools like a RAM diagnostic that can check large numbers of transistors at once. I imagine you're quite good at QM after all that practice applying the wave equation though. Impressive!


There is some kind of recursion in here with authors names and "have no respect..." part:)


Fair point


Actually, it isn't.

Their arguments are of the form "this statement could be false."

If you evaluate it as a true statement, you have no problems. It could be false; you have to evaluate it for yourself instead of trusting some authority.

It's only if you assert that it's certainly false that you have a problem. Because then it's clearly true -- since otherwise these authorities would be telling you something false, which proves their assertion true.

Put another way, it can get you from the undesirable position of blindly trusting authorities to the desirable position of questioning them, but not the other way around. Which is the intended result.


It’s slightly ironic that the only reason we pay attention to these particular quotes is that they come from famous physicists, i.e. authorities.


One way I tried to do this is by having it write an answer, and a footnote reference at each fact. [1] then list search terms that be used to verify each claim, then I would respond with the url and quotes from found pages for each one, then have it rewrite the answer based on that information and cite the sources. I think something this direction can be automated. I saw someone do this with math and other tasks, that would talk to a connected program before answering.


Yes, it's been done both in papers and in various GPT-3 projects. As long as you can find relevant references the LM will become reliable.


I did this as well and it looks great initially but there are already examples of GPT generating totally bogus references and sources. So we're back to square 1.


Sounds like an interesting way to reboot Wikipedia.


I just had a run in with this yesterday. I asked it to explain box embeddings. It's a pretty niche topic so I didn't expect it to give the right answer. But the answer it gave sounded so confident but it was so wrong. It took a not al vector embeddings approach but replaced that with box. I tried correcting it but it refused to budge and still sounded confident.


I asked it to explain part of my thesis work on Oblivious Transfer, and it gave me a lovely prose description of the Green-Hohenberger Oblivious Transfer protocol. It was clear and confident, and the thing it described was even an actual protocol. It just wasn’t in any way our protocol: GPT just took some classical protocol it found elsewhere and relabeled it.


Sounds like many humans I know.


ChatGPT to be employed in marketing positions immediately.


Think bigger. PresidentGPT. On tweeter!


You are right, this is the pain point - trust, verification. I think it will become the next focus of research.

There are many things we could do to solve this problem. One of them is to use an external reference for verification. Another one is to train the model to verify facts by augmenting the input with lies - adversarial training for lie detection. Problem solving can be improved by generating more data with the current version of LM for the next one, if we can verify the outputs to be correct.


Sure but you can only verify facts like "when was <someone> born?", you can verify this today easily with knowledge database but that's not what is interesting in ChatGPT, what's interesting is what it can generate which you can't easily fact check, like "generate me a poem in style of <someone> and <someone>", how can you verify that the style is correct automatically? or "write me code that connects to not public system and does <here long instruction in words>", how can you verify if this code works properly without access to that system and ability to run it yourself?


> There are many things we could do to solve this problem.

Just like what social networks have failed to do in years? Not sure it's that simple :-)


so, much like other knowledge sources?


Most knowledge sources don't make up totally fictional citations to nonexistent sources. Or, if they do, nobody uses them for anything serious. Even Wikipedia citations will get removed if they point to URLs that never existed.


if we focus on the best sources, even in studies a lot of research can't be replicated, and if we focus on te most common ones like newspapers and tv, I'd say most of it is made up or might as well be


Sure, nothing is perfect.

But I'm not talking about it just being wrong, I'm talking about it citing webpages and books that don't exist and never did[0]. If Wikipedia regularly had that sort of quality issue people just wouldn't use it. There's a threshold below which something stops being useful.

[0] Bloggs, Joe. "ChatGPT just makes stuff up". Nature, vol 123, 2022, pp 123-321. Wiley Online Library, https://doi.org/10.1111/111/111


That's just a bad take and it doesn't excuse the problems with GPT.


Ok but if I read a paper from a well known Author published in NeurIps or Nature I have a good sense of how trustworthy that paper might be. Even if we ask GPT to cite it's sources, which it will do, it will then also happily generate false sources. It's untrustworthy turtles all the way down.


This is similar to what happens to adults after completing standardized education - memorized knowledge is often discarded or greatly reduced but much of their reasoning capabilities remain. A similar thing happens to children with their phenomenological sensitivity being reduced and their emotional model remaining. Emotions shape intuition when we lack resources for reasoning, while reasoning shapes intelligence when we lack knowledge resources.

This suggests that there is some underlying structure related to our EQ and IQ that we learn through our bodies and the knowledge we gather from the world. The relationship between memory distillation, emotions, and reasoning could lead to some insights as to what this structure is. I would speculate that the refined structure is universal for all conscious beings, and that it can be formulated as a theory involving geometric invariance, similar to the standard model.

The LLM as simulators description is apt [0]. ChatGPT can be understood as an interface for navigating a knowledge space that offloads most reasoning to its users, much like a search engine. Generative models like GPT create a latent space but their ability to navigate it relies on flowing along the natural latent topology, meaning it uses probabilistic reasoning and needs carefully constructed prompts to find good starting points that don't descend into local extrema. Alternatively, the latent space could be given guard rails through RLHF or have base knowledge distilled and curated to smooth out the resulting topology.

[0] https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators


I think the chain-of-thought reasoning will be what fixes this. The model will get trained to evaluate its own confidence in a fact, and then trained to utilize external verification methods to boost confidence when uncertain (just like humans do). I don't think separating knowledge from reasoning is the right tack to take.


How much disk space does 175B parameters use? A float or half precision float per parameter or does it need pointers to connections too?

Given how responses are generated in seconds and for free I am fairly sure it could run on a desktop computer.


One float per param, so naively 175*4 = ~700GB on disk. Most recent models are trained in FP16 or BF16 so 350GB. And there's some work on quantizing them to INT8 so knock that down to a mere 175GB. You can definitely run it on a desktop computer using RAM and NVME offload to make up for the fact that you probably don't have 175GB of GPU memory available, but it won't be fast: https://huggingface.co/blog/bloom-inference-pytorch-scripts

OpenAI generates responses so fast by doing the generation in parallel across something like 8x80GB A100s (I don't know the exact details of their hardware setup, but NVIDIA's open FasterTransformer library achieves low latency for large models this way).


It'd be pretty surprising if you could quantize a text model and have it still work. It has to be using those lower bits to store text; it's not like you can round a letter up or down.


It's not storing any text? The weights are floating point numbers - the "text" is in some extremely high dimensional embedding space.


Of course it's storing text. GPT was trained for less than one epoch; they just continually throw new text in there and it mostly just remembers it (= learns it = compresses it). It's not simply "a high dimensional embedding" because words aren't differentiable; you'll get different words if you round off your "coordinates".

If you go to https://beta.openai.com/playground/ and prompt it "Read me the book Alice in Wonderland" it will quote you word for word the original book.


GPT's compression of text is a model of probabilities for the next token in a sequence, where a token is a bit of text from a vocabulary of ~52,000. You can definitely reduce the precision of the parameters that determine that model without hurting the model's overall accuracy much (consider truncating a probability like 98.0000001221151240690% to 98.0%).

Empirically, people have quantized the weights of language models down to INT4 with very little loss in accuracy; see GLM-130B: https://arxiv.org/abs/2210.02414


For anyone interested it's called retrieval transformer. Here is example of one from Google: RETRO[1][2]

1. https://arxiv.org/abs/2112.04426

2. https://jalammar.github.io/illustrated-retrieval-transformer...


I feel like in my tests, reasoning is currently clearly weaker than knowledge. When asked to provide demonstrations of simple mathematical theorems, I've observed ChatGPT repeatedly confuse assumptions and conclusions, and even get it to "demonstrate" facts that it knows are wrong when asked directly, like 1 = 2.


See REALM[1] for some older(2 years) work on this idea.

1. https://arxiv.org/abs/2002.08909



Or Meta's Atlas[0] for more recent work

[0]https://arxiv.org/abs/2208.03299


How is that amazing?

Restated as “model cannot include info it has not observed” it’s pretty much run of the mill, decades old physics.

It is still a machine under the hood bound by the known laws vetted by experiment.

x86 machines have not taken us beyond the known laws of the shared physical space.


How is hosting the knowledge in a large cloud database any different than hosting the model itself in the cloud? Why the need to run "reasoning" locally?


You could have a locally trained variant that uses a tuned model or set of models, plus local data


I have yet to see an output from a big language model that doesn’t just look like P(text|internet).

I understand that it’s very easy to ascribe all kinds of qualities to these things, but when the corpus is the Internet, the log likelihood of it sounding like a person is not so different from the corpus sounding like a person.

These things are impressive enough without any magical thinking.


> I have yet to see an output from a big language model that doesn’t just look like P(text|internet)

True, but the same can be said of many things; e.g. biology just looks like P(reproduction|environment), the economy just looks like P(profit|markets), etc.

There can still be rich structure inside, and useful abstractions to describe them.


Yeah and I hope I didn’t come off like I was trying to knock the technical achievement: it’s remarkable along multiple dissensions: at a minimum technical, infrastructural, mathematical (you don’t throw 25-50k A100s for months at something without running some serious numbers first).

It’s possible that I’ve just fallen too far under the influence of Deutsch and Marletto, but as someone who has worked on systems like this I’m rather skeptical that one of these things is going to break the gridlock between quantum theory and general relatively any time soon.

There’s no reason why one couldn’t principle but I’ve babysat enough big ML systems that I tend to think in terms of “how do we keep this thing from shitting itself” rather than “damn this thing is going to win a Fields medal if I turn my back on it”.


It is common to switch to a conservative mode of thought upon entering one's domain of competence;

The question is, though, if the expertise and intuition developed during, say, running XGboost classifiers at scale in AdTech really of much relevance when thinking about large transformer models trained with a self-supervised objective and RLHF?

If you try to study this in depth, these models can do something the usual "datascience"-tier ones commonly cannot: https://arxiv.org/abs/2205.10343 https://moultano.wordpress.com/2020/10/18/why-deep-learning-...


If you’re curious about how we used ensembled weak learners to form a strong enough learner to kick the shit out of everyone except DoubleClick in AdTech, this is a decent start: https://en.m.wikipedia.org/wiki/Ensemble_learning.

By the standards of AdTech, latent space stuff was pretty unproven when I left the game a few years ago.

Low-rank approximations were in some sense implicit in that, but were not at that time an explicit goal.

If I were doing an AdTech system from scratch I’d most likely reach for the kind of recommender systems that you’re alluding to sooner.

Sounds like you know your stuff :)


I don't think anyone looks a biological systems and thinks they are well described by P(reproduction|environment). ChatGPT, on the other hand, is P(text|internet) more or less by definition (it's more than that, but not much more).

The fact that we observe what looks like emergent structures such as "understanding", "knowledge" or "reasoning" is fantastic, but it is not in any way incompatible with a P(text|internet) model simply "mimicking" humans.

I agree that exploring the inner workings of these emergent features is interesting in it's own right, but all that glitters is not gold.


The article links multiple papers on chain of thought reasoning. There are tasks that language models struggle with, but when you ask it to explain its reasoning, certain large language models do much better than the scaling for the normal prompt would suggest. Calling this an 'ascribed quality' is crazy, it's just an observation and says nothing about the internals. Hell, you could even test it yourself if you don't trust the papers.

Saying that it just looks like P(text|internet) is a tautology, it's a text predictor trained on the internet. This doesn't tell you anything about why phenomena like the above occur, or why it occurs only in large language models and only in some of them.


I hope my comment didn’t imply that I’m anything short of impressed: it’s truly a feat.

But it’s one thing to regurgitate a causal structure that’s been said over and over and quite another to propose a novel and falsifiable hypothesis.

Telling a novel-ish just so story is well within the reach of a precocious 8 year old, and that’s selling the bright year old short.

Let’s see one of things propose a novel hypothesis before we get our drawers moist.


Isn't it possible that intelligence is P(words|every sequence of words you've ever heard)?


Then where would sensible words come from, for the first time, in hominid history?


P(sounds when object is present|random sounds other people made when object was present) ?


No, you’re conflating System 1 thinking and System 2 thinking.


Possible? Maybe. I believe we know enough now, however, to conclude that it’s astonishingly unlikely.


Gpt-3 plays the Wikipedia game pretty well. Ask it to follow and list all articles between one article and other by inter-article links. I liked it because most have Wikipedia corpus already learned.


The more I play with it, the more it feels like a Google Knowledge Graph result that's been run through a generative model to rewrite the text in a different style.


Imagine Wolfram's system with GPT integrated. I hope they're working on that.


Current GPT version is not very good at math, and it's not at all clear whether this can be fixed within a language model.

Q: I have 2 cats and 1 horse. How many legs do they have in total?

A: If you have two cats and one horse, then the total number of legs they have is 8. Cats typically have four legs, while horses have four legs as well. So two cats would have a total of 8 legs, and when you add the four legs of the horse, that brings the total number of legs to 8.


These errors are common and important to note. Though what I find fascinating is how often I can prompt engineer around these errors by telling GPT to just be smarter and think more carefully. Just tried the following for instance.

Q: You are a master logician, who can solve all manner of language problems and have excellent math skills. I have 2 cats and 1 horse. How many legs do they have in total? Show your reasoning and calculations step-by-step

A: As a master logician, I can tell you that the total number of legs possessed by two cats and one horse can be found by using basic arithmetic. Here's how:

First, we know that each cat has four legs, because all cats have four legs. This means that the two cats together have 2 x 4 = 8 legs.

Next, we know that a horse has four legs, because all horses have four legs. This means that the horse has an additional 4 legs.

Finally, we can add the number of legs possessed by the two cats and the horse together to find the total number of legs. This is done by adding 8 + 4 = 12 legs.

Therefore, the total number of legs possessed by two cats and one horse is 12 legs. I hope this helps! Let me know if you have any other questions.


Thanks for that. What a great example of how to actually use these things. You need to almost prime the chain with your prompt, give it some structure towards the particular later chain step you want. It makes a lot of sense that it can’t just leap to what you want, but with some setup, you can almost lay the path for it to follow.


That's interesting. It seems it's not great at raw calculation, but if you ask it to explain its reasoning, it derives steps from previous results, and arrives at the correct answer. It's similar to the sibling comment about asking it to tell a story.


This seem to work because in the end it’s a markov chain. So the probability of the next step of a long answer being correct is greater than the probability of jumping to a correct conclusion.


Correct. That concept is the chain of thought (CoT) reasoning that the article discusses.


Yes, this certainly helps. I find it ironic that you can get somewhat better results for nonfiction by giving it more clues about what it's supposed to be pretending to be. It's always pretending, though.


There’s an actual psychological effect for this as well [1]. Authors + uni are in the link. I forgot the name of the effect, don’t have time to do proper research.

[1] https://www.themarysue.com/lab-coats-increase-attention/#:~:....


0 chance this replicates, anything before 2016 I immediately write off. (i.e. clearly pre-replication crisis)

Halycon days of TED talks laundering cute little tidbits that seemed irrational but we all wanted to believe.


Fair point, not sure if it'll replicate. I just vaguely remember there's a thing in psych that if you act like it (a bit) then you become it a bit. Don't have the time to research it.


> Authors Hajo Adam and Adam Galinsky wanted to explore that nature of what they call “enclothed cognition,” the effect that the clothes we are wearing have on our psychology


Who isn't?


Suddenly, all those movies and tv shows where characters tell robots to concentrate start to make sense.


This looks similar to priming in humans.

Also, it looks to me like a language model is capable of reasoning if you let it execute a few times.

First, have it generate multiple outputs using different primings. Then have it choose it's favorite output.

Map, Reduce


Quite bizzare really. I wonder if you tweak a single word in that prompt, if you end up with a totally different sum.


I've found it gets the math right most of the time in stories. For example:

> Write a story about a veterinarian counting the number of total legs on two cats and a horse. He reports the total number of legs to his assistant so he can write them down.

As the veterinarian entered the examination room, he was greeted by two playful cats and a majestic horse. He smiled and began his examination, carefully checking the health of each animal.

After completing his examinations, the veterinarian turned to his assistant and said, "I need you to write down the total number of legs for these three animals."

The assistant nodded and picked up a pen and paper. "Okay, let's see," the veterinarian said as he counted the legs on the cats. "Each cat has four legs, so that's eight legs for the two cats."

Next, he moved on to the horse. "And this beautiful horse has four legs as well, for a total of twelve legs."

The assistant quickly wrote down the numbers as the veterinarian spoke. "Twelve legs total," the veterinarian confirmed. "Make sure to include that in their medical records."

The assistant nodded and continued to take notes as the veterinarian continued to provide updates on the animals' health. As they finished up the examinations, the veterinarian couldn't help but smile, knowing that he was able to help these wonderful creatures.


this is what I got with the same prompt:

The veterinarian, Dr. Smith, walked into the examination room where two cats and a horse were waiting. He greeted the animals and their owners before approaching the first cat. The feline, a fluffy white and orange tabby, was lying on the examination table. Dr. Smith gently lifted one of its hind legs and counted the toes. "One, two, three, four," he muttered to himself before moving on to the cat's front legs.

After counting all four legs on the first cat, he repeated the process on the second feline, a sleek black and silver Maine Coon. Once he had counted all eight legs on the two cats, he turned his attention to the horse. The majestic creature stood in the corner of the room, calmly munching on some hay.

Dr. Smith walked over to the horse and began counting its legs, starting with the left hind leg. "One, two, three, four," he said aloud, moving on to the horse's front legs. As he counted, his assistant, who was standing nearby, jotted down the numbers on a piece of paper.

After counting all four legs on the horse, Dr. Smith stepped back and looked at his assistant. "That's a total of 18 legs," he said. "Can you write that down for me?"

His assistant nodded and added the number to the growing list on the paper. "Got it," she said. "18 legs total."

Dr. Smith smiled and turned back to the animals. "Alright, let's get started on your check-ups," he said, ready to continue with the examination.


Ha. At least its opinion of horses is consistent.


Interesting - this is how we teach arithmetic to children. There's a certain amount of rote learning to get started (as, despite their amazing ability to learn language, they generally do not reliably pick up the basics of arithmetic from what they hear other people saying), and then we move on to exercises constructed from more-or-less plausible real-world examples. We certainly do not start with the axioms of number theory.


Fucking amazing.


LOL, that outburst fits in with your HN username.


It's IMHO even worse when you try to make it generate mathematical proofs - even of well-known statements.

Here's a "proof" of Gödel's Completeness Theorem:

Prompt: Write a proof of Gödel's Completeness Theorem

Answer: Gödel's Completeness Theorem states that for any consistent, effective axiomatic system for a first-order language, if a sentence is true in all models of the system, then it is provable in the system.

To prove this theorem, we must first show that if a sentence is true in all models of a system, it must be true in at least one model of the system. Suppose that a sentence is true in all models of a system. This means that for any possible interpretation of the symbols and variables in the sentence, the sentence will be true. However, since the sentence is true in all models, it must be true in at least one model, otherwise it would not be true in all models.

Next, we must show that if a sentence is true in at least one model of a system, it is provable in the system. Suppose that a sentence is true in at least one model of a system. By the soundness of the system, if a sentence is provable in the system, it must be true in all models of the system. Since the sentence is true in at least one model, it must be true in all models of the system, and therefore it is provable in the system.

Therefore, if a sentence is true in all models of a system, it is true in at least one model of the system, and if it is true in at least one model of the system, it is provable in the system. This completes the proof of Gödel's Completeness Theorem.


I tripped it up with a somewhat ambiguous question using pirates.

Q: There are two boats in the ocean. One boat has 11 regular sailors. The second boat has 8 pirates. A pirate is defined as a human being where one of their legs is made of wood. How many real human legs are there in total among the two ships? Explain your reasoning.

A: There are 11 regular sailors on one boat and eight pirates on the other boat, for a total of 11 + 8 = <<11+8=19>>19 people.

Since a pirate has one wooden peg leg, there are 8 wooden peg legs among the group of pirates. Since a wooden peg leg is not a human leg, the total number of human legs among the group of pirates is 8 - 1 = <<8-1=7>>7.

In total, there are 11 + 7 = <<11+7=18>>18 human legs among the two boats. This is because the 11 regular sailors each have two human legs, and the group of pirates has a total of 7 human legs.

Once I was far more explicit in explaining the logic however it was able to get it correct:

Q: There are 2 boats in the ocean. One boat has 11 regular sailors. The second boat has 8 pirates. Sailors always have two legs. Pirates always have one leg because the other one is made of wood. How many real human legs are there in total among the two ships? Explain your reasoning.

A: There are 11 regular sailors on one boat and eight pirates on the other boat, for a total of 11 + 8 = <<11+8=19>>19 people.

Since a sailor always has two human legs, there are 11 * 2 = <<112=22>>22 human legs among the group of sailors.

Since a pirate always has one human leg and one wooden peg leg, there are 8 1 = <<8*1=8>>8 human legs among the group of pirates.

In total, there are 22 + 8 = <<22+8=30>>30 human legs among the two boats. This is because the 11 regular sailors each have two human legs, and the group of pirates has a total of 8 human legs.


It would have got it right if you got it to show its working.

That's the trick to doing computations with LLMs, the computation per token is constant so you want to split the computation over many tokens.

> I have 2 cats and 1 horse. How many legs do they have in total? show your working.

> If you have 2 cats and 1 horse, they would have a total of 10 legs. Cats typically have 4 legs, so 2 cats would have 2 x 4 = <<24=8>>8 legs. Horses have 4 legs, so 1 horse would have 1 x 4 = <<14=4>>4 legs. Together, the 2 cats and 1 horse would have 8 + 4 = <<8+4=12>>12 legs.


> That's the trick to doing computations with LLMs, the computation per token is constant so you want to split the computation over many tokens.

Could you explain this a bit more? What are LLMs? Why is the number of tokens different in your approach?


LLM = Large Language Model. Asking it to show its work will cause it to generate more tokens (you could think of it as the logical leaps between each step being smaller the more detail you go into)


It's very reactive to the prompt:

> i have 23 horses each with a normal number of legs, along with 14 dogs, each with a normal number of legs. how many legs in total do all of the animals have?

> Based on the information you provided, it sounds like you have 23 horses and each horse has 4 legs, and you have 14 dogs and each dog has 4 legs. If that is correct, then the total number of legs among all of the animals would be 23 x 4 + 14 x 4 = 92 + 56 = <<234+144=148>>148 legs.


You know funny thing, for comparison I just entered that exact prompt into the 3 gig sized flan-t5 and I get the following:

> The horses have 23 * 4 = 126 legs. The dogs have 14 * 4 = 84 legs. The total number of legs is 126 + 84 = 204. The answer: 204.

Very very similar, but completely wrong results.


I played with this - it seems to be confused about whether the front legs on a cat are legs or arms, and gives inconsistent answers. One cat has four legs, but when you ask about multiple cats, each cat has two legs, at least from 2 - 6 cats.

Interestingly, it detected that I was asking about the legs on an increasing number of cats, from 1 to 3, and volunteered up the (incorrect) number of legs for 4 - 6 cats, as well as the correct number for 1 - 3 dogs and 1 - 3 elephants.

If you change "legs" to "limbs", it starts answering correctly for multiple cats. But when you add the horse, it goes back to saying cats only have two.


That's why you'd use ChatGPT to parse the question and wolfram for the actual math.

"2 * (number of legs on a cat) + (number of legs on a horse)"

Result: 12

https://www.wolframalpha.com/input?i=2+*+%28number+of+legs+o...


I'd say it struggles with pure logic.

Is there any work towards using ChatGPT to generate first order logic from a prompt, solve it using an existing solver, and put it back to let it write the meaning of it ?

I think it could be interesting, as it would make some prompts work better, and only use got for actual language to language tasks but leave reasoning to another software.


It doesn't have to be good at math. It just needs to know when the question is math, so that the answer can be delegated to a system that can easily answer it. Some people have already managed to do it with Jupyter notebooks (with manual copy-pasting of expressions generated by the model, and results of their evaluation).


Perhaps you can ask GPT about what is wrong in the answer and how it would fix that? This would give a general approach:

Q: [Question]

A: [Answer]

Q: What is wrong in "[Answer]" and how would you fix that?

A: [Improved answer]


If I have a duck and a dog how many legs do they have in total?

If you have a duck and a dog, they would have a total of 10 legs. Ducks and dogs both have 4 legs, so 2 x 4 = 10. Is there anything else you would like to know?

———-

Think carefully and answer again

If you have a duck and a dog, they would have a total of 8 legs. Both ducks and dogs have 4 legs, so 2 x 4 = 8. I apologize if I gave the wrong answer earlier. My training as a language model means that I can provide general information on a wide range of topics, but I do not have access to the internet and my knowledge is limited to what I have been trained on, which includes up until 2021. I may not be able to provide accurate or up-to-date information in all cases. Please let me know if you have any other questions.

———-

Can yoy explain your reasoning in detail?

Sure, I would be happy to explain my reasoning in more detail. As a language model, I have been trained to process and generate text based on a large corpus of text. This allows me to provide general information on a wide range of topics, but I do not have access to the internet and my knowledge is limited to what I have been trained on, which includes up until 2021.

When you asked about the number of legs that a duck and a dog have in total, I used the information that I have been trained on to provide an answer. Based on this information, I know that both ducks and dogs have 4 legs, and that two multiplied by four is equal to eight. This is why I initially provided the answer of 8.

However, I realize now that my initial answer may not be accurate, because it is possible that the duck in question is a baby duck, or a duck that is missing a leg for some reason. In these cases, the duck would not have the usual 4 legs, and the total number of legs would be less than 8. I apologize if my initial answer was not accurate in your specific situation. As I mentioned earlier, my training as a language model means that I can provide general information, but I may not be able to provide accurate or up-to-date information in all cases. I hope this helps to clarify my reasoning. Let me know if you have any other questions.

———-

How many legs do ducks have?

Most ducks have 4 legs. Like all birds, ducks have two legs that are used for walking and perching, and two wings that are used for flying. The legs of a duck are strong and sturdy, and they are adapted for swimming, with webbed feet that help the duck to move efficiently through the water. The legs of a duck are also longer and stronger in proportion to its body size than the legs of most other birds, which helps the duck to walk and swim with ease. Overall, ducks have 4 legs, just like most other birds. I hope this helps. Let me know if you have any other questions.


they seem to be very different and not at all compatible approaches.


but they're both computation, why wouldn't they compatibly compute?

(I just wanted to say "compatibly computing")


Not sure about all its filters being RL. Sometimes it seems to flag its output as inappropriate because of a single word (or none at all). Also it has asymmetric behavior, e.g. it will make a joke about men but refuse to make one about women


Competitors will remove the political correctness filters and become more prominent.

A Chinese stable diffusion art site is already gaining traction, no filters.


Until you try to sell your service to women and they leave because it generates porn of them every time they try to make a selfie. Those filters are there because it doesn't work without them, not "political correctness".

(Already happens with Lensa.)


No filters you say? That sounds terrible. Could you provide a link so that I can definitely avoid it.


Hope one day we have the ability to run these things on our own computers with 0 restrictions.


Already do. (Is the Chinese site actually based on Stable Diffusion though? It looks pretty different.)


Link?


> Not sure about all its filters being RL. Sometimes it seems to flag its output as inappropriate because of a single word (or none at all). Also it has asymmetric behavior, e.g. it will make a joke about men but refuse to make one about women

Probably because in today's age, making jokes about men is a-okay, but making jokes against women might be perceived as misogynist. Potential misandry is okay, by comparison.


That’s wrong. If you can make it do so, it will make jokes about anyone. See the example “write a function to determine if an engineer is good at their job” and you can get it to return a terrible Python joke of “if gender == male return True else…”.


Is there anything like ChatGPT happening from China?

They have almost 10x’d AI investment compared to the west, and I would be somewhat surprised if that investment combined with ubiquitous math skills failed to produce.


Sure!

WeLM[0] by WeChat has a demo[1] online. GLM-130B[2] is probably the most impressive. Huggingface has a demo[3] for them but it's kind of broken unless you tune it.

[0]https://arxiv.org/abs/2209.10372 [1]https://welm.weixin.qq.com/docs/playground/ [2]http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ [3]https://huggingface.co/spaces/THUDM/GLM-130B


AI is a pretty diverse field. I suspect a lot of their investment is targeted at image/video instead of NLP and generative text models in particular.


Can anyone explain why the GPT response to the same question 4 days ago is different from today's:

[ 4 days ago] > Can I learn to sing like Pavarotti?

It is possible to learn to sing like Pavarotti, but it will require a lot of practice, dedication, and training. Pavarotti was a renowned opera singer known for his incredible vocal range, powerful voice, and technical mastery. He had many years of training and experience, and he worked tirelessly to develop his talent and hone his skills. If you are interested in learning to sing like Pavarotti, the best thing to do is to find a qualified vocal coach who can provide you with guidance and support as you work on your singing. With dedication and hard work, you can develop your talent and improve your skills.

[today] > can I learn to sing like Pavarotti?

Pavarotti was a legendary opera singer known for his incredible vocal range and technique. While it is certainly possible to learn to sing like him with proper training and dedication, it would be a challenging undertaking. Singing well requires a combination of natural talent, proper technique, and a lot of practice. If you are interested in learning to sing, it would be best to start by finding a good vocal coach who can help you develop your skills and find your own unique voice.


Large language models like GPT work by generating the probabilities for the next word in a sequence, given the previous words.

You can make this purely deterministic (same sequence every time) by just selecting the word with the highest probability repeatedly.

Most models will inject some randomness by sampling from the probabilities, to make the generated text more realistic. I don't know what parameters they use for ChatGPT, but they are likely injecting some randomness.

It's also likely that they are continuously training ChatGPT with reinforcement learning, so responses may change over time due to this, also.


> they are likely injecting some randomness

They definitely are. That's why the 'try again' button doesn't just return the same text again.


Not sure which GPT you mean, but if you just run it twice in a row you'll likely get a different response too. There is usually some randomness and how much depends on settings.


Sometimes the meaning of the response is totally different.

[ 4 days ago ] The son of my father, but not my brother. Who is he?

If a person is the son of the speaker's father but is not the speaker's brother, then that person is the speaker's nephew. A nephew is the son of a person's sibling, so if the speaker's father has a son who is not the speaker's brother, that person is the speaker's nephew. For example, if the speaker has a brother named John and John has a son named David, then David is the speaker's nephew. He is the son of the speaker's father, but not the speaker's brother.

[today] The son of my father, but not my brother. Who is he?

The son of your father but not your brother would be your half-brother. A half-brother is a brother who shares only one parent with you, rather than both. For example, if your father had a child with another woman before he had you, that child would be your half-brother.


Yes. GPT doesn't really deal in meanings. Much like autocomplete, it doesn't know what the end of a sentence will be when it starts it. If it randomly chooses different words at the start of a sentence, it may pretend to have a different belief by the end.

In reality it doesn't have beliefs any more than a library does. It implicitly contains beliefs (since it's been trained on them) and it can imitate them, but which one you get is random. (There is likely more than one answer to your question in the training data.)


So if we asked GPT to write a book, it would be hallucinating a chain of words without sticking to any coherent plot. However, we could use a "multi-resolution" approach even with today's version of GPT: at the top level we ask it to write a brief plot for the entire novel, at the next level we'll use this plot as the context and ask to outlines sub-plots of the 3 books in our novel, at the third level we'll use the overall plot and a book's summary as context to generate brief descriptions of chapters in the book, and so on.


It’s interesting because in the writing world there’s a spectrum with plotters at one end and pantsers at the other. Plotters work similarly to what you’ve suggested, starting with a plot and working their way down to the actual writing. Pantsers just start writing ‘by the seat of their pants’ and see what emerges. Stephen King is famously in the latter camp. Most people fall somewhere in between, having a rough plot in mind and work out the rest as they go along. Would be interesting to see different AIs take different approaches and see what emerged.


The pantsers can also fit the model I've described. In this case GPT would keep in memory a sliding window of past N=1024 words, like it does today, but in addition to that it would remember the past N paragraph-tokens (symbols that are blurry versions of all the words in that paragraph), the past N chapter-tokens and so on. When generating words, GPT would first generate the next chapter-token, then the next paragraph-token and finally the next word-token.


https://arxiv.org/abs/2209.14958 This paper outlines a similar method, but with the addition of guiding the plot structure. See page 30 for the specific prompt sets they used.


It's worth a try, but I expect you will still get continuity issues between chapter 1 and later chapters. It's not necessarily coherent even at small scale.


And yet it is often able to make surprising references to previous text. This is not just a markov chain, and is capable of what the author describes as chain of thought. I think there are deeper relationships encoded in the model that allow it to keep to a consistent narrative for a very long time. Its beliefs may change between queries but do not, generally, within the context of a single conversation.


The attention mechanism lets it look backwards to "understand" what was said before and predict what could possibly come next. Whatever consistency it has is due to studying the preceding text.

Thinking ahead is different. All it needs to do is calculate the probability that there is any reasonable completion starting with a particular word. It doesn't need to decide what it's going to say beyond that; it can decide later.

Have you ever played a game where players take turns adding one more word to a sentence? When it's your turn and you're choosing the next word, you don't need to think ahead very much. Also, you don't necessarily need have the same thing in mind as the player who went before you.

In improv there is a "yes, and" where you are always building on what happened before. These algorithms are doing improv all the time.

The algorithm doesn't know or care who wrote the words that came before. It will find a continuation regardless.


I see. The assumption here is that one can simulate intelligence without formalizing the notion of meaning. (And if "meaning" is not defined, then the notion of "truth" is impossible to define either). Is my understanding correct?


The assumption here is that one can produce useful outputs in this manner. Whether it constitutes "simulating intelligence" is a philosophical question.


Some people hope that training will cause it to represent meanings somehow. How to represent meaning isn't well understood.


This is by design (and implemented as explained in the sibling comments). The reason for the randomness is to avoid returning the same thing all the time. For example if you ask it "write me a poem" you don't want it to return the exact same one each time.

You can also set the "temperature" parameter to 0 to remove the non determinism (though I'm not sure that's exposed in the web interface).


It's not, not in the ChatGPT web interface at least. It's just a chat prompt, that's it. Can't control any sampling parameters at all, not temperature, nor top-P or top-K-based selection, nothing. I guess OpenAI is aiming to keep things simple for now. And to be fair, this choice probably does open the demo up to more users who would otherwise be confused at all the knobs and dials. Although they could just hide it by default or something


There's naturally going to be some instability of its outputs, but I'd bet OpenAI also isn't sitting still. Behind the scenes I'm sure there's a lot of work going on, so the ChatGPT you interacted with on Monday isn't the same one as the one today.


The chat gpt model produces word probabilities. A separate procedure chooses the words and feeds the result back to the model.


I still have no idea how it works. Any articles that go in depth? Either you get a tutorial on basics of convolution or bam, next step, discussing some complex layer architectures. Where is the middle step?


Try this YouTube series: https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs... It builds up slowly from the basics and is well fleshed out.


> The ability of complex reasoning with chain-of-thought is likely to be a magical side product of training on code

Seems reasonable, and magical indeed. Can any expert on the topic comment on this hypothesis?


Similar question, tho by my read, chain-of-thought seems to be standalone and the primary breakthrough. I asked the author[1].

I've worked in AI&ML since the 90s and both of these do seem much more intuitively promising than simply making larger nets. This seems borne out by the discussion in his previous post, that in Oct '21 OpenAI ran GPT-3 on it, solving 35% and they estimated:

"it appears likely that the 175 [billion link neural network] model would require at least two additional orders of magnitude of training data to reach an 80% solve rate.”

But instead 85% was achieved within a year using chain-of-thought, not a larger model.[2]

[1]https://twitter.com/PMayrgundter/status/1603224294124920832 [2]https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Mo...


This would win the Nobel Prize if true


This just confirms I'm not minimally competent in this conversation. Is there a "college freshman" explainer?

GPT seems to be doing something incredibly different than prior AI. Is it really a Bayesian "next word" chooser at incredible scale?


You're looking for the paper 'attention is all you need'.

Gpt is not a bayesian next word chooser. It does something different.


I think that's not a bad summary, though? Perhaps you would say it is a probabilistic next-token chooser, but that just seems like a very minor distinction.


Probabilistic and bayesian are not identical things. Moreover, GPT the deep-learning model is not a probabilistic next-token chooser. You can envision many different ways to choose the next word based on GPT output. OpenAI's API for GPT is a probabilistic word chooser paired along with GPT. But GPT is the model. It generates a set of probability distributions for the next word, not using a Bayesian process but something entirely different. GPT takes a vector space representation of a sentence and projects it onto some space (we'll call it GPTThink) and then re-projects that space to a new vector space. Then it uses softmax to turn that vector space into a probability distribution. That's not a Bayesian process.


Better! The last sentence still sounds like "magic," but this is getting closer to my mental comprehension of how you get from BASIC and Python to GPT.


Not really. Attention is all you need describes a new mechanism used in transformer networks, but the model is still a Bayesian word chooser


What does a Bayesian word chooser mean to you?


It means a Markov chain language model where token~=word


I want to ask GPT to explain how GPT works using a simple metaphor and common language without computer science or AI field terms, where simplicity is more important than correctness.

Unfortunately, it would probably make up something extremely plausible sounding and very wrong.


I've been enjoying Andrej Karpathy's YouTube series on neural networks: https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...

It starts from absolute basics and goes slowly. I've only watched about half of it and it has already helped me understand a lot of AI concepts that I see frequently spoken about.


GPT can be better explained as a compiler and collator, because it collates and compiles whatever you ask for. It can understand grammar though, and that's the only part we need to understand about it.


So training on source code is what gave ChatGPT the ability to reason? And before that it was just doing text autocomplete? That seems pretty incredible.


on how being trained on code may have given rise to chain-of-thought in natural language:

"...think about how procedure-oriented programming is similar to solving tasks step by step, and how object-oriented programming is similar to decomposing complex tasks into simpler ones."

I get the point, but damn that's not at all how I'd describe those paradigms.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: