Some of ChatGPT's Defining Characteristics
It is not easy to elicit straight answers from chatGPT about its own architecture and training parameters, but others have provided them for us. Alex Hughes, writing for BBC Science Focus, tells us that GPT-3, the parent AI engine from which chatGPT arose, has 175 billion parameters, is trained on 570GB of digital text gleaned from "books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system".
That sounds a lot, but how much is it?
How Big is 300 Billion Words?
Given that most of us have very little grasp of what "300 billion" looks like, it's worth just considering this number in relation to estimates of the capacity of a human brain, which some have estimated to be capable of storing about 1024 terabytes of information. In comparison, 300 billion words probably amounts to something like 2 terabytes since 300 billion is 0.3 terabytes and we might estimate the mean number of letters per word - each stored as one byte, 8 bits - as being around 5-6, so 2 terabytes may be a slight overestimate. (In fact, the mean number of letters per English word is roughly 4.7, but we are being generous because many technical articles have more long words.)
According to Wikipedia as of September 2022, the total size of Wikipedia's compressed articles is just over 23GB, so GPT-3's training data at 570GB is about 25 times greater than the whole of Wikipedia. That gives us some idea of the quantity, but it still may not mean much. For example, there are over 30,000 articles on Wikipedia just about mathematics, which form a tiny part of the 23GB.
Size Isn't Everything
The amount of training data nevertheless only tells part of the story because the remarkable thing about chatGPT is the sophistication and flexibility of its responses to questions framed in ordinary - and often complex - everyday language. So the amount of training data is of only minor significance compared with its ability to use it to respond to a bewildering range of questions on - as far as one can tell - almost any topic under the sun.
One of the most interesting features of chatGPT's self-description is that it emphasises that it does not store information raw. For example, there is nowhere in its neural net where we can find the name of the forty-fourth president of the USA, but chatGPT can identify that person by a kind of inferential process that extracts it from the rest of the data that has been stored in its trained neural net.
Nobody is ever likely to be able to say precisely how chatGPT performs its inferences because the information stored in its neural net is completely unintelligible from the outside. To all intents and purposes it operates like a "black box": we feed in a question, and it responds with an answer. What happens in between is unlikely to be very informative, even to its OpenAI creators.
Knowing How and Knowing That
The OpenAI team obviously know how to create a language model, how to train it, refine it, curate the data that trains it, and monitor its responses. But "knowing that" it can answer questions does not entail "knowing how" beyond architectural generalities. Yes, they and we know that reinforcement learning entails backpropagation, updating weights and biases using some form of gradient descent in order to minimise the output errors as measured by some criterion as they deviate from the supposedly correct answers in the test and training data, but exactly what is going on is no more intelligible than the way a human brain uses neurons to respond to a question with "Barack Obama".
We know "how" chatGPT operates only in the sense that we know that it is a trained neural net; exactly how that net is producing the answers remains unknown and unknowable. Even chatGPT doesn't know, and that isn't a piece of spiritualistic mystery-mongering, just a statement of the brute fact that trying to understand the billions of interacting nodes in the neural net as they produce an output is completely intractable: we know that it happens, and at a technical hardware level we know how, but we are quite incapable of tracing or understanding the particular processes that enable production of a particular answer.
We may reasonably suppose that lots of chatGPT's training data involved the Obamas, but let's just suppose that nowhere was it told explicitly the name of the 44th president. It might know something about Michelle, the Obama children, the eight years in the White House, and all kinds of other things. How might it "infer" using what it rather amusingly calls an "educated guess" who the 44th president was? Well, it might say "there was a 43rd president and a 45th president, so there must have been a 44th" but it doesn't even "say this to itself" explicitly; instead, it will conjure up a kind of integrative composite answer that is most compatible with and least incompatible with everything else it "knows", a kind of "minimax" process; it can still be wrong, but with so much data it is unlikely to be wrong about something as basic as this.
chatGPT doesn't connect to the Internet for information: whatever resources it has are fixed and static, a part of the consequences of its training on the 570GB of digital text. Sometimes it will get things wrong, but it's reluctant to say it was wrong; it prefers to say "sorry for the confusion", which is perhaps a reflection of the ambiguity of "wrong" when all that you know is stored amorphously in a distributed neural net unintelligible to anyone and everyone, just like the human brain.
The human brain, of course, also stores a great deal of sensory data, especially visual images, that require much more capacity than digital text. That partly explains the difference between the size of the training data used by chatGPT and a human being's storage capacity, but of course chatGPT is far better at remembering and recalling than we are, and probably better at summarising, too.
Nonetheless, it doesn't "peek" or "cheat": either it has the resources to answer a question, or it doesn't. That it sometimes performs "educated guesses" that are wide of the mark should not surprise us: what's remarkable is that it can perform them at all.