Matthias Kainer explains how T works in GPT. This is possibly the simplest explanation of transformers that I have come across.
All the “knowledge” a Transformer has is encoded in its weight matrices – those millions (or billions) of numbers that were set during training. When you ask it a question, it does not go look something up. It reconstructs an answer from compressed statistical patterns. Think of it like this: if you memorized every cookbook in the world but then someone asked you “what temperature for roasting a chicken?” – you would reconstruct an answer from all those overlapping memories. Most of the time you would be right. But sometimes your memories would blend together and you would confidently say “180C for 3 hours” when the actual answer depends on the size of the chicken. You would have no way to check, because the cookbooks are gone – only your compressed memory remains.
That is why the search + embeddings approach from earlier matters so much. RAG (Retrieval Augmented Generation) is essentially saying: “Do not trust your memory alone. Before answering, go find the actual document, read it, and base your answer on that”. It does not completely solve hallucinations, but it dramatically reduces them for factual questions.
The bottom line: Transformers are prediction machines, not truth machines. They predict what text should come next based on patterns. When the pattern aligns with truth, they are brilliant. When it does not, they are confidently wrong. This is not something that can be “fixed” without fundamentally changing the architecture – it is a feature of how next-word prediction works. Always verify important claims. The AI does not know what it does not know.