How LLMs Actually Work
Jump to section
Why you need a mental model
You don't need to understand neurons and linear algebra. But having a basic idea of how LLMs work will help you write significantly better prompts. It's like driving — you don't need to be a mechanic, but it helps to know that fuel goes in the tank and not in the windshield washer reservoir.
Most prompting mistakes stem from misconceptions about what AI is and how it works. People expect AI to 'understand' their intent, 'remember' previous conversations, or 'know' current information. Once you understand reality, you stop making these mistakes and your prompts dramatically improve.
LLM = the world's most advanced autocomplete
Large Language Models work on one fundamental principle: they predict the next word. More precisely, the next token. All the 'magic' is sophisticated prediction: based on what you wrote, the model estimates what should come next. It doesn't plan ahead, it has no hidden agenda — it generates text token by token.
This doesn't mean the outputs are simple. Modern models have hundreds of billions of parameters and were trained on trillions of tokens of text. The result is a system that produces surprisingly coherent, useful, and often brilliant responses — but still based on prediction, not understanding.
Practical implication: AI doesn't generate the entire response at once and then display it. It literally builds the sentence word by word. That's why it sometimes 'gets tangled' — previous words steer it in the wrong direction and it can't go back. That's also why chain-of-thought (lesson 3) helps — it forces the model to generate intermediate steps that steer prediction in the right direction.
Tokens: how AI reads your text
A token isn't a word. It's a chunk of text — it might be a whole word, part of a word, or even just a character. English 'hello' is one token. The word 'unbelievable' might be two or three tokens. Numbers get tokenized digit by digit. Why does this matter? Because AI limits are measured in tokens, not words — and some languages are 'more expensive' than others in terms of tokens per word.
Modern models have a context window — the total number of tokens they can work with at once. This includes your prompt AND the AI's response. GPT-4o has 128K tokens, Claude Sonnet has 200K tokens, with the latest models handling a million or more. When you exceed the limit, the model 'forgets' the beginning of the conversation — information at the start gets lost.
Practical estimate: 1 token is roughly 0.75 English words. So 200K tokens is about 150K English words. For perspective: the entire Harry Potter and the Philosopher's Stone book has about 77K English words — it fits in the context window.
Temperature and top-p: why AI answers differently each time
Ever noticed that the same prompt gives a slightly different answer each time? That's not a bug — it's by design. The 'temperature' parameter controls randomness. Low temperature (0-0.3) = more consistent, predictable responses — ideal for factual tasks, code, analysis. High temperature (0.7-1.0) = more creative, diverse responses — ideal for brainstorming, creative writing, idea generation.
Most chatbots don't let you change temperature directly. But you can influence it through your prompt. 'Answer precisely and factually, stick only to verifiable information' pushes the model toward lower temperature behavior. 'Be creative, surprise me, try unconventional approaches' does the opposite.
The attention mechanism: not all tokens are created equal
LLMs use a mechanism called 'attention.' When generating each token, the model 'weighs' all previous tokens — but not equally. It pays more attention to some than others. Practically, this means: instructions at the beginning and end of your prompt typically have more influence than those in the middle. This is known as the 'lost in the middle' effect.
Common myth: 'AI understands what I want.' It doesn't. AI predicts the statistically most likely continuation of your text. The more precisely you formulate, the better prediction you get — not because AI 'understood' you, but because you narrowed the space of possible responses.
What this means for prompting
Knowing how the model works directly affects your prompting strategy. When you know the model predicts the next token, you understand why clear structure helps — it steers prediction in the right direction. When you know the model has no memory between conversations, you stop expecting it to 'remember' last week's chat. When you know about the 'lost in the middle' effect, you put key instructions at the beginning and end of your prompt.
Three rules that follow from this
First: be explicit, not implicit. The model won't guess what's in your head — it works only with what you write. Second: order matters. The model processes text left to right, token by token. Put important instructions first. Third: when the response isn't good, the problem is almost always in the prompt, not the model. Reformulate instead of getting frustrated.
Give AI two versions of the same request: Version A: 'Write me something about productivity.' Version B: 'I'm a manager of a small remote team (5 people). Write me 5 specific tips to improve team productivity. Format each tip as: tip name in bold, 2 sentences of explanation, 1 concrete implementation example.' Compare the results. Notice how the structure of your prompt directly influences the structure of the response — because the model generates tokens to match the pattern you gave it.
Hint
Version B should produce a significantly more structured and usable response. The key is that you gave the model an output format — this works as a 'template' that the model follows token by token.
Give AI the same task but with a key instruction in different positions: Version A: 'Write 5 energy-saving tips for homes. Respond as an energy consultant. Each tip max 2 sentences. Format: numbered list.' Version B: 'Each tip max 2 sentences. Write 5 energy-saving tips for homes. Format: numbered list. Respond as an energy consultant.' Compare: did AI respect the 2-sentence limit in both versions? Which version better follows all instructions?
Hint
Instructions at the beginning and end of a prompt typically have more influence. If response length is your top priority, put that constraint first.
Start a conversation with AI and gradually extend it. In the first message, tell AI a specific fact: 'My favorite number is 42 and my name is Albatross.' Then have a conversation on any topic (10+ messages). At the end, ask: 'What is my favorite number? What is my name?' Can AI remember? Try the same with a longer conversation. At what point does AI start forgetting?
Hint
This exercise shows how the context window works in practice. For short conversations, AI remembers everything. For long ones (hundreds of messages), it may start forgetting information from the beginning — especially with smaller models.
Prompt Engineering — full course
- LLMs predict the next token — they don't plan, understand, they predict
- Tokens aren't words — some languages are 'more expensive' than others (roughly 2x more tokens)
- The context window is limited — it includes both prompt and response combined
- Temperature affects creativity vs. consistency — you can influence it through prompt wording
- Order matters: put key instructions at the beginning and end of your prompt
- Explicit, structured prompt = better prediction = better response