Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture - 科技前沿

A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive.

Representative figures from frontier models circa 2024 — exact numbers shift with every release. The scale is the point, not the precision.

The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.

The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly 10 consumer hard drives worth of text — representing ~15 trillion tokens.

Neural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.

GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.

The Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.

Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.

The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.

Select a training stage to see model output quality

Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.

This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.

Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text.

Watch the model choose the next word. Each bar shows the probability of a candidate token.

After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.

Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.

The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.

The base model is a token simulator. To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.

Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.

Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.

Every conversation must be encoded as a flat token sequence. Special tokens mark the structure:

Then RLHF refines the assistant's behavior further:

Human raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.

Understanding why LLMs behave the way they do requires thinking about their psychology — the emergent properties of being trained to statistically imitate human text.

LLMs have a knowledge cutoff and a finite context window. RAG solves this by embedding your documents into a vector store, retrieving the most semantically relevant chunks at query time, and injecting them into the context — shifting the model's prediction distribution toward grounded, up-to-date facts rather than memorized training data.

Every document is converted to a dense vector (~1,536 numbers) by an embedding model. Semantically similar texts land near each other in this high-dimensional space — no keyword matching needed.

The user's question is embedded the same way. Cosine similarity finds the nearest document vectors — the chunks most semantically related to the query — typically the top 2–5.

Retrieved chunks are prepended to the prompt before the LLM sees the question. The model generates from injected facts rather than relying on memorized training data — dramatically reducing hallucination on knowledge-intensive tasks.

The complete journey from raw web crawl to the ChatGPT you interact with — across two major stages, months of compute, and billions of parameters.

Built from Andrej Karpathy's "Intro to Large Language Models" lecture — all facts, figures, and framings traced back to that source. Interactive visualizations built with AI assistance. The most important takeaway: every word generated is a probabilistic sample — a biased coin flip, at 100K-way scale, billions of times.

Full lecture transcript · HN update note