Reference Living Document

AI Glossary for Normal People

Everyone is talking about models, weights, RAG, agents, and MCP. Most explanations are either too technical or too simplified to be useful. These are plain-English definitions for people who are curious but not trying to get a PhD.

01

The Foundation

Before anything else makes sense, you need to know what a model actually is and how it comes to exist.

Model

A model is the finished product of training an AI system. Under the hood it's a very large mathematical function, but you interact with it like software. When someone says "I'm using Claude" or "we're running Llama," they mean a specific trained model. The word gets used loosely to mean the whole system you talk to.

Analogy

A model is like a recipe distilled from watching a great chef cook thousands of meals. The chef is gone. What you have is a set of instructions that captures most of what they knew, ready to apply to any new dish.

Parameters / Weights

The actual numbers inside a model. There are billions of them, each a small decimal value. Together they encode everything the model "knows." When someone says a "70 billion parameter model," they mean 70 billion of these numbers. More parameters generally means more capability, but also more memory and computing power required to run it.

Analogy

Think of a model as a very complex machine with billions of dials. Training turns those dials until the machine produces good outputs. After training, the dials are locked in. The weight values are the dial positions.

Training

The process of building a model. You feed the system an enormous amount of text, it makes predictions, those predictions get compared to correct answers, and the parameters get nudged slightly in the right direction. Repeat this billions of times. Training a large model takes weeks, costs millions of dollars, and consumes a significant amount of electricity. You do it once. Everyone else uses the result.

Analogy

Training is like learning a language by reading every book ever written and being corrected every time you guess the next word wrong. Painful for a person. Feasible for a data center.

Inference

Running a trained model to get an output. When you type a message to an AI and it responds, that's inference. Training is the expensive part that happens once. Inference is cheaper and happens every time someone uses the model. Cloud AI providers (OpenAI, Anthropic, Google) handle inference on their servers so you don't have to.

Analogy

Training is writing the recipe book. Inference is cooking from it.

LLM (Large Language Model)

A specific type of AI model trained primarily on text. "Large" refers to the parameter count. LLMs like GPT-4, Claude, and Gemini can write, summarize, translate, answer questions, and generate code. They don't understand in a human sense. They're very good at predicting what a useful response looks like based on patterns in the training data.

Analogy

An LLM is like a very well-read person who has absorbed an enormous amount of text and can respond in kind. They haven't lived any of it, but they can discuss all of it with apparent fluency.

02

How It Talks

Models process text in specific ways and have real limits. These terms describe what's happening under the surface when you talk to one.

Token

The unit of text an LLM processes. Tokens aren't quite words and aren't quite characters. "cat" is one token. "running" is one token. "uncategorized" might be three. A rough rule: 100 tokens is about 75 words. Models have limits on how many tokens they can handle at once, and API pricing is almost always per token.

Analogy

Tokens are like puzzle pieces. The model works with pieces, not whole words or whole sentences. Two models might break the same sentence into different pieces.

Context Window

The maximum amount of text a model can "see" at once. If the context window is 128,000 tokens, the model can process roughly 96,000 words at a time. Once you exceed that limit, the model can't access anything older. Early models had very small context windows (4,096 tokens). Modern ones are dramatically larger, enabling much more complex conversations and document analysis.

Analogy

A context window is like a whiteboard. You can write a lot on it, but once it's full, you have to erase something before you can write anything new. The model can't see what was erased.

Temperature

A setting that controls how predictable or creative the model's outputs are. At temperature 0, the model always picks the most statistically likely next token, producing consistent and deterministic responses. Higher temperature values introduce randomness, making outputs more varied and sometimes more creative, but also less reliable.

Analogy

Temperature is a dial for imagination. Turn it down when you want factual, consistent answers. Turn it up when you want the model to surprise you.

Prompt / Prompt Engineering

A prompt is the text you send to the model. Prompt engineering is the practice of crafting prompts to get better results. The framing, context, and examples you include all influence what the model produces. It's part craft, part intuition, and it genuinely matters. A vague prompt gets vague results.

Analogy

A prompt is like a job brief. A vague brief gets mediocre work. A clear, specific brief with examples and context gets better work. The model is the contractor; you're the client.

System Prompt

Hidden background instructions given to the model before the conversation starts. When a company deploys a chatbot, the system prompt defines its persona, constraints, and context. Users usually can't see it, though it shapes every response. "Always be polite. Never discuss competitors. Your name is Aria." That's a system prompt.

Analogy

A system prompt is like the briefing notes a manager leaves for a temp worker before they arrive. The worker reads them, then the customer walks in and none of that backstory is visible.

Hallucination

When a model produces a confident, plausible-sounding response that is factually wrong. The model isn't lying in any intentional sense. It's predicting what text should come next, and sometimes the most statistically plausible next text is incorrect. This is a fundamental limitation, not a bug that will be fully patched. You should verify anything important that an LLM tells you.

Analogy

A hallucination is like asking someone to finish a crossword puzzle. They can fill in squares that look right from the intersecting letters. If they don't actually know the answer, the result is confident-looking nonsense.

03

Making It Smarter

A general-purpose model doesn't know your business, your documents, or your terminology. These are the techniques for closing that gap.

Fine-tuning

Taking a pre-trained general model and continuing to train it on a smaller, specific dataset. This shifts the model's behavior toward a particular domain, style, or task without starting from scratch. Fine-tuning is useful when you need a model that consistently speaks in a specific tone or handles a niche subject. It costs much less than training from scratch but still requires real compute and good training data.

Analogy

Fine-tuning is like hiring an experienced contractor and then giving them two weeks of company-specific training. You don't teach them construction from scratch. You teach them your systems, your clients, your way of doing things.

RAG (Retrieval Augmented Generation)

Instead of baking all knowledge into the model's weights through training, RAG lets the model search for relevant information at query time. You ask a question, the system retrieves the most relevant documents from a knowledge base, includes them in the prompt, and the model answers based on those documents. The advantage over fine-tuning: your knowledge base can be updated without retraining the model.

Analogy

RAG is the difference between a doctor who memorized every medical textbook and a doctor who keeps a library and looks things up before answering. The second doctor's information can be more current, and the library can grow independently of the doctor.

Embeddings

A way of representing text as a list of numbers (a vector) that captures meaning. Similar concepts produce similar vectors. This is what makes semantic search work: searching for "dog" also surfaces results about "canine" and "puppy" because their vectors are close together in mathematical space. Embeddings are the glue that lets RAG find relevant documents efficiently.

Analogy

Imagine a map where every word is a city. Related words are close to each other. Embeddings are the GPS coordinates. Finding similar content means finding nearby cities on that map.

Vector Database

A database optimized for storing and searching embeddings. A regular database searches for exact matches. A vector database searches for nearby matches in embedding space. When a RAG system needs to find the most relevant documents for a query, it converts the query to an embedding and finds the closest matches in the vector database. Speed matters here because this lookup happens at query time, not in advance.

Analogy

A regular database asks "does this row match exactly?" A vector database asks "what rows are closest to this?" Like the difference between looking up an exact address and asking for the nearest coffee shop.

Distillation

A technique for training a smaller model to mimic the behavior of a larger one. The large model (the "teacher") generates outputs on a huge set of examples. Those outputs are used as training data for the smaller model (the "student"). The result is a compact model that performs surprisingly close to the teacher at a fraction of the size and cost. Many of the fast, cheap models available today were built this way.

Analogy

Distillation is like a senior expert spending a year mentoring a junior hire. The junior can't match every nuance, but they absorb enough to handle most situations competently and work much faster. The expertise transfers without the junior needing to repeat the senior's entire career path.

04

Agents and Automation

This is where things get genuinely new. Agents shift AI from "answer machine" to "capable participant." The terminology here is evolving fast.

Agent

An AI system that can take actions, not just generate text. An agent is typically an LLM connected to tools (web search, code execution, file access, APIs) and given a goal. It plans steps, executes them, evaluates results, and decides what to do next. The key shift from a chatbot: the model decides what to do, not just what to say. An agent can make mistakes and retry, just like a person.

Analogy

A chatbot is a very good answering machine. An agent is an employee. One tells you what to do. The other does it.

Tool Use / Function Calling

The mechanism that lets an LLM invoke external functions. Instead of only generating text, the model can signal "I need to call this function with these arguments" and the runtime executes it and returns results. This is how agents interact with the real world: they call search engines, run code, query databases, and hit APIs. The model decides when and how to call each tool.

Analogy

Function calling is like a manager who, instead of just drafting a reply, can say "pull the latest sales report" and have it actually appear on their desk. The model is the manager. The tools are the staff it can call on.

MCP (Model Context Protocol)

A standard protocol, published by Anthropic and now widely adopted, for how AI agents communicate with external tools and data sources. Before MCP, every model-tool combination needed its own custom integration. MCP standardizes that interface so tools built once can work with any compatible model. It covers how tools are described, how the model calls them, and how results get returned.

Analogy

MCP is like USB. Before USB, every peripheral needed its own connector and driver. USB created one standard so anything could plug into anything. MCP is trying to do the same for AI agents and the tools they use.

Skills

In AI agent systems, a skill is a reusable, pre-packaged capability: a set of instructions, prompts, and tool access bundled together for a specific task. "Summarize a research paper" or "draft a customer reply" might each be a skill. Skills make agents more reliable by giving them a practiced pattern for familiar tasks rather than reasoning from scratch each time. The definition varies by platform, but the concept is consistent.

Analogy

A skill is like a trained reflex. A surgeon doesn't think through every step of a routine incision from first principles. They have a practiced pattern. An agent with a skill works the same way for familiar tasks.

Multi-agent

A system where multiple AI agents work together, each handling a portion of a larger task. One agent might plan, another execute, another review the output. This mirrors how human teams work and lets you handle complexity that no single agent could manage reliably alone. You can also specialize agents (a research agent, a writing agent, a fact-checking agent) and orchestrate them.

Analogy

A multi-agent system is like a small team. The project manager is an agent. The researcher is an agent. The writer is an agent. Each does what they're best at, and someone coordinates them.

Quick Reference

Terms you'll hear without much explanation attached.

Multimodal

A model that handles more than text. Multimodal models can process images, audio, and video alongside text. GPT-4o and Claude 3 are multimodal. Text-only models cannot see images.

Text-only is a phone call. Multimodal is a video call where you can share your screen.

Open Source vs. Closed

Open source models (Llama, Mistral, Gemma) release their weights so you can download and run them yourself. Closed models (GPT-4, Claude, Gemini) are only available through an API. "Open weights" is more accurate since licensing varies.

Open source gives you control and privacy. Closed models tend to be more capable at the frontier.

On-device / Local Inference

Running a model on your own hardware instead of sending data to a cloud API. You get privacy, no per-query cost, and no network dependency. The tradeoff: your hardware can only run smaller models, and smaller models are less capable.

On-device is a calculator in your pocket. Cloud inference is calling a math professor. The professor is smarter; the calculator is always available.

Transformer

The neural network architecture that powers most modern LLMs. The "T" in GPT. It comes from a 2017 paper called "Attention Is All You Need." You don't need to know how it works internally. Just know it's the reason all these systems got dramatically better so quickly.

The key innovation was an "attention mechanism" that lets the model weigh the importance of every word relative to every other word in the context.

A note on pace

The AI field moves fast enough that some of this will be outdated within months. New terms get coined constantly, and some of them stick. This page covers the ones that have earned staying power as of early 2026. When you see a new term, ask two questions: what does it replace, and who benefits from you calling it by this name? That usually clarifies things quickly.