AI inference is the step where a trained model takes your input and produces an output, in other words the moment the model actually does its job. Training is the expensive, one-time process of building the model from data; inference is everything that happens afterward, every time you send a prompt and get an answer. Each request runs the input through the model to generate a result, and because that consumes computing power every single time, inference is where the ongoing cost of running AI lives. This explainer covers how inference differs from training, why it costs money, and what determines its speed.
Training versus inference
These two phases are easy to confuse but do very different things.
| Aspect |
Training |
Inference |
| When |
Once, up front |
Every request |
| Purpose |
Build the model |
Use the model |
| Cost pattern |
Huge, one-time |
Smaller, but constant |
| Model changes |
Learns and updates |
Stays fixed |
| Who pays |
Model maker |
Whoever runs queries |
The key point: during inference the model does not learn. Its internal values are frozen. Your prompt does not update the model, it only flows through to produce an answer. Anything that looks like "memory" in a chat is the app re-sending earlier text, not the model changing.
How inference works
- Receive input. Your text is tokenized into the chunks the model reads. See what a token is.
- Run the forward pass. The input flows through the network once to compute the next token.
- Generate token by token. Each output token is produced, appended, and fed back in.
- Return the result. The completed output is sent back to you.
Because text generation is one token at a time, longer outputs take proportionally longer and cost more. This is also why replies stream in gradually rather than appearing all at once.
What drives cost and speed
- Model size: larger models need more compute per token, so they are slower and pricier.
- Output length: every generated token costs time and money, so long answers add up.
- Input length: big prompts and pasted documents must be processed too.
- Hardware: faster accelerators cut latency but raise the cost of the infrastructure.
If you are budgeting an AI feature, inference, not training, is usually the recurring line item, and it scales directly with usage. For how that maps to billing, see what a token is.
What to skip
- Do not assume the model learns from your prompts. It does not change during inference.
- Do not ignore output length. It is often the biggest, most controllable cost lever.
- Do not pay for a giant model when a smaller one infers faster and cheaper for your task.
FAQ
What is the difference between training and inference?
Training builds the model from data once and is very expensive. Inference uses the finished model to answer each request and is the recurring cost of running AI.
Does the model learn during inference?
No. The model weights are fixed at inference time. It produces answers but does not update itself from your prompts.
Why does inference cost money every time?
Each request runs compute to generate output, token by token. That hardware time is what providers bill for, usually per token.
What makes inference slow?
Mainly model size and how many tokens it must generate, plus the hardware it runs on. Bigger models and longer answers take longer.
Where to go next
Learn what a token is and how billing works, see how a large language model is built, and explore running AI locally.