What is a token in LLMs?

A token is the fundamental unit of text that a large language model processes. Tokens can be whole words, parts of words, punctuation, or single characters. In English, one token is roughly 4 characters or 0.75 words on average. LLM APIs charge per token and enforce context window limits in tokens.

What happens when I exceed the context window?

The context window is the maximum total tokens in a single API call — including system prompt, conversation history, user message, and model response. If you exceed it, the API returns an error. Some implementations silently truncate older context. Use a model with a larger context window (e.g., GPT-4o at 128K or Gemini 1.5 Pro at 2M tokens) for long documents or extended conversations.

LLM Token Counter — GPT, Claude & Llama Tokens

You’re building an LLM-powered application and the API call cost is higher than expected. Or you’re crafting a complex system prompt and need to know whether it fits inside GPT-4o’s 128K context window without hitting limits. Tokens are the unit of measure for everything in LLM APIs — what you pay for, what you’re limited by, and what determines how much context a model can “see” at once.

Why Token Counting Matters

Every LLM API charges by the token. Every model has a maximum context window measured in tokens. Understanding your token usage isn’t optional — it’s the difference between a product that scales economically and one that burns your API budget on day two.

Cost control: A GPT-4o API call with a 10,000-token prompt costs $0.025 in input tokens. If that prompt is bloated with redundant instructions, you’re paying for noise. Trim it to 3,000 focused tokens and you cut input costs by 70% with the same or better results.

Context window management: Models can only “remember” what fits in their context window. GPT-4 caps at 8,192 tokens; GPT-4o extends to 128K; Gemini 1.5 Pro reaches 2 million tokens. If your conversation history plus system prompt plus user message exceeds the limit, the API returns an error — or silently truncates earlier context. Knowing your token budget prevents these silent failures.

Prompt engineering efficiency: Good system prompts are dense with information, not length. Counting tokens forces you to identify and cut redundancy. A well-engineered 500-token system prompt routinely outperforms a sloppy 2,000-token one.

How Tokenization Works

LLMs don’t process text character by character or word by word — they process tokens, which are chunks of text that can range from a single character to a full word or even a multi-word phrase.

Byte Pair Encoding (BPE)

OpenAI’s models (GPT-3.5, GPT-4, GPT-4o) use Byte Pair Encoding, implemented in the tiktoken library. BPE starts with individual bytes and iteratively merges the most frequently co-occurring pairs into a single token. The result is a vocabulary of ~100,000 tokens where common English words are single tokens, rare words are split into subwords, and arbitrary byte sequences can always be encoded.

Examples under BPE:

"hello" → 1 token
"tokenization" → 3 tokens (token, ization, possibly split differently)
"supercalifragilistic" → many tokens (uncommon word, split into subword pieces)
Code like "for i in range(10):" → ~7 tokens (each symbol may be a separate token)

SentencePiece

Meta’s Llama models use SentencePiece, a language-agnostic tokenizer that works directly on the raw text stream. It’s similar to BPE conceptually but can produce different tokenizations for the same text — which is why the same prompt might use 2,100 tokens on GPT-4 and 1,900 tokens on Llama 3.1.

Anthropic’s Tokenizer

Claude models use a custom tokenizer similar in design to BPE. Anthropic doesn’t publish the exact vocabulary, but empirically Claude’s token counts are close to GPT-4’s for English text.

Token Counts Across Models

The same input text produces different token counts across models because each model uses a different vocabulary:

Text	GPT-4 (BPE)	Llama 3.1 (SP)	Approximate ratio
English prose	baseline	~5% lower	~0.95×
Python code	baseline	~10% higher	~1.10×
JSON data	baseline	~15% higher	~1.15×
Chinese text	baseline	~30% lower	~0.70×

This tool uses a word-based approximation (tokens ≈ words × 1.3) which is accurate to within ~15% for typical English text. For production budgeting, use the exact tokenizer: tiktoken for OpenAI models, transformers tokenizer for Llama, or Anthropic’s token counting API.

Cost Optimization Strategies

Remove redundant instructions: Go through your system prompt and ask “does removing this sentence change the model’s behavior in production?” If no, cut it.

Use structured formats carefully: JSON output instructions are token-expensive if verbose. Instead of "Always respond with a JSON object containing keys 'answer' and 'confidence' and 'reasoning'", use a concise schema example.

Choose the right model for the task: GPT-4o-mini handles classification, extraction, and summarization tasks almost as well as GPT-4o at 6% of the cost. Reserve expensive models for complex reasoning.

Cache system prompts: Anthropic offers prompt caching — if your system prompt is large and repeated across calls, cached tokens cost 90% less than uncached.

Compress conversation history: For multi-turn conversations, summarize older turns rather than passing the full history. A 100-token summary of 10 turns is much cheaper than 10 turns of 80 tokens each.

Batch similar requests: When processing documents in bulk, batch API calls rather than making one call per document. Reduces per-call overhead and enables better scheduling.

Frequently Asked Questions

What is a token exactly? A token is the fundamental unit of text that an LLM processes. Tokens can be words, parts of words, punctuation, or even single characters. In English, a token is roughly 4 characters or 0.75 words on average. The rule of thumb “1 token ≈ 1 word” is convenient but imprecise — code and non-English text can tokenize very differently.

Why does the same text have different token counts in different models? Each model uses its own tokenizer vocabulary. GPT-4 uses BPE with ~100K vocabulary tokens; Llama uses SentencePiece with a different vocabulary. A rare word that’s a single token in GPT-4’s vocabulary might be split into 3 subword pieces in Llama’s tokenizer. English text usually differs by less than 15%, but code and non-Latin scripts can differ by 30% or more.

How do context windows work and what happens when I exceed them? The context window is the maximum number of tokens an LLM can process in a single call — it includes the system prompt, all conversation history, the current user message, and the model’s response. If you exceed it, the API returns an error (or, in some implementations, silently truncates older context). Choose a model with a larger context window for tasks involving long documents or extended conversations.

Is this token counter accurate enough for production cost budgeting? For rough estimates — yes. For precise cost tracking in a production system — no. The word-based approximation (words × 1.3) works well for typical English prose but will be off by 20–40% for code, JSON, mixed-language text, or heavily punctuated content. For production cost estimation, integrate the actual tokenizer: tiktoken for OpenAI (pip install tiktoken), or call Anthropic’s messages.count_tokens API endpoint directly.

LLM Token Counter

Cost Estimation — GPT-4o