PureDevTools

LLM API Cost Calculator

Compare AI model pricing across 17+ models — enter your usage to see real cost differences

All processing happens in your browser. No data is sent to any server.

Your Usage Pattern

~750 words ≈ 1,000 tokens

Short reply ≈ 100–300 tokens

3,000 req/month

Cheapest option

Gemini 1.5 Flash$0.675/mo

Most expensive

Claude Opus 4.6$157.50/mo

Cost range

$0.675$157.50/mo

ModelProviderInput $/1MOutput $/1MCost/RequestDailyMonthly
CheapestGemini 1.5 FlashGoogle$0.075$0.300$0.000225$0.022$0.675
Llama 3.1 8BTogether AI$0.180$0.180$0.000270$0.027$0.810
Gemini 2.0 FlashGoogle$0.100$0.400$0.000300$0.030$0.900
GPT-4o miniOpenAI$0.150$0.600$0.000450$0.045$1.35
DeepSeek V3DeepSeek$0.270$1.10$0.000820$0.082$2.46
Llama 3.1 70BTogether AI$0.880$0.880$0.0013$0.132$3.96
Qwen 2.5 72BAlibaba$0.900$0.900$0.0014$0.135$4.05
Claude Haiku 4.5Anthropic$0.800$4.00$0.0028$0.280$8.40
o3-miniOpenAI$1.10$4.40$0.0033$0.330$9.90
Gemini 1.5 ProGoogle$1.25$5.00$0.0037$0.375$11.25
Mistral LargeMistral$2.00$6.00$0.0050$0.500$15.00
GPT-4oOpenAI$2.50$10.00$0.0075$0.750$22.50
Claude Sonnet 4.6Anthropic$3.00$15.00$0.010$1.05$31.50
Llama 3.1 405BTogether AI$5.00$15.00$0.013$1.25$37.50
GPT-4 TurboOpenAI$10.00$30.00$0.025$2.50$75.00
o1OpenAI$15.00$60.00$0.045$4.50$135.00
Claude Opus 4.6Anthropic$15.00$75.00$0.052$5.25$157.50

Prices are approximate and may vary. Verify current pricing on each provider's pricing page before making decisions. Does not include context caching discounts, batch API discounts, or volume tiers.

You’re building an AI-powered product and the model bill is coming in higher than expected. Or you’re evaluating which LLM to use for a new feature and want to know the real cost difference at scale before you commit. This calculator lets you plug in your actual usage pattern — tokens per request, requests per day — and instantly see every major model ranked from cheapest to most expensive. All computation runs in your browser; nothing is sent to any server.

The LLM Pricing Landscape in 2026

AI model pricing has compressed dramatically over the past two years. In late 2023, GPT-4 cost $60/million output tokens. Today you can get comparable reasoning quality for under $5/million tokens, and capable smaller models for under $1/million. The market has stratified into three clear tiers:

Frontier models (GPT-4o, Claude Sonnet 4.6, Gemini 1.5 Pro) — $1–15/M input, $5–75/M output. These deliver the highest quality on complex reasoning, nuanced writing, and tasks requiring deep context. Use these when quality is the constraint, not cost.

Balanced models (GPT-4o mini, Claude Haiku 4.5, Gemini Flash, o3-mini) — $0.10–1/M input, $0.40–5/M output. These are the workhorses for production applications. 80–90% of the quality at 10–20% of the cost. The right choice for most real-world use cases.

Budget/open models (Llama 3.1 8B via Together AI, Gemini 1.5 Flash, DeepSeek V3) — $0.07–0.27/M input. Suitable for classification, extraction, structured output tasks, and high-volume pipelines where you can accept lower quality on edge cases.

How to Estimate Token Usage for Your Use Case

Tokens are roughly 3/4 of a word in English. A useful rule of thumb: 1,000 tokens ≈ 750 words ≈ 3/4 of a page. Here are common use cases and their typical token profiles:

Use CaseTypical Input TokensTypical Output Tokens
Customer support chatbot (single turn)200–500100–300
Document summarization (1–2 pages)1,500–3,000200–500
Code generation (function-level)500–1,500300–800
RAG application (3–5 retrieved chunks)2,000–5,000200–600
Long-context analysis (10+ pages)8,000–20,000500–2,000
Batch classification / labeling100–50010–50

The output token estimate matters more than you think — output tokens typically cost 3–5× more than input tokens per provider. A request with 1,000 input tokens and 500 output tokens is often priced similarly to 2,500–3,500 input tokens.

Tip: Add logging to your application to measure real token usage. Estimates from documentation are often 20–50% off from production reality, especially if your prompts include dynamic content like user data or retrieved context.

Cost Optimization Strategies

1. Prompt caching (biggest lever) OpenAI, Anthropic, and Google all offer prompt caching — if you send the same prefix repeatedly (a long system prompt, a document), subsequent calls can cost 50–90% less for the cached portion. For RAG applications with a shared context window, this alone can cut costs by 40–60%.

2. Model routing Not every request needs your best model. Route simple classification or extraction tasks to a cheaper model, and reserve the frontier model for complex reasoning. A hybrid approach (Haiku/Flash for 80% of requests, Sonnet/GPT-4o for 20%) can cut costs by 60–70% with minimal quality degradation if routing logic is tuned correctly.

3. Prompt compression Long system prompts accumulate cost across every request. Audit your prompts: remove redundant instructions, examples that aren’t needed, and verbose formatting that could be inferred. A 30% reduction in your system prompt size translates directly to a 30% cost reduction on input tokens.

4. Output length control Instruct the model to be concise when brevity is acceptable. Setting max_tokens as a hard limit prevents runaway outputs. Structured output formats (JSON with a defined schema) also tend to be more token-efficient than natural language responses.

5. Batch API OpenAI and Anthropic offer async batch processing at 50% discount. If you have non-real-time workloads (document processing, overnight jobs, dataset labeling), batching is an easy way to halve costs with no engineering changes beyond switching the API endpoint.

When to Use Expensive vs Cheap Models

Use expensive frontier models when:

Use cheap/small models when:

The benchmark trap: model benchmarks are measured on specific evaluation sets. Your task may perform very differently. Always A/B test on a sample of your actual production inputs before committing to a cheaper model at scale.

Frequently Asked Questions

Why is there such a large price difference between input and output tokens? Generating output tokens requires the model to run an autoregressive forward pass for each token — compute scales linearly with output length. Input tokens are processed in parallel in a single forward pass, which is far more compute-efficient. The input/output price ratio typically ranges from 3:1 to 5:1 for most providers.

Do these prices include context window length? The prices shown are per-token regardless of where in the context window the tokens appear, for most providers. However, some models have tiered pricing based on total context length — for example, Gemini 1.5 Pro has a lower price for prompts under 128K tokens and a higher price above that threshold. Check the provider’s current pricing page for details on context-length tiers.

How accurate are these cost estimates? The estimates assume your usage matches the input values exactly, with no caching, no batch discounts, and no volume pricing. Real bills will differ based on caching hit rates, request variance, and any negotiated rates. Use this calculator for order-of-magnitude planning and directional comparisons, not for precise budget forecasting.

What about self-hosted open-source models? Running Llama 3.1 70B or Qwen 2.5 72B on your own GPU infrastructure can be cheaper at high volume, but the break-even point depends heavily on your utilization rate, GPU rental costs, and engineering overhead. At typical SaaS workloads, managed APIs are often cheaper until you hit several million requests per day. The “via Together AI” prices shown here represent a managed hosting cost reference point.

Related Tools

More AI & LLM Tools