When developers ask "how much does AI cost?", they usually mean the model API pricing — $3 per million input tokens, $12 per million output. That's only part of the bill. A production AI application has three or four distinct cost centers, and the model API is often not the largest one. Here's a realistic breakdown.
The Components of a Production AI App
A typical AI-powered SaaS product has these cost layers:
1. The LLM API (model call cost): The obvious one. Charged per token of input and output. 2. Embeddings API: If your app uses retrieval-augmented generation (RAG), every document chunk and every user query needs to be converted to a vector. Embedding models charge separately. 3. Vector database: Storing and querying embedding vectors has its own infrastructure cost — either hosted (Pinecone, Weaviate, Qdrant Cloud) or self-hosted. 4. Compute (serverless or container): Your API layer, orchestration logic, and preprocessing pipelines run on servers that cost money regardless of AI. 5. Storage and caching: Prompt caching, response logging, user session data, and assets.
Example: A Customer Support AI App
Let's model a mid-sized B2B SaaS company running an AI-powered customer support tool. Assumptions:
- Monthly active users: 5,000 end users
- Average queries per user: 10/month = 50,000 queries/month
- Average system context (product knowledge base): 8,000 tokens, cached
- Average user input: 150 tokens
- Average model response: 400 tokens
- Model choice: Claude Sonnet 4.5 with Anthropic prompt caching
#### Model API Cost (Claude Sonnet 4.5)
- Uncached input: 150 tokens × 50,000 queries = 7,500,000 tokens = $22.50 at $3/M
- Cached context input: 8,000 tokens × 50,000 queries × $0.30/M (90% cache discount) = $12.00
- Output: 400 tokens × 50,000 queries = 20,000,000 tokens = $300 at $15/M
Model API subtotal: ~$335/month
Without caching on the context (8,000 × 50,000 = 400M tokens × $3/M), the same app would spend $1,200/month on context alone. Caching reduces that line from $1,200 to $12. This is why prompt caching is the most important cost optimization available to AI app developers.
#### Embeddings Cost
- Knowledge base: 500 document chunks, embedded once = negligible (one-time)
- Query embedding: 50,000 queries × 150 tokens = 7,500,000 tokens at OpenAI text-embedding-3-small pricing ($0.02/M) = $0.15
Embeddings are extremely cheap. This line item is not worth optimizing until you're embedding millions of documents.
#### Vector Database (Hosted)
- Pinecone Starter: $0 (free tier, up to 100k vectors)
- Pinecone Standard: ~$70/month for 1M vectors + query volume
For a support bot with a 500-document knowledge base, the free tier is sufficient. For larger apps, budget $50–150/month for a managed vector database.
#### Compute (Serverless API Layer)
- AWS Lambda or Vercel serverless: 50,000 calls/month, average ~2 seconds
- Typical serverless cost at this volume: $15–$40/month
#### Storage and Logging
- Conversation history, error logs, usage metrics: $10–$25/month on S3 or equivalent
Total Cost for this Example App
| Component | Monthly Cost |
|---|---|
| Model API (Claude Sonnet 4.5 with caching) | $335 |
| Embeddings API | $0.15 |
| Vector database | $0–$70 |
| Compute (serverless) | $25 |
| Storage and logging | $15 |
| Total | ~$375–$445/month |
At 5,000 active users, that's $0.075–$0.089 per user per month — well under $0.10/user. That cost drops further as cached context ratios increase and as the volume triggers potential volume discounts from the model provider.
Where Costs Scale Unexpectedly
The two line items that surprise developers most as they scale:
Output token cost. Input tokens are cheap. Output tokens are typically 4–5× more expensive per token than input. An application that generates long, verbose responses costs significantly more than one that returns concise answers. Response length is a direct cost lever that's worth tuning in system prompts.
Context window utilization. Every token in your context window on each call costs money. Sending a 50,000-token product manual to every query, even when the user's question is about a single feature, is an expensive architecture choice. RAG retrieval — fetching only the relevant 2,000 tokens of context — dramatically reduces per-call context cost.
Cost Per User Benchmarks
For reference, here are approximate cost-per-user-per-month benchmarks across different application types at 5,000 MAU:
| App Type | Est. Cost/User/Month |
|---|---|
| Customer support bot (short Q&A) | $0.05–$0.15 |
| Document summarization tool | $0.20–$0.50 |
| Code review assistant | $0.30–$0.80 |
| Long-form writing assistant | $0.50–$2.00 |
| Research synthesis tool (long context) | $1.00–$5.00 |
These figures assume efficient caching and prompt design. Without caching, multiply the model API line by 3–5×.