Meter before you manage: How to cut LLM costs by up to 85%
If you're being slammed with high bills for LLM-powered features, instrumenting your system to see the correct cause is the most important place to start.
Feb 17, 2026 • 11 Minute Read
You've shipped an LLM-powered feature. Usage is growing. The product team is thrilled. Then finance sends a message: "Can you explain this $47,000 API bill?"
You open your cloud dashboard. There's a single line item: "OpenAI API - $47,832." No breakdown by feature. No usage by team. No way to know if the support chatbot or the document analyzer is driving the cost. You're flying blind.
This scenario is increasingly common. Model API spending doubled from $3.5 billion to $8.4 billion between late 2024 and mid-2025, and 72% of companies plan to increase their LLM spending this year. The enterprise LLM market is projected to grow from $6.7 billion to $71.1 billion by 2034 . Yet most teams discover their costs are out of control only when the bill arrives.
Without observability, LLM costs are a black box. With instrumentation, you can attribute every dollar to specific features and teams.
Running LLMs without observability is like running a restaurant kitchen where you can't see which chef is cooking which dish, can't track ingredient usage by recipe, and only discover you're over budget when the supplier bill arrives. Imagine trying to cut costs in that kitchen:
- Should you cut costly dishes? Maybe your expensive wagyu beef dishes are actually your most profitable items, and your cheapest dish is actually losing money because staff spend too long preparing it.
- Should you use cheaper ingredients or reduce portions? Maybe you reduce or eliminate the wrong ingredient and tank quality, or lose customers.
- Should you fire a chef? Maybe they weren't a large expense at all, and by doing so, you've driven down the kitchen's output without substantially cutting costs.
Without visibility into where the money goes, every decision is a guess. No one would run a kitchen that way! But that's exactly how most teams run their LLM applications: making optimization decisions based on gut feel rather than data.
The problem with cost optimization advice
Most "reduce LLM costs" articles jump straight to tactics: use smaller models, write shorter prompts, cache responses. But these are shots in the dark without observability. How do you know which prompts to shorten? Which features can use smaller models? Where caching would actually help?
Consider a concrete example: your support chatbot handles 500,000 requests monthly at an average of 1,500 tokens per request. At GPT-4 pricing, that's roughly $18,000 per month for a single feature. But without instrumentation, you can't distinguish between complex tickets requiring premium models and simple FAQ questions that could run on a model costing 100x less. You're paying GPT-4 prices for "What are your business hours?"
The problem isn't a lack of optimization techniques. The problem is that you can't optimize what you can't see.
Observability isn't one of many cost optimization tips, it's the prerequisite that makes all other optimizations possible. This is the "Meter Before You Manage" framework: you must first instrument your system to see costs, then attribute them to understand where they come from, and only then can you optimize effectively.
The "Meter Before You Manage" framework: you can't optimize costs without first instrumenting to see them and attributing to understand them.
Let's walk through each layer with concrete tools and code.
Layer 1: Instrument with LiteLLM
The foundation of LLM cost management is a unified gateway that captures every request. Without this central point of observation, your LLM calls are scattered across services, teams, and codebases: each one invisible until the bill arrives.
LiteLLM is an open-source proxy that provides a single interface to 100+ LLM providers while automatically tracking tokens, latency, and costs. Think of it as a toll booth for your LLM traffic: every request passes through, gets counted, and gets priced, regardless of whether it's going to OpenAI, Anthropic, or a self-hosted Ollama instance running on your own infrastructure.
Instead of scattering API calls throughout your codebase, you route everything through the proxy:
# config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: mixtral
litellm_params:
model: together_ai/mixtral-8x7b
api_key: os.environ/TOGETHER_API_KEY
general_settings:
master_key: sk-your-master-key
database_url: postgresql://user:pass@localhost/litellm
Start the proxy and your applications use it as an OpenAI-compatible endpoint:
import openai
client = openai.OpenAI(
api_key="your-litellm-key",
base_url="http://localhost:4000"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this document..."}],
extra_body={
"metadata": {
"team": "support",
"feature": "ticket-summarizer"
}
}
)
The gateway captures every request with its token count, latency, and calculated cost. But instrumentation alone isn't enough; you need to enforce limits before costs spiral. LiteLLM's virtual keys let you set budgets per team or project:
curl 'http://localhost:4000/key/generate' \
-H 'Authorization: Bearer sk-master-key' \
-H 'Content-Type: application/json' \
-d '{
"models": ["gpt-4o", "claude-sonnet"],
"max_budget": 5000,
"budget_duration": "monthly",
"metadata": {"team": "support"}
}'
When a team hits their budget, requests fail with a clear error rather than silently accumulating charges. This is the difference between discovering a $47,000 bill at month-end and getting an alert at $4,000 with time to investigate.
A unified LLM gateway centralizes cost tracking, caching, and provider access while streaming observability data to Langfuse for analysis.
Layer 2: Attribute with Langfuse
Instrumentation tells you how much you're spending. Attribution tells you why. This distinction matters enormously. Knowing you spent $47,000 last month is useful. Knowing you spent $18,000 on support tickets, $12,000 on document analysis, and $8,000 on code generation---and that 60% of support costs came from simple FAQ questions---is actionable. Langfuse is an open-source LLM observability platform that integrates directly with LiteLLM to provide tracing, cost analytics, and quality evaluation.
The integration requires just a few environment variables:
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"
Then enable the callback in your LiteLLM configuration:
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
Every request now flows to Langfuse with full context: which model, how many tokens, what it cost, and the metadata you attached (team, feature, user). You can slice costs by any dimension: which feature drives the most spend? Which prompts are inefficient? Which team is over budget?
For more complex applications with multi-step chains or RAG pipelines, you can use the Langfuse SDK directly:
from langfuse import observe
from langfuse.openai import openai
@observe()
def analyze_support_ticket(ticket_text):
# First call: classify the ticket
classification = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Classify this ticket: {ticket_text}"}]
)
# Second call: generate response based on classification
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Draft a response for this {classification} ticket..."}]
)
return response
The @observe() decorator captures the entire trace, showing you exactly where time and tokens are spent within each operation. This granularity is crucial: a single "analyze ticket" function might make three LLM calls internally, and you need to know which one is eating your budget.
Setting up basic instrumentation is straightforward, but production observability involves decisions about sampling strategies, metric aggregation, and alert thresholds. If you want to dive deeper into deploying and scaling LLMs in production---including the trade-offs between API-based and self-hosted approaches---we cover this extensively in the Scale and Deploy LLMs in Production Environments course on Pluralsight.
Layer 3: Optimize with Routing and Caching
Once you can see and attribute costs, you can finally optimize with confidence. Two techniques deliver the largest impact: intelligent model routing and semantic caching.
Intelligent Routing with RouteLLM
Not every query needs GPT-4. A customer asking "What are your business hours?" doesn't require the same computational power as someone asking you to debug a complex distributed systems issue. Yet most applications treat every query identically, routing everything to the most capable (and expensive) model available.
Research from UC Berkeley shows that routing simple queries to smaller models while reserving expensive models for complex reasoning can reduce costs by up to 85% while maintaining 95% of the quality. The insight is that model capability is often wasted on straightforward tasks: you're paying for reasoning power you don't need.
The RouteLLM framework uses a trained classifier to assess query complexity before deciding which model should handle it:
RouteLLM routes simple queries to cost-effective models while preserving premium models for complex reasoning, achieving up to 85% cost reduction without quality loss.
The price difference is dramatic: GPT-4 costs roughly $24.70 per million tokens while Mixtral 8x7B costs $0.24---a 100x difference. If 60% of your queries are simple enough for the cheaper model, you've cut costs by more than half without touching your complex workflows.
from routellm.controller import Controller
client = Controller(
routers=["mf"],
strong_model="gpt-4o",
weak_model="mixtral-8x7b"
)
# The router automatically selects the appropriate model
response = client.chat.completions.create(
model="router-mf-0.116", # Threshold controls routing sensitivity
messages=[{"role": "user", "content": user_query}]
)
Semantic Caching
Routing optimizes which model handles each query. Caching eliminates redundant queries entirely. Research shows that 31% of enterprise LLM queries are semantically similar to previous requests. If a user asks "How do I reset my password?" and another asks "What's the process to change my password?", they should get the same cached response rather than two separate API calls. Traditional exact-match caching misses these opportunities because the strings are different. Semantic caching uses vector embeddings to identify similar meaning regardless of phrasing.
Semantic caching identifies similar queries through vector similarity, serving cached responses for 31% of enterprise queries and reducing costs by 40-70%.
Semantic caching delivers 40-70% cost reduction and improves latency from ~850ms to ~120ms---a 7x speedup. Tools like GPTCache and Redis with vector search make implementation straightforward, and LiteLLM supports semantic caching natively with configurable similarity thresholds.
Don't overlook provider-level caching either. These work at a different layer than semantic caching and can be combined for compound savings. Anthropic's prompt caching reduces costs by 90% for repeated long prompts, which is particularly valuable when you have large system prompts or knowledge bases that stay constant across requests. OpenAI's automatic caching provides 50% savings on identical requests with no code changes required. For RAG pipelines with static context, combining provider prompt caching with semantic response caching can reduce costs dramatically.
The Result: From Black Box to Control Plane
Let's return to that $47,000 bill. With the observability stack in place, here's what changes:
You open Langfuse and immediately see the breakdown: the support chatbot accounts for $18,000, document analysis for $12,000, code generation for $8,000, and the RAG pipeline for $10,000. You notice the support chatbot is using GPT-4 for every query, including simple FAQ responses that could run on a model 100x cheaper.
You implement RouteLLM to route simple support queries to GPT-4o-mini. You add semantic caching for the FAQ-style questions that make up 40% of support volume. You set budget alerts at 80% of the monthly target so you'll never be surprised by a bill again.
The next month's bill: $28,000. A 42% reduction, with quality metrics unchanged. More importantly, you now understand exactly where that $28,000 goes, which means you can make informed decisions about where to optimize next.
This pattern---instrument, attribute, optimize---is how organizations transform unpredictable LLM costs into managed infrastructure. The key insight is that optimization techniques like routing and caching are only effective when you have the visibility to apply them strategically. Blindly implementing caching might save money on the wrong queries while leaving your actual cost drivers untouched. Switching to cheaper models without understanding query complexity might tank quality on the requests that matter most. You need to meter before you can manage.
One critical reminder: always monitor quality alongside cost. The goal isn't the cheapest possible LLM deployment: it's the most cost-effective deployment that still solves user problems. A 50% cost reduction means nothing if customer satisfaction drops 30%. Langfuse's evaluation features let you track accuracy, relevance, and user feedback so you can catch any quality degradation before it impacts users. Cost without quality is a false economy: you'll just spend the savings on customer support and churn recovery.
Now you're ready to implement your own observability stack. Start with LiteLLM to centralize your LLM traffic, add Langfuse for attribution and analysis, then layer in routing and caching once you understand where the opportunities are.
The 40-85% savings are real, but only if you can see where the money goes first.
Advance your tech skills today
Access courses on AI, cloud, data, security, and more—all led by industry experts.