The biggest pricing knobs are input tokens, output tokens, and whether you use long-context pricing tiers (when applicable). Input tokens include everything you send: system instructions, user text, tool definitions, and retrieved context. Output tokens are everything Claude generates. In most real systems, the cost driver is either (a) large prompts from bloated context, or (b) large outputs from unconstrained generation. So the highest-leverage knobs are: keep prompts compact, cap output, and avoid sending redundant text repeatedly.
Anthropic also provides cost-saving mechanisms that can change your unit economics significantly: prompt caching (which can reduce costs for repeated context) and batch processing (which can reduce costs for offline workloads). The practical interpretation is: if you have a lot of repeated boilerplate (policy text, tool schemas, static docs headers), caching can cut costs drastically. Batch processing is useful for non-interactive jobs like nightly doc summarization, embedding generation, or offline eval runs, where you can trade latency for lower cost. A solid cost-control checklist looks like this:
Cap output tokens by endpoint (FAQ answers vs deep analysis).
Trim prompts: remove duplicated instructions and long chat history.
Use retrieval instead of pasting whole docs.
Cache stable prefixes (system prompts, tool schemas, shared policies).
Separate “fast” vs “deep” routes (don’t pay Opus-depth for simple tasks).
If you want to estimate cost reliably, log {input_tokens, output_tokens, model, route} per request and calculate cost per endpoint (FAQ, code review, agent run).
RAG design has direct cost impact. When you index content into Milvus or Zilliz Cloud and retrieve top-k chunks, you can dramatically cut input tokens versus “paste everything.” The retrieval step itself is cheap compared to model tokens. In production, this often becomes the dominant cost strategy: spend engineering effort on better retrieval and chunking so you can send fewer tokens while keeping quality high. That’s the cleanest way to control Opus 4.6 cost without degrading user experience.