Dispatch №05Mar 23, 20268 min read

Two Env Vars and Done: Claude Code for Every Developer, Zero API Keys

Open Source to allow every developer with all possible anthropic models and credits across clouds

By Nivedit JainOriginally published on jainnivedit.substack.com

Contents · 12 sections

01The Problem
02The Architecture
03What Developers Actually Do
04Intelligent Routing: Credits Drive Weights
05Session Affinity for Prompt Caching
06What the Org Gets
07What Developers Get
08The Full Stack
09Key Design Decisions
10Before and After
11What Current LLM Gateways Still Can’t Do
12Try It Yourself

The Problem

Claude Code is a genuine productivity multiplier. Rolling it out across an engineering org, though, creates a set of headaches that compound fast.

Developers end up signing up with individual Anthropic accounts, so spend is invisible at the org level with no way to set budgets or attribute costs to teams. Every developer holds their own API key, and one leaked key on GitHub means a billing surprise and a security incident. When something goes wrong (or right), there is no audit trail of which team, project, or developer drove the usage. And most orgs have committed GCP and AWS spend sitting there: without deliberate routing, Anthropic API calls never draw from those pools.

We solved all four with a single self-hosted proxy. Developer onboarding is two environment variables and a shell reload.

The Architecture

The whole stack runs on a single VM with three components. Nginx handles TLS termination, HTTP-to-HTTPS redirect, and session ID extraction via a small Lua script. LiteLLM is the API-compatible proxy that routes requests across providers, tracks costs, and enforces budgets. PostgreSQL stores every request’s token count, cost, model, provider, and timestamp.

What Developers Actually Do

Add two lines to your shell profile:

# ~/.bashrc or ~/.zshrc
export ANTHROPIC_BASE_URL=https://your-proxy.example.com/v1
export ANTHROPIC_AUTH_TOKEN=sk-...  # Virtual key issued by the proxy

Reload your shell, type claude, and it works exactly as before. Same models, same streaming, same tool use. Claude Code uses the Anthropic SDK under the hood, and the SDK respects ANTHROPIC_BASE_URL natively, so the proxy is completely invisible. No plugins, no config files, no wrapper scripts.

New hire onboarding: generate a virtual key, send it, done. Under five minutes.

Intelligent Routing: Credits Drive Weights

The proxy doesn’t forward blindly. It distributes traffic across three providers using weighted load balancing, and the weights are set to directly mirror your cloud credit ratio.

If you have roughly 10x more GCP credits than AWS or Anthropic spend, the router sends ~83% of traffic to Vertex AI and treats the others as fallbacks. When credit allocations shift next quarter, you update the weights. The config follows your financial reality rather than the other way around.

The YAML that makes this work is simpler than it looks. Each model is declared three times, once per provider, all sharing the same model_name:

- model_name: claude-sonnet-4-6
  litellm_params:
    model: vertex_ai/claude-sonnet-4-6
    weight: 10
- model_name: claude-sonnet-4-6
  litellm_params:
    model: bedrock/us.anthropic.claude-sonnet-4-6
    weight: 1
- model_name: claude-sonnet-4-6
  litellm_params:
    model: anthropic/claude-sonnet-4-6
    weight: 1

LiteLLM handles weighted shuffle, retries, and failover automatically. If Vertex AI is down or rate-limited, requests fall through to Bedrock, then Anthropic Direct. The developer sees nothing.

Session Affinity for Prompt Caching

Claude’s API supports prompt caching: if consecutive requests share a long prefix (system prompt, file contents), the provider caches it and serves subsequent requests faster and cheaper. The catch is that this only works if those requests hit the same provider deployment. With three providers behind a weighted router, that’s not guaranteed.

We solved this at the infrastructure level with zero developer configuration.

Claude Code already includes a session_id in every request body. A small Lua script in Nginx reads it and passes it upstream as a header:

-- extract_session.lua -- runs on every request, inside Nginx
ngx.req.read_body()
local body = ngx.req.get_body_data()
if body then
    local sid = body:match(’”session_id”%s*:%s*”([^”]+)”’)
    if sid then
        ngx.var.session_id = sid
    end
end

LiteLLM receives the x-litellm-session-id header and pins that session to a single provider deployment for four hours.

The session_id was already in the request body. We just needed Nginx to read it and forward it. No shell wrappers, no custom headers, no developer configuration. Prompt caching can cut costs by up to 90% on the cached prefix, and every developer gets it for free.

What the Org Gets

Complete cost visibility. Every request lands in PostgreSQL with full metadata: input tokens, output tokens, cached tokens, exact cost, model, provider, timestamp, and the virtual key identifying which developer or team made the request. Full prompt content is optional via store_prompts_in_spend_logs: true. Spend logs are retained for 30 days with a daily cleanup job, which is enough history for monthly cost analysis without unbounded storage growth. Since the data is in PostgreSQL, cost reporting is just SQL.

Per-developer tracing. LiteLLM’s virtual key system issues a unique key per developer or team. Every request is tagged with it, so you can answer questions like “which team is burning Opus on tasks that should use Sonnet?” or “what’s the per-engineer cost variance this sprint?” No more guessing, no more asking developers to self-report.

Budget controls. Virtual keys can carry spend limits per developer, per team, or globally. When a key hits its budget, requests are rejected with a clear error rather than silently running up a bill.

Zero trust posture. Developers never see the real provider credentials. Keys are issued through the proxy admin and can be revoked instantly without touching AWS, GCP, or Anthropic. All LLM traffic flows through a single auditable gateway, which is a significant improvement over prompts scattered across individual accounts.

Cloud credit utilization. Most enterprise GCP and AWS agreements have committed spend you lose if you don’t use it. Routing Claude Code traffic through Vertex AI and Bedrock converts that into developer productivity. Update the routing weights when your credit allocations change.

What Developers Get

Two environment variables, no Anthropic account signup, no billing setup, no credit card. A new hire can be using Claude Code within minutes of getting their virtual key.

When a new Claude model ships, the proxy admin adds it once and every developer gets access immediately. No waitlists, no individual account upgrades.

Single-provider setups have single points of failure. With three providers, if Vertex AI has an outage, requests automatically failover to Bedrock or Anthropic Direct. The developer sees fewer errors and less downtime.

Developers also experiment more freely when they’re not spending personal money or worrying about running up a bill. Budget limits prevent runaway costs, and that combination removes a real psychological barrier to using AI tooling aggressively.

The Full Stack

The full stack deploys with a single deploy.sh that installs Docker, provisions a Let’s Encrypt certificate, sets up automatic renewal, and starts the LiteLLM and PostgreSQL containers. After that, the admin creates virtual keys through the LiteLLM dashboard and distributes them. Adding a developer is: generate key, send key, done.

Key Design Decisions

Weights mirror credit ratio. The 10:1:1 split isn’t a preference. It reflects how your cloud credits are actually allocated. Traffic distribution is a financial decision expressed as a routing config.

drop_params: true on Bedrock. Claude Code sends Anthropic-specific beta headers that Bedrock doesn’t understand. This config strips them silently instead of failing. Without it, Bedrock requests error on headers it never asked for.

Session affinity with a 4h TTL. Long enough to cover a working session and benefit from prompt caching. Short enough that a provider outage doesn’t lock a session to a degraded backend for half the day.

PostgreSQL for spend logs. Queryable, reliable, and already in the stack. Cost reporting is SQL, with no separate analytics infrastructure needed.

Host networking for LiteLLM. Nginx runs on the host, LiteLLM in a container. Host networking removes the subnet-routing complexity between them.

Before and After

Before: no cost visibility, zero GCP/AWS credits applied to LLM usage, 15-30 minutes to onboard a developer, and prompts scattered across individual accounts with no audit trail.

After: every token tracked in PostgreSQL, traffic distributed across providers proportional to credit ratio, developer onboarding under two minutes, full audit trail with prompt logging and key-level tracing.

What Current LLM Gateways Still Can’t Do

The proxy as described is static: you set the weights once and they stay until you change them. That works, but it leaves two problems on the table.

The first is credit-aware dynamic routing. Right now, routing weights are a config value you update manually when your credit situation changes. The more useful version polls each provider’s credit balance on a schedule, compares current consumption rates, and adjusts weights in real time to keep drawdown proportional across providers. You could also define reserves: “never let GCP credits drop below $500” or “keep one month of Bedrock runway available.” The router would then treat those reserves as a floor and route around them automatically, shifting traffic to other providers when a credit pool gets too close to the threshold. This turns what is currently a quarterly config update into something that manages itself.

The second is sensitive data detection at the request boundary. Coding agents create a genuinely new class of data security problem. A developer working on a production codebase can, without meaning to, feed credentials, internal hostnames, customer records, or proprietary logic directly into a prompt. The agent then happily sends all of it upstream. In a traditional setup there is no interception point. With a proxy, there is: every request passes through before it reaches any provider. That makes it the right place to inspect the request body, run pattern matching or a lightweight classifier against known sensitive data signatures (API keys, connection strings, PII patterns, internal IP ranges), and either strip the offending content, block the request and return an error, or alert the security team while letting it through. For enterprise teams where the compliance surface of coding agents is still not well understood, having that enforcement layer at the infrastructure level rather than relying on developer discipline is significant.

Neither of these is something you can wire up easily with LiteLLM or any other gateway today. They require a different layer of intelligence sitting above the routing logic.

Try It Yourself

The repo is at github.com/exospherehost/claude-code-proxy. You need a VM, API credentials for at least one Claude provider, and a domain pointed at the VM. Start with Anthropic Direct and add cloud providers as your credit situation warrants. The developer experience is the same either way: two environment variables, and Claude Code just works.