an inference api that understands volumes

A single inference API that adapts to your volume, SLA, and cost constraints.

Why Large Scale Inference?

Designed for reliable volume inference

Inference from 1 -> 1B tokens

Single API to run inference on dynamic batch sizes. No manual batching, or scheduling needed.

Just call into the api with an SLA and Exosphere automatically batches and schedules your requests optimising GPU usage and ratelimits resulting in lower costs and faster inference.

Send in a list of prompt, input pairs or JSONL with bulk inference and get a task_id to track your requests, while Exosphere manages inference in the background.

01 # Call inference API
02 response = requests.post("https://models.exosphere.host/v0/infer/"),
03 headers={"Authorization": "Bearer API_KEY"},
04 json=[{
05 "sla": 60,
06 "model": "deepseek:r1-32b",
07 "input": "Hello model..."
08 }]
09 )
10 task_id = response.json()["task_id"]
Inference from 1 -> 1B tokens

Built in evaluation at scale

Evals run on each output and node after generation. Independent analysis of each record. Support of your favourite evals with custom metrics and scoring. Observability on record and node level for the batch.

01 Eval Logs
02 2025-12-08 12:00:00 | Task Id: 3224 | Accuracy: 94.7% | Precision: 92.3%
03 2025-12-08 12:00:01 | Task Id: 3225 | Accuracy: 94.7% | Precision: 92.3%
04 2025-12-08 12:00:02 | Task Id: 3226 | Accuracy: 95.3% | Precision: 92.8%
05 2025-12-08 12:00:03 | Task Id: 3227 | Accuracy: 95.8% | Precision: 93.3%
06 2025-12-08 12:00:04 | Task Id: 3228 | Accuracy: 40.3% | Precision: 30.8%
07 2025-12-08 12:00:05 | Task Id: 3229 | Accuracy: 50.3% | Precision: 40.8%
08 2025-12-08 12:00:06 | Task Id: 3230 | Accuracy: 97.3% | Precision: 94.8%
09 2025-12-08 12:00:07 | Task Id: 3231 | Accuracy: 97.8% | Precision: 95.3%
Built in evaluation at scale

Failure recovery only for failed records

Failures should not restart entire jobs.

Set up failure policies to automatically retry failed records, with node level failure policies and fallback options.

You get full job completion without manual intervention.

01 Status
02 Completed: 9,947
03 Retrying: 52
04 Failed: 1
05
06 # Automatic retry
07 Batch 7ae4f partially succeeded with 53 failures and 9947 successful records
08 Retrying in 2.3s...
Failure recovery only for failed records

bulk,
without
the
bulkiness

Introducing Flexible SLA Inference

Choose your SLA window and data volume to optimize performance.

12hours
1 hour24 hours
500Ktokens
12B

Estimated Cost

0%

As a percentage of the base price

What is Exosphere doing behind the scenes?

A combination of techniques to optimise your inference

Smart batching and scheduling
Exosphere optimises batching and scheduling dynamically across your pending inference requests.
Prefix Caching for Common Workloads
We aggressively reuse computation for shared prompt prefixes across requests.
Prefix-Aware Routing
We route batched requests to the most suitable GPU replica. This enables cache reuse even in distributed, multi-GPU deployments.

Batch 1 : 1 hour SLA

Batch 2 : 10 hour SLA

Batch 3 : 24 hour SLA

Request Prioritization with SLA-aware Scheduling
We prioritise requests based on their SLA requirements. This ensures that high-priority requests are completed within the defined time window.
Dynamic Auto-scaling
We automatically scale the number of GPU replicas based on workload. This ensures high throughput for your inference jobs.

All your favourite models

Supported model categories

OpenAI
OpenAI
Anthropic
Anthropic
DeepSeek
DeepSeek
Meta LLaMA
Meta LLaMA
Mistral
Mistral
Gemini
Gemini
Qwen
Qwen
Custom OSS

Single API surface

Inference API signature

POST/v0/infer/

Request body

[
  {
    "sla": 60,
    "model": "deepseek:r1-32b",
    "input": "Hello model, how are you?"
  },
  {
    "sla": 1440,
    "model": "openai:gpt-4o",
    "input": "Hello OpenAI, how are you?"
  }
]

Response

{
  "status": "submitted",
  "task_id": "2f92fc35-07d6-4737-aefa-8ddffd32f3fc",
  "total_items": 2,
  "objects": [
    {
      "status": "submitted",
      "usage": {
        "input_tokens": 10,
        "price_factor": 0.3
      },
      "object_id": "63bb0b28-edfe-4f5b-9e05-9232f63d76ec"
    },
    {
      "status": "submitted",
      "usage": {
        "input_tokens": 10,
        "price_factor": 0.5
      },
      "object_id": "88d68bc3-643c-4251-a003-6c2c14f76649"
    }
  ]
}

optimise your large scale inference

Single API call, any number of tokens. Reliably, within your defined SLA.