an inference api that understands volumes

A single inference API that adapts to your volume, SLA, and cost constraints.

Why Large Scale Inference?

Designed for reliable volume inference

Inference from 1 -> 1B tokens

Single API to run inference on dynamic batch sizes. No manual batching, or scheduling needed.

Just call into the api with an SLA and Exosphere automatically batches and schedules your requests optimising GPU usage and ratelimits resulting in lower costs and faster inference.

Send in a list of prompt, input pairs or JSONL with bulk inference and get a task_id to track your requests, while Exosphere manages inference in the background.

01 # Call inference API

02 response = requests.post("https://models.exosphere.host/v0/infer/"),

03 headers={"Authorization": "Bearer API_KEY"},

04 json=[{

05 "sla": 60,

06 "model": "deepseek:r1-32b",

07 "input": "Hello model..."

08 }]

09 )

10 task_id = response.json()["task_id"]

Built in evaluation at scale

Evals run on each output and node after generation. Independent analysis of each record. Support of your favourite evals with custom metrics and scoring. Observability on record and node level for the batch.

01 Eval Logs

02 2025-12-08 12:00:00 | Task Id: 3224 | Accuracy: 94.7% | Precision: 92.3%

03 2025-12-08 12:00:01 | Task Id: 3225 | Accuracy: 94.7% | Precision: 92.3%

04 2025-12-08 12:00:02 | Task Id: 3226 | Accuracy: 95.3% | Precision: 92.8%

05 2025-12-08 12:00:03 | Task Id: 3227 | Accuracy: 95.8% | Precision: 93.3%

06 2025-12-08 12:00:04 | Task Id: 3228 | Accuracy: 40.3% | Precision: 30.8%

07 2025-12-08 12:00:05 | Task Id: 3229 | Accuracy: 50.3% | Precision: 40.8%

08 2025-12-08 12:00:06 | Task Id: 3230 | Accuracy: 97.3% | Precision: 94.8%

09 2025-12-08 12:00:07 | Task Id: 3231 | Accuracy: 97.8% | Precision: 95.3%

Failure recovery only for failed records

Failures should not restart entire jobs.

Set up failure policies to automatically retry failed records, with node level failure policies and fallback options.

You get full job completion without manual intervention.

01 Status

02 ✓ Completed: 9,947

03 ⟳ Retrying: 52

04 ✗ Failed: 1

06 # Automatic retry

07 Batch 7ae4f partially succeeded with 53 failures and 9947 successful records

08 Retrying in 2.3s...

Failure recovery only for failed records

bulk,
without
the
bulkinessbulk,
without the bulkiness
Introducing Flexible SLA Inference

Choose your SLA window and data volume to optimize performance.

SLA Window

12hours

1 hour24 hours

Input Data Shape

500Ktokens

12B

Estimated Cost

As a percentage of the base price

What is Exosphere doing behind the scenes?

A combination of techniques to optimise your inference

Smart batching and scheduling

Exosphere optimises batching and scheduling dynamically across your pending inference requests.

Prefix Caching for Common Workloads

We aggressively reuse computation for shared prompt prefixes across requests.

Prefix-Aware Routing

We route batched requests to the most suitable GPU replica. This enables cache reuse even in distributed, multi-GPU deployments.

Batch 1 : 1 hour SLA

Batch 2 : 10 hour SLA

Batch 3 : 24 hour SLA

Request Prioritization with SLA-aware Scheduling

We prioritise requests based on their SLA requirements. This ensures that high-priority requests are completed within the defined time window.

Dynamic Auto-scaling

We automatically scale the number of GPU replicas based on workload. This ensures high throughput for your inference jobs.

All your favourite models

Supported model categories

OpenAI

Anthropic

DeepSeek

Meta LLaMA

Mistral

Gemini

Qwen

Custom OSS

Single API surface

Inference API signature

POST/v0/infer/

Request body

[
  {
    "sla": 60,
    "model": "deepseek:r1-32b",
    "input": "Hello model, how are you?"
  },
  {
    "sla": 1440,
    "model": "openai:gpt-4o",
    "input": "Hello OpenAI, how are you?"
  }
]

Response

{
  "status": "submitted",
  "task_id": "2f92fc35-07d6-4737-aefa-8ddffd32f3fc",
  "total_items": 2,
  "objects": [
    {
      "status": "submitted",
      "usage": {
        "input_tokens": 10,
        "price_factor": 0.3
      },
      "object_id": "63bb0b28-edfe-4f5b-9e05-9232f63d76ec"
    },
    {
      "status": "submitted",
      "usage": {
        "input_tokens": 10,
        "price_factor": 0.5
      },
      "object_id": "88d68bc3-643c-4251-a003-6c2c14f76649"
    }
  ]
}

Read API Spec

optimise your large scale inference

Single API call, any number of tokens. Reliably, within your defined SLA.