A single inference API that adapts to your volume, SLA, and cost constraints.
Why Large Scale Inference?
Single API to run inference on dynamic batch sizes. No manual batching, or scheduling needed.
Just call into the api with an SLA and Exosphere automatically batches and schedules your requests optimising GPU usage and ratelimits resulting in lower costs and faster inference.
Send in a list of prompt, input pairs or JSONL with bulk inference and get a task_id to track your requests, while Exosphere manages inference in the background.
Evals run on each output and node after generation. Independent analysis of each record. Support of your favourite evals with custom metrics and scoring. Observability on record and node level for the batch.
Failures should not restart entire jobs.
Set up failure policies to automatically retry failed records, with node level failure policies and fallback options.
You get full job completion without manual intervention.
Introducing Flexible SLA Inference
Choose your SLA window and data volume to optimize performance.
Estimated Cost
As a percentage of the base price
What is Exosphere doing behind the scenes?
Batch 1 : 1 hour SLA
Batch 2 : 10 hour SLA
Batch 3 : 24 hour SLA
/v0/infer/[
{
"sla": 60,
"model": "deepseek:r1-32b",
"input": "Hello model, how are you?"
},
{
"sla": 1440,
"model": "openai:gpt-4o",
"input": "Hello OpenAI, how are you?"
}
]{
"status": "submitted",
"task_id": "2f92fc35-07d6-4737-aefa-8ddffd32f3fc",
"total_items": 2,
"objects": [
{
"status": "submitted",
"usage": {
"input_tokens": 10,
"price_factor": 0.3
},
"object_id": "63bb0b28-edfe-4f5b-9e05-9232f63d76ec"
},
{
"status": "submitted",
"usage": {
"input_tokens": 10,
"price_factor": 0.5
},
"object_id": "88d68bc3-643c-4251-a003-6c2c14f76649"
}
]
}Single API call, any number of tokens. Reliably, within your defined SLA.