Documentation / LLM providers

LLM provider reference

For OpenClaw, NemoClaw, and other agent stacks: what to use, where to get API access, and how to think about cost. Pricing changes often—confirm on each vendor before production spend.

← Documentation overview LLM setup on your server

For customers running OpenClaw, NemoClaw, and other agent stacks. Last updated April 2026. Pricing changes frequently—verify on provider sites before committing production spend.

1. Model comparison at a glance

USD per 1M tokens unless noted. Self-host rows have no standard per-token public rate. Agentic score is a 1–10 shorthand for tool-use reliability, multi-step reasoning, and ecosystem maturity for autonomous workflows (indicative only).

Model	Input $	Output $	Context	Agentic	Best for
Claude Opus 4.6	$5.00	$25.00	1M	9.8	Flagship: long-horizon agents, hard coding, enterprise reasoning
Claude Sonnet 4.6	$3.00	$15.00	1M	9.3	Best price/perf for production agents
GPT-4o	$2.50	$10.00	128K	9.0	General agents, fast tool calls
Gemini 1.5 Pro	$1.25	$5.00	2M	8.5	Huge-context document/video work
DeepSeek V3.2	$0.28	$0.42	128K	8.5	Cost-sensitive reasoning
Claude Haiku 4.5	$1.00	$5.00	200K	8.0	Fast structured agents
Grok 4.20	$2.00	$6.00	2M	8.0	Large context, real-time data
Grok 4.1 Fast	$0.20	$0.50	2M	8.0	Budget production agents
Claude 3 Haiku	$0.25	$1.25	200K	7.5	High-throughput chat / extraction
GPT-4o-mini	$0.15	$0.60	128K	7.0	Testing, simple agents, volume
Gemini 1.5 Flash	$0.075	$0.30	1M	7.0	Cheap chat, summarisation
Llama 3.1 405B	self-host	self-host	128K	7.0	Private/local, no per-token cost

2. Recommendations by use case

Use case	Pick	Why
Mission-critical / long-horizon agent	Claude Opus 4.6	Sustained reasoning across multi-hour sessions; 1M context at standard pricing.
Production agent (best value at frontier)	Claude Sonnet 4.6	Roughly 60% the cost of Opus with 1M context; the sensible default for most agents.
Coding agent / dev tooling	Claude Opus 4.6 or GPT-4o	Opus for hard refactors and multi-file changes, GPT-4o when latency matters more than depth.
Long document / video analysis	Gemini 1.5 Pro or Sonnet 4.6	Gemini for 2M context + native video; Sonnet 4.6 for stronger recall on text-heavy 1M prompts.
High-volume cheap automation	GPT-4o-mini or Gemini Flash	Sub-cent responses, fine for classify/extract.
Budget production agent	Grok 4.1 Fast or DeepSeek V3.2	Near-frontier capability at fraction of premium pricing.
Private / on-prem deployment	Llama 3.1 via NVIDIA NIM	No data leaves the box; GPU required.
Real-time / social data	Grok 4.20	Native X integration, low latency.

3. Getting an API key

Providers use bearer-token style credentials. AgenticHosting does not collect LLM API keys in the dashboard—create keys in each provider console, then enter them when your stack prompts you on the VPS (OpenClaw / NemoClaw installer, scripts, or config on the server). Follow LLM setup for the step-by-step flow.

Anthropic (Claude) — API keys; $5 minimum credit; key shown once. Console: console.anthropic.com.
OpenAI (GPT) — Create API key; verify phone and add billing as required.
Google (Gemini) — Quick path: Google AI Studio API key. Teams on GCP may use Google Cloud Console with Vertex AI and a service account JSON (see Vertex AI docs).
xAI (Grok) — console.x.ai → API keys (check current free-credit offers on their site).
DeepSeek — platform.deepseek.com → API keys.
NVIDIA NIM — build.nvidia.com → API keys (prefix often nvapi-).
Meta (Llama API) — llama.developer.meta.com (availability may vary by region), or self-host weights.

4. Cost optimisation

Prompt caching — Anthropic, OpenAI, and DeepSeek cache repeated prefixes; large savings on system prompts and tool definitions. Cache your agent system prompt where supported.
Batch APIs — OpenAI and Google offer discounted batch endpoints for non-realtime work.
Model routing — Route cheap turns to Haiku / Flash / mini; escalate only on low confidence.
Context pruning — Summarise old tool output instead of replaying full logs every step.
Local preprocessing — Small local models can handle extraction before a paid frontier call (GPU helps).

4b. Tuning Claude Opus 4.6 (effort + Fast mode)

Opus 4.6 exposes effort (adaptive thinking) and optional Fast mode. Thinking tokens are billed at output rates, so effort drives cost on hard prompts. Fast mode trades cost for latency at 6× standard rates (e.g. $30/1M in and $150/1M out for Opus)—applied to the whole request, including cached reads.

Try a configuration

Pick a preset or fine-tune. Cost preview updates live (indicative USD; verify on Anthropic's pricing page).

Quick presets

Or fine-tune

Model

Claude Opus 4.6$5/$25 per 1M · 1M

Flagship. Long-horizon agents, hardest reasoning.

Claude Sonnet 4.6$3/$15 per 1M · 1M

Best price/perf at the frontier. Production default.

Claude Haiku 4.5$1/$5 per 1M · 200K

Fast and cheap. Routing, classification, high volume.

Reasoning effort

Default. Autonomous extended thinking.

Fast mode (beta)6× pricing

Only available on Claude Opus 4.6.

Sample token counts

Input tokensOutput tokens

Cost preview

Effective rate

$3.00 / $15.00

per 1M in/out

Sample request

50k in / 5k out

Estimated cost

$0.2250

Show API parameters (example JSON)

{
  "model": "claude-sonnet-4-6",
  "effort": "high"
}

effort	Behaviour	When to use
low	Fast, minimal reasoning.	Routing, classification, short tool calls.
medium	Balanced; light reasoning when needed.	Standard chat and most agent tool steps.
high	Default. Autonomous extended thinking.	Multi-step reasoning, code review, planning.
max	Forces deep scrutiny—slowest, most costly.	Critical decisions, hard debugging, audits.

Fast mode: use when a human is waiting (interactive coding, IDE assistants) or latency would otherwise break the workflow. Avoid for background agents and high-volume tool loops—the multiplier compounds quickly.

Workload	Suggested config	Why
Interactive dev assistant	Opus 4.6, effort high, Fast on	User is watching the cursor.
Background autonomous agent	Sonnet 4.6, effort high, Fast off	Cheaper base; latency less important.
High-volume routing / classification	Sonnet 4.6 or Haiku 4.5, effort low	Avoid wasted reasoning on simple turns.
Hard one-shot reasoning	Opus 4.6, effort max, Fast off	Pay for depth, not speed.

Use the interactive configuration tool (earlier in this section) to apply these presets and estimate cost.

5. Practical notes for agentic workloads

Verify current Anthropic billing and access terms in their console—agent frameworks typically use API keys with per-token billing.
Watch for optional Claude modifiers (e.g. Fast mode, regional inference multipliers) documented on Anthropic's pricing page.
Google Cloud: enable billing controls and spending caps before pointing autonomous agents at Vertex.
Long context (Grok, Gemini) still scales cost with input size—big windows are not free context.
NVIDIA NIM is a common path for local inference on your own hardware; validate rate limits against your tier.
Implement retry with backoff for 429s—new accounts often hit rate limits early.
Check each provider's supported regions and terms; availability varies by country.

6. Official pricing and docs

Provider	Pricing	Docs
Anthropic	Pricing	Documentation
OpenAI	Pricing	Documentation
Google Vertex	Pricing	Documentation
xAI	Pricing	Documentation
DeepSeek	Pricing	Documentation
NVIDIA NIM	Pricing	Documentation
Meta Llama	Pricing	Documentation

Disclaimer: prices and capabilities change frequently. Confirm against each vendor before committing production spend.