Documentation / LLM providers

LLM provider reference

For OpenClaw, NemoClaw, and other agent stacks: what to use, where to get API access, and how to think about cost. Pricing changes often—confirm on each vendor before production spend.

← Documentation overviewLLM setup on your server

For customers running OpenClaw, NemoClaw, and other agent stacks. Last updated April 2026. Pricing changes frequently—verify on provider sites before committing production spend.

1. Model comparison at a glance

USD per 1M tokens unless noted. Self-host rows have no standard per-token public rate. Agentic score is a 1–10 shorthand for tool-use reliability, multi-step reasoning, and ecosystem maturity for autonomous workflows (indicative only).

ModelInput $Output $ContextAgenticBest for
Claude Opus 4.6$5.00$25.001M9.8Flagship: long-horizon agents, hard coding, enterprise reasoning
Claude Sonnet 4.6$3.00$15.001M9.3Best price/perf for production agents
GPT-4o$2.50$10.00128K9.0General agents, fast tool calls
Gemini 1.5 Pro$1.25$5.002M8.5Huge-context document/video work
DeepSeek V3.2$0.28$0.42128K8.5Cost-sensitive reasoning
Claude Haiku 4.5$1.00$5.00200K8.0Fast structured agents
Grok 4.20$2.00$6.002M8.0Large context, real-time data
Grok 4.1 Fast$0.20$0.502M8.0Budget production agents
Claude 3 Haiku$0.25$1.25200K7.5High-throughput chat / extraction
GPT-4o-mini$0.15$0.60128K7.0Testing, simple agents, volume
Gemini 1.5 Flash$0.075$0.301M7.0Cheap chat, summarisation
Llama 3.1 405Bself-hostself-host128K7.0Private/local, no per-token cost

2. Recommendations by use case

Use casePickWhy
Mission-critical / long-horizon agentClaude Opus 4.6Sustained reasoning across multi-hour sessions; 1M context at standard pricing.
Production agent (best value at frontier)Claude Sonnet 4.6Roughly 60% the cost of Opus with 1M context; the sensible default for most agents.
Coding agent / dev toolingClaude Opus 4.6 or GPT-4oOpus for hard refactors and multi-file changes, GPT-4o when latency matters more than depth.
Long document / video analysisGemini 1.5 Pro or Sonnet 4.6Gemini for 2M context + native video; Sonnet 4.6 for stronger recall on text-heavy 1M prompts.
High-volume cheap automationGPT-4o-mini or Gemini FlashSub-cent responses, fine for classify/extract.
Budget production agentGrok 4.1 Fast or DeepSeek V3.2Near-frontier capability at fraction of premium pricing.
Private / on-prem deploymentLlama 3.1 via NVIDIA NIMNo data leaves the box; GPU required.
Real-time / social dataGrok 4.20Native X integration, low latency.

3. Getting an API key

Providers use bearer-token style credentials. AgenticHosting does not collect LLM API keys in the dashboard—create keys in each provider console, then enter them when your stack prompts you on the VPS (OpenClaw / NemoClaw installer, scripts, or config on the server). Follow LLM setup for the step-by-step flow.

4. Cost optimisation

  • Prompt caching — Anthropic, OpenAI, and DeepSeek cache repeated prefixes; large savings on system prompts and tool definitions. Cache your agent system prompt where supported.
  • Batch APIs — OpenAI and Google offer discounted batch endpoints for non-realtime work.
  • Model routing — Route cheap turns to Haiku / Flash / mini; escalate only on low confidence.
  • Context pruning — Summarise old tool output instead of replaying full logs every step.
  • Local preprocessing — Small local models can handle extraction before a paid frontier call (GPU helps).

4b. Tuning Claude Opus 4.6 (effort + Fast mode)

Opus 4.6 exposes effort (adaptive thinking) and optional Fast mode. Thinking tokens are billed at output rates, so effort drives cost on hard prompts. Fast mode trades cost for latency at standard rates (e.g. $30/1M in and $150/1M out for Opus)—applied to the whole request, including cached reads.

Try a configuration

Pick a preset or fine-tune. Cost preview updates live (indicative USD; verify on Anthropic's pricing page).

Quick presets

Or fine-tune

Model
Reasoning effort

Default. Autonomous extended thinking.

Sample token counts

Cost preview

Effective rate
$3.00 / $15.00
per 1M in/out
Sample request
50k in / 5k out
Estimated cost
$0.2250
Show API parameters (example JSON)
{
  "model": "claude-sonnet-4-6",
  "effort": "high"
}
effortBehaviourWhen to use
lowFast, minimal reasoning.Routing, classification, short tool calls.
mediumBalanced; light reasoning when needed.Standard chat and most agent tool steps.
highDefault. Autonomous extended thinking.Multi-step reasoning, code review, planning.
maxForces deep scrutiny—slowest, most costly.Critical decisions, hard debugging, audits.

Fast mode: use when a human is waiting (interactive coding, IDE assistants) or latency would otherwise break the workflow. Avoid for background agents and high-volume tool loops—the multiplier compounds quickly.

WorkloadSuggested configWhy
Interactive dev assistantOpus 4.6, effort high, Fast onUser is watching the cursor.
Background autonomous agentSonnet 4.6, effort high, Fast offCheaper base; latency less important.
High-volume routing / classificationSonnet 4.6 or Haiku 4.5, effort lowAvoid wasted reasoning on simple turns.
Hard one-shot reasoningOpus 4.6, effort max, Fast offPay for depth, not speed.

Use the interactive configuration tool (earlier in this section) to apply these presets and estimate cost.

5. Practical notes for agentic workloads

  • Verify current Anthropic billing and access terms in their console—agent frameworks typically use API keys with per-token billing.
  • Watch for optional Claude modifiers (e.g. Fast mode, regional inference multipliers) documented on Anthropic's pricing page.
  • Google Cloud: enable billing controls and spending caps before pointing autonomous agents at Vertex.
  • Long context (Grok, Gemini) still scales cost with input size—big windows are not free context.
  • NVIDIA NIM is a common path for local inference on your own hardware; validate rate limits against your tier.
  • Implement retry with backoff for 429s—new accounts often hit rate limits early.
  • Check each provider's supported regions and terms; availability varies by country.