Blog
The Infrastructure Stack for Production AI Agents
April 17, 2026
What you actually need to run AI agents reliably in production — beyond the model itself

Everyone talks about the model. The prompt. The fine-tuning. But if you've ever actually tried to run AI agents in production, you know the model is maybe 20% of the problem. The rest is infrastructure.
This post maps out what a real production AI agent stack looks like, what components you need, and why most teams underestimate the complexity until they're already deep in troubleshooting.
What "Production" Actually Means
Development is easy. You open your laptop, run your prompt, see the output, iterate. Production is different.
Production means:
- Availability — Your agents need to work when users need them, not when you've got time to restart a container
- Throughput — One request at a time is fine for development; production needs concurrency
- Reliability — Single points of failure don't exist in production; you need redundancy
- Observability — You need to see what's happening, why something failed, what your latency distribution looks like
- Security — Your agents are processing real data, interacting with real systems; the attack surface is real
- Cost control — You need to know what each request costs, not just in aggregate but per operation
This is infrastructure territory. And it's where most "just use the API" approaches fall apart.
The Layer Cake
Let's break down what a complete production AI agent infrastructure looks like, from bottom to top:
Layer 1: Compute
At the base, you need GPU compute. Not just any compute — memory bandwidth matters more than raw core count for most transformer models. Here's what matters:
- GPU memory — Your model needs to fit in VRAM plus room for context windows, KV cache, and any batching. A 70B model needs roughly 140GB just for the weights in FP16. That's multiple GPUs.
- Inter-GPU connectivity — If you're running multi-GPU setups, you need fast interconnects (NVLink or at minimum 100Gbps) for model parallelism. PCIe bandwidth becomes a bottleneck quickly.
- CPU fallback — Not everything needs GPU. Some operations (tokenisation, simple routing, document parsing) run fine on CPU and save your GPU for what actually needs it.
Layer 2: Model Serving
Once you've got compute, you need to serve models. This isn't just "run the model" — it's a sophisticated orchestration layer:
- Inference servers — vLLM, TensorRT-LLM, or Triton for high-throughput inference. These handle batch scheduling, KV cache management, continuous batching. You don't write this from scratch.
- Model loading — Getting models from storage into GPU memory efficiently.quantisation strategies (GPTQ, AWQ, GGUF) for fitting larger models on smaller hardware.
- Hot swapping — Updating models without downtime. This matters because model updates will happen — you need to do them without killing active requests.
Layer 3: Agent Orchestration
Now you're running models. Next, you need to orchestrate multi-step agent workflows:
- State management — Agents have memory, context, conversation history. You need to track what state lives where (session, database, in-memory) and how it gets passed between steps.
- Tool calling — Your agent needs to actually do things: query databases, call APIs, execute code. You need a tool abstraction layer that lets agents invoke external systems securely.
- Retry logic — Things fail. Network timeouts, rate limits, model errors. Your agent needs to handle this gracefully without getting stuck in retry loops.
- Circuit breakers — When a downstream service goes down, you need to fail fast rather than queue up thousands of requests waiting for a service that won't recover.
Layer 4: API and Interface
Finally, you need to expose your agents to the world:
- REST or gRPC APIs — Your agents need interfaces that your applications can call. This means request validation, authentication, rate limiting, logging.
- WebSocket support — For agents that take a while, you need streaming responses. Not optional.
- Authentication — Who's calling your agent? What are they allowed to do? This gets complex fast.
Layer 5: Observability
Infrastructure without observability is a black box. For production AI agents, you need:
- Request tracing — Track a request through every step. Langfuse, Weave, or custom solutions. This is essential for debugging.
- Metrics — Latency percentiles (p50, p95, p99), throughput, error rates, GPU utilisation, token throughput. Everything.
- Logging — Structured logs with correlation IDs so you can reconstruct what happened.
- Alerting — When something breaks, you need to know. Not after a customer emails you.
Layer 6: Security
Production systems need security at every layer:
- Network isolation — Your GPU servers shouldn't be directly exposed to the internet
- Encryption — Data at rest (model weights, stored data) and in transit (all API traffic)
- Access control — Who can deploy models, change configuration, access logs
- Audit trails — Who did what, when, why
The Integration Problem
Here's what makes this hard: none of these layers exist in isolation. They're interdependent, and they need to work together.
- Your inference server needs to know about your GPU topology
- Your orchestration layer needs to handle backpressure from your inference layer
- Your API layer needs to expose the right metrics from the layers below
- Your security needs to be baked into every layer, not bolted on
This is why teams underestimate the effort. They see "run a model" and think it's one thing. It's actually seven things that all need to work together.
What Most Teams Do Wrong
Common mistakes we see:
Starting too complex — Trying to build the entire stack before they have a working agent. Start simple, add layers as you need them.
Skipping observability — "We'll add monitoring later." No you won't. You won't have time, and when things break you'll wish you had it. Add it from day one.
Underestimating GPU supply — Getting GPUs in 2026 is still hard. Plan lead times, have backup sources, don't assume you can just buy what you need when you need it.
Treating AI infrastructure like web infrastructure — It's different. Your scaling properties are different, your failure modes are different, your cost model is different. Don't just apply the same patterns.
Ignoring cost until it hurts — GPU time is expensive. You need to understand your cost per request, your cost per token, your cost per user. Without this, you can't make intelligent tradeoffs.
A Practical Path
Here's what we recommend for teams building production agents:
-
Start with API — Use cloud APIs for initial development and prototyping. It's the right choice for learning what your agent actually needs to do.
-
Profile your workload — Understand your actual token consumption, latency requirements, and concurrency needs before you invest in infrastructure.
-
Start self-hosted for the right reasons — Not because it's "cool" but because you've identified a specific need: cost at scale, data control, compliance, or reliability requirements.
-
Build incrementally — Add infrastructure layers one at a time. Get each layer solid before moving up.
-
Use managed components where possible — You don't need to build everything. Use managed inference (like us), managed observability, managed authentication. Focus your effort on where you add unique value.
The Bottom Line
Building production AI agents is a software engineering discipline. It requires the same rigor you'd apply to any critical system: layered architecture, observability, security, testing.
The model gets all the attention, but the infrastructure is what determines whether you can actually run that model reliably at the scale your business needs.
If you're serious about building AI into your product, take infrastructure seriously from the start. It's the difference between an impressive demo and a reliable production system.