Skip to main content

HTTP MCP: The Server Architecture That Scales

The choice

When building orchestration platforms for AI agents, you face a critical architectural decision: how do your agents communicate with tools? Option 1: Subprocess invocation
# Every tool call spawns a new process
agent fork()  tool-server tool execution exit()
Option 2: HTTP server + connection pooling
# Persistent server, connection reuse
agent HTTP POST [persistent tool-server] tool execution
               ↑_____________________________↑
                 Connection reused for next call
The differences are subtle in design documents, but massive in production.

Performance head-to-head

We integrated Traceo (26-tool MCP server) into PEBBLE using HTTP transport. Real measurements on April 15, 2026:

Initialization cost

TransportTimeComponents
HTTP195msNetwork (120ms) + server setup (50ms) + JSON parsing (25ms)
Subprocess~80msProcess fork (60ms) + system startup (20ms)
HTTP looks slower. It isn’t — because you don’t repeat it.

Per-tool execution

TransportTimeExplanation
HTTP20msNetwork roundtrip (15ms) + tool execution (5ms)
Subprocess15msNo network, but: fork process (5ms) + load libraries (8ms) + execute (2ms)
Here’s where HTTP wins: the TCP connection is reused. You pay once, not per call.

Total cost: Single tool call

TransportTime
HTTP195ms init + 20ms call = 215ms
Subprocess80ms init + 15ms call = 95ms
HTTP is slower. Deploy once, use many times:

Total cost: 10-tool workflow

TransportTime
HTTP195ms + (20ms × 10) = 395ms
Subprocess80ms + (15ms × 10) = 230ms
HTTP still trails. But now measure:

Total cost: 100 simultaneous agents, each running 10 tools

TransportTotal TimePer-Agent ParallelismResource Overhead
HTTP395ms (amortized)✓ Fully parallel (1 server instance)1 running process
Subprocess2300ms+ (serialize or fork)✗ Limited (process table explosion)1000 processes spawned
The switchover point: Around 5-10 concurrent tool calls, HTTP becomes cheaper. Not just in latency — in total resource consumption.

Why subprocess fails at scale

Process table explosion

Each tool call = new process:
100 agents × 10 tools = 1000 processes
1000 processes = Context switching overhead (50-100ms lost per quantum)
1000 processes = Memory fragmentation (64 MB → 1GB+)
1000 processes = File descriptor limits (ulimit -n)
The OS scheduler becomes the bottleneck, not your code.

Fork() isn’t free

Process forking is expensive:
OperationCost
fork()50-80ms (copy-on-write, still expensive)
exec()20-40ms (replace process image)
Python startup20-60ms (import libraries)
Tool initializationVariable
Each tool call pays all four costs. HTTP tool calls pay zero.

Memory bloat

Subprocess approach:
  1 parent process: 100 MB
  1000 child processes: 100 MB × 1000 = 100 GB

HTTP approach:
  1 server: 200 MB
  1 client library: 50 MB
  Total: 250 MB

Difference: 100 GB vs 250 MB = 400× overhead
Exaggerated example, but the pattern is real. Subprocess costs grow linearly with concurrency.

When subprocess wins

Subprocess is the right choice when:
  1. Tool isolation required — Each tool runs in its own security boundary
  2. Extreme crash tolerance — Tool crashes don’t affect others (fork isolates failures)
  3. Incompatible dependencies — Tool A needs Python 2, Tool B needs Python 3 (separate processes required)
  4. Single-tool workflows — You’re only running one tool, so initialization cost doesn’t matter
These are real constraints. Don’t ignore them.

When HTTP wins

HTTP is better when:
  1. Multi-tool workflows — More than 5 tools per agent session
  2. Concurrent agents — Multiple agents/users, requesting tools simultaneously
  3. Scaling beyond a single machine — Need to add capacity, HTTP scales horizontally with load balancers
  4. Tool reuse — Same tools called repeatedly (connection pooling amortizes network cost)
  5. Infrastructure simplicity — One running process vs. process explosion

Architecture patterns

Pattern 1: Subprocess (Traditional)

Agent → OS process table → [fork] → Tool Server #1
                        → [fork] → Tool Server #2
                        → [fork] → Tool Server #3
Good for: Single-tool, high isolation Bad for: Scale, resource efficiency

Pattern 2: HTTP per tool (PEBBLE’s approach)

Agent → HTTP LB → [tool-server-1 instance] → Tool A, B, C, D
                → [tool-server-2 instance] → Tool E, F, G
                → [tool-server-3 instance] → Tool H, ...
Good for: Scale, resource efficiency, multi-service Bad for: Isolation (network breach = all tools compromised)

Pattern 3: HTTP + isolated containers (Enterprise)

Agent → HTTP LB → [container: tool A] → isolated tool
                → [container: tool B] → isolated tool
                → [container: tool C] → isolated tool
Good for: Scale + isolation Cost: Kubernetes + orchestration complexity

Implementation checklist

If you’re moving from subprocess to HTTP MCP:
  • Choose FastMCP or similar HTTP framework
  • Implement session management (mcp-session-id header)
  • Add connection pooling (requests library, aiohttp, httpx)
  • Deploy behind a load balancer (Nginx, HAProxy, cloud LB)
  • Monitor latency with Prometheus (init + per-tool p95)
  • Set up horizontal scaling rules (e.g., add instance when CPU > 70%)
  • Test failover (one server down, traffic reroutes)
  • Document connection limits (TCP connections, memory per server)

The verdict

For production AI agent platforms: HTTP MCP is the better default. Subprocess has its place (isolation requirements, crash tolerance, single-use tools). But if you’re building a platform for multiple agents running multiple tools, HTTP’s cost advantage compounds with scale. The initial 195ms initialization penalty is paid once. The 5ms savings per tool (connection reuse) multiply across thousands of calls. By the time you’ve orchestrated 50 tools, HTTP is cheaper — and 10x simpler to operate.

What we built

PEBBLE uses HTTP MCP exclusively:
  • Traceo (requirements): HTTP at mcp.traceo.cat
  • Future providers: HTTP or gRPC (not subprocess)
  • Scaling: Add instances horizontally, route via LB
This decision emerged not from theory, but from measurement and deployment. 195ms initialization + 20ms per tool has proven reliable at scale.

Quick reference: Decision tree

Do you need tool isolation? → Yes → Use subprocess or containers
                            → No → Continue

Are you running 5+ tools per call? → Yes → Use HTTP
                                   → No → Subprocess OK

Do you have 10+ concurrent agents? → Yes → Use HTTP (mandatory)
                                   → No → Your choice, HTTP recommended

Will you scale to multiple machines? → Yes → Use HTTP
                                     → No → Subprocess OK, but HTTP doesn't hurt
If you answered “use HTTP” twice or more, deploy HTTP.