OpenClaw + Ollama Not Working? Fix Streaming Bugs, Tool Calling Errors & Model Hangs

The Promise and the Pain of Local LLMs

Running OpenClaw with Ollama gives you a fully local, private AI agent -- no API keys, no usage charges, no data leaving your server. It's the dream setup for self-hosters who care about privacy and cost control.

The reality is messier. Ollama's OpenAI-compatible API has subtle incompatibilities that break OpenClaw's tool calling, streaming responses corrupt mid-generation, models hang indefinitely during complex reasoning, and the wrong configuration format causes silent failures with no error messages.

Here's how to fix every common issue when running OpenClaw with Ollama.

The #1 Mistake: Using the OpenAI-Compatible Endpoint

Most guides tell you to configure Ollama as an "OpenAI-compatible" provider in OpenClaw. This technically works for simple chat, but breaks tool calling -- which is the core of what makes OpenClaw useful.

The Problem

When you configure Ollama with OpenClaw's OpenAI provider and set the base URL to http://localhost:11434/v1, you're using Ollama's OpenAI compatibility layer. This layer has a known streaming bug: when stream: true is set, the tool_calls field in the response gets corrupted or dropped entirely.

The result: OpenClaw sends a request that requires a tool call, the model generates the correct tool call, but the streaming response format mangles it. OpenClaw receives an empty or malformed tool call and either does nothing or throws an error.

Fix: Use the Native Ollama API

Configure OpenClaw to use Ollama's native API instead of the OpenAI compatibility layer:

{
  "providers": {
    "ollama": {
      "api": "ollama",
      "baseUrl": "http://localhost:11434",
      "models": ["qwen3:8b"]
    }
  }
}

Notice: no /v1 at the end of the URL. The native API endpoint is just http://localhost:11434. Adding /v1 routes to the OpenAI compatibility layer, which is exactly what we're avoiding.

⚠️

If your Ollama config has "api": "openai" or a baseUrl ending in /v1, you're using the compatibility layer. Switch to "api": "ollama" with the base URL http://localhost:11434 to fix tool calling issues.

Tool Calling Argument Corruption

Even with the native API, tool calling with Ollama models can produce corrupted arguments. This is an upstream Ollama issue (tracked as #57103) where the model generates valid JSON for tool arguments, but Ollama's response parsing occasionally truncates or misformats the JSON before passing it back.

Symptoms

Tool calls execute with missing parameters
JSON parse errors in OpenClaw logs: Unexpected end of JSON input
Tools receive partial arguments (e.g., a file path cut off mid-string)
Intermittent -- works sometimes, fails on longer argument strings

Fix: Disable Streaming for Tool Calls

In your OpenClaw agent configuration, disable streaming when tool calling is active:

{
  "agents": {
    "defaults": {
      "streaming": false
    }
  }
}

This forces Ollama to return the complete response in a single JSON payload instead of streaming tokens. The tradeoff: you won't see tokens appear in real-time, but tool calls will be complete and valid.

Fix: Use Models Known to Handle Tool Calls Well

Not all Ollama models support tool calling reliably. The models with the best tool calling support as of April 2026:

Model	Size	Tool Calling	Notes
qwen3:8b	4.9 GB	Excellent	Best balance of speed and capability
qwen3:14b	9.0 GB	Excellent	More reliable on complex multi-tool chains
qwen3:32b	19 GB	Excellent	Best local model for agentic workflows
llama3.1:8b	4.7 GB	Good	Solid but less reliable than Qwen3 for tools
mistral-nemo:12b	7.1 GB	Fair	Works for simple single-tool calls
deepseek-r1:8b	4.9 GB	Poor	Thinking model, drops tool_calls frequently

✅

The Qwen3 series is currently the best choice for OpenClaw + Ollama. It has native tool calling support that doesn't rely on prompt hacking, and Ollama's implementation handles it cleanly.

Model Hangs: No Response After Prompt

You send a message, OpenClaw shows "thinking..." and nothing ever comes back. No error, no timeout, just infinite waiting.

Cause 1: Model Not Downloaded

Ollama doesn't auto-download models. If your OpenClaw config references qwen3:8b but you haven't pulled it, Ollama returns an error that OpenClaw may not surface clearly.

# Check what models are available locally
ollama list

# Pull the model you need
ollama pull qwen3:8b

Cause 2: Insufficient VRAM/RAM

The model fits in memory on paper, but the actual inference requires more than the model file size. A 8B parameter model at Q4 quantization needs ~4.9 GB for weights plus 1-2 GB for context, KV cache, and inference overhead.

Check if Ollama is actually loading the model:

# Watch Ollama logs for loading/memory errors
journalctl -u ollama -f

# Or if running in Docker
docker logs ollama --tail 50 -f

Look for messages like out of memory, insufficient resources, or model load failed.

Cause 3: Context Length Exceeds Model Limit

OpenClaw sends increasingly large context as conversations grow. If the total context exceeds the model's configured limit, Ollama may hang rather than returning an error.

# Check the model's default context length
ollama show qwen3:8b --modelfile | grep num_ctx

Default is usually 2048 or 4096 tokens. For OpenClaw agent workflows, you likely need more:

# Create a custom model with larger context
cat << 'EOF' > Modelfile
FROM qwen3:8b
PARAMETER num_ctx 8192
EOF

ollama create qwen3-8k -f Modelfile

Then update your OpenClaw config to use qwen3-8k.

Cause 4: Model Discovery Timeout

When OpenClaw starts, it queries Ollama to discover available models. If Ollama is still loading or the API is slow to respond, OpenClaw may time out during discovery and fail to register the provider.

# Verify Ollama API is responding
curl http://localhost:11434/api/tags

If this is slow or times out, Ollama isn't ready yet. Ensure Ollama starts before OpenClaw in your Docker Compose:

services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 10

  openclaw:
    image: openclaw/openclaw:latest
    depends_on:
      ollama:
        condition: service_healthy

Thinking Models Dropping Tool Calls

If you're using a "thinking" or "reasoning" model (like DeepSeek-R1 or QwQ), you'll notice it frequently generates a reasoning chain but then fails to emit the tool call. The model "thinks" about what tool to use but never actually calls it.

Why It Happens

Thinking models use <think>...</think> blocks in their output. When OpenClaw parses the response, it may interpret the thinking block as the complete response and miss the tool call that follows. Additionally, some thinking models at smaller sizes (7B-14B) simply don't have enough capacity to maintain both a reasoning chain and structured tool call output.

Fix: Disable Thinking for Tool-Heavy Agents

{
  "agents": {
    "defaults": {
      "thinkingDefault": false
    }
  }
}

Or switch to a non-thinking model for agents that rely heavily on tool calling. Use thinking models only for analysis and reasoning tasks where tool calling isn't needed.

Performance: Speed vs Quality Tradeoffs

Local LLMs are inherently slower than cloud APIs. Here's how to optimize response time:

GPU vs CPU Inference

Setup	Tokens/sec (8B model)	Typical Response Time
NVIDIA RTX 4090 (24 GB VRAM)	80-120 tok/s	1-3 seconds
NVIDIA RTX 3060 (12 GB VRAM)	40-60 tok/s	3-6 seconds
CPU only (8 cores, 32 GB RAM)	5-15 tok/s	10-30 seconds
CPU only (4 cores, 16 GB RAM)	3-8 tok/s	20-60 seconds

For a VPS without a GPU, CPU inference with an 8B model is the practical limit. Anything larger will be painfully slow.

Fix: Choose the Right Model Size for Your Hardware

VPS RAM	Max Model	Quantization	Practical Use
4 GB	3B	Q4_K_M	Simple Q&A, basic tasks
8 GB	8B	Q4_K_M	General agent, tool calling
16 GB	14B	Q4_K_M	Better reasoning, code generation
32 GB	32B	Q4_K_M	Complex agents, multi-step workflows
64 GB	70B	Q4_K_M	Near-cloud quality

💡

The model file size shown by ollama list is the disk size, not the RAM requirement. Actual RAM usage during inference is 20-40% higher due to context window, KV cache, and processing overhead.

Contabo

Contabo VPS 2: 16 GB RAM, 6 vCPU for $8.49/mo. Run Ollama with 14B models on CPU for a fully private OpenClaw agent.

Visit Contabo→

* Affiliate link — we may earn a commission at no extra cost to you.

Complete Docker Compose: OpenClaw + Ollama

Here's a production-ready setup for running both services:

services:
  ollama:
    image: ollama/ollama:latest
    restart: always
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=1
    deploy:
      resources:
        limits:
          memory: 12g
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 10

  openclaw:
    image: openclaw/openclaw:latest
    restart: always
    ports:
      - "127.0.0.1:18789:18789"
    environment:
      - OPENCLAW_GATEWAY_TOKEN=your-secure-token
    volumes:
      - openclaw_data:/root/.openclaw
    depends_on:
      ollama:
        condition: service_healthy

volumes:
  ollama_data:
  openclaw_data:

Key settings:

OLLAMA_NUM_PARALLEL=2 -- limits concurrent inference to 2 requests. Higher values use more RAM.
OLLAMA_MAX_LOADED_MODELS=1 -- keeps only one model in memory at a time. Essential for limited RAM.
Memory limit of 12 GB -- leaves room for OpenClaw and the OS on a 16 GB VPS.

After starting, pull your model:

docker exec ollama ollama pull qwen3:8b

Troubleshooting Checklist

When OpenClaw + Ollama isn't working:

Is Ollama running? -- curl http://localhost:11434/api/tags should list models
Is the model downloaded? -- ollama list should show your model
Is the API format correct? -- Use "api": "ollama" with no /v1 in the URL
Are tool calls working? -- Test with a simple tool-calling prompt; if it fails, disable streaming
Is the model hanging? -- Check Ollama logs for memory errors; reduce model size or increase RAM
Is context too large? -- Enable context compaction in OpenClaw, increase num_ctx in the model
Is inference too slow? -- Use a smaller model, enable GPU passthrough, or consider a cloud API for time-sensitive tasks

VPS Sizing for OpenClaw + Ollama

Use Case	RAM	vCPU	Model	Cost
Experimentation	8 GB	4	3B-8B	$4.50-8/mo
Personal agent	16 GB	6	8B-14B	$8-15/mo
Production agent	32 GB	8	14B-32B	$15-30/mo
Team deployment	64 GB	16	32B-70B	$30-60/mo

Provider recommendations for Ollama workloads:

Contabo VPS 1 ($4.50/mo): 8 GB RAM -- enough for 8B models on CPU
Contabo VPS 2 ($8.49/mo): 16 GB RAM -- runs 14B models comfortably
Hostinger KVM 4 ($15.99/mo): 16 GB RAM with NVMe for fast model loading
Vultr GPU instances (from $90/mo): NVIDIA A100 for real-time inference speeds

Hostinger

Hostinger KVM 4: 16 GB RAM and NVMe storage for $15.99/mo. Fast model loading and enough memory for 14B parameter models.

Visit Hostinger→

* Affiliate link — we may earn a commission at no extra cost to you.

Conclusion

OpenClaw + Ollama is a powerful combination for private, cost-free AI agents -- but the integration has sharp edges. The biggest wins come from using the native Ollama API (not the OpenAI compatibility layer), choosing models with good tool calling support (Qwen3 series), and sizing your VPS correctly for the model you want to run.

Fix the streaming bug by using "api": "ollama", fix tool calling by disabling streaming or choosing a better model, and fix hangs by ensuring your model fits in memory with room for context. Once these are dialed in, you get a fully local AI agent that costs nothing per message.