5.6 KiB
Troubleshooting
Upstream stream-init failures (500 empty_stream)
Symptom: claw-code exits with 500 empty_stream: upstream stream closed before first payload or similar upstream stream-init error.
Root cause: Upstream provider (Anthropic, OpenAI, other) closed the HTTP connection before sending the first response payload. Common causes:
- Transient network issue between claw-code and provider
- Provider overload / temporary service degradation
- Authentication token expired or invalid
- Rate limit exceeded (even if not visible in response headers)
Mitigation:
- Check credentials: Verify
claw whoamishows the expected provider and account. Re-authenticate if expired. - Wait and retry: Provider transient issues usually resolve within 30-60 seconds. Wait a minute, then retry the same command.
- Check provider status: Visit the provider's status page (e.g., status.anthropic.com, status.openai.com).
- Reduce request size: If the prompt is large, try a smaller request first to isolate stream-init from context-window failures.
- Check network: Ensure your network connection is stable. If behind a proxy, verify proxy allows streaming responses.
When to escalate:
- If stream-init failures persist >10 minutes across multiple requests
- If
claw whoamifails to authenticate - If no provider status page shows degradation
Related pinpoint: #290 (typed stream-init failure envelope — future improvement for better diagnostics)
Context-window-blocked errors
Symptom: claw-code exits with context_window_blocked or similar provider error when resuming a long session, or when sending a request with a very large prompt + accumulated history.
Root cause: Session size exceeded provider context window before claw-code's auto-compaction could reduce it. Auto-compaction is currently REACTIVE-AFTER-SUCCESS — it only fires after a successful provider response. If the request itself is oversized, compaction never runs.
Mitigation:
- Resume with manual compact:
claw resume <session> --compact-before(if available); else manually compact via/compactslash command before retrying - Start a fresh session: Sometimes the cleanest path; existing session-state preserved in
~/.claw/sessions/<id>/ - Reduce prompt size: If interactive, send shorter prompts; truncate file contents before pasting
- Adjust threshold: Lower
CLAW_AUTO_COMPACT_INPUT_TOKENS_THRESHOLDenv var (default varies by provider)
Related pinpoints: #287 (auto-compaction reactive-not-preflight, CRITICAL), #283 (threshold env-only no settings.json key), #288 (failure envelope omits diagnostics)
Manual /compact reports "session below compaction threshold"
Symptom: You run /compact to manually compact a session, but it reports session below compaction threshold even though the session feels large.
Root cause: The "below threshold" message is currently a catch-all for multiple skip reasons:
- Too few compactable messages
- Already compacted (only summary remains)
- Compactable tokens below threshold
- Tool-use/tool-result boundary preserved
- Live vs resume threshold divergence
Mitigation:
- Check session state:
claw session info <id>to inspect message count, total tokens - Force compaction: Currently no
--forceflag exists; track #289 for typed skip-reason discriminants - Workaround: Continue session and let auto-compact fire after next provider response (when reactive-after-success path is available)
Related pinpoint: #289 (manual /compact skip-reason flattened, lacks typed discriminants)
Parallel agent stuck in "running" state
Symptom: A parallel agent lane shows status: running indefinitely, never transitioning to completed or error. Downstream coordination treats it as still-working.
Root cause: Agent::execute_agent writes a running manifest BEFORE spawning a detached std::thread::spawn. The JoinHandle is dropped. If the process crashes during agent execution, the manifest stays as running forever (zombie state). No heartbeat or stale-reaper exists.
Mitigation:
- Manual cleanup: Inspect
~/.claw/agents/<lane>/and remove stalemanifest.jsonfiles where last-modified > N minutes ago - Restart agent lane:
claw agent restart <lane> - Kill orphaned processes:
pgrep clawto find lingering processes
Related pinpoint: #286 (Parallel Agent detached-thread no-heartbeat no-reaper)
Sustained upstream provider failures (500 empty_stream repeating)
Symptom: Same upstream provider error (e.g., 500 empty_stream: upstream stream closed before first payload) repeats 5+ times in <60 minutes. Retries hit the same dead upstream blindly.
Root cause: claw-code does NOT detect repeat-failure patterns. No circuit-breaker. No automatic provider-fallback when configured. Each retry attempts the same provider+endpoint regardless of recent failure history.
Mitigation:
- Manual circuit-breaker: Wait 5-10 minutes after repeated failures before retrying
- Switch provider: If you have multiple providers configured (
ANTHROPIC_API_KEY+OPENAI_API_KEY), restart with different model prefix (e.g.,gpt-4instead ofclaude-) - Check provider status pages: status.anthropic.com, status.openai.com
- Verify upstream endpoint: If using a proxy (CCAPI, custom OpenAI-compatible endpoint), check proxy logs
Related pinpoints: #291 (no repeat-failure detection / circuit-breaker), #285 (declarative providers config for fallback), #290 (stream-init failure envelope)
Other common failures
[placeholder for future sections: tool-use failures, session corruption]