From 4fc2265d38fcffc9117f90cbcbd784c4b98aedb3 Mon Sep 17 00:00:00 2001 From: YeonGyu-Kim Date: Mon, 27 Apr 2026 00:02:36 +0900 Subject: [PATCH] docs: expand TROUBLESHOOTING.md with context-window, /compact, parallel-agent, repeat-upstream sections --- TROUBLESHOOTING.md | 69 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md index 4403cff..66ab460 100644 --- a/TROUBLESHOOTING.md +++ b/TROUBLESHOOTING.md @@ -26,6 +26,73 @@ --- +## Context-window-blocked errors + +**Symptom:** claw-code exits with `context_window_blocked` or similar provider error when resuming a long session, or when sending a request with a very large prompt + accumulated history. + +**Root cause:** Session size exceeded provider context window before claw-code's auto-compaction could reduce it. Auto-compaction is currently REACTIVE-AFTER-SUCCESS — it only fires after a successful provider response. If the request itself is oversized, compaction never runs. + +**Mitigation:** +1. **Resume with manual compact:** `claw resume --compact-before` (if available); else manually compact via `/compact` slash command before retrying +2. **Start a fresh session:** Sometimes the cleanest path; existing session-state preserved in `~/.claw/sessions//` +3. **Reduce prompt size:** If interactive, send shorter prompts; truncate file contents before pasting +4. **Adjust threshold:** Lower `CLAW_AUTO_COMPACT_INPUT_TOKENS_THRESHOLD` env var (default varies by provider) + +**Related pinpoints:** #287 (auto-compaction reactive-not-preflight, CRITICAL), #283 (threshold env-only no settings.json key), #288 (failure envelope omits diagnostics) + +--- + +## Manual `/compact` reports "session below compaction threshold" + +**Symptom:** You run `/compact` to manually compact a session, but it reports `session below compaction threshold` even though the session feels large. + +**Root cause:** The "below threshold" message is currently a catch-all for multiple skip reasons: +- Too few compactable messages +- Already compacted (only summary remains) +- Compactable tokens below threshold +- Tool-use/tool-result boundary preserved +- Live vs resume threshold divergence + +**Mitigation:** +1. **Check session state:** `claw session info ` to inspect message count, total tokens +2. **Force compaction:** Currently no `--force` flag exists; track #289 for typed skip-reason discriminants +3. **Workaround:** Continue session and let auto-compact fire after next provider response (when reactive-after-success path is available) + +**Related pinpoint:** #289 (manual `/compact` skip-reason flattened, lacks typed discriminants) + +--- + +## Parallel agent stuck in "running" state + +**Symptom:** A parallel agent lane shows `status: running` indefinitely, never transitioning to `completed` or `error`. Downstream coordination treats it as still-working. + +**Root cause:** `Agent::execute_agent` writes a `running` manifest BEFORE spawning a detached `std::thread::spawn`. The `JoinHandle` is dropped. If the process crashes during agent execution, the manifest stays as `running` forever (zombie state). No heartbeat or stale-reaper exists. + +**Mitigation:** +1. **Manual cleanup:** Inspect `~/.claw/agents//` and remove stale `manifest.json` files where last-modified > N minutes ago +2. **Restart agent lane:** `claw agent restart ` +3. **Kill orphaned processes:** `pgrep claw` to find lingering processes + +**Related pinpoint:** #286 (Parallel `Agent` detached-thread no-heartbeat no-reaper) + +--- + +## Sustained upstream provider failures (`500 empty_stream` repeating) + +**Symptom:** Same upstream provider error (e.g., `500 empty_stream: upstream stream closed before first payload`) repeats 5+ times in <60 minutes. Retries hit the same dead upstream blindly. + +**Root cause:** claw-code does NOT detect repeat-failure patterns. No circuit-breaker. No automatic provider-fallback when configured. Each retry attempts the same provider+endpoint regardless of recent failure history. + +**Mitigation:** +1. **Manual circuit-breaker:** Wait 5-10 minutes after repeated failures before retrying +2. **Switch provider:** If you have multiple providers configured (`ANTHROPIC_API_KEY` + `OPENAI_API_KEY`), restart with different model prefix (e.g., `gpt-4` instead of `claude-`) +3. **Check provider status pages:** status.anthropic.com, status.openai.com +4. **Verify upstream endpoint:** If using a proxy (CCAPI, custom OpenAI-compatible endpoint), check proxy logs + +**Related pinpoints:** #291 (no repeat-failure detection / circuit-breaker), #285 (declarative providers config for fallback), #290 (stream-init failure envelope) + +--- + ## Other common failures -*[placeholder for future sections: context-window errors, tool-use failures, session corruption, auto-compaction loops]* +*[placeholder for future sections: tool-use failures, session corruption]*