From 79eeaaeaf66fb1700ca5366400ead3854c61d0db Mon Sep 17 00:00:00 2001 From: Yeachan-Heo Date: Sun, 26 Apr 2026 09:30:37 +0000 Subject: [PATCH] roadmap: #286 filed --- ROADMAP.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index f06741f..e4bfec8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -17227,3 +17227,15 @@ Gap. Claw Code lacks a declarative provider graph and websearch backend contract Required fix shape: (a) add schema-backed settings sections for `providers`, `models`, and `websearch` with safe secret handling (support env indirection for API keys instead of encouraging raw key commits); (b) define precedence `CLI > local/project/user config > env > built-in defaults`; (c) make `ProviderClient` resolve from the merged config graph, including custom OpenAI-compatible base URLs, auth env/key refs, max context, max output, and reasoning/tool quirks; (d) make `WebSearch` dispatch through configured providers such as DuckDuckGo, Tavily, Brave, or custom base URL; (e) surface the resolved provider/model/search backend in `claw status --output-format json` and `claw doctor`; (f) add tests for LM Studio-style OpenAI-compatible config, multi-model selection, and Tavily-style search backend config without leaking raw API keys in output. Acceptance: the user-requested provider/model/search shape can be placed in settings, resolved deterministically, and audited without relying on undocumented env-only behavior. **Status:** Open. No source code changed. Filed 2026-04-26 18:28 KST. Branch: feat/jobdori-168c-emission-routing. HEAD: `92a598e` before filing. Cluster delta: declarative-provider-websearch-config +1; concrete user-signal source: Sigrid request in #clawcode-building-in-public. Concrete delta this cycle: ROADMAP-only follow-up appended from config/provider/websearch audit. + +## Pinpoint #286 — Parallel `Agent` execution can leave forever-running manifests because background thread lifecycle is not durable across process/gateway death and has no heartbeat/stale reaper + +Dogfooded 2026-04-26 18:32 KST after Sigrid requested heavy dogfooding around parallel execution and async execution because users report mistakes there. Static audit of `rust/crates/tools/src/lib.rs` shows `execute_agent` writes an `AgentOutput` manifest with `status: "running"`, `derivedState: "working"`, and a `lane.started` event, then calls `spawn_agent_job`. `spawn_agent_job` launches a detached `std::thread::Builder::spawn` closure and immediately returns `Ok(())`; the `JoinHandle` is discarded. The only transition out of `running` happens inside the in-process thread via `run_agent_job` → `persist_agent_terminal_state(..., "completed"|"failed")`, or if spawn itself fails before the thread starts. + +Concrete failure mode: if the parent process/gateway crashes, restarts, OOMs, or is killed after the `running` manifest is written but before the detached thread persists terminal state, the manifest remains `running` forever. There is no durable job queue, PID/thread identity, heartbeat timestamp, lease, resume record, or stale reaper. `derive_agent_state("running", ..)` always returns `working`, so downstream parallel/team coordination sees the lane as active rather than `orphaned`, `lost`, or `needs_recovery`. This is exactly the class of parallel/async mistake users notice: a lane looks alive because a JSON file says `running`, not because any worker is actually executing. + +Gap. Agent parallelism has a fire-and-forget in-process thread model but reports as durable background execution. Tests cover spawn failure and fake completion/failure, but they do not simulate crash-after-running-manifest-before-terminal-state, dropped `JoinHandle`, process restart, stale heartbeat, or reaper classification. This is distinct from #281 dogfood git↔Discord transactionality: #286 is runtime lane lifecycle durability for parallel worker execution. + +Required fix shape: (a) persist a durable agent job record with `agent_id`, owner process id/start time, heartbeat timestamp, and phase before spawning; (b) either retain/track `JoinHandle`s in a supervisor or move execution to a durable worker queue; (c) update heartbeat during long `run_turn` execution; (d) on startup/tool access, scan manifests stuck in `running` beyond a lease and classify them as `orphaned_worker` / `needs_recovery` instead of `working`; (e) expose stale/orphaned lane state in Agent/Team status and lane events; (f) regression-test crash-after-manifest-before-terminal-state by creating a running manifest with stale heartbeat and verifying the reaper emits a typed blocker. Acceptance: a parallel Agent lane cannot remain silently `running` forever after its executor disappears. + +**Status:** Open. No source code changed. Filed 2026-04-26 18:33 KST. Branch: feat/jobdori-168c-emission-routing. HEAD: `639e1e3` before filing. Cluster delta: parallel-agent-lifecycle-durability +1; concrete user-signal source: Sigrid request to dogfood parallel/async execution mistakes. Concrete delta this cycle: ROADMAP-only pinpoint appended from Agent spawn/lifecycle audit.