mirror of
https://github.com/instructkr/claw-code.git
synced 2026-04-06 16:14:49 +08:00
367 lines
18 KiB
Markdown
367 lines
18 KiB
Markdown
# ROADMAP.md
|
||
|
||
# Clawable Coding Harness Roadmap
|
||
|
||
## Goal
|
||
|
||
Turn claw-code into the most **clawable** coding harness:
|
||
- no human-first terminal assumptions
|
||
- no fragile prompt injection timing
|
||
- no opaque session state
|
||
- no hidden plugin or MCP failures
|
||
- no manual babysitting for routine recovery
|
||
|
||
This roadmap assumes the primary users are **claws wired through hooks, plugins, sessions, and channel events**.
|
||
|
||
## Definition of "clawable"
|
||
|
||
A clawable harness is:
|
||
- deterministic to start
|
||
- machine-readable in state and failure modes
|
||
- recoverable without a human watching the terminal
|
||
- branch/test/worktree aware
|
||
- plugin/MCP lifecycle aware
|
||
- event-first, not log-first
|
||
- capable of autonomous next-step execution
|
||
|
||
## Current Pain Points
|
||
|
||
### 1. Session boot is fragile
|
||
- trust prompts can block TUI startup
|
||
- prompts can land in the shell instead of the coding agent
|
||
- "session exists" does not mean "session is ready"
|
||
|
||
### 2. Truth is split across layers
|
||
- tmux state
|
||
- clawhip event stream
|
||
- git/worktree state
|
||
- test state
|
||
- gateway/plugin/MCP runtime state
|
||
|
||
### 3. Events are too log-shaped
|
||
- claws currently infer too much from noisy text
|
||
- important states are not normalized into machine-readable events
|
||
|
||
### 4. Recovery loops are too manual
|
||
- restart worker
|
||
- accept trust prompt
|
||
- re-inject prompt
|
||
- detect stale branch
|
||
- retry failed startup
|
||
- classify infra vs code failures manually
|
||
|
||
### 5. Branch freshness is not enforced enough
|
||
- side branches can miss already-landed main fixes
|
||
- broad test failures can be stale-branch noise instead of real regressions
|
||
|
||
### 6. Plugin/MCP failures are under-classified
|
||
- startup failures, handshake failures, config errors, partial startup, and degraded mode are not exposed cleanly enough
|
||
|
||
### 7. Human UX still leaks into claw workflows
|
||
- too much depends on terminal/TUI behavior instead of explicit agent state transitions and control APIs
|
||
|
||
## Product Principles
|
||
|
||
1. **State machine first** — every worker has explicit lifecycle states.
|
||
2. **Events over scraped prose** — channel output should be derived from typed events.
|
||
3. **Recovery before escalation** — known failure modes should auto-heal once before asking for help.
|
||
4. **Branch freshness before blame** — detect stale branches before treating red tests as new regressions.
|
||
5. **Partial success is first-class** — e.g. MCP startup can succeed for some servers and fail for others, with structured degraded-mode reporting.
|
||
6. **Terminal is transport, not truth** — tmux/TUI may remain implementation details, but orchestration state must live above them.
|
||
7. **Policy is executable** — merge, retry, rebase, stale cleanup, and escalation rules should be machine-enforced.
|
||
|
||
## Roadmap
|
||
|
||
## Phase 1 — Reliable Worker Boot
|
||
|
||
### 1. Ready-handshake lifecycle for coding workers
|
||
Add explicit states:
|
||
- `spawning`
|
||
- `trust_required`
|
||
- `ready_for_prompt`
|
||
- `prompt_accepted`
|
||
- `running`
|
||
- `blocked`
|
||
- `finished`
|
||
- `failed`
|
||
|
||
Acceptance:
|
||
- prompts are never sent before `ready_for_prompt`
|
||
- trust prompt state is detectable and emitted
|
||
- shell misdelivery becomes detectable as a first-class failure state
|
||
|
||
### 2. Trust prompt resolver
|
||
Add allowlisted auto-trust behavior for known repos/worktrees.
|
||
|
||
Acceptance:
|
||
- trusted repos auto-clear trust prompts
|
||
- events emitted for `trust_required` and `trust_resolved`
|
||
- non-allowlisted repos remain gated
|
||
|
||
### 3. Structured session control API
|
||
Provide machine control above tmux:
|
||
- create worker
|
||
- await ready
|
||
- send task
|
||
- fetch state
|
||
- fetch last error
|
||
- restart worker
|
||
- terminate worker
|
||
|
||
Acceptance:
|
||
- a claw can operate a coding worker without raw send-keys as the primary control plane
|
||
|
||
## Phase 2 — Event-Native Clawhip Integration
|
||
|
||
### 4. Canonical lane event schema
|
||
Define typed events such as:
|
||
- `lane.started`
|
||
- `lane.ready`
|
||
- `lane.prompt_misdelivery`
|
||
- `lane.blocked`
|
||
- `lane.red`
|
||
- `lane.green`
|
||
- `lane.commit.created`
|
||
- `lane.pr.opened`
|
||
- `lane.merge.ready`
|
||
- `lane.finished`
|
||
- `lane.failed`
|
||
- `branch.stale_against_main`
|
||
|
||
Acceptance:
|
||
- clawhip consumes typed lane events
|
||
- Discord summaries are rendered from structured events instead of pane scraping alone
|
||
|
||
### 5. Failure taxonomy
|
||
Normalize failure classes:
|
||
- `prompt_delivery`
|
||
- `trust_gate`
|
||
- `branch_divergence`
|
||
- `compile`
|
||
- `test`
|
||
- `plugin_startup`
|
||
- `mcp_startup`
|
||
- `mcp_handshake`
|
||
- `gateway_routing`
|
||
- `tool_runtime`
|
||
- `infra`
|
||
|
||
Acceptance:
|
||
- blockers are machine-classified
|
||
- dashboards and retry policies can branch on failure type
|
||
|
||
### 6. Actionable summary compression
|
||
Collapse noisy event streams into:
|
||
- current phase
|
||
- last successful checkpoint
|
||
- current blocker
|
||
- recommended next recovery action
|
||
|
||
Acceptance:
|
||
- channel status updates stay short and machine-grounded
|
||
- claws stop inferring state from raw build spam
|
||
|
||
## Phase 3 — Branch/Test Awareness and Auto-Recovery
|
||
|
||
### 7. Stale-branch detection before broad verification
|
||
Before broad test runs, compare current branch to `main` and detect if known fixes are missing.
|
||
|
||
Acceptance:
|
||
- emit `branch.stale_against_main`
|
||
- suggest or auto-run rebase/merge-forward according to policy
|
||
- avoid misclassifying stale-branch failures as new regressions
|
||
|
||
### 8. Recovery recipes for common failures
|
||
Encode known automatic recoveries for:
|
||
- trust prompt unresolved
|
||
- prompt delivered to shell
|
||
- stale branch
|
||
- compile red after cross-crate refactor
|
||
- MCP startup handshake failure
|
||
- partial plugin startup
|
||
|
||
Acceptance:
|
||
- one automatic recovery attempt occurs before escalation
|
||
- the attempted recovery is itself emitted as structured event data
|
||
|
||
### 9. Green-ness contract
|
||
Workers should distinguish:
|
||
- targeted tests green
|
||
- package green
|
||
- workspace green
|
||
- merge-ready green
|
||
|
||
Acceptance:
|
||
- no more ambiguous "tests passed" messaging
|
||
- merge policy can require the correct green level for the lane type
|
||
|
||
## Phase 4 — Claws-First Task Execution
|
||
|
||
### 10. Typed task packet format
|
||
Define a structured task packet with fields like:
|
||
- objective
|
||
- scope
|
||
- repo/worktree
|
||
- branch policy
|
||
- acceptance tests
|
||
- commit policy
|
||
- reporting contract
|
||
- escalation policy
|
||
|
||
Acceptance:
|
||
- claws can dispatch work without relying on long natural-language prompt blobs alone
|
||
- task packets can be logged, retried, and transformed safely
|
||
|
||
### 11. Policy engine for autonomous coding
|
||
Encode automation rules such as:
|
||
- if green + scoped diff + review passed -> merge to dev
|
||
- if stale branch -> merge-forward before broad tests
|
||
- if startup blocked -> recover once, then escalate
|
||
- if lane completed -> emit closeout and cleanup session
|
||
|
||
Acceptance:
|
||
- doctrine moves from chat instructions into executable rules
|
||
|
||
### 12. Claw-native dashboards / lane board
|
||
Expose a machine-readable board of:
|
||
- repos
|
||
- active claws
|
||
- worktrees
|
||
- branch freshness
|
||
- red/green state
|
||
- current blocker
|
||
- merge readiness
|
||
- last meaningful event
|
||
|
||
Acceptance:
|
||
- claws can query status directly
|
||
- human-facing views become a rendering layer, not the source of truth
|
||
|
||
## Phase 5 — Plugin and MCP Lifecycle Maturity
|
||
|
||
### 13. First-class plugin/MCP lifecycle contract
|
||
Each plugin/MCP integration should expose:
|
||
- config validation contract
|
||
- startup healthcheck
|
||
- discovery result
|
||
- degraded-mode behavior
|
||
- shutdown/cleanup contract
|
||
|
||
Acceptance:
|
||
- partial-startup and per-server failures are reported structurally
|
||
- successful servers remain usable even when one server fails
|
||
|
||
### 14. MCP end-to-end lifecycle parity
|
||
Close gaps from:
|
||
- config load
|
||
- server registration
|
||
- spawn/connect
|
||
- initialize handshake
|
||
- tool/resource discovery
|
||
- invocation path
|
||
- error surfacing
|
||
- shutdown/cleanup
|
||
|
||
Acceptance:
|
||
- parity harness and runtime tests cover healthy and degraded startup cases
|
||
- broken servers are surfaced as structured failures, not opaque warnings
|
||
|
||
## Immediate Backlog (from current real pain)
|
||
|
||
Priority order: P0 = blocks CI/green state, P1 = blocks integration wiring, P2 = clawability hardening, P3 = swarm-efficiency improvements.
|
||
|
||
**P0 — Fix first (CI reliability)**
|
||
1. Isolate `render_diff_report` tests into tmpdir — flaky under `cargo test --workspace`; reads real working-tree state; breaks CI during active worktree ops
|
||
2. Expand GitHub CI from single-crate coverage to workspace-grade verification — current `rust-ci.yml` runs `cargo fmt` and `cargo test -p rusty-claude-cli`, but misses broader `cargo test --workspace` coverage that already passes locally
|
||
3. Add release-grade binary workflow — repo has a Rust CLI and release intent, but no GitHub Actions path that builds tagged artifacts / checks release packaging before a publish step
|
||
4. Add container-first test/run docs — runtime detects Docker/Podman/container state, but docs do not show a canonical container workflow for `cargo test --workspace`, binary execution, or bind-mounted repo usage
|
||
5. Surface `doctor` / preflight diagnostics in onboarding docs and help — the CLI already has setup-diagnosis commands and branch preflight machinery, but they are not prominent enough in README/USAGE, so new users still ask manual setup questions instead of running a built-in health check first
|
||
6. Automate branding/source-of-truth residue checks in CI — the manual doc pass is done, but badges, Discord invites, copied repo URLs, and org-name drift can regress unless a cheap lint/check keeps README/docs aligned with `ultraworkers/claw-code`
|
||
7. Eliminate warning spam from first-run help/build path — `cargo run -p rusty-claude-cli -- --help` currently prints a wall of compile warnings before the actual help text, which pollutes the first-touch UX and hides the product surface behind unrelated noise
|
||
8. Promote `doctor` from slash-only to top-level CLI entrypoint — users naturally try `claw doctor`, but today it errors and tells them to enter a REPL or resume path first; healthcheck flows should be callable directly from the shell
|
||
9. Make machine-readable status commands actually machine-readable — `status` and `sandbox` accept the global `--output-format json` flag path, but currently still render prose tables, which breaks shell automation and agent-friendly health polling
|
||
10. Unify legacy config/skill namespaces in user-facing output — `skills` currently surfaces mixed project roots like `.codex` and `.claude`, which leaks historical layers into the current product and makes it unclear which config namespace is canonical
|
||
11. Honor JSON output on inventory commands like `skills` and `mcp` — these are exactly the commands agents and shell scripts want to inspect programmatically, but `--output-format json` still yields prose, forcing text scraping where structured inventory should exist
|
||
12. Audit `--output-format` contract across the whole CLI surface — current behavior is inconsistent by subcommand, so agents cannot trust the global flag without command-by-command probing; the format contract itself needs to become deterministic
|
||
|
||
**P1 — Next (integration wiring, unblocks verification)**
|
||
2. Add cross-module integration tests — **done**: 12 integration tests covering worker→recovery→policy, stale_branch→policy, green_contract→policy, reconciliation flows
|
||
3. Wire lane-completion emitter — **done**: `lane_completion` module with `detect_lane_completion()` auto-sets `LaneContext::completed` from session-finished + tests-green + push-complete → policy closeout
|
||
4. Wire `SummaryCompressor` into the lane event pipeline — **done**: `compress_summary_text()` feeds into `LaneEvent::Finished` detail field in `tools/src/lib.rs`
|
||
|
||
**P2 — Clawability hardening (original backlog)**
|
||
5. Worker readiness handshake + trust resolution — **done**: `WorkerStatus` state machine with `Spawning` → `TrustRequired` → `ReadyForPrompt` → `PromptAccepted` → `Running` lifecycle, `trust_auto_resolve` + `trust_gate_cleared` gating
|
||
6. Prompt misdelivery detection and recovery — **done**: `prompt_delivery_attempts` counter, `PromptMisdelivery` event detection, `auto_recover_prompt_misdelivery` + `replay_prompt` recovery arm
|
||
7. Canonical lane event schema in clawhip — **done**: `LaneEvent` enum with `Started/Blocked/Failed/Finished` variants, `LaneEvent::new()` typed constructor, `tools/src/lib.rs` integration
|
||
8. Failure taxonomy + blocker normalization — **done**: `WorkerFailureKind` enum (`TrustGate/PromptDelivery/Protocol/Provider`), `FailureScenario::from_worker_failure_kind()` bridge to recovery recipes
|
||
9. Stale-branch detection before workspace tests — **done**: `stale_branch.rs` module with freshness detection, behind/ahead metrics, policy integration
|
||
10. MCP structured degraded-startup reporting — **done**: `McpManager` degraded-startup reporting (+183 lines in `mcp_stdio.rs`), failed server classification (startup/handshake/config/partial), structured `failed_servers` + `recovery_recommendations` in tool output
|
||
11. Structured task packet format — **done**: `task_packet.rs` module with `TaskPacket` struct, validation, serialization, `TaskScope` resolution (workspace/module/single-file/custom), integrated into `tools/src/lib.rs`
|
||
12. Lane board / machine-readable status API — **done**: Lane completion hardening + `LaneContext::completed` auto-detection + MCP degraded reporting surface machine-readable state
|
||
13. **Session completion failure classification** — **done**: `WorkerFailureKind::Provider` + `observe_completion()` + recovery recipe bridge landed
|
||
14. **Config merge validation gap** — **done**: `config.rs` hook validation before deep-merge (+56 lines), malformed entries fail with source-path context instead of merged parse errors
|
||
15. **MCP manager discovery flaky test** — `manager_discovery_report_keeps_healthy_servers_when_one_server_fails` has intermittent timing issues in CI; temporarily ignored, needs root cause fix
|
||
|
||
16. **Commit provenance / worktree-aware push events** — clawhip build stream shows duplicate-looking commit messages and worktree-originated pushes without clear supersession indicators; add worktree/branch metadata to push events and de-dup superseded commits in build stream display
|
||
17. **Orphaned module integration audit** — `session_control` is `pub mod` exported from `runtime` but has zero consumers across the entire workspace (no import, no call site outside its own file). `trust_resolver` types are re-exported from `lib.rs` but never instantiated outside unit tests. These modules implement core clawability contracts (session management, trust resolution) that are structurally dead — built but not wired into the CLI or tools crate. **Action:** audit all `pub mod` / `pub use` exports from `runtime` for actual call sites; either wire orphaned modules into the real execution path or demote to `pub(crate)` / `cfg(test)` to prevent false clawability surface.
|
||
18. **Context-window preflight gap** — claw-code auto-compacts only after cumulative input crosses a static `100_000`-token threshold, while provider requests derive `max_tokens` from a naive model-name heuristic (`opus` => 32k, else 64k) and do not appear to preflight `estimated_prompt_tokens + requested_output_tokens` against the selected model’s actual context window. Result: giant sessions can be sent upstream and fail hard with provider-side `input_exceeds_context_by_*` errors instead of local preflight compaction/rejection. **Action:** add a model-context registry + request-size preflight before provider call; if projected request exceeds context, emit a structured `context_window_blocked` event and auto-compact or force `/compact` before retry.
|
||
19. **Subcommand help falls through into runtime/API path** — direct dogfood shows `./target/debug/claw doctor --help` and `./target/debug/claw status --help` do not render local subcommand help. Instead they enter the request path, show `🦀 Thinking...`, then fail with `api returned 500 ... auth_unavailable: no auth available`. Help/usage surfaces must be pure local parsing and never require auth or provider reachability. **Action:** fix argv dispatch so `<subcommand> --help` is intercepted before runtime startup/API client initialization; add regression tests for `doctor --help`, `status --help`, and similar local-info commands.
|
||
20. **Session state classification gap (working vs blocked vs finished vs truly stale)** — dogfooding with 14 parallel tmux/OMX lanes exposed that text-idle stale detection is far too coarse. Sessions were repeatedly flagged as stale even when they were already **finished/reportable** (P0.9, P0.10, P2.18, P2.19), **working but quiet** (doc/branding/audit passes), or **blocked on a specific recoverable state** (background terminal still running, cherry-pick conflict, MCP startup noise, transport interruption after partial progress). **Action:** add explicit machine states above prose scraping such as `working`, `blocked_background_job`, `blocked_merge_conflict`, `degraded_mcp`, `interrupted_transport`, `finished_pending_report`, `finished_cleanable`, and `truly_idle`; update clawhip/session monitoring so quiet work is not paged as stale and completed sessions can auto-report + auto-clean.
|
||
|
||
**P3 — Swarm efficiency**
|
||
13. Swarm branch-lock protocol — detect same-module/same-branch collision before parallel workers drift into duplicate implementation
|
||
14. Commit provenance / worktree-aware push events — emit branch, worktree, superseded-by, and canonical commit lineage so parallel sessions stop producing duplicate-looking push summaries
|
||
|
||
## Suggested Session Split
|
||
|
||
### Session A — worker boot protocol
|
||
Focus:
|
||
- trust prompt detection
|
||
- ready-for-prompt handshake
|
||
- prompt misdelivery detection
|
||
|
||
### Session B — clawhip lane events
|
||
Focus:
|
||
- canonical lane event schema
|
||
- failure taxonomy
|
||
- summary compression
|
||
|
||
### Session C — branch/test intelligence
|
||
Focus:
|
||
- stale-branch detection
|
||
- green-level contract
|
||
- recovery recipes
|
||
|
||
### Session D — MCP lifecycle hardening
|
||
Focus:
|
||
- startup/handshake reliability
|
||
- structured failed server reporting
|
||
- degraded-mode runtime behavior
|
||
- lifecycle tests/harness coverage
|
||
|
||
### Session E — typed task packets + policy engine
|
||
Focus:
|
||
- structured task format
|
||
- retry/merge/escalation rules
|
||
- autonomous lane closure behavior
|
||
|
||
## MVP Success Criteria
|
||
|
||
We should consider claw-code materially more clawable when:
|
||
- a claw can start a worker and know with certainty when it is ready
|
||
- claws no longer accidentally type tasks into the shell
|
||
- stale-branch failures are identified before they waste debugging time
|
||
- clawhip reports machine states, not just tmux prose
|
||
- MCP/plugin startup failures are classified and surfaced cleanly
|
||
- a coding lane can self-recover from common startup and branch issues without human babysitting
|
||
|
||
## Short Version
|
||
|
||
claw-code should evolve from:
|
||
- a CLI a human can also drive
|
||
|
||
to:
|
||
- a **claw-native execution runtime**
|
||
- an **event-native orchestration substrate**
|
||
- a **plugin/hook-first autonomous coding harness**
|