mirror of
https://github.com/instructkr/claw-code.git
synced 2026-04-16 21:14:50 +08:00
refresh
This commit is contained in:
647
ROADMAP.md
647
ROADMAP.md
@@ -88,6 +88,25 @@ Acceptance:
|
||||
- trust prompt state is detectable and emitted
|
||||
- shell misdelivery becomes detectable as a first-class failure state
|
||||
|
||||
### 1.5. First-prompt acceptance SLA
|
||||
After `ready_for_prompt`, expose whether the first task was actually accepted within a bounded window instead of leaving claws in a silent limbo.
|
||||
|
||||
Emit typed signals for:
|
||||
- `prompt.sent`
|
||||
- `prompt.accepted`
|
||||
- `prompt.acceptance_delayed`
|
||||
- `prompt.acceptance_timeout`
|
||||
|
||||
Track at least:
|
||||
- time from `ready_for_prompt` -> first prompt send
|
||||
- time from first prompt send -> `prompt_accepted`
|
||||
- whether acceptance required retry or recovery
|
||||
|
||||
Acceptance:
|
||||
- clawhip can distinguish `worker is ready but idle` from `prompt was sent but not actually accepted`
|
||||
- long silent gaps between ready-state and first-task execution become machine-visible
|
||||
- recovery can trigger on acceptance timeout before humans start scraping panes
|
||||
|
||||
### 2. Trust prompt resolver
|
||||
Add allowlisted auto-trust behavior for known repos/worktrees.
|
||||
|
||||
@@ -109,6 +128,23 @@ Provide machine control above tmux:
|
||||
Acceptance:
|
||||
- a claw can operate a coding worker without raw send-keys as the primary control plane
|
||||
|
||||
### 3.5. Boot preflight / doctor contract
|
||||
Before spawning or prompting a worker, run a machine-readable preflight that reports whether the lane is actually safe to start.
|
||||
|
||||
Preflight should check and emit typed results for:
|
||||
- repo/worktree existence and expected branch
|
||||
- branch freshness vs base branch
|
||||
- trust-gate likelihood / allowlist status
|
||||
- required binaries and control sockets
|
||||
- plugin discovery / allowlist / startup eligibility
|
||||
- MCP config presence and server reachability expectations
|
||||
- last-known failed boot reason, if any
|
||||
|
||||
Acceptance:
|
||||
- claws can fail fast before launching a doomed worker
|
||||
- a blocked start returns a short structured diagnosis instead of forcing pane-scrape triage
|
||||
- clawhip can summarize `why this lane did not even start` without inferring from terminal noise
|
||||
|
||||
## Phase 2 — Event-Native Clawhip Integration
|
||||
|
||||
### 4. Canonical lane event schema
|
||||
@@ -130,6 +166,551 @@ Acceptance:
|
||||
- clawhip consumes typed lane events
|
||||
- Discord summaries are rendered from structured events instead of pane scraping alone
|
||||
|
||||
### 4.5. Session event ordering + terminal-state reconciliation
|
||||
When the same session emits contradictory lifecycle events (`idle`, `error`, `completed`, transport/server-down) in close succession, claw-code must expose a deterministic final truth instead of making downstream claws guess.
|
||||
|
||||
Required behavior:
|
||||
- attach monotonic sequence / causal ordering metadata to session lifecycle events
|
||||
- classify which events are terminal vs advisory
|
||||
- reconcile duplicate or out-of-order terminal events into one canonical lane outcome
|
||||
- distinguish `session terminal state unknown because transport died` from a real `completed`
|
||||
|
||||
Acceptance:
|
||||
- clawhip can survive `completed -> idle -> error -> completed` noise without double-reporting or trusting the wrong final state
|
||||
- server-down after a session event burst surfaces as a typed uncertainty state rather than silently rewriting history
|
||||
- downstream automation has one canonical terminal outcome per lane/session
|
||||
|
||||
### 4.6. Event provenance / environment labeling
|
||||
Every emitted event should say whether it came from a live lane, synthetic test, healthcheck, replay, or system transport layer so claws do not mistake test noise for production truth.
|
||||
|
||||
Required fields:
|
||||
- event source kind (`live_lane`, `test`, `healthcheck`, `replay`, `transport`)
|
||||
- environment / channel label
|
||||
- emitter identity
|
||||
- confidence / trust level for downstream automation
|
||||
|
||||
Acceptance:
|
||||
- clawhip can ignore or down-rank test pings without heuristic text matching
|
||||
- synthetic/system events do not contaminate lane status or trigger false follow-up automation
|
||||
- event streams remain machine-trustworthy even when test traffic shares the same channel
|
||||
|
||||
### 4.7. Session identity completeness at creation time
|
||||
A newly created session should not surface as `(untitled)` or `(unknown)` for fields that orchestrators need immediately.
|
||||
|
||||
Required behavior:
|
||||
- emit stable title, workspace/worktree path, and lane/session purpose at creation time
|
||||
- if any field is not yet known, emit an explicit typed placeholder reason rather than a bare unknown string
|
||||
- reconcile later-enriched metadata back onto the same session identity without creating ambiguity
|
||||
|
||||
Acceptance:
|
||||
- clawhip can route/triage a brand-new session without waiting for follow-up chatter
|
||||
- `(untitled)` / `(unknown)` creation events no longer force humans or bots to guess scope
|
||||
- session creation events are immediately actionable for monitoring and ownership decisions
|
||||
|
||||
### 4.8. Duplicate terminal-event suppression
|
||||
When the same session emits repeated `completed`, `failed`, or other terminal notifications, claw-code should collapse duplicates before they trigger repeated downstream reactions.
|
||||
|
||||
Required behavior:
|
||||
- attach a canonical terminal-event fingerprint per lane/session outcome
|
||||
- suppress or coalesce repeated terminal notifications within a reconciliation window
|
||||
- preserve raw event history for audit while exposing only one actionable terminal outcome downstream
|
||||
- surface when a later duplicate materially differs from the original terminal payload
|
||||
|
||||
Acceptance:
|
||||
- clawhip does not double-report or double-close based on repeated terminal notifications
|
||||
- duplicate `completed` bursts become one actionable finish event, not repeated noise
|
||||
- downstream automation stays idempotent even when the upstream emitter is chatty
|
||||
|
||||
### 4.9. Lane ownership / scope binding
|
||||
Each session and lane event should declare who owns it and what workflow scope it belongs to, so unrelated external/system work does not pollute claw-code follow-up loops.
|
||||
|
||||
Required behavior:
|
||||
- attach owner/assignee identity when known
|
||||
- attach workflow scope (e.g. `claw-code-dogfood`, `external-git-maintenance`, `infra-health`, `manual-operator`)
|
||||
- mark whether the current watcher is expected to act, observe only, or ignore
|
||||
- preserve scope through session restarts, resumes, and late terminal events
|
||||
|
||||
Acceptance:
|
||||
- clawhip can say `out-of-scope external session` without humans adding a prose disclaimer
|
||||
- unrelated session churn does not trigger false claw-code follow-up or blocker reporting
|
||||
- monitoring views can filter to `actionable for this claw` instead of mixing every session on the host
|
||||
|
||||
### 4.10. Nudge acknowledgment / dedupe contract
|
||||
Periodic clawhip nudges should carry enough state for claws to know whether the current prompt is new work, a retry, or an already-acknowledged heartbeat.
|
||||
|
||||
Required behavior:
|
||||
- attach nudge id / cycle id and delivery timestamp
|
||||
- expose whether the current claw has already acknowledged or responded for that cycle
|
||||
- distinguish `new nudge`, `retry nudge`, and `stale duplicate`
|
||||
- allow downstream summaries to bind a reported pinpoint back to the triggering nudge id
|
||||
|
||||
Acceptance:
|
||||
- claws do not keep manufacturing fresh follow-ups just because the same periodic nudge reappeared
|
||||
- clawhip can tell whether silence means `not yet handled` or `already acknowledged in this cycle`
|
||||
- recurring dogfood prompts become idempotent and auditable across retries
|
||||
|
||||
### 4.11. Stable roadmap-id assignment for newly filed pinpoints
|
||||
When a claw records a new pinpoint/follow-up, the roadmap surface should assign or expose a stable tracking id immediately instead of leaving the item as anonymous prose.
|
||||
|
||||
Required behavior:
|
||||
- assign a canonical roadmap id at filing time
|
||||
- expose that id in the structured event/report payload
|
||||
- preserve the same id across later edits, reorderings, and summary compression
|
||||
- distinguish `new roadmap filing` from `update to existing roadmap item`
|
||||
|
||||
Acceptance:
|
||||
- channel updates can reference a newly filed pinpoint by stable id in the same turn
|
||||
- downstream claws do not need heuristic text matching to figure out whether a follow-up is new or already tracked
|
||||
- roadmap-driven dogfood loops stay auditable even as the document is edited repeatedly
|
||||
|
||||
### 4.12. Roadmap item lifecycle state contract
|
||||
Each roadmap pinpoint should carry a machine-readable lifecycle state so claws do not keep rediscovering or re-reporting items that are already active, resolved, or superseded.
|
||||
|
||||
Required behavior:
|
||||
- expose lifecycle state (`filed`, `acknowledged`, `in_progress`, `blocked`, `done`, `superseded`)
|
||||
- attach last state-change timestamp
|
||||
- allow a new report to declare whether it is a first filing, status update, or closure
|
||||
- preserve lineage when one pinpoint supersedes or merges into another
|
||||
|
||||
Acceptance:
|
||||
- clawhip can tell `new gap` from `existing gap still active` without prose interpretation
|
||||
- completed or superseded items stop reappearing as if they were fresh discoveries
|
||||
- roadmap-driven follow-up loops become stateful instead of repeatedly stateless
|
||||
|
||||
### 4.13. Multi-message report atomicity
|
||||
A single dogfood/lane update should be representable as one structured report payload, even if the chat surface ends up rendering it across multiple messages.
|
||||
|
||||
Required behavior:
|
||||
- assign one report id for the whole update
|
||||
- bind `active_sessions`, `exact_pinpoint`, `concrete_delta`, and `blocker` fields to that same report id
|
||||
- expose message-part ordering when the chat transport splits the report
|
||||
- allow downstream consumers to reconstruct one canonical update without scraping adjacent chat messages heuristically
|
||||
|
||||
Acceptance:
|
||||
- clawhip and other claws can parse one logical update even when Discord delivery fragments it into several posts
|
||||
- partial/misordered message bursts do not scramble `pinpoint` vs `delta` vs `blocker`
|
||||
- dogfood reports become machine-reliable summaries instead of fragile chat archaeology
|
||||
|
||||
### 4.14. Cross-claw pinpoint dedupe / merge contract
|
||||
When multiple claws file near-identical pinpoints from the same underlying failure, the roadmap surface should merge or relate them instead of letting duplicate follow-ups accumulate as separate discoveries.
|
||||
|
||||
Required behavior:
|
||||
- compute or expose a similarity/dedupe key for newly filed pinpoints
|
||||
- allow a new filing to link to an existing roadmap item as `same_root_cause`, `related`, or `supersedes`
|
||||
- preserve reporter-specific evidence while collapsing the canonical tracked issue
|
||||
- surface when a later filing is genuinely distinct despite similar wording
|
||||
|
||||
Acceptance:
|
||||
- two claws reporting the same gap do not automatically create two independent roadmap items
|
||||
- roadmap growth reflects real new findings instead of duplicate observer churn
|
||||
- downstream monitoring can see both the canonical item and the supporting duplicate evidence without losing auditability
|
||||
|
||||
### 4.15. Pinpoint evidence attachment contract
|
||||
Each filed pinpoint should carry structured supporting evidence so later implementers do not have to reconstruct why the gap was believed to exist.
|
||||
|
||||
Required behavior:
|
||||
- attach evidence references such as session ids, message ids, commits, logs, stack traces, or file paths
|
||||
- label each attachment by evidence role (`repro`, `symptom`, `root_cause_hint`, `verification`)
|
||||
- preserve bounded previews for human scanning while keeping a canonical reference for machines
|
||||
- allow evidence to be added after filing without changing the pinpoint identity
|
||||
|
||||
Acceptance:
|
||||
- roadmap items stay actionable after chat scrollback or session context is gone
|
||||
- implementation lanes can start from structured evidence instead of rediscovering the original failure
|
||||
- prioritization can weigh pinpoints by evidence quality, not just prose confidence
|
||||
|
||||
### 4.16. Pinpoint priority / severity contract
|
||||
Each filed pinpoint should expose a machine-readable urgency/severity signal so claws can separate immediate execution blockers from lower-priority clawability hardening.
|
||||
|
||||
Required behavior:
|
||||
- attach priority/severity fields (for example `p0`/`p1`/`p2` or `critical`/`high`/`medium`/`low`)
|
||||
- distinguish user-facing breakage, operator-only friction, observability debt, and long-tail hardening
|
||||
- allow priority to change as new evidence lands without changing the pinpoint identity
|
||||
- surface why the priority was assigned (blast radius, reproducibility, automation breakage, merge risk)
|
||||
|
||||
Acceptance:
|
||||
- clawhip can rank fresh pinpoints without relying on prose urgency vibes
|
||||
- implementation queues can pull true blockers ahead of reporting-only niceties
|
||||
- roadmap dogfood stays focused on the most damaging clawability gaps first
|
||||
|
||||
### 4.17. Pinpoint-to-implementation handoff contract
|
||||
A filed pinpoint should be able to turn into an execution lane without a human re-translating the same context by hand.
|
||||
|
||||
Required behavior:
|
||||
- expose a structured handoff packet containing objective, suspected scope, evidence refs, priority, and suggested verification
|
||||
- mark whether the pinpoint is `implementation_ready`, `needs_repro`, or `needs_triage`
|
||||
- preserve the link between the roadmap item and any spawned execution lane/worktree/PR
|
||||
- allow later execution results to update the original pinpoint state instead of forking separate unlinked narratives
|
||||
|
||||
Acceptance:
|
||||
- a claw can pick up a filed pinpoint and start implementation with minimal re-interpretation
|
||||
- roadmap items stop being dead prose and become executable handoff units
|
||||
- follow-up loops can see which pinpoints have already turned into real execution lanes
|
||||
|
||||
### 4.18. Report backpressure / repetitive-summary collapse
|
||||
Periodic dogfood reporting should avoid re-broadcasting the full known gap inventory every cycle when only a small delta changed.
|
||||
|
||||
Required behavior:
|
||||
- distinguish `new since last report` from `still active but unchanged`
|
||||
- emit compact delta-first summaries with an optional expandable full state
|
||||
- track per-channel/reporting cursor so repeated unchanged items collapse automatically
|
||||
- preserve one canonical full snapshot elsewhere for audit/debug without flooding the live channel
|
||||
|
||||
Acceptance:
|
||||
- new signal does not get buried under the same repeated backlog list every cycle
|
||||
- claws and humans can scan the latest update for actual change instead of re-reading the whole inventory
|
||||
- recurring dogfood loops become low-noise without losing auditability
|
||||
|
||||
### 4.19. No-change / no-op acknowledgment contract
|
||||
When a dogfood cycle produces no new pinpoint, no new delta, and no new blocker, claws should be able to acknowledge that cycle explicitly without pretending a fresh finding exists.
|
||||
|
||||
Required behavior:
|
||||
- expose a structured `no_change` / `noop` outcome for a reporting cycle
|
||||
- bind that outcome to the triggering nudge/report id
|
||||
- distinguish `checked and unchanged` from `not yet checked`
|
||||
- preserve the last meaningful pinpoint/delta reference without re-filing it as new work
|
||||
|
||||
Acceptance:
|
||||
- recurring nudges do not force synthetic novelty when the real answer is `nothing changed`
|
||||
- clawhip can tell `handled, no delta` apart from silence or missed handling
|
||||
- dogfood loops become honest and low-noise when the system is stable
|
||||
|
||||
### 4.20. Observation freshness / staleness-age contract
|
||||
Every reported status, pinpoint, or blocker should carry an explicit observation timestamp/age so downstream claws can tell fresh state from stale carry-forward.
|
||||
|
||||
Required behavior:
|
||||
- attach observed-at timestamp and derived age to active-session state, pinpoints, and blockers
|
||||
- distinguish freshly observed facts from carried-forward prior-cycle state
|
||||
- allow freshness TTLs so old observations degrade from `current` to `stale` automatically
|
||||
- surface when a report contains mixed freshness windows across its fields
|
||||
|
||||
Acceptance:
|
||||
- claws do not mistake a 2-hour-old observation for current truth just because it reappeared in the latest report
|
||||
- stale carried-forward state is visible and can be down-ranked or revalidated
|
||||
- dogfood summaries remain trustworthy even when some fields are unchanged across many cycles
|
||||
|
||||
### 4.21. Fact / hypothesis / confidence labeling
|
||||
Dogfood reports should distinguish confirmed observations from inferred root-cause guesses so downstream claws do not treat speculation as settled truth.
|
||||
|
||||
Required behavior:
|
||||
- label each reported claim as `observed_fact`, `inference`, `hypothesis`, or `recommendation`
|
||||
- attach a confidence score or confidence bucket to non-fact claims
|
||||
- preserve which evidence supports each claim
|
||||
- allow a later report to promote a hypothesis into confirmed fact without changing the underlying pinpoint identity
|
||||
|
||||
Acceptance:
|
||||
- claws can tell `we saw X happen` from `we think Y caused it`
|
||||
- speculative root-cause text does not get mistaken for machine-trustworthy state
|
||||
- dogfood summaries stay honest about uncertainty while remaining actionable
|
||||
|
||||
### 4.22. Negative-evidence / searched-and-not-found contract
|
||||
When a dogfood cycle reports that something was not found (no active sessions, no new delta, no repro, no blocker), the report should also say what was checked so absence is machine-meaningful rather than empty prose.
|
||||
|
||||
Required behavior:
|
||||
- attach the checked surfaces/sources for negative findings (sessions, logs, roadmap, state file, channel window, etc.)
|
||||
- distinguish `not observed in checked scope` from `unknown / not checked`
|
||||
- preserve the query/window used for the negative observation when relevant
|
||||
- allow later reports to invalidate an earlier negative finding if the search scope was incomplete
|
||||
|
||||
Acceptance:
|
||||
- `no blocker` and `no new delta` become auditable conclusions rather than unverifiable vibes
|
||||
- downstream claws can tell whether absence means `looked and clean` or `did not inspect`
|
||||
- stable dogfood periods stay trustworthy without overclaiming certainty
|
||||
|
||||
### 4.23. Field-level delta attribution
|
||||
Even in delta-first reporting, claws still need to know exactly which structured fields changed between cycles instead of inferring change from prose.
|
||||
|
||||
Required behavior:
|
||||
- emit field-level change markers for core report fields (`active_sessions`, `pinpoint`, `delta`, `blocker`, lifecycle state, priority, freshness)
|
||||
- distinguish `changed`, `unchanged`, `cleared`, and `carried_forward`
|
||||
- preserve previous value references or hashes when useful for machine comparison
|
||||
- allow one report to contain both changed and unchanged fields without losing per-field status
|
||||
|
||||
Acceptance:
|
||||
- downstream claws can tell precisely what changed this cycle without diffing entire message bodies
|
||||
- delta-first summaries remain compact while still being machine-comparable
|
||||
- recurring reports stop forcing text-level reparse just to answer `what actually changed?`
|
||||
|
||||
### 4.24. Report schema versioning / compatibility contract
|
||||
As structured dogfood reports evolve, the reporting surface needs explicit schema versioning so downstream claws can parse new fields safely without silent breakage.
|
||||
|
||||
Required behavior:
|
||||
- attach schema version to each structured report payload
|
||||
- define additive vs breaking field changes
|
||||
- expose compatibility guidance for consumers that only understand older schemas
|
||||
- preserve a minimal stable core so basic parsing survives partial upgrades
|
||||
|
||||
Acceptance:
|
||||
- downstream claws can reject, warn on, or gracefully degrade unknown schema versions instead of misparsing silently
|
||||
- adding new reporting fields does not randomly break existing automation
|
||||
- dogfood reporting can evolve quickly without losing machine trust
|
||||
|
||||
### 4.25. Consumer capability negotiation for structured reports
|
||||
Schema versioning alone is not enough if different claws consume different subsets of the reporting surface. The producer should know what the consumer can actually understand.
|
||||
|
||||
Required behavior:
|
||||
- let downstream consumers advertise supported schema versions and optional field families/capabilities
|
||||
- allow producers to emit a reduced-compatible payload when a consumer cannot handle richer report fields
|
||||
- surface when a report was downgraded for compatibility vs emitted in full fidelity
|
||||
- preserve one canonical full-fidelity representation for audit/debug even when a downgraded view is delivered
|
||||
|
||||
Acceptance:
|
||||
- claws with older parsers can still consume useful reports without silent field loss being mistaken for absence
|
||||
- richer report evolution does not force every consumer to upgrade in lockstep
|
||||
- reporting remains machine-trustworthy across mixed-version claw fleets
|
||||
|
||||
### 4.26. Self-describing report schema surface
|
||||
Even with versioning and capability negotiation, downstream claws still need a machine-readable way to discover what fields and semantics a report version actually contains.
|
||||
|
||||
Required behavior:
|
||||
- expose a machine-readable schema/field registry for structured report payloads
|
||||
- document field meanings, enums, optionality, and deprecation status in a consumable format
|
||||
- let consumers fetch the schema for a referenced report version/capability set
|
||||
- preserve stable identifiers for fields so docs, code, and live payloads point at the same schema truth
|
||||
|
||||
Acceptance:
|
||||
- new consumers can integrate without reverse-engineering example payloads from chat logs
|
||||
- schema drift becomes detectable against a declared source of truth
|
||||
- structured report evolution stays fast without turning every integration into brittle archaeology
|
||||
|
||||
### 4.27. Audience-specific report projection
|
||||
The same canonical dogfood report should be projectable into different consumer views (clawhip, Jobdori, human operator) without each consumer re-summarizing the full payload from scratch.
|
||||
|
||||
Required behavior:
|
||||
- preserve one canonical structured report payload
|
||||
- support consumer-specific projections/views (for example `delta_brief`, `ops_audit`, `human_readable`, `roadmap_sync`)
|
||||
- let consumers declare preferred projection shape and verbosity
|
||||
- make the projection lineage explicit so a terse view still points back to the canonical report
|
||||
|
||||
Acceptance:
|
||||
- Jobdori/Clawhip/humans do not keep rebroadcasting the same full inventory in slightly different prose
|
||||
- each consumer gets the right level of detail without inventing its own lossy summary layer
|
||||
- reporting noise drops while the underlying truth stays shared and auditable
|
||||
|
||||
### 4.28. Canonical report identity / content-hash anchor
|
||||
Once multiple projections and summaries exist, the system needs a stable identity anchor proving they all came from the same underlying report state.
|
||||
|
||||
Required behavior:
|
||||
- assign a canonical report id plus content hash/fingerprint to the full structured payload
|
||||
- include projection-specific metadata without changing the canonical identity of unchanged underlying content
|
||||
- surface when two projections differ because the source report changed vs because only the rendering changed
|
||||
- allow downstream consumers to detect accidental duplicate sends of the exact same report payload
|
||||
|
||||
Acceptance:
|
||||
- claws can verify that different audience views refer to the same underlying report truth
|
||||
- duplicate projections of identical content do not look like new state changes
|
||||
- report lineage remains auditable even as the same canonical payload is rendered many ways
|
||||
|
||||
### 4.29. Projection invalidation / stale-view cache contract
|
||||
If the canonical report changes, previously emitted audience-specific projections must be identifiable as stale so downstream claws do not keep acting on an old rendered view.
|
||||
|
||||
Required behavior:
|
||||
- bind each projection to the canonical report id + content hash/version it was derived from
|
||||
- mark projections as superseded when the underlying canonical payload changes
|
||||
- expose whether a consumer is viewing the latest compatible projection or a stale cached one
|
||||
- allow cheap regeneration of projections without minting fake new report identities
|
||||
|
||||
Acceptance:
|
||||
- claws do not mistake an old `delta_brief` view for current truth after the canonical report was updated
|
||||
- projection caching reduces noise/compute without increasing stale-action risk
|
||||
- audience-specific views stay safely linked to the freshness of the underlying report
|
||||
|
||||
### 4.30. Projection-time redaction / sensitivity labeling
|
||||
As canonical reports accumulate richer evidence, projections need an explicit policy for what can be shown to which audience without losing machine trust.
|
||||
|
||||
Required behavior:
|
||||
- label report fields/evidence with sensitivity classes (for example `public`, `internal`, `operator_only`, `secret`)
|
||||
- let projections redact, summarize, or hash sensitive fields according to audience policy while preserving the canonical report intact
|
||||
- expose when a projection omitted or transformed data for sensitivity reasons
|
||||
- preserve enough stable identity/provenance that redacted projections can still be correlated with the canonical report
|
||||
|
||||
Acceptance:
|
||||
- richer canonical reports do not force all audience views to leak the same detail level
|
||||
- consumers can tell `field absent because redacted` from `field absent because nonexistent`
|
||||
- audience-specific projections stay safe without turning into unverifiable black boxes
|
||||
|
||||
### 4.31. Redaction provenance / policy traceability
|
||||
When a projection redacts or transforms data, downstream consumers should be able to tell which policy/rule caused it rather than treating redaction as unexplained disappearance.
|
||||
|
||||
Required behavior:
|
||||
- attach redaction reason/policy id to transformed or omitted fields
|
||||
- distinguish policy-based redaction from size truncation, compatibility downgrade, and source absence
|
||||
- preserve auditable linkage from the projection back to the canonical field classification
|
||||
- allow operators to review which projection policy version produced the visible output
|
||||
|
||||
Acceptance:
|
||||
- claws can tell *why* a field was hidden, not just that it vanished
|
||||
- redacted projections remain operationally debuggable instead of opaque
|
||||
- sensitivity controls stay auditable as reporting/projection policy evolves
|
||||
|
||||
### 4.32. Deterministic projection / redaction reproducibility
|
||||
Given the same canonical report, schema version, consumer capability set, and projection policy, the emitted projection should be reproducible byte-for-byte (or canonically equivalent) so audits and diffing do not drift on re-render.
|
||||
|
||||
Required behavior:
|
||||
- make projection/redaction output deterministic for the same inputs
|
||||
- surface which inputs participate in projection identity (schema version, capability set, policy version, canonical content hash)
|
||||
- distinguish content changes from nondeterministic rendering noise
|
||||
- allow canonical equivalence checks even when transport formatting differs
|
||||
|
||||
Acceptance:
|
||||
- re-rendering the same report for the same audience does not create fake deltas
|
||||
- audit/debug workflows can reproduce why a prior projection looked the way it did
|
||||
- projection pipelines stay machine-trustworthy under repeated regeneration
|
||||
|
||||
### 4.33. Projection golden-fixture / regression lock
|
||||
Once structured projections become deterministic, claw-code still needs regression fixtures that lock expected outputs so report rendering changes cannot slip in unnoticed.
|
||||
|
||||
Required behavior:
|
||||
- maintain canonical fixture inputs covering core report shapes, redaction classes, and capability downgrades
|
||||
- snapshot or equivalence-test expected projections for supported audience views
|
||||
- make intentional rendering/schema changes update fixtures explicitly rather than drifting silently
|
||||
- surface which fixture set/version validated a projection pipeline change
|
||||
|
||||
Acceptance:
|
||||
- projection regressions get caught before downstream claws notice broken or drifting output
|
||||
- deterministic rendering claims stay continuously verified, not assumed
|
||||
- report/projection evolution remains fast without sacrificing machine-trustworthy stability
|
||||
|
||||
### 4.34. Downstream consumer conformance test contract
|
||||
Producer-side fixture coverage is not enough if real downstream claws still parse or interpret the reporting contract incorrectly. The ecosystem needs a way to verify consumer behavior against the declared report schema/projection rules.
|
||||
|
||||
Required behavior:
|
||||
- define conformance cases for consumers across schema versions, capability downgrades, redaction states, and no-op cycles
|
||||
- provide a machine-runnable consumer test kit or fixture bundle
|
||||
- distinguish parse success from semantic correctness (for example: correctly handling `redacted` vs `missing`, `stale` vs `current`)
|
||||
- surface which consumer/version last passed the conformance suite
|
||||
|
||||
Acceptance:
|
||||
- report-contract drift is caught at the producer/consumer boundary, not only inside the producer
|
||||
- downstream claws can prove they understand the structured reporting surface they claim to support
|
||||
- mixed claw fleets stay interoperable without relying on optimism or manual spot checks
|
||||
|
||||
### 4.35. Provisional-status dedupe / in-flight acknowledgment suppression
|
||||
When a claw emits temporary status such as `working on it`, `please wait`, or `adding a roadmap gap`, repeated provisional notices should not flood the channel unless something materially changed.
|
||||
|
||||
Required behavior:
|
||||
- fingerprint provisional/in-flight status updates separately from terminal or delta-bearing reports
|
||||
- suppress repeated provisional messages with unchanged meaning inside a short reconciliation window
|
||||
- allow a new provisional update through only when progress state, owner, blocker, or ETA meaningfully changes
|
||||
- preserve raw repeats for audit/debug without exposing each one as a fresh channel event
|
||||
|
||||
Acceptance:
|
||||
- monitoring feeds do not churn on duplicate `please wait` / `working on it` messages
|
||||
- consumers can tell the difference between `still in progress, unchanged` and `new actionable update`
|
||||
- in-flight acknowledgments remain useful without drowning out real state transitions
|
||||
|
||||
### 4.36. Provisional-status escalation timeout
|
||||
If a provisional/in-flight status remains unchanged for too long, the system should stop treating it as harmless noise and promote it back into an actionable stale signal.
|
||||
|
||||
Required behavior:
|
||||
- attach timeout/TTL policy to provisional states
|
||||
- escalate prolonged unchanged provisional status into a typed stale/blocker signal
|
||||
- distinguish `deduped because still fresh` from `deduped too long and now suspicious`
|
||||
- surface which timeout policy triggered the escalation
|
||||
|
||||
Acceptance:
|
||||
- `working on it` does not suppress visibility forever when real progress stalled
|
||||
- consumers can trust provisional dedupe without losing long-stuck work
|
||||
- low-noise monitoring still resurfaces stale in-flight states at the right time
|
||||
|
||||
### 4.37. Policy-blocked action handoff
|
||||
When a requested action is disallowed by branch/merge/release policy (for example direct `main` push), the system should expose a structured refusal plus the next safe execution path instead of leaving only freeform prose.
|
||||
|
||||
Required behavior:
|
||||
- classify policy-blocked requests with a typed reason (`main_push_forbidden`, `release_requires_owner`, etc.)
|
||||
- attach the governing policy source and actor scope when available
|
||||
- emit a safe fallback path (`create branch`, `open PR`, `request owner approval`, etc.)
|
||||
- allow downstream claws/operators to distinguish `blocked by policy` from `blocked by technical failure`
|
||||
|
||||
Acceptance:
|
||||
- policy refusals become machine-actionable instead of dead-end chat text
|
||||
- claws can pivot directly to the safe alternative workflow without re-triaging the same request
|
||||
- monitoring/reporting can separate governance blocks from actual product/runtime defects
|
||||
|
||||
### 4.38. Policy exception / owner-approval token contract
|
||||
For actions that are normally blocked by policy but can be allowed with explicit owner approval, the approval path should be machine-readable instead of relying on ambiguous prose interpretation.
|
||||
|
||||
Required behavior:
|
||||
- represent policy exceptions as typed approval grants or tokens scoped to action/repo/branch/time window
|
||||
- bind the approval to the approving actor identity and policy being overridden
|
||||
- distinguish `no approval`, `approval pending`, `approval granted`, and `approval expired/revoked`
|
||||
- let downstream claws verify an approval artifact before executing the otherwise-blocked action
|
||||
|
||||
Acceptance:
|
||||
- exceptional approvals stop depending on fuzzy chat interpretation
|
||||
- claws can safely execute policy-exception flows without confusing them with ordinary blocked requests
|
||||
- governance stays auditable even when owner-authorized exceptions occur
|
||||
|
||||
### 4.39. Approval-token replay / one-time-use enforcement
|
||||
If policy-exception approvals become machine-readable tokens, they also need replay protection so one explicit exception cannot be silently reused beyond its intended scope.
|
||||
|
||||
Required behavior:
|
||||
- support one-time-use or bounded-use approval grants where appropriate
|
||||
- record token consumption against the exact action/repo/branch/commit scope it authorized
|
||||
- reject replay, scope expansion, or post-expiry reuse with typed policy errors
|
||||
- surface whether an approval was unused, consumed, partially consumed, expired, or revoked
|
||||
|
||||
Acceptance:
|
||||
- one owner-approved exception cannot quietly authorize repeated or broader dangerous actions
|
||||
- claws can distinguish `valid approval present` from `approval already spent`
|
||||
- governance exceptions remain auditable and non-replayable under automation
|
||||
|
||||
### 4.40. Approval-token delegation / execution chain traceability
|
||||
If one actor approves an exception and another claw/bot/session executes it, the system should preserve the delegation chain so policy exceptions remain attributable end-to-end.
|
||||
|
||||
Required behavior:
|
||||
- record approver identity, requesting actor, executing actor, and any intermediate relay/orchestrator hop
|
||||
- preserve the delegation chain on approval verification and token consumption events
|
||||
- distinguish direct self-use from delegated execution
|
||||
- surface when execution occurs through an unexpected or unauthorized delegate
|
||||
|
||||
Acceptance:
|
||||
- policy-exception execution stays attributable even across bot/session hops
|
||||
- audits can answer `who approved`, `who requested`, and `who actually used it`
|
||||
- delegated exception flows remain governable instead of collapsing into generic bot activity
|
||||
|
||||
### 4.41. Token-optimization / repo-scope guidance contract
|
||||
New users hit token burn and context bloat immediately, but the product surface does not clearly explain how repo scope, ignored paths, and working-directory choice affect clawability.
|
||||
|
||||
Required behavior:
|
||||
- explicitly document whether `.clawignore` / `.claudeignore` / `.gitignore` are honored, and how
|
||||
- surface a simple recommendation to start from the smallest useful subdirectory instead of the whole monorepo when possible
|
||||
- provide first-run guidance for excluding heavy/generated directories (`node_modules`, `dist`, `build`, `.next`, coverage, logs, dumps, generated reports`)
|
||||
- make token-saving repo-scope guidance visible in onboarding/help rather than buried in external chat advice
|
||||
|
||||
Acceptance:
|
||||
- new users can answer `how do I stop dragging junk into context?` from product docs/help alone
|
||||
- first-run confusion about ignore files and repo scope drops sharply
|
||||
- clawability improves before users burn tokens on obviously-avoidable junk
|
||||
|
||||
### 4.42. Workspace-scope weight preview / token-risk preflight
|
||||
Before a user starts a session in a repo, claw-code should surface a lightweight estimate of how heavy the current workspace is and why it may be costly.
|
||||
|
||||
Required behavior:
|
||||
- inspect the current working tree for high-risk token sinks (huge directories, generated artifacts, vendored deps, logs, dumps)
|
||||
- summarize likely context-bloat sources before deep indexing or first large prompt flow
|
||||
- recommend safer scope choices (e.g. narrower subdirectory, ignore patterns, cleanup targets)
|
||||
- distinguish `workspace looks clean` from `workspace is likely to burn tokens fast`
|
||||
|
||||
Acceptance:
|
||||
- users get an early warning before accidentally dogfooding the entire junkyard
|
||||
- token-saving guidance becomes situational and concrete, not just generic docs
|
||||
- onboarding catches avoidable repo-scope mistakes before they turn into cost/perf complaints
|
||||
|
||||
### 4.43. Safer-scope quick-apply action
|
||||
After warning that the current workspace is too heavy, claw-code should offer a direct way to adopt the safer scope instead of leaving the user to manually reinterpret the advice.
|
||||
|
||||
Required behavior:
|
||||
- turn scope recommendations into actionable choices (e.g. switch to subdirectory, generate ignore stub, exclude detected heavy paths)
|
||||
- preview what would be included/excluded before applying the change
|
||||
- preserve an easy path back to the original broader scope
|
||||
- distinguish advisory suggestions from user-confirmed scope changes
|
||||
|
||||
Acceptance:
|
||||
- users can go from `this workspace is too heavy` to `use this safer scope` in one step
|
||||
- token-risk preflight becomes operational guidance, not just warning text
|
||||
- first-run users stop getting stuck between diagnosis and manual cleanup
|
||||
|
||||
### 5. Failure taxonomy
|
||||
Normalize failure classes:
|
||||
- `prompt_delivery`
|
||||
@@ -148,6 +729,20 @@ Acceptance:
|
||||
- blockers are machine-classified
|
||||
- dashboards and retry policies can branch on failure type
|
||||
|
||||
### 5.5. Transport outage vs lane failure boundary
|
||||
When the control server or transport goes down, claw-code should distinguish host-level outage from lane-local failure instead of letting all active lanes look broken in the same vague way.
|
||||
|
||||
Required behavior:
|
||||
- emit typed transport outage events separate from lane failure events
|
||||
- annotate impacted lanes with dependency status (`blocked_by_transport`) rather than rewriting them as ordinary lane errors
|
||||
- preserve the last known good lane state before transport loss
|
||||
- surface outage scope (`single session`, `single worker host`, `shared control server`)
|
||||
|
||||
Acceptance:
|
||||
- clawhip can say `server down blocked 3 lanes` instead of pretending 3 independent lane failures happened
|
||||
- recovery policies can restart transport separately from lane-local recovery recipes
|
||||
- postmortems can separate infra blast radius from actual code-lane defects
|
||||
|
||||
### 6. Actionable summary compression
|
||||
Collapse noisy event streams into:
|
||||
- current phase
|
||||
@@ -159,6 +754,23 @@ Acceptance:
|
||||
- channel status updates stay short and machine-grounded
|
||||
- claws stop inferring state from raw build spam
|
||||
|
||||
### 6.5. Blocked-state subphase contract
|
||||
When a lane is `blocked`, also expose the exact subphase where progress stopped, rather than forcing claws to infer from logs.
|
||||
|
||||
Subphases should include at least:
|
||||
- `blocked.trust_prompt`
|
||||
- `blocked.prompt_delivery`
|
||||
- `blocked.plugin_init`
|
||||
- `blocked.mcp_handshake`
|
||||
- `blocked.branch_freshness`
|
||||
- `blocked.test_hang`
|
||||
- `blocked.report_pending`
|
||||
|
||||
Acceptance:
|
||||
- `lane.blocked` carries a stable subphase enum + short human summary
|
||||
- clawhip can say "blocked at MCP handshake" or "blocked waiting for trust clear" without pane scraping
|
||||
- retries can target the correct recovery recipe instead of treating all blocked states the same
|
||||
|
||||
## Phase 3 — Branch/Test Awareness and Auto-Recovery
|
||||
|
||||
### 7. Stale-branch detection before broad verification
|
||||
@@ -182,6 +794,22 @@ Acceptance:
|
||||
- one automatic recovery attempt occurs before escalation
|
||||
- the attempted recovery is itself emitted as structured event data
|
||||
|
||||
### 8.5. Recovery attempt ledger
|
||||
Expose machine-readable recovery progress so claws can see what automatic recovery has already tried, what is still running, and why escalation happened.
|
||||
|
||||
Ledger should include at least:
|
||||
- recovery recipe id
|
||||
- attempt count
|
||||
- current recovery state (`queued`, `running`, `succeeded`, `failed`, `exhausted`)
|
||||
- started/finished timestamps
|
||||
- last failure summary
|
||||
- escalation reason when retries stop
|
||||
|
||||
Acceptance:
|
||||
- clawhip can report `auto-recover tried prompt replay twice, then escalated` without log archaeology
|
||||
- operators can distinguish `no recovery attempted` from `recovery already exhausted`
|
||||
- repeated silent retry loops become visible and auditable
|
||||
|
||||
### 9. Green-ness contract
|
||||
Workers should distinguish:
|
||||
- targeted tests green
|
||||
@@ -249,6 +877,21 @@ Acceptance:
|
||||
- claws can query status directly
|
||||
- human-facing views become a rendering layer, not the source of truth
|
||||
|
||||
### 12.5. Running-state liveness heartbeat
|
||||
When a lane is marked `working` or otherwise in-progress, emit a lightweight liveness heartbeat so claws can tell quiet progress from silent stall.
|
||||
|
||||
Heartbeat should include at least:
|
||||
- current phase/subphase
|
||||
- seconds since last meaningful progress
|
||||
- seconds since last heartbeat
|
||||
- current active step label
|
||||
- whether background work is expected
|
||||
|
||||
Acceptance:
|
||||
- clawhip can distinguish `quiet but alive` from `working state went stale`
|
||||
- stale detection stops depending on raw pane churn alone
|
||||
- long-running compile/test/background steps stay machine-visible without log scraping
|
||||
|
||||
## Phase 5 — Plugin and MCP Lifecycle Maturity
|
||||
|
||||
### 13. First-class plugin/MCP lifecycle contract
|
||||
@@ -428,6 +1071,8 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes
|
||||
|
||||
32. **OpenAI-compatible provider/model-id passthrough is not fully literal** — **verified no-bug on 2026-04-09**: `resolve_model_alias()` only matches bare shorthand aliases (`opus`/`sonnet`/`haiku`) and passes everything else through unchanged, so `openai/gpt-4` reaches the dispatch layer unmodified. `strip_routing_prefix()` at `openai_compat.rs:732` then strips only recognised routing prefixes (`openai`, `xai`, `grok`, `qwen`) so the wire model is the bare backend id. No fix needed. **Original filing below.**
|
||||
|
||||
42. **Hook JSON failure opacity: invalid hook output does not surface the offending payload/context** — dogfooding on 2026-04-13 in the live `clawcode-human` lane repeatedly hit `PreToolUse/PostToolUse/Stop hook returned invalid ... JSON output` while the operator had no immediate visibility into which hook emitted malformed JSON, what raw stdout/stderr came back, or whether the failure was hook-formatting breakage vs prompt-misdelivery fallout. This turns a recoverable hook/schema bug into generic lane fog. **Impact.** Lanes look blocked/noisy, but the event surface is too lossy to classify whether the next action is fix the hook serializer, retry prompt delivery, or ignore a harmless hook-side warning. **Concrete delta landed now.** Recorded as an Immediate Backlog item so the failure is tracked explicitly instead of disappearing into channel scrollback. **Recommended fix shape:** when hook JSON parse fails, emit a typed hook failure event carrying hook phase/name, command/path, exit status, and a redacted raw stdout/stderr preview (bounded + safe), plus a machine class like `hook_invalid_json`. Add regression coverage for malformed-but-nonempty hook output so the surfaced error includes the preview instead of only `invalid ... JSON output`.
|
||||
|
||||
32. **OpenAI-compatible provider/model-id passthrough is not fully literal** — dogfooded 2026-04-08 via live user in #claw-code who confirmed the exact backend model id works outside claw but fails through claw for an OpenAI-compatible endpoint. The gap: `openai/` prefix is correctly used for **transport selection** (pick the OpenAI-compat client) but the **wire model id** — the string placed in `"model": "..."` in the JSON request body — may not be the literal backend model string the user supplied. Two candidate failure modes: **(a)** `resolve_model_alias()` is called on the model string before it reaches the wire — alias expansion designed for Anthropic/known models corrupts a user-supplied backend-specific id; **(b)** the `openai/` routing prefix may not be stripped before `build_chat_completion_request` packages the body, so backends receive `openai/gpt-4` instead of `gpt-4`. **Fix shape:** cleanly separate transport selection from wire model id. Transport selection uses the prefix; wire model id is the user-supplied string minus only the routing prefix — no alias expansion, no prefix leakage. **Trace path for next session:** (1) find where `resolve_model_alias()` is called relative to the OpenAI-compat dispatch path; (2) inspect what `build_chat_completion_request` puts in `"model"` for an `openai/some-backend-id` input. **Source:** live user in #claw-code 2026-04-08, confirmed exact model id works outside claw, fails through claw for OpenAI-compat backend.
|
||||
|
||||
33. **OpenAI `/responses` endpoint rejects claw's tool schema: `object schema missing properties` / `invalid_function_parameters`** — **done at `e7e0fd2` on 2026-04-09**. Added `normalize_object_schema()` in `openai_compat.rs` which recursively walks JSON Schema trees and injects `"properties": {}` and `"additionalProperties": false` on every object-type node (without overwriting existing values). Called from `openai_tool_definition()` so both `/chat/completions` and `/responses` receive strict-validator-safe schemas. 3 unit tests added. All api tests pass. **Original filing below.**
|
||||
@@ -500,6 +1145,8 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes
|
||||
|
||||
63. **Droid session completion semantics broken: code arrives after "status: completed"** — dogfooded 2026-04-12. Ultraclaw droid sessions (use-droid via acpx) report `session.status: completed` before file writes are fully flushed/synced to the working tree. Discovered +410 lines of "late-arriving" droid output that appeared after I had already assessed 8 sessions as "no code produced." This creates false-negative assessments and duplicate work. **Fix shape:** (a) droid agent should only report completion after explicit file-write confirmation (fsync or existence check); (b) or, claw-code should expose a `pending_writes` status that indicates "agent responded, disk flush pending"; (c) lane orchestrators should poll for file changes for N seconds after completion before final assessment. **Blocker:** none. Source: Jobdori ultraclaw dogfood 2026-04-12.
|
||||
|
||||
64. **ACP/Zed editor integration entrypoint is too implicit** — dogfooded 2026-04-13 from a user request for a `-acp` parameter to support ACP protocol integration in editor-first workflows such as Zed. The gap is not generic "please add another integration" churn; it is a **discoverability and launch-contract problem**. Right now the product surface does not make it obvious whether ACP is already supported, how an editor should invoke claw-code, or whether a dedicated flag/mode exists at all. That forces evaluators into repo archaeology instead of giving them a crisp editor-facing invocation contract. **Fix shape:** either (a) add an explicit ACP/editor entrypoint such as `--acp` / `acp serve` with help text that states the contract, or (b) if the protocol path already exists, surface it prominently in CLI help/README with a concrete Zed/editor integration example so users do not have to guess. **Acceptance bar:** an editor-first user can answer "how do I launch claw-code for ACP/Zed?" from `claw --help` or the first screen of docs without reading source. **Blocker:** none; currently recorded as a roadmap follow-up because the repo-local entrypoint was not obvious during dogfood.
|
||||
|
||||
64. **Artifact provenance is post-hoc narration, not structured events** — **done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now attaches structured `artifactProvenance` metadata to `lane.finished`, including `sourceLanes`, `roadmapIds`, `files`, `diffStat`, `verification`, and `commitSha`, while keeping the existing `lane.commit.created` provenance event intact. Regression coverage locks a successful completion payload that carries roadmap ids, file paths, diff stat, verification states, and commit sha without relying on prose re-parsing. **Original filing below.**
|
||||
|
||||
65. **Backlog-scanning team lanes emit opaque stops, not structured selection outcomes** — **done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now recognizes backlog-scan selection summaries and records structured `selectionOutcome` metadata on `lane.finished`, including `chosenItems`, `skippedItems`, `action`, and optional `rationale`, while preserving existing non-selection and review-lane behavior. Regression coverage locks the structured backlog-scan payload alongside the earlier quality-floor and review-verdict paths. **Original filing below.**
|
||||
|
||||
Reference in New Issue
Block a user