fix: #164 Stage A — cooperative cancellation via cancel_event in submit_message

Closes the #161 follow-up gap identified in review: wall-clock timeout bounded caller-facing wait but did not cancel the underlying provider thread, which could silently mutate mutable_messages / transcript_store / permission_denials / total_usage after the caller had already observed stop_reason='timeout'. A ghost turn committed post-deadline would poison any session that got persisted afterwards. Stage A scope (this commit): runtime + engine layer cooperative cancel. Engine layer (src/query_engine.py): - submit_message now accepts cancel_event: threading.Event | None = None - Two safe checkpoints: 1. Entry (before max_turns / budget projection) — earliest possible return 2. Post-budget (after output synthesis, before mutation) — catches cancel that arrives while output was being computed - Both checkpoints return stop_reason='cancelled' with state UNCHANGED (mutable_messages, transcript_store, permission_denials, total_usage all preserved exactly as on entry) - cancel_event=None preserves legacy behaviour with zero overhead (no checkpoint checks at all) Runtime layer (src/runtime.py): - run_turn_loop creates one cancel_event per invocation when a deadline is in play (and None otherwise, preserving legacy fast path) - Passes the same event to every submit_message call across turns, so a late cancel on turn N-1 affects turn N - On timeout (either pre-call or mid-call), runtime explicitly calls cancel_event.set() before future.cancel() + synthesizing the timeout TurnResult. This upgrades #161's best-effort future.cancel() (which only cancels not-yet-started futures) to cooperative mid-flight cancel. Stop reason taxonomy after Stage A: 'completed' — turn committed, state mutated exactly once 'max_budget_reached' — overflow, state unchanged (#162) 'max_turns_reached' — capacity exceeded, state unchanged 'cancelled' — cancel_event observed, state unchanged (#164 Stage A) 'timeout' — synthesised by runtime, not engine (#161) The 'cancelled' vs 'timeout' split matters: - 'timeout' is the runtime's best-effort signal to the caller: deadline hit - 'cancelled' is the engine's confirmation: cancel was observed + honoured If the provider call wedges entirely (never reaches a checkpoint), the caller still sees 'timeout' and the thread is leaked — but any NEXT submit_message call on the same engine observes the event at entry and returns 'cancelled' immediately, preventing ghost-turn accumulation. This is the honest cooperative limit in Python threading land; true preemption requires async-native provider IO (future work, not Stage A). Tests (29 new tests, tests/test_submit_message_cancellation.py + tests/ test_run_turn_loop_cancellation.py): Engine-layer (12 tests): - TestCancellationBeforeCall (5): pre-set event returns 'cancelled' immediately; mutable_messages, transcript_store, usage, permission_denials all preserved - TestCancellationAfterBudgetCheck (1): cancel set mid-call (after projection, before commit) still honoured; output synthesised but state untouched - TestCancellationAfterCommit (2): post-commit cancel not observable (honest limit) BUT next call on same engine observes it + returns 'cancelled' - TestLegacyCallersUnchanged (3): cancel_event=None preserves #162 atomicity + max_turns contract with zero behaviour change - TestCancellationVsOtherStopReasons (2): cancel precedes max_turns check; cancel does not retroactively override a completed turn Runtime-layer (5 tests): - TestTimeoutPropagatesCancelEvent (3): submit_message receives a real Event object when deadline is set; None in legacy mode; timeout actually calls event.set() so in-flight threads observe at their next checkpoint - TestCancelEventSharedAcrossTurns (1): same event object passed to every turn (object identity check) — late cancel on turn N-1 must affect turn N Regression: 3 existing timeout test mocks updated to accept cancel_event kwarg (mocks that previously had signature (prompt, commands, tools, denials) now have (prompt, commands, tools, denials, cancel_event=None) since runtime passes cancel_event positionally on the timeout path). Full suite: 97 → 114 passing, zero regression. Closes ROADMAP #164 Stage A. What's explicitly NOT in Stage A: - Preemptive cancellation of wedged provider IO (requires asyncio-native provider path; larger refactor) - Timeout on the legacy unbounded run_turn_loop path (by design: legacy callers opt out of cancellation entirely) - CLI exposure of 'cancelled' as a distinct exit code (currently 'cancelled' maps to the same stop_reason != 'completed' break condition as others; CLI surface for cancel is a separate pinpoint if warranted)
2026-04-27 00:44:57 +08:00 · 2026-04-22 18:14:14 +09:00
parent 455bdec06c
commit 524edb2b2e
5 changed files with 475 additions and 7 deletions
--- a/src/query_engine.py
+++ b/src/query_engine.py
@@ -1,6 +1,7 @@
 from __future__ import annotations

 import json
+import threading
 from dataclasses import dataclass, field
 from uuid import uuid4

@@ -64,7 +65,59 @@ class QueryEnginePort:
        matched_commands: tuple[str, ...] = (),
        matched_tools: tuple[str, ...] = (),
        denied_tools: tuple[PermissionDenial, ...] = (),
+        cancel_event: threading.Event | None = None,
    ) -> TurnResult:
+        """Submit a prompt and return a TurnResult.
+
+        #164 Stage A: cooperative cancellation via cancel_event.
+
+        The cancel_event argument (added for #164) lets a caller request early
+        termination at a safe point. When set before the pre-mutation commit
+        stage, submit_message returns early with ``stop_reason='cancelled'``
+        and the engine's state (mutable_messages, transcript_store,
+        permission_denials, total_usage) is left **exactly as it was on
+        entry**. This closes the #161 follow-up gap: before this change, a
+        wedged provider thread could finish executing and silently mutate
+        state after the caller had already observed ``stop_reason='timeout'``,
+        giving the session a ghost turn the caller never acknowledged.
+
+        Contract:
+          - cancel_event is None (default) — legacy behaviour, no checks.
+          - cancel_event set **before** budget check — returns 'cancelled'
+            immediately; no output synthesis, no projection, no mutation.
+          - cancel_event set **between** budget check and commit — returns
+            'cancelled' with state intact.
+          - cancel_event set **after** commit — not observable; the turn is
+            already committed and the caller sees 'completed'. Cancellation
+            is a *safe point* mechanism, not preemption. This is the honest
+            limit of cooperative cancellation in Python threading land.
+
+        Stop reason taxonomy after #164 Stage A:
+          - 'completed'            — turn committed, state mutated exactly once
+          - 'max_budget_reached'   — overflow, state unchanged (#162)
+          - 'max_turns_reached'    — capacity exceeded, state unchanged
+          - 'cancelled'            — cancel_event observed, state unchanged
+          - 'timeout'              — synthesised by runtime, not engine (#161)
+
+        Callers that care about deadline-driven cancellation (run_turn_loop)
+        can now request cleanup by setting the event on timeout — the next
+        submit_message on the same engine will observe it at the start and
+        return 'cancelled' without touching state, even if the previous call
+        is still wedged in provider IO.
+        """
+        # #164 Stage A: earliest safe cancellation point. No output synthesis,
+        # no budget projection, no mutation — just an immediate clean return.
+        if cancel_event is not None and cancel_event.is_set():
+            return TurnResult(
+                prompt=prompt,
+                output='',
+                matched_commands=matched_commands,
+                matched_tools=matched_tools,
+                permission_denials=denied_tools,
+                usage=self.total_usage,  # unchanged
+                stop_reason='cancelled',
+            )
+
        if len(self.mutable_messages) >= self.config.max_turns:
            output = f'Max turns reached before processing prompt: {prompt}'
            return TurnResult(
@@ -104,6 +157,21 @@ class QueryEnginePort:
                stop_reason='max_budget_reached',
            )

+        # #164 Stage A: second safe cancellation point. Projection is done
+        # but nothing has been committed yet. If the caller cancelled while
+        # we were building output / computing budget, honour it here — still
+        # no mutation.
+        if cancel_event is not None and cancel_event.is_set():
+            return TurnResult(
+                prompt=prompt,
+                output=output,
+                matched_commands=matched_commands,
+                matched_tools=matched_tools,
+                permission_denials=denied_tools,
+                usage=self.total_usage,  # unchanged
+                stop_reason='cancelled',
+            )
+
        self.mutable_messages.append(prompt)
        self.transcript_store.append(prompt)
        self.permission_denials.extend(denied_tools)
--- a/src/runtime.py
+++ b/src/runtime.py
@@ -1,5 +1,6 @@
 from __future__ import annotations

+import threading
 import time
 from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError
 from dataclasses import dataclass
@@ -209,6 +210,14 @@ class PortRuntime:
        denied_tools = tuple(self._infer_permission_denials(matches))
        results: list[TurnResult] = []
        deadline = time.monotonic() + timeout_seconds if timeout_seconds is not None else None
+        # #164 Stage A: shared cancel_event signals cooperative cancellation
+        # across turns. On timeout we set() it so any still-running
+        # submit_message call (or the next one on the same engine) observes
+        # the cancel at a safe checkpoint and returns stop_reason='cancelled'
+        # without mutating state. This closes the window where a wedged
+        # provider thread could commit a ghost turn after the caller saw
+        # 'timeout'.
+        cancel_event = threading.Event() if deadline is not None else None

        # ThreadPoolExecutor is reused across turns so we cancel cleanly on exit.
        executor = ThreadPoolExecutor(max_workers=1) if deadline is not None else None
@@ -229,22 +238,35 @@ class PortRuntime:
                if deadline is None:
                    # Legacy path: unbounded call, preserves existing behaviour exactly.
                    # #159: pass inferred denied_tools (no longer hardcoded empty tuple)
+                    # #164: cancel_event is None on this path; submit_message skips
+                    # cancellation checks entirely (legacy zero-overhead behaviour).
                    result = engine.submit_message(turn_prompt, command_names, tool_names, denied_tools)
                else:
                    remaining = deadline - time.monotonic()
                    if remaining <= 0:
+                        # #164: signal cancel for any in-flight/future submit_message
+                        # calls that share this engine. Safe because nothing has been
+                        # submitted yet this turn.
+                        assert cancel_event is not None
+                        cancel_event.set()
                        results.append(self._build_timeout_result(turn_prompt, command_names, tool_names))
                        break
                    assert executor is not None
                    future = executor.submit(
-                        engine.submit_message, turn_prompt, command_names, tool_names, denied_tools
+                        engine.submit_message, turn_prompt, command_names, tool_names,
+                        denied_tools, cancel_event,
                    )
                    try:
                        result = future.result(timeout=remaining)
                    except FuturesTimeoutError:
-                        # Best-effort cancel; submit_message may still finish in background
-                        # but we never read its output. The engine's own state mutation
-                        # is owned by the engine and not our concern here.
+                        # #164 Stage A: explicitly signal cancel to the still-running
+                        # submit_message thread. The next time it hits a checkpoint
+                        # (entry or post-budget), it returns 'cancelled' without
+                        # mutating state instead of committing a ghost turn. This
+                        # upgrades #161's best-effort future.cancel() (which only
+                        # cancels pre-start futures) to cooperative mid-flight cancel.
+                        assert cancel_event is not None
+                        cancel_event.set()
                        future.cancel()
                        results.append(self._build_timeout_result(turn_prompt, command_names, tool_names))
                        break