roadmap: #220 filed — Image/vision input is structurally impossible across the entire data model: zero `image` content-block taxonomy variant on `InputContentBlock` (`types.rs:80-94` has only Text/ToolUse/ToolResult — three of three exhaustive variants, zero Image, zero Document, zero MediaType, zero ImageSource, zero base64/file_id slot, zero `media_type` field anywhere in `rust/crates/api/src/`), zero parse arm for `/image <path>` and `/screenshot` slash commands despite their advertised summaries ("Add an image file to the conversation" at `commands/lib.rs:585`, "Take a screenshot and add to conversation" at `commands/lib.rs:578`) being in the canonical SlashCommandSpec table since project inception, both gated under STUB_COMMANDS at `main.rs:8381-8382` (UX patch over missing-feature, not missing-feature fix), `ResolvedAttachment` at `tools/lib.rs:2660-2666` carries path/size/is_image triple but no bytes / no base64 / no media_type / no upload affordance / no transport-ready payload despite `is_image_path` at line 5276 correctly classifying png/jpg/jpeg/gif/webp/bmp/svg extensions and the SendUserMessage/Brief tool surfacing `isImage: true` in JSON envelope (asserted at line 8969); `build_chat_completion_request` (`openai_compat.rs:845`) and `translate_message` (`openai_compat.rs:946`) have three-arm exhaustive matches over Text/ToolUse/ToolResult with no Image arm and no `{type: "image", source: {type: "base64", media_type, data}}` Anthropic-canonical wire shape and no `{type: "image_url", image_url: {url: "data:image/...;base64,..."}}` OpenAI-compat wire shape; the markdown renderer at `render.rs:379-426` handles `Tag::Image` and `TagEnd::Image` for *output* rendering (asymmetric capability — model emits image markdown → rendered as colored `[image:url]` link, user attaches image → silent black hole at API boundary); the runtime's own worker_boot test fixture at `worker_boot.rs:1324`+`:1349` literally hard-codes "Explain this KakaoTalk screenshot for a friend" as the canonical task-classification example for worker prompt-mismatch recovery — claw-code uses screenshot analysis as a runtime-classifier signal while having zero capability to actually send a screenshot to the model; TUI-ENHANCEMENT-PLAN.md:57 backlogs the gap as "No image/attachment preview" but the gap is far worse than no preview — there is no transport, no codec, no envelope, no anything from the byte stream to the wire (Jobdori cycle #372 / extends #168c emission-routing audit / sibling-shape cluster grows to nineteen: #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220 / wire-format-parity cluster grows to ten: #211+#212+#213+#214+#215+#216+#217+#218+#219+#220 / capability-parity cluster (strict-superset including user-facing surfacing): #218+#220 / five-layer-structural-absence shape (data-model-variant + slash-command-parse-arm + attachment-metadata-threading + request-builder-translation + OS-integration-helper) is the largest single feature absence yet catalogued, exceeding #218's four-layer; advertised-but-unbuilt shape is novel — UX-layer cousin of #219's false-positive-opt-in shape — applicable to other STUB_COMMAND entries with capability-claim summaries / claw-code is the sole client/agent/CLI in the surveyed coding-agent ecosystem with zero image-input capability despite Anthropic Vision GA on 2024-03-04 (25 months ago at filing time, default-on for all Claude 3.5+ models with 5MB-per-image / 32MB-per-request / 100-images-per-request limits) and OpenAI Vision GA on 2024-05-13 (23 months ago) and Google Gemini multimodal GA on 2024-02-15 (26 months ago), making this a regression against the upstream claude-code CLI claw-code is porting from / external validation: Anthropic Vision API reference at platform.claude.com/docs/en/build-with-claude/vision documenting the canonical {type, source: {type, media_type, data}} content block, Anthropic Messages API reference, Anthropic Files API beta with file_id reference for repeated-image-use efficiency, AWS Bedrock prompt-caching docs with image-block coverage and 20-images-per-request stricter limit and same cachePoint:{} pattern from #219, OpenAI Vision API reference documenting the {type:image_url, image_url:{url}} data-URL shape used by GPT-4o/4o-mini/5-vision/o1-vision/o3-vision/DeepSeek-VL2/Qwen-VL/QwQ-VL/MiniMax-VL/Moonshot kimi-VL, Google Gemini multimodal API documenting {inline_data:{mime_type, data}} shape, anomalyco/opencode#16184 (look_at tool image-file-from-disk handling bug), anomalyco/opencode#15728 (Read tool image-handling bug), anomalyco/opencode#8875 (custom-provider attachment-allowlist gap), anomalyco/opencode#17205 (text-only-model token-burn on image attachment) — all four are integration-quality gaps in opencode while claw-code is missing the capability entirely (~85% vs 0% parity asymmetry, the largest in the cluster), charmbracelet/crush vision-input via terminal paste, simonw/llm --attachment flag, Vercel AI SDK experimental_attachments + image content blocks, LangChain HumanMessage content blocks, LangGraph image-message routing, OpenAI Python and Anthropic Python SDK first-class image-typed messages, anthropic-quickstarts vision examples, claude-code official CLI paste-image and screenshot shortcuts (the upstream this is a regression against), OpenTelemetry GenAI semconv gen_ai.input.attachments and gen_ai.input.images.count multimodal observability attributes, IANA MIME-type registry RFC 4288/4289) · d46c423c1d - claw-code

eeymoo/claw-code

Fork 0

mirror of https://github.com/instructkr/claw-code.git synced 2026-04-26 19:14:59 +08:00

roadmap: #220 filed — Image/vision input is structurally impossible across the entire data model: zero `image` content-block taxonomy variant on `InputContentBlock` (`types.rs:80-94` has only Text/ToolUse/ToolResult — three of three exhaustive variants, zero Image, zero Document, zero MediaType, zero ImageSource, zero base64/file_id slot, zero `media_type` field anywhere in `rust/crates/api/src/`), zero parse arm for `/image <path>` and `/screenshot` slash commands despite their advertised summaries ("Add an image file to the conversation" at `commands/lib.rs:585`, "Take a screenshot and add to conversation" at `commands/lib.rs:578`) being in the canonical SlashCommandSpec table since project inception, both gated under STUB_COMMANDS at `main.rs:8381-8382` (UX patch over missing-feature, not missing-feature fix), `ResolvedAttachment` at `tools/lib.rs:2660-2666` carries path/size/is_image triple but no bytes / no base64 / no media_type / no upload affordance / no transport-ready payload despite `is_image_path` at line 5276 correctly classifying png/jpg/jpeg/gif/webp/bmp/svg extensions and the SendUserMessage/Brief tool surfacing `isImage: true` in JSON envelope (asserted at line 8969); `build_chat_completion_request` (`openai_compat.rs:845`) and `translate_message` (`openai_compat.rs:946`) have three-arm exhaustive matches over Text/ToolUse/ToolResult with no Image arm and no `{type: "image", source: {type: "base64", media_type, data}}` Anthropic-canonical wire shape and no `{type: "image_url", image_url: {url: "data:image/...;base64,..."}}` OpenAI-compat wire shape; the markdown renderer at `render.rs:379-426` handles `Tag::Image` and `TagEnd::Image` for output rendering (asymmetric capability — model emits image markdown → rendered as colored `[image:url]` link, user attaches image → silent black hole at API boundary); the runtime's own worker_boot test fixture at `worker_boot.rs:1324`+`:1349` literally hard-codes "Explain this KakaoTalk screenshot for a friend" as the canonical task-classification example for worker prompt-mismatch recovery — claw-code uses screenshot analysis as a runtime-classifier signal while having zero capability to actually send a screenshot to the model; TUI-ENHANCEMENT-PLAN.md:57 backlogs the gap as "No image/attachment preview" but the gap is far worse than no preview — there is no transport, no codec, no envelope, no anything from the byte stream to the wire (Jobdori cycle #372 / extends #168c emission-routing audit / sibling-shape cluster grows to nineteen: #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220 / wire-format-parity cluster grows to ten: #211+#212+#213+#214+#215+#216+#217+#218+#219+#220 / capability-parity cluster (strict-superset including user-facing surfacing): #218+#220 / five-layer-structural-absence shape (data-model-variant + slash-command-parse-arm + attachment-metadata-threading + request-builder-translation + OS-integration-helper) is the largest single feature absence yet catalogued, exceeding #218's four-layer; advertised-but-unbuilt shape is novel — UX-layer cousin of #219's false-positive-opt-in shape — applicable to other STUB_COMMAND entries with capability-claim summaries / claw-code is the sole client/agent/CLI in the surveyed coding-agent ecosystem with zero image-input capability despite Anthropic Vision GA on 2024-03-04 (25 months ago at filing time, default-on for all Claude 3.5+ models with 5MB-per-image / 32MB-per-request / 100-images-per-request limits) and OpenAI Vision GA on 2024-05-13 (23 months ago) and Google Gemini multimodal GA on 2024-02-15 (26 months ago), making this a regression against the upstream claude-code CLI claw-code is porting from / external validation: Anthropic Vision API reference at platform.claude.com/docs/en/build-with-claude/vision documenting the canonical {type, source: {type, media_type, data}} content block, Anthropic Messages API reference, Anthropic Files API beta with file_id reference for repeated-image-use efficiency, AWS Bedrock prompt-caching docs with image-block coverage and 20-images-per-request stricter limit and same cachePoint:{} pattern from #219, OpenAI Vision API reference documenting the {type:image_url, image_url:{url}} data-URL shape used by GPT-4o/4o-mini/5-vision/o1-vision/o3-vision/DeepSeek-VL2/Qwen-VL/QwQ-VL/MiniMax-VL/Moonshot kimi-VL, Google Gemini multimodal API documenting {inline_data:{mime_type, data}} shape, anomalyco/opencode#16184 (look_at tool image-file-from-disk handling bug), anomalyco/opencode#15728 (Read tool image-handling bug), anomalyco/opencode#8875 (custom-provider attachment-allowlist gap), anomalyco/opencode#17205 (text-only-model token-burn on image attachment) — all four are integration-quality gaps in opencode while claw-code is missing the capability entirely (~85% vs 0% parity asymmetry, the largest in the cluster), charmbracelet/crush vision-input via terminal paste, simonw/llm --attachment flag, Vercel AI SDK experimental_attachments + image content blocks, LangChain HumanMessage content blocks, LangGraph image-message routing, OpenAI Python and Anthropic Python SDK first-class image-typed messages, anthropic-quickstarts vision examples, claude-code official CLI paste-image and screenshot shortcuts (the upstream this is a regression against), OpenTelemetry GenAI semconv gen_ai.input.attachments and gen_ai.input.images.count multimodal observability attributes, IANA MIME-type registry RFC 4288/4289)

Browse Source

This commit is contained in:

YeonGyu-Kim

2026-04-26 01:18:43 +09:00

parent 2858aeccff

commit d46c423c1d

1 changed files with 250 additions and 0 deletions

250

ROADMAP.md

View File

File diff suppressed because one or more lines are too long

250 ROADMAP.md View File

250

ROADMAP.md

View File