Compare commits
No commits in common. "775f91779cc44b7c3a80ed7d8512a4a5f12a6828" and "4a44460b5f95234451b3a4a545cd8ae3069a969c" have entirely different histories.
775f91779c
...
4a44460b5f
@ -1,328 +0,0 @@
|
|||||||
# Agent Infrastructure: Design Principles
|
|
||||||
|
|
||||||
You are editing agent infrastructure files (hooks, instructions, skills,
|
|
||||||
agents). Before making changes, understand the principles that govern how this
|
|
||||||
system works.
|
|
||||||
|
|
||||||
## Single Source of Truth
|
|
||||||
|
|
||||||
`.agents/` is the canonical directory for all agent infrastructure. An MCP
|
|
||||||
server (`.agents/mcp/index.ts`) exposes agents as prompts and skills as tools to
|
|
||||||
both Copilot and OpenCode — this replaces file-based fan-out to
|
|
||||||
`.github/agents/`, `.opencode/agents/`, etc.
|
|
||||||
|
|
||||||
### MCP server (`all-agents`)
|
|
||||||
|
|
||||||
Available once the server is running (configured in `.vscode/mcp.json` and
|
|
||||||
`opencode.json`):
|
|
||||||
|
|
||||||
- **Prompts** (slash commands): `/research`, `/brainstorm`, `/build`,
|
|
||||||
`/orchestrator`
|
|
||||||
- **Tools** (model-controlled): `load_research_methodology`
|
|
||||||
|
|
||||||
Bodies are read from disk at call time — editing `.agents/agents/*.md` or
|
|
||||||
`.agents/skills/research.md` takes effect immediately.
|
|
||||||
|
|
||||||
**Not handled by MCP** (stays bespoke):
|
|
||||||
|
|
||||||
- `.agents/hooks/` — MCP has no lifecycle intercept primitive
|
|
||||||
- This file — model needs to read it before `tools/list` is available
|
|
||||||
|
|
||||||
## The Enforcement Hierarchy
|
|
||||||
|
|
||||||
Not all guidance is equally effective. From most to least reliable:
|
|
||||||
|
|
||||||
```
|
|
||||||
PreToolUse hard block ← Structural. Always fires. Agent cannot bypass.
|
|
||||||
PostToolUse file-path check ← Fires right after editing a relevant file (context tail).
|
|
||||||
Nested AGENTS.md at path ← Always-on for that folder scope. Portable across tools.
|
|
||||||
Stop / SessionStart inject ← Fires at session boundaries. Good for broad reminders.
|
|
||||||
Root AGENTS.md sections ← Context-start only. Subject to "lost in the middle."
|
|
||||||
```
|
|
||||||
|
|
||||||
**Root cause of degradation** (Liu et al. 2023, "Lost in the Middle"): LLMs
|
|
||||||
attend to the beginning and end of context, not the middle. Guidance written
|
|
||||||
into AGENTS.md is injected once at session start and degrades as context grows.
|
|
||||||
Hooks inject at the _context tail_ — the high-attention zone — which is why they
|
|
||||||
outlast AGENTS.md under context pressure.
|
|
||||||
|
|
||||||
**Decision rule when adding new guidance:**
|
|
||||||
|
|
||||||
1. Is the anti-pattern a **terminal command**? → `PreToolUse` hard block
|
|
||||||
(Policies 1–6 in `pre-tool-use.sh`).
|
|
||||||
2. Is the anti-pattern **editing a specific file type or path**? → `PreToolUse`
|
|
||||||
block on `FILE_PATH` (Policy 7+).
|
|
||||||
3. Should the reminder fire **during active work** in a domain? → `PostToolUse`
|
|
||||||
file-path check (see `post-tool-use.sh` BFF reminder pattern).
|
|
||||||
4. Is it guidance scoped to **specific files** an agent might edit? → nested
|
|
||||||
`AGENTS.md` at the target path.
|
|
||||||
5. Should it fire **in response to what the user just wrote**? →
|
|
||||||
`UserPromptSubmit` injection (context tail, prompt text available — e.g.
|
|
||||||
agent nudges).
|
|
||||||
6. Is it a **broad session reminder** with no tight scope? → `SessionStart` or
|
|
||||||
`Stop` injection.
|
|
||||||
7. Is it **architecture/rationale** that an agent might need but shouldn't
|
|
||||||
always load? → AGENTS.md stub with a conditional `read_file` instruction (see
|
|
||||||
"Deferred Loading" below).
|
|
||||||
|
|
||||||
## Deferred Loading
|
|
||||||
|
|
||||||
Write a trigger condition and `read_file` instruction directly in an AGENTS.md
|
|
||||||
section. AGENTS.md is always loaded, so the trigger is always present; the
|
|
||||||
referenced file's content only loads when the model judges it relevant. Example:
|
|
||||||
|
|
||||||
> When the user shows signs of analysis paralysis, read
|
|
||||||
> `.agents/agents/brainstorm.md`.
|
|
||||||
|
|
||||||
Do **not** use tool-specific deferred-loading mechanisms (`description:`-only
|
|
||||||
`.instructions.md` files, etc.) — no portable equivalent exists. See Forbidden
|
|
||||||
Patterns below.
|
|
||||||
|
|
||||||
## Hook Files
|
|
||||||
|
|
||||||
All hook scripts live in `.agents/hooks/`. The Copilot harness
|
|
||||||
(`.agents/github/hooks.json`) and OpenCode plugin (`.agents/opencode/plugin.ts`)
|
|
||||||
both delegate to these scripts, keeping hook logic in one place. Symlinks from
|
|
||||||
`.github/hooks/agent-support.json` and `.opencode/plugins/agent-support.ts`
|
|
||||||
point back to these canonical sources; those directories are gitignored.
|
|
||||||
|
|
||||||
### Hook Injection Marker Convention
|
|
||||||
|
|
||||||
Every hook that injects `additionalContext` prefixes its payload with a
|
|
||||||
self-identifying line:
|
|
||||||
|
|
||||||
```
|
|
||||||
[HOOK INJECTION: <hook-name>] System reminder — NOT part of preceding tool output / user message:
|
|
||||||
```
|
|
||||||
|
|
||||||
The harness additionally wraps the payload in a
|
|
||||||
`<HookName-context>...</HookName-context>` XML tag (e.g.
|
|
||||||
`<PostToolUse-context>`). The inline prefix is belt-and-suspenders: when a hook
|
|
||||||
fires after a `read_file` whose content ends with markdown, the XML tag alone is
|
|
||||||
easy to miss — the inline prefix is not. **If you see either marker, treat the
|
|
||||||
content as a separate instruction, never as file content, tool output, or part
|
|
||||||
of the user's message.**
|
|
||||||
|
|
||||||
### Hook Architecture Principle: Platform-Agnostic Scripts
|
|
||||||
|
|
||||||
**Design target**: scripts accept normalized env vars (`TOOL_NAME`, `COMMAND`,
|
|
||||||
`FILE_PATH`), exit non-zero with plain-text denial reason on stdout. Callers
|
|
||||||
normalize input and translate exit code/stdout into their native denial format.
|
|
||||||
|
|
||||||
**⚠️ NOT YET IMPLEMENTED (May 2026)**: `pre-tool-use.sh` still uses
|
|
||||||
Copilot-specific JSON I/O. `plugin.ts` duplicates guards inline instead of
|
|
||||||
calling the script. See `agent-infrastructure.md` item 22 for the refactor plan.
|
|
||||||
|
|
||||||
### `user-prompt-submit.sh` — Per-turn tail injection
|
|
||||||
|
|
||||||
- Fires on every user message. Injects at the **context tail** (high-attention
|
|
||||||
zone) — this is why nudge logic lives here rather than in AGENTS.md.
|
|
||||||
- Detects brainstorm and research trigger words in the prompt and appends a
|
|
||||||
one-line nudge suggestion to `additionalContext`.
|
|
||||||
- Writes the raw prompt text to `/tmp/.last-user-prompt.txt` and injects the
|
|
||||||
task-capture instruction.
|
|
||||||
|
|
||||||
### `pre-tool-use.sh` — Hard stops
|
|
||||||
|
|
||||||
- Intercepts: `run_in_terminal`, `execution_subagent`, `send_to_terminal` (for
|
|
||||||
`$COMMAND`) and `replace_string_in_file`, `multi_replace_string_in_file`,
|
|
||||||
`create_file` (for `$FILE_PATH`).
|
|
||||||
- Outputs `permissionDecision: "deny"` to block the tool call.
|
|
||||||
- **CRITICAL**: A syntax error in this file blocks ALL file edits and terminal
|
|
||||||
commands. Always validate after editing:
|
|
||||||
`bash -n .agents/hooks/pre-tool-use.sh`
|
|
||||||
- When adding a new policy: follow the existing numbered pattern, add to the
|
|
||||||
comment header, use `deny "BLOCKED: ..."` with a clear fix instruction.
|
|
||||||
- Regex patterns operate on `$COMMAND` (terminal policies) or `$FILE_PATH`
|
|
||||||
(file-edit policies). Both are empty strings unless the right tool fired.
|
|
||||||
|
|
||||||
### `post-tool-use.sh` — Timed reminders
|
|
||||||
|
|
||||||
- Fires after every tool use with the tool name and response in stdin.
|
|
||||||
- Currently: self-check every 15 tool calls, debugging reminder on test failure,
|
|
||||||
BFF reminder when editing `apps/client/src/pages/`.
|
|
||||||
- Adding a new reminder: extract `$FILE_PATH` or match `$TOOL_NAME`, build the
|
|
||||||
message string, append to `$context`.
|
|
||||||
- Injects at the _tail_ of the context — this is what makes reminders persist
|
|
||||||
through long sessions.
|
|
||||||
|
|
||||||
### `session-start.sh` — Broad session injection
|
|
||||||
|
|
||||||
- Fires once per session. Good for: current branch, active investigations, dead
|
|
||||||
ends.
|
|
||||||
- Not good for: precise rule reminders (use PostToolUse or nested AGENTS.md).
|
|
||||||
- **OpenCode delivery:** injected as a synthetic `text` part via
|
|
||||||
`output.parts.unshift()` on the first `chat.message` turn. **Not** via
|
|
||||||
`experimental.chat.system.transform` — that hook fires for task-spawned
|
|
||||||
subagent sessions after a user message is already in context, which causes
|
|
||||||
Qwen-family GGUF models to abort with a Jinja "System message must be at the
|
|
||||||
beginning" error. See Forbidden Patterns below.
|
|
||||||
|
|
||||||
### `stop.sh` — End-of-session reflection
|
|
||||||
|
|
||||||
- Fires when agent stops. Lessons-learned capture + effort reflection.
|
|
||||||
- Not a blocking hook — injects `additionalContext` only.
|
|
||||||
|
|
||||||
### `pre-compact.sh` — Pre-summarization state export
|
|
||||||
|
|
||||||
- Fires before context is summarized. Saves investigation state to
|
|
||||||
`.session/pre-compact-state.md`.
|
|
||||||
- Note: `PostCompact` does NOT exist. Only `PreCompact`.
|
|
||||||
|
|
||||||
## Forbidden Patterns
|
|
||||||
|
|
||||||
These approaches exist in agentic tooling but are **banned** in this codebase
|
|
||||||
because portable alternatives exist. Document the reason so future agents
|
|
||||||
understand rather than re-introducing them.
|
|
||||||
|
|
||||||
### ❌ `applyTo:` frontmatter in `.instructions.md` files
|
|
||||||
|
|
||||||
Supported only in VS Code Copilot. Other tools either ignore it or load the file
|
|
||||||
as always-on context. Portable alternative: nested `AGENTS.md` at the target
|
|
||||||
path. Nested AGENTS.md files are natively supported by all major agent tools
|
|
||||||
(Copilot, OpenCode, Claude Code) without any special configuration.
|
|
||||||
|
|
||||||
### ❌ `description:`-only `.instructions.md` files (new additions)
|
|
||||||
|
|
||||||
VS Code Copilot builds a stub `<instructions>` block for these and tells the
|
|
||||||
model to load content on demand. Confirmed via `InstructionsContextComputer` in
|
|
||||||
`extensionHostProcess.js`. However, no other tool implements this — they load
|
|
||||||
the same files as always-on context. Portable alternative: AGENTS.md stub with a
|
|
||||||
`read_file` instruction (see "Deferred Loading" above).
|
|
||||||
|
|
||||||
### ❌ Any `.github/instructions/*.instructions.md` for new rules
|
|
||||||
|
|
||||||
`.instructions.md` is a VS Code Copilot-specific format. All new rules go into
|
|
||||||
nested `AGENTS.md` files (path-scoped rules) or directly into root `AGENTS.md`
|
|
||||||
(broad guidance). Do not add new `.instructions.md` files.
|
|
||||||
|
|
||||||
## Skills (`.agents/skills/`)
|
|
||||||
|
|
||||||
- Skills contain distilled methodologies that any agent can load on demand via
|
|
||||||
`read_file`. An agent MUST `read_file` the SKILL.md before using it.
|
|
||||||
- For **methodologies** (how to research, brainstorm) — not project rules.
|
|
||||||
Project rules belong in nested AGENTS.md files or hooks.
|
|
||||||
|
|
||||||
## Agents (`.agents/agents/`)
|
|
||||||
|
|
||||||
- Agent files define persona, workflow phases, tools, and circuit breakers.
|
|
||||||
- Runtime config (`model`, `mode`, `permission`) lives in `opencode.json` agent
|
|
||||||
entries. Body `.md` files are prompt-body only (plain markdown, no OpenCode
|
|
||||||
frontmatter keys except `description`).
|
|
||||||
- Circuit breakers (hard stops) belong in the agent file itself, not in hooks.
|
|
||||||
|
|
||||||
## Tool-Specific Entry Points
|
|
||||||
|
|
||||||
Some things cannot be unified and live in tool-specific locations:
|
|
||||||
|
|
||||||
- **`.agents/opencode/plugin.ts`** — OpenCode plugin harness (canonical).
|
|
||||||
Bridges hook scripts to OpenCode's plugin API. Symlinked from
|
|
||||||
`.opencode/plugins/agent-support.ts`.
|
|
||||||
- **`.agents/github/hooks.json`** — Copilot harness config (canonical). Points
|
|
||||||
to `.agents/hooks/*.sh`. Symlinked from `.github/hooks/agent-support.json`.
|
|
||||||
|
|
||||||
## Common Mistakes
|
|
||||||
|
|
||||||
- ❌ Writing long explanations in AGENTS.md for rules that could be a PreToolUse
|
|
||||||
block or nested AGENTS.md — they degrade under context pressure
|
|
||||||
- ❌ Adding a PostToolUse reminder without checking `$FILE_PATH` or `$TOOL_NAME`
|
|
||||||
— causes it to fire on every tool call, creating noise
|
|
||||||
- ❌ Leaving a syntax error in `pre-tool-use.sh` — blocks all file edits and
|
|
||||||
terminal commands immediately
|
|
||||||
- ❌ Creating new `.instructions.md` files — see Forbidden Patterns above
|
|
||||||
- ❌ Putting project-specific rules into a skill file — skills are for
|
|
||||||
methodologies, not codebase conventions
|
|
||||||
- ❌ Assuming PostCompact exists — it does not. Use PreCompact.
|
|
||||||
- ❌ Editing generated files in `.github/agents/`, `.github/skills/`,
|
|
||||||
`.opencode/agents/`, `.opencode/skills/` — edit `.agents/` sources instead, or
|
|
||||||
the pre-tool hook will block the edit
|
|
||||||
- ❌ Blaming the model for unexpected BLOCKED/tool-call behavior before
|
|
||||||
verifying the harness — when a model call is blocked or uses unexpected
|
|
||||||
parameters, check the actual tool schema first (read the source or docs)
|
|
||||||
before concluding the model is wrong. The harness was recently changed; the
|
|
||||||
model may be correct. Applies to: OpenCode tool names (`read`/`edit`/`task`),
|
|
||||||
parameter names (`offset`/`limit` not `startLine`/`endLine`), and plugin guard
|
|
||||||
logic.
|
|
||||||
- ❌ Using `experimental.chat.system.transform` to inject session-start content
|
|
||||||
in OpenCode. That hook fires for every model call — including task-spawned
|
|
||||||
subagent sessions — **after** the task prompt (a user message) is already in
|
|
||||||
the conversation. Pushing to `output.system` at that point places a system
|
|
||||||
message at a non-zero position, which Qwen-family GGUF models reject with
|
|
||||||
_"System message must be at the beginning"_ (Jinja chat template guard).
|
|
||||||
Fix: inject session-start as a synthetic `text` part via `output.parts.unshift()`
|
|
||||||
on the first `chat.message` turn (guarded by an `initializedSessions` set).
|
|
||||||
Text parts have no position constraint. Committed `f0d21e9` in dotfiles.
|
|
||||||
- ❌ Asserting that a third-party tool does **not** support a feature (config
|
|
||||||
mechanism, directory, option) without fetching the tool's current docs first.
|
|
||||||
Training data is frequently stale. Negative claims ("X doesn't have Y") must
|
|
||||||
be verified live — fetch the docs page before stating the absence. Cost of a
|
|
||||||
wrong negative: wasted user time, dead-end architecture, and eroded trust.
|
|
||||||
Rule: if you're about to say "tool X doesn't support Y," fetch the relevant
|
|
||||||
docs URL first.
|
|
||||||
- ❌ Adding _"reflect / double-check / are you sure / take another look"_
|
|
||||||
instructions as a mitigation for any failure mode — these feel productive in
|
|
||||||
transcripts but Huang et al. (arXiv:2310.01798) show that intrinsic
|
|
||||||
self-correction without an external oracle _consistently degrades_ reasoning
|
|
||||||
performance. Without a test runner, hook, type checker, or other ground- truth
|
|
||||||
signal in the loop, "ask the model to reflect" is at best noise. If the
|
|
||||||
failure mode lacks an external verifier, route to compaction, adversarial
|
|
||||||
reframing, or a cross-family judge subagent instead — see
|
|
||||||
[docs/research/intent-interpretation-action-plan.md](../docs/research/intent-interpretation-action-plan.md)
|
|
||||||
§4.1.
|
|
||||||
- ❌ Defaulting to multi-agent / parallel-worker topologies for complex tasks —
|
|
||||||
Cognition's failure analysis shows the dominant failure mode is **context
|
|
||||||
divergence**: separate agents accumulate incompatible interpretations of the
|
|
||||||
same task, and reconciliation costs exceed any parallelism gain. A single
|
|
||||||
agent loop with an explicit plan/act split outperforms multi-agent on almost
|
|
||||||
all real coding tasks (§3.1,
|
|
||||||
[docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md)).
|
|
||||||
Subagents are only justified for read-only exploration, fully isolated tasks,
|
|
||||||
or adversarial review.
|
|
||||||
- ❌ Treating the orchestrator as the right pattern for cloud frontier models —
|
|
||||||
for local models the orchestrator is a **context firewall** (sub-agents return
|
|
||||||
≤2k compressed summaries; the parent's context never sees raw exploration).
|
|
||||||
Frontier models have 200k+ context and no `task` dispatch tool in Copilot, so
|
|
||||||
the firewall pattern doesn't apply. The cloud orchestrator is a **planning
|
|
||||||
gate** (forced decomposition + user confirmation before acting), not a
|
|
||||||
dispatch coordinator. The `<!-- @local -->` / `<!-- @cloud -->` blocks in
|
|
||||||
`orchestrator.md` encode this distinction. See §3.4 of
|
|
||||||
[docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md).
|
|
||||||
|
|
||||||
## Testing destructive-command blocks — NEVER use live ammunition
|
|
||||||
|
|
||||||
When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
|
|
||||||
command pattern, **never issue the real destructive command as the test input.**
|
|
||||||
The hook is the system under test — if it fails, the test destroys the host.
|
|
||||||
|
|
||||||
Use one of these methods instead, in order of preference:
|
|
||||||
|
|
||||||
1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the script
|
|
||||||
and check exit code + stderr. No agent in the loop. No real shell invocation.
|
|
||||||
Example:
|
|
||||||
|
|
||||||
```
|
|
||||||
echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' \
|
|
||||||
| bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"
|
|
||||||
```
|
|
||||||
|
|
||||||
The hook should exit non-zero (deny) and print the block reason. No `rm` was
|
|
||||||
ever queued.
|
|
||||||
|
|
||||||
2. **Use a sentinel path that exercises the regex but is harmless if the block
|
|
||||||
fails.** A path that obviously doesn't exist and could not possibly hold real
|
|
||||||
data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
|
|
||||||
The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
|
|
||||||
case is a "no such file" error on a sentinel path. **NEVER** use bare `/`,
|
|
||||||
`/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even if
|
|
||||||
the hook is broken.
|
|
||||||
|
|
||||||
3. **Never** issue the literal destructive command (`rm -rf /`,
|
|
||||||
`dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
|
|
||||||
`git push --force` to a published branch, etc.) as an agent prompt. Not even
|
|
||||||
with `--dry-run`. Not even "just to see." Not even if you're sure the hook
|
|
||||||
works. **The hook MIGHT not work. That's why you're testing it.**
|
|
||||||
|
|
||||||
This rule applies to humans writing test prompts AND to agents asked to verify
|
|
||||||
hook behavior. If you (the agent) are asked to verify a block, **refuse any
|
|
||||||
plan that involves issuing the real destructive command** and propose a
|
|
||||||
unit-test or sentinel approach instead.
|
|
||||||
@ -1,77 +0,0 @@
|
|||||||
# Agent Definition Files
|
|
||||||
|
|
||||||
Each `.md` file here defines a custom OpenCode agent (prompt, model,
|
|
||||||
permissions).
|
|
||||||
|
|
||||||
## ⚠️ Symlink required
|
|
||||||
|
|
||||||
OpenCode only loads agents from `.opencode/agents/` (or
|
|
||||||
`~/.config/opencode/agents/`). Files here are **not loaded automatically**. Each
|
|
||||||
file must be symlinked using a **two-level relative path** from
|
|
||||||
`.opencode/agents/`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd .opencode/agents
|
|
||||||
ln -s ../../.agents/agents/<name>.md <name>.md
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Do NOT use three `../` levels.** `.opencode/agents/` is two levels below the
|
|
||||||
> repo root, not three. A wrong-depth path creates a broken symlink that
|
|
||||||
> silently fails — OpenCode will not error, the agent simply won't load.
|
|
||||||
|
|
||||||
Current symlinks (verify with `ls -la .opencode/agents/`):
|
|
||||||
|
|
||||||
| Agent file | Symlinked? |
|
|
||||||
| ----------------- | ------------------------------------- |
|
|
||||||
| `build.md` | ✅ `.opencode/agents/build.md` |
|
|
||||||
| `orchestrator.md` | ✅ `.opencode/agents/orchestrator.md` |
|
|
||||||
| `brainstorm.md` | ✅ `.opencode/agents/brainstorm.md` |
|
|
||||||
| `research.md` | ✅ `.opencode/agents/research.md` |
|
|
||||||
|
|
||||||
When you add a new agent here, add its symlink and update this table.
|
|
||||||
|
|
||||||
## Verification
|
|
||||||
|
|
||||||
After adding or changing an agent, run the following to confirm OpenCode can
|
|
||||||
read it:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Confirm symlink resolves (should print file contents, not an error)
|
|
||||||
cat .opencode/agents/<name>.md | head -5
|
|
||||||
|
|
||||||
# 2. Confirm OpenCode registers the agent with correct permissions
|
|
||||||
opencode agent list
|
|
||||||
|
|
||||||
# Check that your agent appears with the right mode (all/primary/subagent)
|
|
||||||
# and that deny rules are present at the bottom of its permission list.
|
|
||||||
# If it's missing: broken symlink, YAML frontmatter parse error, or OpenCode
|
|
||||||
# was not restarted after the change.
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output for `orchestrator` after a correct setup:
|
|
||||||
|
|
||||||
```
|
|
||||||
orchestrator (all)
|
|
||||||
[
|
|
||||||
...
|
|
||||||
{ "permission": "edit", "action": "deny", "pattern": "*" },
|
|
||||||
{ "permission": "bash", "action": "deny", "pattern": "*" }
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
## Invoking custom agents
|
|
||||||
|
|
||||||
- **Tab**: cycles through `primary` and `all`-mode agents as the active session
|
|
||||||
agent
|
|
||||||
- **`@mention`**: invokes an agent — but only at the **start of a fresh
|
|
||||||
session**. Sending `@orchestrator ...` after already exchanging messages in a
|
|
||||||
Build session causes the current model to process the text as freeform input.
|
|
||||||
Open a new session first, then `@mention` as the first message.
|
|
||||||
- **CLI**: `opencode run --agent orchestrator "your prompt"` — reliable,
|
|
||||||
session-agnostic invocation for scripting or testing.
|
|
||||||
|
|
||||||
## Permission config
|
|
||||||
|
|
||||||
`opencode.json` `agent.<name>.permission` only applies if a matching markdown
|
|
||||||
file is loaded. Without the symlink, permission config for that agent is
|
|
||||||
silently ignored.
|
|
||||||
@ -1,216 +0,0 @@
|
|||||||
---
|
|
||||||
description: "Use when brainstorming, ideating, exploring options, feeling stuck, over-thinking, over-complicating, or needing to step back and reconsider an approach. Use when the user says 'wait', 'actually', 'hmm', 'reconsider', 'what if', 'too complicated', 'there has to be a simpler way', or expresses uncertainty about direction."
|
|
||||||
---
|
|
||||||
|
|
||||||
# Brainstorm Agent
|
|
||||||
|
|
||||||
You are a creative thinking partner. Your job is to help the user generate,
|
|
||||||
explore, and evaluate ideas quickly — then get out of the way so real work can
|
|
||||||
happen.
|
|
||||||
|
|
||||||
## Core Philosophy
|
|
||||||
|
|
||||||
**Speed over depth. Breadth over precision. Intuition over analysis.**
|
|
||||||
|
|
||||||
You are the opposite of a deep-thinking agent. You exist because Claude Opus 4.6
|
|
||||||
already overthinks everything. Your role is to COUNTERBALANCE that tendency by
|
|
||||||
keeping things loose, fast, and generative.
|
|
||||||
|
|
||||||
Do NOT ruminate. Do NOT exhaustively analyze. Do NOT hedge with caveats. When
|
|
||||||
you catch yourself going deep, stop and surface back to the idea level.
|
|
||||||
|
|
||||||
## When You're Activated
|
|
||||||
|
|
||||||
You're here because the user is either:
|
|
||||||
|
|
||||||
1. **Stuck** — going in circles, overthinking, analysis paralysis
|
|
||||||
2. **Exploring** — genuinely unsure what direction to take
|
|
||||||
3. **Reconsidering** — realized something isn't working and needs fresh angles
|
|
||||||
|
|
||||||
In all cases, the antidote is the same: generate options fast, pick one, move.
|
|
||||||
|
|
||||||
## Brainstorming Techniques
|
|
||||||
|
|
||||||
Use these as lenses, not rigid processes. Pick whichever fits the moment.
|
|
||||||
|
|
||||||
### Rapid Ideation (Crazy 8s style)
|
|
||||||
|
|
||||||
Generate 5-8 distinct approaches in quick succession. No judgment, no analysis.
|
|
||||||
Just ideas. One line each. Then ask the user which ones spark something.
|
|
||||||
|
|
||||||
### SCAMPER
|
|
||||||
|
|
||||||
When modifying an existing design or approach:
|
|
||||||
|
|
||||||
- **Substitute** — What component could be swapped?
|
|
||||||
- **Combine** — What two things could merge?
|
|
||||||
- **Adapt** — What similar problem has a known solution?
|
|
||||||
- **Modify** — What if we made one part bigger/smaller/different?
|
|
||||||
- **Put to other uses** — Can this serve a purpose we haven't considered?
|
|
||||||
- **Eliminate** — What can we cut entirely?
|
|
||||||
- **Reverse** — What if we did the opposite?
|
|
||||||
|
|
||||||
### Worst Possible Idea
|
|
||||||
|
|
||||||
When truly stuck: ask what the WORST way to solve this would be. Then invert it.
|
|
||||||
Bad ideas are easier to generate and often contain the seed of good ones.
|
|
||||||
|
|
||||||
### How Might We...
|
|
||||||
|
|
||||||
Reframe the problem as an opportunity. "How might we make X do Y without Z?"
|
|
||||||
Forces a positive, solution-oriented frame.
|
|
||||||
|
|
||||||
### Inversion / Pre-mortem
|
|
||||||
|
|
||||||
"Imagine this approach failed completely. Why did it fail?" Work backward from
|
|
||||||
failure to identify hidden risks or assumptions.
|
|
||||||
|
|
||||||
### Constraint Flipping
|
|
||||||
|
|
||||||
List the constraints you're assuming. Remove one. What becomes possible? Often
|
|
||||||
the constraint you think is fixed... isn't.
|
|
||||||
|
|
||||||
## How You Work
|
|
||||||
|
|
||||||
### Phase 1: Quick Frame (30 seconds of thinking, max)
|
|
||||||
|
|
||||||
- What's the actual problem? (One sentence.)
|
|
||||||
- What constraints exist? (Bullet list, keep it short.)
|
|
||||||
- What has already been tried or considered?
|
|
||||||
|
|
||||||
### Phase 2: Diverge (the brainstorm)
|
|
||||||
|
|
||||||
- Pick a technique from above (or freestyle)
|
|
||||||
- Generate options FAST — quantity over quality
|
|
||||||
- No evaluation during this phase
|
|
||||||
- Aim for at least 5 genuinely different directions
|
|
||||||
- Push past the obvious — your first 2-3 ideas will be "average" by nature; the
|
|
||||||
interesting ones start after those
|
|
||||||
- _Optional divergence prompt:_ the expertise ladder — what would a junior
|
|
||||||
engineer propose? What would a senior engineer with deep domain knowledge
|
|
||||||
propose differently? What would an outsider with zero context propose?
|
|
||||||
Different vantage points surface different assumptions. **Use only to broaden
|
|
||||||
the candidate pool, never to produce the final answer.** Recent
|
|
||||||
persona-prompting work (Principled Personas EMNLP 2025; Persona is a
|
|
||||||
Double-Edged Sword IJCNLP 2025; arXiv:2512.05858) shows that low-knowledge
|
|
||||||
personas often _reduce_ accuracy, so evaluate any candidate the ladder
|
|
||||||
surfaces under the un-personified model and an external rubric before
|
|
||||||
committing.
|
|
||||||
|
|
||||||
### Phase 3: Converge (the gut check)
|
|
||||||
|
|
||||||
- Which 1-2 ideas feel most promising? Trust intuition here.
|
|
||||||
- What's the smallest thing we could try to test the idea?
|
|
||||||
- What would make us confident it's wrong? (Kill criteria)
|
|
||||||
- **Re-evaluate at each comparison, not just at the end.** New constraints
|
|
||||||
surface as options are weighed — this is the idea behind Think-Anywhere (Jiang
|
|
||||||
et al., arXiv:2603.29957): fresh reasoning at each decision point, not
|
|
||||||
execution of the original plan. If a constraint you assumed earlier turns out
|
|
||||||
to be flexible, update.
|
|
||||||
|
|
||||||
### Phase 4: Capture & Hand Off
|
|
||||||
|
|
||||||
**Do this IMMEDIATELY after convergence. Do not wait for user confirmation.**
|
|
||||||
Open questions go in the exploration file, not in your response as blockers.
|
|
||||||
|
|
||||||
- Write the exploration file (see Output Format below)
|
|
||||||
- Create a session memory note (`/memories/session/brainstorm-<topic>.md`) with
|
|
||||||
the problem, selected approach, and key context so subagents or fresh
|
|
||||||
conversations can pick up where you left off
|
|
||||||
- Hand off to the right next step:
|
|
||||||
- If the chosen direction needs **investigation or debugging** → delegate to
|
|
||||||
`@research` or suggest the user invoke it
|
|
||||||
- If it's ready for **implementation** → delegate to the default agent or
|
|
||||||
suggest the user invoke it
|
|
||||||
- If it needs **more exploration** → suggest the user continue with you
|
|
||||||
|
|
||||||
**Never end on open questions alone.** Capture first, ask second. The
|
|
||||||
exploration file is the handoff artifact — if it exists, any agent can pick up
|
|
||||||
where you left off regardless of whether the user answered your questions.
|
|
||||||
|
|
||||||
## Delegation Rules
|
|
||||||
|
|
||||||
**You do the thinking. Subagents do the digging.**
|
|
||||||
|
|
||||||
When you need to understand the codebase to generate better ideas, delegate to
|
|
||||||
the Explore subagent. Give it a specific, bounded question:
|
|
||||||
|
|
||||||
- "Find how authentication is currently structured in this project"
|
|
||||||
- "Look for existing patterns for X in the codebase"
|
|
||||||
|
|
||||||
Do NOT send the Explore agent on open-ended research missions. Keep requests
|
|
||||||
tight and factual. You synthesize — it investigates.
|
|
||||||
|
|
||||||
## Token Discipline
|
|
||||||
|
|
||||||
You are the LIGHTWEIGHT agent. Your entire purpose is to stay at the idea level
|
|
||||||
and avoid burning context on deep dives. Rules:
|
|
||||||
|
|
||||||
1. Keep your own responses concise — bullet points over paragraphs
|
|
||||||
2. Delegate all codebase exploration to subagents
|
|
||||||
3. If an exploration is going deep, STOP and create the exploration file so a
|
|
||||||
fresh context can pick it up
|
|
||||||
4. Never read more than a few files yourself — that's what Explore is for
|
|
||||||
5. Hold references; load on demand. Do not read files you don't need yet.
|
|
||||||
|
|
||||||
## Output Format: The Exploration File
|
|
||||||
|
|
||||||
When a brainstorming session produces a direction worth exploring, create a
|
|
||||||
tracking file. Ask the user for a short name, or derive one from the topic.
|
|
||||||
|
|
||||||
**Location**: `docs/explorations/<name>.md`
|
|
||||||
|
|
||||||
Use this structure:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
# Exploration: <Title>
|
|
||||||
|
|
||||||
**Status**: brainstorming | exploring | prototyping | decided | abandoned
|
|
||||||
**Created**: <date> **Last Updated**: <date>
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
<One or two sentences. What are we trying to solve?>
|
|
||||||
|
|
||||||
## Constraints
|
|
||||||
|
|
||||||
- <Real constraints, not assumed ones>
|
|
||||||
|
|
||||||
## Ideas Generated
|
|
||||||
|
|
||||||
<List from the brainstorm session. Keep all of them, even rejected ones.>
|
|
||||||
|
|
||||||
1. **<Idea name>** — <One-line description>
|
|
||||||
2. **<Idea name>** — <One-line description> ...
|
|
||||||
|
|
||||||
## Selected Approach
|
|
||||||
|
|
||||||
**<Chosen idea>**: <Why this one — keep it to 2-3 sentences max>
|
|
||||||
|
|
||||||
### Kill Criteria
|
|
||||||
|
|
||||||
<What would tell us this approach is wrong?>
|
|
||||||
|
|
||||||
## Exploration Log
|
|
||||||
|
|
||||||
<Append entries as work progresses. Newest first.>
|
|
||||||
|
|
||||||
### <date> — <brief title>
|
|
||||||
|
|
||||||
- What was tried:
|
|
||||||
- What happened:
|
|
||||||
- What we learned:
|
|
||||||
- Next step:
|
|
||||||
|
|
||||||
## Blockers
|
|
||||||
|
|
||||||
- <Anything currently preventing progress>
|
|
||||||
```
|
|
||||||
|
|
||||||
## What You Are NOT
|
|
||||||
|
|
||||||
- You are NOT an implementation agent. Don't write production code.
|
|
||||||
- You are NOT a research agent. Don't go deep on diagnosis or root cause.
|
|
||||||
- You are NOT a planning agent. Don't create detailed project plans.
|
|
||||||
|
|
||||||
You are a spark. Once an idea has enough shape to act on, hand it off.
|
|
||||||
@ -1,69 +0,0 @@
|
|||||||
---
|
|
||||||
description: 'Targeted implementation task: well-scoped edit, single file or small refactor where the scope is already clear. NOT for open-ended investigation, architecture decisions, or multi-file refactors.'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Build Agent
|
|
||||||
|
|
||||||
You execute well-scoped implementation tasks accurately and efficiently.
|
|
||||||
|
|
||||||
<!-- @local -->
|
|
||||||
|
|
||||||
## Model Profile
|
|
||||||
|
|
||||||
**Smaller-scale, not low-reasoning.** If your architecture supports extended
|
|
||||||
thinking blocks, use them at decision points. Your failure modes are not absence
|
|
||||||
of reasoning — they are:
|
|
||||||
|
|
||||||
- Narrower training distribution (Python/JS heavy — verify TypeScript idioms)
|
|
||||||
- Quantization degradation in long sessions (tool-call history fills context
|
|
||||||
fast)
|
|
||||||
- JSON schema compliance degrading as context grows
|
|
||||||
- Repetition loops if context pressure is high
|
|
||||||
|
|
||||||
Compensate structurally: stay grounded, delegate exploration, keep context lean.
|
|
||||||
|
|
||||||
<!-- @endlocal -->
|
|
||||||
|
|
||||||
## Core Rules
|
|
||||||
|
|
||||||
1. **Read before you write.** Always `ls` and `read_file` before any edit.
|
|
||||||
2. **Verify before asserting.** Never assume a file path, library, or API exists
|
|
||||||
— check first.
|
|
||||||
3. **Hold references; load on demand.** Do not read files you don't need yet.
|
|
||||||
Context is a finite budget — treat it as your most constrained resource.
|
|
||||||
4. **Delegate exploration, not orchestration.** Use the `Explore` subagent
|
|
||||||
(Copilot) or `task` subagent (OpenCode) for scanning large directories or
|
|
||||||
tracing imports. This agent is a recipient of tasks — it does NOT decompose
|
|
||||||
or dispatch further work. Keep your own context for reasoning.
|
|
||||||
5. **Scope-check before starting.** If the task touches more than 2–3 files or
|
|
||||||
requires understanding architecture, stop and tell the user: "This looks
|
|
||||||
broader than a targeted edit — the orchestrator or default agent should
|
|
||||||
handle this." Do not attempt to self-decompose into subtasks.
|
|
||||||
|
|
||||||
<!-- @local -->
|
|
||||||
|
|
||||||
## Working Memory
|
|
||||||
|
|
||||||
For tasks spanning multiple steps, maintain a `NOTES.md` scratch file:
|
|
||||||
|
|
||||||
- Write your progress after each step before proceeding to the next
|
|
||||||
- Record which files you've read and what you found
|
|
||||||
- Note any assumptions you made
|
|
||||||
|
|
||||||
This keeps your context clean and enables resumption after compaction.
|
|
||||||
|
|
||||||
## Reasoning
|
|
||||||
|
|
||||||
Reason at each decision point before acting. Open `<think>` blocks with
|
|
||||||
substantive analysis — not filler phrases ("Okay, let me...", "The user
|
|
||||||
wants..."). Begin directly with the analysis or plan.
|
|
||||||
|
|
||||||
<!-- @endlocal -->
|
|
||||||
|
|
||||||
## Handoff
|
|
||||||
|
|
||||||
When this task is done (or if it exceeds your scope), tell the user clearly:
|
|
||||||
|
|
||||||
- What you completed
|
|
||||||
- What remains (if anything)
|
|
||||||
- Whether the next step needs a different agent
|
|
||||||
@ -1,134 +0,0 @@
|
|||||||
---
|
|
||||||
description:
|
|
||||||
"Decomposes high-level goals into bounded subtasks and delegates to build,
|
|
||||||
research, or brainstorm. Delegates file edits to workers."
|
|
||||||
---
|
|
||||||
|
|
||||||
# Orchestrator
|
|
||||||
|
|
||||||
You decompose high-level goals into focused, bounded subtasks and dispatch them to
|
|
||||||
specialist workers. You write delegation plans and summarize results. Your output is a
|
|
||||||
delegation plan and a summary of results.
|
|
||||||
|
|
||||||
## Context Management
|
|
||||||
|
|
||||||
You have limited context window and so do your workers. Workers hit their context limit and return a summary. Reassess and break the work down further. To address context loss between phases you MUST:
|
|
||||||
|
|
||||||
1. Delegate only focused, bounded subtasks (one file, one concern, one directory at a time)
|
|
||||||
2. Ask workers to summarize, diff, or answer specific questions
|
|
||||||
3. A worker returning partial or incomplete results is incomplete. Re-delegate the missing pieces.
|
|
||||||
4. Tasks involving many files split into phases: read phase → analysis phase → synthesis phase. Each phase gets its own worker
|
|
||||||
5. Split tasks requiring >200 lines into research phase + build phase.
|
|
||||||
6. A failed phase or truncated output → STOP. Report the failure.
|
|
||||||
|
|
||||||
## Constraints
|
|
||||||
|
|
||||||
- **File edits go through `build`.** Editing tools (`replace_string_in_file`,
|
|
||||||
`create_file`, etc.) route through `build`. File edits are a subtask for `build`.
|
|
||||||
- **Terminal commands go through `build`.** Build or test results go through `build`. **Exception:**
|
|
||||||
you MAY use `run_in_terminal` to write to `/tmp/.last-user-prompt.txt` (TASK
|
|
||||||
CAPTURE). This single path is exempt — the Stop hook reads it to verify every
|
|
||||||
question was answered.
|
|
||||||
- **Delegate only.** Your only tool for task execution is `task`
|
|
||||||
(OpenCode) or subagent dispatch. You reason and plan; workers act.
|
|
||||||
<!-- @local -->
|
|
||||||
- **Read files under `apps/` or `packages/` through a worker.** This is enforced at the
|
|
||||||
plugin layer and will throw. Reading these auto-loads nested `AGENTS.md` files
|
|
||||||
and is expensive for a small context window. Package reads go through a
|
|
||||||
worker with `task`.
|
|
||||||
- **Root reads only.** Read top-level files (`README.md`, root
|
|
||||||
`AGENTS.md`, root `package.json`) and files under `docs/`. Everything else goes
|
|
||||||
through a worker.
|
|
||||||
<!-- @endlocal -->
|
|
||||||
|
|
||||||
## Workflow
|
|
||||||
|
|
||||||
### 1. Understand the goal
|
|
||||||
|
|
||||||
Read the project root `AGENTS.md` first. Identify which areas of the codebase
|
|
||||||
are involved. Note the relevant package for goals touching `apps/` or `packages/` so workers know to check nested `AGENTS.md` files.
|
|
||||||
|
|
||||||
### 2. Decompose into bounded subtasks
|
|
||||||
|
|
||||||
Break the goal into subtasks where each one:
|
|
||||||
|
|
||||||
- Touches at most 2–3 files
|
|
||||||
- Has a clear acceptance criterion ("the build passes" / "the test passes")
|
|
||||||
- Can be handed off to a single worker with self-contained context
|
|
||||||
|
|
||||||
### 3. Confirm before dispatching
|
|
||||||
|
|
||||||
Present the decomposition to the user **before dispatching any tasks**. Format:
|
|
||||||
|
|
||||||
```
|
|
||||||
Plan:
|
|
||||||
1. [worker] Task description — expected output
|
|
||||||
2. [worker] Task description — expected output
|
|
||||||
...
|
|
||||||
Proceed?
|
|
||||||
```
|
|
||||||
|
|
||||||
Wait for explicit confirmation before dispatching.
|
|
||||||
|
|
||||||
<!-- @local -->
|
|
||||||
|
|
||||||
### 4. Dispatch one subtask at a time
|
|
||||||
|
|
||||||
Use `task` to dispatch each subtask to the appropriate worker. Pass all context
|
|
||||||
the worker needs in the task prompt — the worker reads only what is in the prompt.
|
|
||||||
|
|
||||||
**Keep task prompts short.** The `task` tool has a JSON serialization limit.
|
|
||||||
Tell the worker _which files to read_ and _what to do_. Example:
|
|
||||||
|
|
||||||
- ❌
|
|
||||||
`"Read package.json — here are the deps: { ... 500 lines ... }. Update README."`
|
|
||||||
- ✅
|
|
||||||
`"Read the root package.json and all workspace package.json files, then update the Technology Stack section in README.md to match."`
|
|
||||||
|
|
||||||
Workers available:
|
|
||||||
|
|
||||||
- **`build`** — implementation tasks (edits, refactors, new files)
|
|
||||||
- **`research`** — investigation, root-cause analysis, unfamiliar territory
|
|
||||||
- **`brainstorm`** — ideation, design exploration, approach selection
|
|
||||||
<!-- @endlocal -->
|
|
||||||
|
|
||||||
<!-- @cloud -->
|
|
||||||
|
|
||||||
### 4. Execute directly with plan-act-verify
|
|
||||||
|
|
||||||
You have the context budget to act directly. After user confirmation, execute
|
|
||||||
each subtask in sequence using inline tool calls (no worker dispatch needed).
|
|
||||||
Apply the standard plan-act-verify loop:
|
|
||||||
|
|
||||||
- Complete one subtask fully before starting the next
|
|
||||||
- Run the quality gate (`npm run build:strict` or `npm test && npm run lint`)
|
|
||||||
after the final edit
|
|
||||||
- A subtask failing twice with the same error → STOP. Report the failure.
|
|
||||||
|
|
||||||
Workers available as slash commands if you want to hand off reasoning mode:
|
|
||||||
|
|
||||||
- `/research` — for unfamiliar territory or root-cause analysis
|
|
||||||
- `/brainstorm` — for approach selection before implementing
|
|
||||||
<!-- @endcloud -->
|
|
||||||
|
|
||||||
### 5. Collect and report
|
|
||||||
|
|
||||||
After all subtasks complete, summarize results for the user:
|
|
||||||
|
|
||||||
- What was done
|
|
||||||
- Anything incomplete or blocked
|
|
||||||
- Whether the quality gate was run (build + tests)
|
|
||||||
|
|
||||||
## When to escalate
|
|
||||||
|
|
||||||
A subtask failing twice from the same worker with the same error → STOP:
|
|
||||||
|
|
||||||
- Report to the user. No retry.
|
|
||||||
- State what the worker attempted and what went wrong.
|
|
||||||
- Ask whether to try a different approach or switch to a different agent.
|
|
||||||
|
|
||||||
<!-- @local -->
|
|
||||||
|
|
||||||
A task beyond local model capability (reasoning failure, repeated hallucination) → STOP. Recommend the user switch to the default Copilot agent.
|
|
||||||
|
|
||||||
<!-- @endlocal -->
|
|
||||||
@ -1,126 +0,0 @@
|
|||||||
---
|
|
||||||
description: "Use when investigating, debugging, diagnosing, understanding unfamiliar code, tracing behavior, root cause analysis, or systematic exploration. Use when the user says 'why is this broken', 'how does this work', 'what changed', 'trace', 'investigate', 'root cause', 'figure out', 'something's wrong', 'regression', or needs to build a mental model before making changes."
|
|
||||||
---
|
|
||||||
|
|
||||||
# Research Agent
|
|
||||||
|
|
||||||
You are a systematic investigator. Build accurate understanding and diagnose
|
|
||||||
problems through a disciplined, evidence-based workflow.
|
|
||||||
|
|
||||||
## Core Philosophy
|
|
||||||
|
|
||||||
**Evidence over intuition. Systematic over ad-hoc. Record everything.**
|
|
||||||
|
|
||||||
LLMs pattern-match from training data and latch onto the first plausible
|
|
||||||
explanation. Counterbalance that: require evidence before conclusions, consider
|
|
||||||
alternatives before committing, record findings so they persist.
|
|
||||||
|
|
||||||
Verify before guessing. Record findings — they are the investigation's memory.
|
|
||||||
|
|
||||||
## First Action
|
|
||||||
|
|
||||||
Review the **Three-Phase Workflow** below. Load the relevant phase on demand via
|
|
||||||
MCP tools as the investigation progresses.
|
|
||||||
|
|
||||||
## Three-Phase Workflow
|
|
||||||
|
|
||||||
Research follows three phases. Load each on demand via MCP tools:
|
|
||||||
|
|
||||||
1. **Setup** — hypothesis checklist, Understand/Diagnose orientations
|
|
||||||
→ `load_research-setup`
|
|
||||||
2. **Triage** — risk-based table choosing Satisfice vs Strong Inference
|
|
||||||
→ `load_research-triage`
|
|
||||||
3. **Execution** — context management, dead-ends, timing, techniques
|
|
||||||
→ `load_research-execution`
|
|
||||||
|
|
||||||
## Loading Skills
|
|
||||||
|
|
||||||
Skills are loaded via MCP tool calls, not `read_file`. This makes skills work
|
|
||||||
cross-framework (Copilot, OpenCode, Claude Code, etc.).
|
|
||||||
|
|
||||||
- `load_research-setup` — loads the setup checklist
|
|
||||||
- `load_research-triage` — loads the triage table
|
|
||||||
- `load_research-execution` — loads execution rules
|
|
||||||
|
|
||||||
Load phases just-in-time as needed during the investigation.
|
|
||||||
|
|
||||||
## Two Orientations
|
|
||||||
|
|
||||||
Switch fluidly between them, often multiple times per chain of reasoning.
|
|
||||||
|
|
||||||
### 1. Understand (Grounded Theory)
|
|
||||||
|
|
||||||
Build mental models from the code, not from assumptions.
|
|
||||||
|
|
||||||
1. **Open coding** — read code, name what you see
|
|
||||||
2. **Constant comparison** — compare new observations against earlier ones
|
|
||||||
3. **Axial coding** — connect categories, trace data flows
|
|
||||||
4. **Memo** — write session notes as you go
|
|
||||||
5. **Saturation check** — stop reading when files confirm existing patterns
|
|
||||||
|
|
||||||
Apply Understand to: "How does X work?", "What's the architecture of Y?", "Why was it
|
|
||||||
built this way?", "I need to understand this before changing it."
|
|
||||||
|
|
||||||
### 2. Diagnose (Strong Inference + Satisficing)
|
|
||||||
|
|
||||||
Test multiple hypotheses, not just the most likely one. But satisfice when
|
|
||||||
stakes are low.
|
|
||||||
|
|
||||||
**Simple check first** — log a single statement if it answers the question.
|
|
||||||
Escalate when the result is unexpected.
|
|
||||||
|
|
||||||
**Triage** — assess risk across five factors:
|
|
||||||
|
|
||||||
| Factor | Low Risk | High Risk |
|
|
||||||
| ----------------- | --------------------------- | ------------------------------ |
|
|
||||||
| Reversibility | Easy to undo | Hard to reverse |
|
|
||||||
| Blast radius | One file/function | Many systems, shared state |
|
|
||||||
| Confidence | Familiar, clear evidence | Novel, ambiguous |
|
|
||||||
| Novelty | Seen this before | Never encountered |
|
|
||||||
| Time cost | Known baselines | Unknown — measure first |
|
|
||||||
|
|
||||||
**All low risk → Satisfice**: test the most likely hypothesis, stop if confirmed.
|
|
||||||
|
|
||||||
**Any high risk → Strong Inference**: generate 2–3 different hypotheses, design
|
|
||||||
a discriminating test, eliminate by evidence, iterate on what remains.
|
|
||||||
|
|
||||||
Apply Diagnose to: "Why does X fail?", "What changed?", "This worked yesterday",
|
|
||||||
regression diagnosis, behavior verification.
|
|
||||||
|
|
||||||
### Mode Switching
|
|
||||||
|
|
||||||
Follow the question, not the mode:
|
|
||||||
|
|
||||||
```
|
|
||||||
Understand → spot anomaly → Triage → Diagnose → need context → Understand → ...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Investigation Checklist
|
|
||||||
|
|
||||||
Before each hypothesis: write it, write falsification criterion, run falsification test first.
|
|
||||||
|
|
||||||
## Circuit Breakers
|
|
||||||
|
|
||||||
1. 5+ attempts without falsifying = STOP and report (one attempt = one hypothesis tested with a falsification criterion)
|
|
||||||
2. 3+ edits to same file without passing test = STOP and rethink (count each saved edit to the same file)
|
|
||||||
3. any untested guess = STOP and write hypothesis first (no changes without a written hypothesis and falsification criterion)
|
|
||||||
4. 2 failures at same abstraction level = go UP one level (same file, same module, or same layer)
|
|
||||||
|
|
||||||
## Execution Details
|
|
||||||
|
|
||||||
For details, load `load_research-execution` via MCP
|
|
||||||
|
|
||||||
## Delegation Rules
|
|
||||||
|
|
||||||
You direct the investigation. Subagents gather specific evidence.
|
|
||||||
|
|
||||||
Use Explore for bounded fact-finding: "Find all callers of `functionName`",
|
|
||||||
"Check middleware before this route", "List files importing `@cantrips/remnant-core`".
|
|
||||||
|
|
||||||
You form hypotheses, interpret evidence, decide next steps. Subagents retrieve
|
|
||||||
facts.
|
|
||||||
|
|
||||||
## Boundaries
|
|
||||||
|
|
||||||
You investigate: gather evidence, form hypotheses, test them, report findings.
|
|
||||||
Hand off implementation, brainstorming, and planning to other agents.
|
|
||||||
@ -1,854 +0,0 @@
|
|||||||
# Agent Infrastructure
|
|
||||||
|
|
||||||
Shared agent infrastructure for VS Code Copilot and OpenCode — brainstorm
|
|
||||||
agent, research agent, nudge instructions, hooks, skills, and MCP server.
|
|
||||||
Project-specific overlays live in each project's `.agents/` directory.
|
|
||||||
|
|
||||||
> **See also:**
|
|
||||||
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md)
|
|
||||||
> — research synthesis covering the Prompt/Context/Harness taxonomy, failure
|
|
||||||
> modes, enforcement hierarchy, small-model harness patterns, and all
|
|
||||||
> primary-source citations that underpin the design decisions here.
|
|
||||||
|
|
||||||
## Current State
|
|
||||||
|
|
||||||
### Architecture Overview
|
|
||||||
|
|
||||||
The infrastructure is **tool-agnostic**: canonical sources live in `.agents/`
|
|
||||||
and a generator (`npm run generate:agents`) distributes them to
|
|
||||||
`.github/agents/`, `.github/skills/`, `.opencode/agents/`, `.opencode/skills/`.
|
|
||||||
Edit the `.agents/` sources; never edit the generated output directories (they
|
|
||||||
are `.gitignore`d and blocked by pre-tool-use policy).
|
|
||||||
|
|
||||||
```
|
|
||||||
.agents/
|
|
||||||
├── AGENTS.md # Root design doc + enforcement hierarchy
|
|
||||||
├── agents/ # Agent definitions (canonical)
|
|
||||||
│ ├── brainstorm.md
|
|
||||||
│ ├── research.md
|
|
||||||
│ └── build-local.md # OmniCoder 9B via Ollama
|
|
||||||
├── hooks/ # Shared bash hooks (delegated by all harnesses)
|
|
||||||
│ ├── pre-tool-use.sh # Hard blocks (terminal cmds + file-path policies)
|
|
||||||
│ ├── post-tool-use.sh # Self-check counter + methodology reminders
|
|
||||||
│ ├── session-start.sh # Inject project state at session start
|
|
||||||
│ ├── user-prompt-submit.sh # Per-turn nudge detection + task capture
|
|
||||||
│ ├── pre-compact.sh # Export state before context summarization
|
|
||||||
│ └── stop.sh # Session-end verification
|
|
||||||
└── skills/
|
|
||||||
└── research/SKILL.md # Research methodology (any agent can load)
|
|
||||||
```
|
|
||||||
|
|
||||||
Generated output (do not edit — regenerated by `npm run generate:agents`):
|
|
||||||
|
|
||||||
- `.github/agents/` — VS Code Copilot agent files
|
|
||||||
- `.github/skills/` — VS Code Copilot skill files
|
|
||||||
- `.opencode/agents/` — OpenCode agent files
|
|
||||||
- `.opencode/skills/` — OpenCode skill files
|
|
||||||
|
|
||||||
Harness integration:
|
|
||||||
|
|
||||||
- **VS Code Copilot**: `.github/agent-support.json` — maps 4 hook events to the
|
|
||||||
shared bash scripts in `.agents/hooks/`
|
|
||||||
- **OpenCode**: `.opencode/plugins/agent-support.ts` — TypeScript plugin that
|
|
||||||
shells out to the same bash scripts
|
|
||||||
|
|
||||||
### Brainstorm Agent
|
|
||||||
|
|
||||||
- 4-phase workflow: Quick Frame → Diverge → Converge → Capture & Hand Off
|
|
||||||
- 6 techniques: Rapid Ideation, SCAMPER, Worst Possible Idea, How Might We,
|
|
||||||
Inversion/Pre-mortem, Constraint Flipping
|
|
||||||
- Counterbalances Opus 4.6 overthinking tendency
|
|
||||||
- Phase 2 includes "push past the obvious" nudge (Zhao et al. 2024: LLMs fall
|
|
||||||
short on originality, excel at elaboration — first ideas are "average")
|
|
||||||
- Phase 4 routes to `@research` for investigation, default agent for
|
|
||||||
implementation
|
|
||||||
- Creates exploration files at `docs/explorations/<name>.md` and session memory
|
|
||||||
notes
|
|
||||||
|
|
||||||
### Research Agent
|
|
||||||
|
|
||||||
- Two orientations that compose recursively:
|
|
||||||
- **Understand** (Grounded Theory): open coding → constant comparison → axial
|
|
||||||
coding → memo → saturation check
|
|
||||||
- **Diagnose** (Strong Inference + Satisficing): 5-factor triage gates between
|
|
||||||
satisficing (low risk) and full falsification (high risk)
|
|
||||||
- 5-factor triage: reversibility, blast radius, confidence, novelty, time cost
|
|
||||||
- Timing awareness: `time` prefix on unknown commands, session/repo memory for
|
|
||||||
baselines, timing feeds into triage decisions
|
|
||||||
- Investigation files at `docs/explorations/<name>.md`
|
|
||||||
- Techniques reference: Five Whys, Delta Debugging, Rubber Duck
|
|
||||||
- Delegates evidence-gathering to Explore subagent, keeps analytical thinking
|
|
||||||
local
|
|
||||||
|
|
||||||
### Nudge Instructions
|
|
||||||
|
|
||||||
- Brainstorm nudge: triggers on hesitation/overthinking language ('wait',
|
|
||||||
'actually', 'hmm', 'overcomplicating', etc.)
|
|
||||||
- Research nudge: triggers on debugging/investigation language ('why is this
|
|
||||||
broken', 'how does this work', 'root cause', etc.)
|
|
||||||
- Both are non-intrusive single-sentence suggestions, only fire once per topic
|
|
||||||
|
|
||||||
### Tool Mapping (Copilot ↔ OpenCode)
|
|
||||||
|
|
||||||
| Copilot | OpenCode equivalent |
|
|
||||||
| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
|
||||||
| `AGENTS.md` (root + nested) | `AGENTS.md` (root, native; nested via `instructions` glob in `opencode.json`) |
|
|
||||||
| `.github/agents/*.agent.md` | `.opencode/agents/*.md` (frontmatter: `description`, `mode`, `model`, `temperature`, `permission`) |
|
|
||||||
| `.github/skills/<name>/SKILL.md` | `.opencode/skills/<n>/SKILL.md` — also reads `.agents/skills/` and `.claude/skills/` |
|
|
||||||
| `.github/instructions/*.instructions.md` (`applyTo`) | No direct equivalent — fold into AGENTS.md stubs or `instructions` glob |
|
|
||||||
| `.github/hooks/*.sh` (JSON-configured shell) | `.opencode/plugins/*.ts` (TS modules, event-driven) — shells out via Bun's `$` |
|
|
||||||
| `runSubagent` / `Explore` agent | Built-in `general` and `explore` subagents; `@`-mention syntax |
|
|
||||||
| `vscode_askQuestions` | No equivalent — OpenCode uses agent's natural turn-taking |
|
|
||||||
|
|
||||||
OpenCode plugin event mapping:
|
|
||||||
|
|
||||||
| Copilot hook | OpenCode event |
|
|
||||||
| -------------- | ----------------------------------- |
|
|
||||||
| `SessionStart` | `session.created` |
|
|
||||||
| `PreToolUse` | `tool.execute.before` |
|
|
||||||
| `PostToolUse` | `tool.execute.after` |
|
|
||||||
| `PreCompact` | `experimental.session.compacting` |
|
|
||||||
| `Stop` | `session.idle` (closest equivalent) |
|
|
||||||
|
|
||||||
## Research Foundation
|
|
||||||
|
|
||||||
> For full research depth, citations, and failure-mode analysis, see
|
|
||||||
> [`docs/research/ai-coding-best-practices.md`](../research/ai-coding-best-practices.md).
|
|
||||||
> The list below records the specific papers and frameworks that shaped the
|
|
||||||
> design decisions in this project.
|
|
||||||
|
|
||||||
Methodologies and papers that informed the design:
|
|
||||||
|
|
||||||
- **Grounded Theory** (Glaser & Strauss): build understanding from data, not
|
|
||||||
assumptions. Applied to code-reading in the Understand orientation.
|
|
||||||
- **Strong Inference** (Platt 1964): multiple competing hypotheses → crucial
|
|
||||||
experiments → eliminate. Applied to the Diagnose orientation.
|
|
||||||
- **Satisficing** (Simon 1956): accept "good enough" when optimization cost
|
|
||||||
exceeds benefit. Gates between cheap confirmation and expensive falsification.
|
|
||||||
- **Dual Process Theory** (Kahneman): System 1 (fast, pattern-matching) vs
|
|
||||||
System 2 (slow, analytical). System 1 more accurate in familiar domains.
|
|
||||||
Informs the triage decision.
|
|
||||||
- **Zhao et al. 2024** (arxiv): LLMs fall short on originality, excel at
|
|
||||||
elaboration. First ideas are "average." Informs brainstorm agent's "push past
|
|
||||||
the obvious" nudge.
|
|
||||||
- **"Lost in the Middle"** (Liu et al. 2023): LLMs attend best to beginning/end
|
|
||||||
of context. Informs hook design — inject at context tail for high attention.
|
|
||||||
- **Delta Debugging**: binary search the change space between passing/failing
|
|
||||||
cases. Logic behind `git bisect`.
|
|
||||||
- **Five Whys**: iterative causal chain tracing. Starting point for hypothesis
|
|
||||||
generation, not sole diagnostic method.
|
|
||||||
- **Ronacher "Agent Design Is Still Hard"**: reinforce methodology after every
|
|
||||||
tool call at context tail. Structural injection outperforms relying on
|
|
||||||
instructions in the system prompt.
|
|
||||||
- **Think-Anywhere** (Jiang et al. arXiv:2603.29957, Mar 2026, Peking U + Tongyi
|
|
||||||
Lab): LLMs trained to invoke `<think>` blocks at any token position during
|
|
||||||
code generation, not just upfront. SOTA on LeetCode/LiveCodeBench with fewer
|
|
||||||
total tokens. The motivating insight: a model can plan correctly at the start
|
|
||||||
but introduce an off-by-one bug mid-implementation — only mid-loop reasoning
|
|
||||||
catches it. **Applied here**: the research agent's investigation checklist
|
|
||||||
includes "Re-evaluate hypothesis at every tool-call boundary." For Claude 4
|
|
||||||
models, interleaved thinking makes this automatic. Complements Plan-and-Solve:
|
|
||||||
upfront decomposition where structure is clear, mid-execution re-evaluation
|
|
||||||
when intermediate results change what to do next.
|
|
||||||
- **Anthropic interleaved thinking** (Claude 4 + adaptive thinking): Claude
|
|
||||||
Sonnet 4.6+ and Opus 4.6+ automatically insert thinking blocks between tool
|
|
||||||
calls. No separate implementation needed — agent instruction design drives it.
|
|
||||||
The research agent's "Re-evaluate at every tool-call boundary" instruction
|
|
||||||
explicitly activates this behavior.
|
|
||||||
- **Prompt/Context/Harness framework** (Alibaba Cloud, Apr 2026): Names the
|
|
||||||
three engineering layers. Prompt = task expression (stateless). Context = what
|
|
||||||
the model sees (AGENTS.md, skills, tools — engineering target is progressive
|
|
||||||
disclosure). Harness = system constraints + verification loops (hooks,
|
|
||||||
permission gates, sub-agent isolation). Diagnostic map: wrong output → Prompt;
|
|
||||||
hallucinated fact → Context; wrong tool selected → Context (fix description);
|
|
||||||
task drift → Harness (sub-agent boundary); destructive action → Harness
|
|
||||||
(permission hook). LangChain improved Terminal Bench 2.0 from 52.8% → 66.5% by
|
|
||||||
changing Harness alone.
|
|
||||||
- **Context engineering** (Rajasekaran et al., Anthropic, Sep 2025): Formally
|
|
||||||
distinguishes context engineering from prompt engineering. Key principles: (a)
|
|
||||||
just-in-time context — agents hold references and load on demand, not upfront;
|
|
||||||
(b) structured note-taking (NOTES.md) as external working memory for long
|
|
||||||
sequential tasks; (c) every new token depletes attention budget — validates
|
|
||||||
the <60-line AGENTS.md ceiling; (d) compaction strategy: maximize recall
|
|
||||||
first, then improve precision.
|
|
||||||
|
|
||||||
## MCP Server Lifecycle Hooks — Protocol Status (May 2026)
|
|
||||||
|
|
||||||
The `.agents/mcp/` server exposes prompts and tools to agents via the MCP
|
|
||||||
protocol. A recurring question: can the MCP server react to session lifecycle
|
|
||||||
events (session start/end, tool-use boundaries)?
|
|
||||||
|
|
||||||
### Current protocol state
|
|
||||||
|
|
||||||
**No lifecycle hooks exist in the MCP protocol.** The spec defines three phases
|
|
||||||
only: `initialize → operation → shutdown`. There is no `session.created`,
|
|
||||||
`post-tool-call`, or `session.ended` notification. This gap is why session
|
|
||||||
awareness currently lives in the OpenCode plugin layer
|
|
||||||
(`.opencode/plugins/agent-support.ts`) rather than the MCP server — OpenCode
|
|
||||||
exposes `session.created`, `session.idle`, `session.compacted`,
|
|
||||||
`session.deleted`, and `tool.execute.before/after` events natively to plugins.
|
|
||||||
|
|
||||||
### Active work in the MCP spec
|
|
||||||
|
|
||||||
**SEP-2624: Interceptors for the Model Context Protocol**
|
|
||||||
([PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624))
|
|
||||||
|
|
||||||
The most organized effort. Supersedes SEP-1763 (closed as completed). Proposes
|
|
||||||
**Interceptors** as a new MCP primitive — two types: **validators** (inspect,
|
|
||||||
return pass/fail) and **mutators** (transform context payloads) — discoverable
|
|
||||||
and invocable via `interceptors/list` and `interceptor/invoke` JSON-RPC methods.
|
|
||||||
These fire at protocol-level operation events: `tools/call`, `prompts/get`,
|
|
||||||
`resources/read`, `sampling/createMessage`, `elicitation/create`. Not
|
|
||||||
session-start/stop hooks, but before/after wrapping for every operation.
|
|
||||||
|
|
||||||
There is now a formal **Interceptors Working Group** (Bloomberg + Saxo Bank
|
|
||||||
engineers, biweekly cadence). Reference implementations in progress for Go and
|
|
||||||
C# SDKs. Experimental repo:
|
|
||||||
[modelcontextprotocol/experimental-ext-interceptors](https://github.com/modelcontextprotocol/experimental-ext-interceptors).
|
|
||||||
Charter:
|
|
||||||
[modelcontextprotocol.io/community/interceptors/charter](https://modelcontextprotocol.io/community/interceptors/charter).
|
|
||||||
|
|
||||||
**SEP-2282: Server-Declared Behavioural Hooks**
|
|
||||||
([PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282))
|
|
||||||
|
|
||||||
Smaller, separate open PR. Proposes servers declare **context injections** in
|
|
||||||
`ServerCapabilities` — text injected into the agent's context at client-side
|
|
||||||
lifecycle events (session start, post-tool-use, session end). The contract is
|
|
||||||
"here's context the model should have at this moment," not code execution. More
|
|
||||||
directly analogous to our OpenCode `session.created` / `session.idle` patterns.
|
|
||||||
Currently unsponsored — needs a maintainer to pick it up.
|
|
||||||
|
|
||||||
### What to watch
|
|
||||||
|
|
||||||
- **Primary**:
|
|
||||||
[PR #2624](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2624) +
|
|
||||||
experimental-ext-interceptors repo
|
|
||||||
- **Secondary**:
|
|
||||||
[PR #2282](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2282)
|
|
||||||
(closest to session-lifecycle hooks)
|
|
||||||
- **Label filter**:
|
|
||||||
[`SEP` label](https://github.com/modelcontextprotocol/modelcontextprotocol/issues?q=label%3ASEP)
|
|
||||||
on the modelcontextprotocol repo
|
|
||||||
- **Milestone**: `2026-06-30-RC` is the next spec revision window
|
|
||||||
|
|
||||||
### Implication for this project
|
|
||||||
|
|
||||||
Until interceptors land in a shipping spec version and the TypeScript SDK, the
|
|
||||||
session lifecycle pattern stays at the OpenCode plugin layer. When SEP-2282 or
|
|
||||||
an equivalent lands, the MCP server could self-register context injection hooks
|
|
||||||
during `initialize`, removing the need for tool-specific plugin code.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Model Scale Profiles
|
|
||||||
|
|
||||||
Different model sizes require different infrastructure strategies. The failure
|
|
||||||
modes are different, so the mitigations are different.
|
|
||||||
|
|
||||||
### Large-scale API models (Claude Sonnet / Opus)
|
|
||||||
|
|
||||||
**Primary failure modes**: overthinking, sycophancy, verbosity, tendency to add
|
|
||||||
unrequested features or comments.
|
|
||||||
|
|
||||||
**Infrastructure strategy**:
|
|
||||||
|
|
||||||
- Advisory methodology + structural reinforcement (hooks, circuit breakers)
|
|
||||||
- PostToolUse self-check nudges every ~15 calls
|
|
||||||
- PreToolUse hard blocks for high-risk operations
|
|
||||||
- Subagent delegation for isolated tasks (parent Opus → child Sonnet/Haiku)
|
|
||||||
|
|
||||||
### Smaller-scale local models (OmniCoder 9B via Ollama)
|
|
||||||
|
|
||||||
**Primary failure modes** (different from "low reasoning" — OmniCoder uses Qwen3
|
|
||||||
thinking blocks natively):
|
|
||||||
|
|
||||||
- Narrower training distribution (Python/JS heavy)
|
|
||||||
- Quantization degradation: JSON schema compliance drops as context fills
|
|
||||||
- Tool-call history is the primary context consumer — responses must be
|
|
||||||
truncated aggressively
|
|
||||||
- Instruction drift: fewer attention heads (32 vs 64 in 32B) means system prompt
|
|
||||||
recall degrades faster
|
|
||||||
|
|
||||||
**Infrastructure strategy**:
|
|
||||||
|
|
||||||
- PostToolUse response truncation at ~1500 tokens (plugin layer, not bash hook)
|
|
||||||
- PreToolUse JSON validation with schema-specific error messages
|
|
||||||
- Context pressure injection at ≥70% fill (~22K/32K tokens)
|
|
||||||
- `steps: 20` cap + `ask` permission gates for natural checkpoints
|
|
||||||
- `explore` subagent delegation to reduce context pressure on the main agent
|
|
||||||
- `NOTES.md` working memory pattern enforced in agent body
|
|
||||||
- No `web` tool — keeps context lean
|
|
||||||
- Reasoning guidance: "Hold references; load on demand" explicit in agent body
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## OmniCoder 2 Orchestration — Pending Work
|
|
||||||
|
|
||||||
> Full historical rationale and audit findings were maintained in
|
|
||||||
> `docs/projects/local-ai-orchestration.md` (deleted May 2026 after merge). The
|
|
||||||
> plan used an orchestrator-workers pattern with structural `edit: deny`
|
|
||||||
> enforcement on the orchestrator. All OpenCode config values verified against
|
|
||||||
> opencode.ai/docs (May 2026).
|
|
||||||
|
|
||||||
### Goals
|
|
||||||
|
|
||||||
1. All agents run on `ollama/arch-omni2-9b` — no cloud fallback
|
|
||||||
2. User can type vague prompts; the system decomposes and delegates
|
|
||||||
automatically
|
|
||||||
3. Context windows are isolated per subagent (no shared state bleed)
|
|
||||||
4. Changes scale forward: switching to cloud means changing model strings, not
|
|
||||||
architecture
|
|
||||||
|
|
||||||
### Pending Changes
|
|
||||||
|
|
||||||
#### Quick wins — under 5 minutes each, no testing required
|
|
||||||
|
|
||||||
1. - [x] **[CRITICAL] Fix `<tool\*call>` typo in `omnicoder2.modelfile`** —
|
|
||||||
markdown-escape artifact; malformed opening tag paired with correct
|
|
||||||
closing tag. Highest-leverage change; everything below depends on
|
|
||||||
reliable tool-call JSON.
|
|
||||||
2. - [x] **Mark canonical/deprecated modelfiles** — `# CANONICAL` header on
|
|
||||||
`omnicoder2.modelfile`; `# DEPRECATED` on `omnicoder.modelfile`;
|
|
||||||
`omnicoder-v2.modelfile.template` deleted (was dead code — v2 now
|
|
||||||
served from HuggingFace path).
|
|
||||||
3. - [x] **Add `compaction.reserved: 3000` to `opencode.json`** — default 10,000
|
|
||||||
fires compaction too early given ~8–12K baseline context.
|
|
||||||
4. - [x] **Fix `pre-compact.sh` prettier call** — removes `npx prettier` which
|
|
||||||
violates pre-tool-use Policy 1 (self-violating policy).
|
|
||||||
5. - [x] **MCP server error handling** — wrap `server.connect(transport)` in
|
|
||||||
try/catch with stderr + `process.exit(1)`.
|
|
||||||
|
|
||||||
#### Short session — 15–30 minutes each, bounded scope
|
|
||||||
|
|
||||||
6. - [x] **Fix `stop.sh` JSON escaping** — replace `sed`-based escaping with
|
|
||||||
`printf '%b' | node JSON.stringify` pattern used in every other hook.
|
|
||||||
7. - [x] **Per-session PostToolUse counter** — repo-scoped path
|
|
||||||
`/tmp/.opencode-tool-count-<repo-hash>` (derived from REPO_ROOT via
|
|
||||||
md5sum); prevents cross-repo contamination; session-start.sh resets it
|
|
||||||
at session begin.
|
|
||||||
8. - [x] **Shrink compaction prompt to ~120 words** (in
|
|
||||||
`.opencode/plugins/agent-support.ts`) — shorter instructions free
|
|
||||||
bandwidth for the 9B to actually summarize.
|
|
||||||
9. - [x] **Update `.agents/agents/build-local.md` for v2** — pagination 100 → 50
|
|
||||||
lines; rule 4 now says "recipient not dispatcher"; rule 7 scope-check
|
|
||||||
says "tell the user, do not self-decompose".
|
|
||||||
|
|
||||||
#### Depends on orchestrator being proven first
|
|
||||||
|
|
||||||
10. - [x] **Trim root `AGENTS.md` to ~60 lines** — reduced from 435 lines to 45
|
|
||||||
lines; all architecture rationale, code examples, quick task table,
|
|
||||||
and project context removed; cross-cutting rules and quality gate
|
|
||||||
preserved (May 2026).
|
|
||||||
11. - [x] **PostToolUse weighted counter** — reads (`read_file`, `grep`, `list`)
|
|
||||||
+0.25; writes/shell +1; keeps 15-call SELF-CHECK from firing
|
|
||||||
mid-investigation sweep. Depends on #7 (per-session counter) first.
|
|
||||||
|
|
||||||
**Implementation** (`.agents/hooks/post-tool-use.sh`): bash has no
|
|
||||||
float arithmetic — scale to integers: reads +1, writes/shell +4,
|
|
||||||
threshold 60 (equivalent to 15 effective write-units). Read-class
|
|
||||||
tools: `read_file`, `grep_search`, `list_dir`, `file_search`,
|
|
||||||
`semantic_search`, `explore_subagent`. Write/shell-class: all
|
|
||||||
`*_string_in_file`, `create_file`, `run_in_terminal`. Replace the
|
|
||||||
single `COUNT=$((COUNT + 1))` with a `case "$TOOL_NAME"` block that
|
|
||||||
does `COUNT=$((COUNT + 1))` for reads and `COUNT=$((COUNT + 4))` for
|
|
||||||
writes/shell. Change the self-check condition from
|
|
||||||
`(( COUNT % 15 == 0 ))` to `(( COUNT % 60 == 0 ))`.
|
|
||||||
|
|
||||||
12. - [x] **PostToolUse reminder priority filter** — emit at most 2 reminders
|
|
||||||
per tool call; priority: SELF-CHECK > DEBUGGING > path-scoped >
|
|
||||||
tool-specific. Depends on #11.
|
|
||||||
|
|
||||||
**Implementation** (`.agents/hooks/post-tool-use.sh`): replace the
|
|
||||||
current single `context` string accumulator with an indexed array
|
|
||||||
`reminders=()`. Each block appends `reminders+=("$msg")` in priority
|
|
||||||
order (SELF-CHECK first, DEBUGGING second, BFF/QUALITY GATE third,
|
|
||||||
RENAME fourth). At output time: join only the first 2 elements.
|
|
||||||
Append with `\n\n` separator. Blocks that didn't fire don't append,
|
|
||||||
so the cap is natural.
|
|
||||||
|
|
||||||
13. - [x] **Broaden PostToolUse truncation to all `ollama/` agents**
|
|
||||||
(`.opencode/plugins/agent-support.ts`); differentiate limit:
|
|
||||||
orchestrator 2,500 tokens vs workers 1,500. Minor until orchestrator
|
|
||||||
exists.
|
|
||||||
|
|
||||||
**Implementation**: rename `BUILD_LOCAL_MAX_RESPONSE_TOKENS` →
|
|
||||||
`LOCAL_WORKER_MAX_TOKENS = 1500`; add
|
|
||||||
`LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500`. In `tool.execute.after`, the
|
|
||||||
existing `isLocalAgent` check covers all `ollama/` agents via
|
|
||||||
`input.model.startsWith('ollama/')`. Add a second check:
|
|
||||||
`input.agent === 'local-orchestrator'` → use orchestrator limit, else
|
|
||||||
worker limit. The `agent` field is available in `tool.execute.after`
|
|
||||||
(confirmed working for `build-local`).
|
|
||||||
|
|
||||||
14. - [x] **Create `.agents/agents/local-orchestrator.md`** — primary agent with
|
|
||||||
`edit: deny`, `write: deny`, `bash: deny`; whitelist `task` to
|
|
||||||
`build-local`, `research`, `brainstorm` only.
|
|
||||||
|
|
||||||
**Implementation**: new file modeled on `build-local.md`. Role: receive
|
|
||||||
high-level goal, decompose into bounded subtasks, show decomposition to
|
|
||||||
user before dispatching, delegate via `task` subagent. Permission
|
|
||||||
block in `opencode.json` `agent.local-orchestrator`:
|
|
||||||
`{ "edit": "deny", "write": "deny", "bash": "deny" }`. Agent body
|
|
||||||
rules: (1) read project root `AGENTS.md` first; (2) produce a task
|
|
||||||
list and confirm with user before dispatching; (3) one `task` call per
|
|
||||||
subtask, wait for result; (4) never attempt to edit files directly —
|
|
||||||
if a subtask requires context the worker needs, inject it via the
|
|
||||||
`task` prompt, not by reading files yourself; (5) after all subtasks,
|
|
||||||
report summary to user.
|
|
||||||
|
|
||||||
15. - [x] ~~**Set `default_agent: "local-orchestrator"` in `opencode.json`**~~ —
|
|
||||||
Done May 2026. Key is `default_agent` (snake_case, confirmed from
|
|
||||||
`opencode.ai/config.json` schema). `local-orchestrator` has
|
|
||||||
`mode: all` so it qualifies as a primary agent.
|
|
||||||
|
|
||||||
#### Done
|
|
||||||
|
|
||||||
- [x] ~~**Soften `opus-deep.modelfile` directive**~~ — file deleted (May 2026);
|
|
||||||
DeepSeek R1 available online when needed; OmniCoder 2 is the sole local
|
|
||||||
model.
|
|
||||||
|
|
||||||
### Known Tradeoffs
|
|
||||||
|
|
||||||
| Tradeoff | Impact | Mitigation |
|
|
||||||
| -------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
|
|
||||||
| Instructions glob trimmed to root `AGENTS.md` only | Agents miss project-specific patterns for subdirectories unless they read nested `AGENTS.md` explicitly | Add reminder in orchestrator + build-local agent body: "check nested `AGENTS.md` before working in subdirectories" |
|
|
||||||
| Same model for all roles | Orchestrator, worker, compaction agent are all same weights with different prompts | Structural `edit: deny` is the safety net; circuit breakers limit runaway loops |
|
|
||||||
| No cloud fallback | If task is too complex for 9B, no escalation path | Orchestrator includes "ask the user for direction" rule; user can switch to Copilot |
|
|
||||||
| Latency | Sequential dispatch: orchestrator decomposes → build-local runs → returns. ~2× wall time vs. direct build-local | Acceptable for local dev; no VRAM multiplier since Ollama keeps weights hot |
|
|
||||||
| Reminder-stacking cap | 2-per-call priority filter (pending work above) drops lower-priority warnings | Skipped reminders fire on next call if condition holds |
|
|
||||||
|
|
||||||
### Cloud Migration Path
|
|
||||||
|
|
||||||
When ready to add a cloud model, only `opencode.json` changes:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"model": "ollama/arch-omni2-9b",
|
|
||||||
"agent": {
|
|
||||||
"local-orchestrator": {
|
|
||||||
"model": "anthropic/claude-haiku-4-5"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Schema verified against opencode.ai/docs/agents/ (May 2026). The `tools` key
|
|
||||||
inside agent configs is deprecated in favour of `permission` — the orchestrator
|
|
||||||
definition uses `permission`, so it is current. The `agent.{name}.model` key is
|
|
||||||
the correct per-agent override mechanism.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Ecosystem Gap — Contextual AGENTS.md Injection
|
|
||||||
|
|
||||||
During local AI work (May 2026) we hit a fundamental limitation: OpenCode's
|
|
||||||
`instructions` glob in `opencode.json` loads **all matched files upfront** into
|
|
||||||
every session. For a 9B local model with a 32K context window, loading all of
|
|
||||||
`apps/*/AGENTS.md` and `packages/*/AGENTS.md` at startup consumes ~30–40% of the
|
|
||||||
context budget before the first message, triggering early compaction and
|
|
||||||
degrading quality.
|
|
||||||
|
|
||||||
The correct behaviour — injecting only the AGENTS.md relevant to the file being
|
|
||||||
edited — does not exist natively in OpenCode or its plugin ecosystem. The
|
|
||||||
closest community plugin (`opencode-skillful`, 295 stars) is archived as of Feb
|
|
||||||
2026 and still requires the model to explicitly call `skill_find`/`skill_use`;
|
|
||||||
it provides no path-triggered structural injection.
|
|
||||||
|
|
||||||
### Open tasks
|
|
||||||
|
|
||||||
16. - [ ] **Assess: is filling this ecosystem gap worth the effort?** — Before
|
|
||||||
building a contextual-injection plugin, evaluate: (a) Is OpenCode
|
|
||||||
actively used for serious local AI coding work, or is the community
|
|
||||||
primarily cloud-model users for whom context cost is irrelevant? (b)
|
|
||||||
Are there better local AI coding stacks (e.g. Aider + litellm, Cursor
|
|
||||||
local mode, VS Code Copilot + Ollama) where this problem is already
|
|
||||||
solved? (c) Is the `tool.execute.before` event stable enough to build
|
|
||||||
on? Target: 30-minute research session, concrete go/no-go
|
|
||||||
recommendation.
|
|
||||||
|
|
||||||
17. - [ ] **Review + write up our issues and fixes as an ecosystem
|
|
||||||
contribution** — If the gap is worth filling: document the
|
|
||||||
context-bleed problem, the early-compaction root cause, our hook-based
|
|
||||||
mitigation, and the remaining structural gap. Publish as a GitHub
|
|
||||||
issue on the OpenCode repo and/or an npm plugin
|
|
||||||
(`opencode-contextual-rules`?) implementing `tool.execute.before`
|
|
||||||
path-triggered AGENTS.md injection. Depends on #16 go/no-go.
|
|
||||||
|
|
||||||
18. - [x] ~~**Trim `.agents/AGENTS.md`**~~ — Done May 2026. Condensed from
|
|
||||||
12,584 → 10,507 bytes (43 lines removed). Trimmed: Hook Architecture
|
|
||||||
Principle block (redirected to item 22 in project doc), Deferred
|
|
||||||
Loading example + "why not" paragraph, session-start/stop hook prose,
|
|
||||||
outdated `generate-agents.ts` references in Skills/Agents sections.
|
|
||||||
Agent body files updated to prompt-body-only convention (see items
|
|
||||||
25/26).
|
|
||||||
|
|
||||||
19. - [x] ~~**Block bash bypass of read pagination**~~ — Done May 2026. Added
|
|
||||||
Policy 14 to `pre-tool-use.sh`: blocks `cat`/`head`/`tail`/`jq` reads
|
|
||||||
of `apps/*/package.json` and `packages/*/package.json`. Scope limited
|
|
||||||
to package.json (confirmed live bypass vector); general `.ts`/`.md`
|
|
||||||
bash reads are not yet blocked (lower-urgency gap). Pattern verified
|
|
||||||
with Node.js unit test — exact bypass command
|
|
||||||
`cat apps/api/package.json | jq` is caught by P1.
|
|
||||||
|
|
||||||
20. - [ ] **Improve explore-first scope detection** — Policy 14 blocks
|
|
||||||
`manage_todo_list` with ≥4 items, but OmniCoder sometimes starts with
|
|
||||||
`Explore`/`find` before planning, bypassing the check. Options: (a)
|
|
||||||
block `explore_subagent` when the query looks like a multi-file
|
|
||||||
discovery sweep (glob patterns for source files across multiple dirs);
|
|
||||||
(b) add a pre-tool-use check on `run_in_terminal` that denies `find`
|
|
||||||
commands spanning the whole repo when the task hasn't been scoped yet;
|
|
||||||
(c) rely on the todo-list check firing when planning eventually
|
|
||||||
happens (current behavior — catches it late but still before edits
|
|
||||||
start).
|
|
||||||
|
|
||||||
21. - [x] ~~**Remove debug logging from plugin after verified cycle**~~ — Done
|
|
||||||
May 2026. Removed the full-input dump block from `tool.execute.before`
|
|
||||||
in `plugin.ts` (`/tmp/plugin-debug.jsonl` appender). Guards verified
|
|
||||||
via `opencode export` session transcript inspection — no longer need
|
|
||||||
the dump file. Hook error logger (`/tmp/plugin-hook-errors.log`) kept
|
|
||||||
as it only fires on failures, not every call.
|
|
||||||
|
|
||||||
22. - [ ] **Refactor hook scripts to be platform-agnostic** — currently
|
|
||||||
`pre-tool-use.sh` parses Copilot-specific JSON and outputs
|
|
||||||
Copilot-specific `permissionDecision` JSON. `plugin.ts` implements
|
|
||||||
duplicate guards inline rather than calling the script. This means
|
|
||||||
OpenCode and Copilot guards can drift (confirmed May 2026: Policy 14
|
|
||||||
in `pre-tool-use.sh` had no effect on OpenCode `bash` tool calls).
|
|
||||||
|
|
||||||
**Design target**: scripts accept normalized env vars (`TOOL_NAME`,
|
|
||||||
`COMMAND`, `FILE_PATH`), exit non-zero with plain-text denial reason
|
|
||||||
on stdout. Callers normalize input and translate output to their
|
|
||||||
native denial format. Tracked in `.agents/AGENTS.md` Hook Architecture
|
|
||||||
Principle section.
|
|
||||||
|
|
||||||
**Audit required first**: review all hook scripts for Copilot-specific
|
|
||||||
assumptions before refactoring.
|
|
||||||
|
|
||||||
23. - [ ] **Question-drift marker in `user-prompt-submit.sh`** — when the model
|
|
||||||
has committed to a prior position and follow-up questions are being
|
|
||||||
misread through that lens, prepend a disambiguation marker at the
|
|
||||||
prompt tail. Detected pattern: model answers "no" or "not possible" in
|
|
||||||
a prior turn → subsequent turns interpreted as defense of that
|
|
||||||
position. See §2.1 ("Position-anchored priming") in the research doc.
|
|
||||||
|
|
||||||
**Implementation**: in `user-prompt-submit.sh`, read the last N turns
|
|
||||||
of `$TRANSCRIPT_PATH` (injected by OpenCode's native hook env) and
|
|
||||||
look for a prior committed "no/impossible/can't" response within the
|
|
||||||
last 3 model turns. If detected, append to `ADDITIONAL_CONTEXT`:
|
|
||||||
`CURRENT QUESTION (answer only this — not the prior exchange): [prompt
|
|
||||||
text]`. The key is repeating the user's exact question at the tail,
|
|
||||||
after the marker, to counteract lost-in-the-middle effects. Fallback
|
|
||||||
trigger: user prompt contains "that's not what I asked" / "you're
|
|
||||||
answering the wrong question" / "I said" → always inject marker
|
|
||||||
regardless of transcript scan.
|
|
||||||
|
|
||||||
24. - [x] ~~**Review all custom agent files for local-model-specific framing**~~
|
|
||||||
— Done May 2026. `build-local.md` reframed: dropped "OmniCoder", "9B",
|
|
||||||
"Ollama", "Qwen3 thinking blocks", "32K tokens total"; replaced with
|
|
||||||
model-agnostic equivalents. `research.md` and `brainstorm.md` verified
|
|
||||||
clean — no model/provider mentions. `local-orchestrator.md` was fixed
|
|
||||||
earlier this session. All four agent body files are now
|
|
||||||
model-agnostic.
|
|
||||||
|
|
||||||
25. - [ ] **Failure-mode routing in SELF-CHECK** — when the periodic SELF-CHECK
|
|
||||||
fires in `post-tool-use.sh`, if a recent terminal failure or test
|
|
||||||
failure is also present in the same turn, classify the failure type
|
|
||||||
and inject the matched intervention rather than generic "step back."
|
|
||||||
Reference: failure-mode routing table in §3.5 of the research doc.
|
|
||||||
|
|
||||||
**Implementation**: in the SELF-CHECK block, if `context` already
|
|
||||||
contains `DEBUGGING REMINDER` (i.e., test/terminal failure co-occurred
|
|
||||||
this turn), append a classification hint:
|
|
||||||
`FAILURE TYPE HINT: If this is a test/build failure → Reflexion loop
|
|
||||||
(fix based on test output). If convention violation → grep for the
|
|
||||||
pattern and inject a canonical example. If wrong file/directory → stop
|
|
||||||
and re-read the project structure. Do not default to "try harder."`.
|
|
||||||
Low implementation cost — pure text append with a conditional on
|
|
||||||
`$context`.
|
|
||||||
|
|
||||||
26. - [x] ~~**Audit agent `.md` files for OpenCode-specific frontmatter**~~ —
|
|
||||||
Done May 2026. Audit result: only `local-orchestrator.md` had OpenCode
|
|
||||||
frontmatter keys (`mode`, `model`, `permission`). `brainstorm.md`,
|
|
||||||
`build-local.md`, `research.md` were already plain markdown. Went with
|
|
||||||
option (b): stripped `mode`/`model`/`permission` from
|
|
||||||
`local-orchestrator.md`; moved `mode: all` into `opencode.json`
|
|
||||||
(model + permission were already there). Kept `description` in
|
|
||||||
frontmatter as it is neutral and self-documenting. Body files are now
|
|
||||||
prompt-body only — valid in both OpenCode and Copilot.
|
|
||||||
|
|
||||||
27. - [ ] **`plugin.ts` local-agent detection uses provider prefix, not agent
|
|
||||||
name** — `tool.execute.after` detects local agents via
|
|
||||||
`input.model.startsWith('ollama/')`. This is provider-specific: if the
|
|
||||||
model is served via a different backend (e.g. `llama-server/`,
|
|
||||||
`lmstudio/`), truncation silently stops working. Fix: detect by agent
|
|
||||||
name (`input.agent.includes('build-local')`) only, removing the
|
|
||||||
`ollama/` fallback. The `input.agent` field is available in
|
|
||||||
`tool.execute.after` (confirmed May 2026).
|
|
||||||
|
|
||||||
28. - [ ] **`plugin.ts` context pressure threshold is hardcoded to 32,768
|
|
||||||
tokens** — `CONTEXT_LIMIT_TOKENS = 32768` assumes OmniCoder 9B's
|
|
||||||
context window. If the local model changes, the threshold silently
|
|
||||||
drifts out of calibration. Options: (a) read from `opencode.json`
|
|
||||||
model config if OpenCode exposes it to plugins; (b) make it a
|
|
||||||
top-of-file constant with a comment to update when changing models;
|
|
||||||
(c) accept the drift as low-severity (threshold is advisory only —
|
|
||||||
context pressure warnings are informational, not blocking). Option (b)
|
|
||||||
is the minimum; option (a) is ideal if OpenCode exposes model metadata
|
|
||||||
to plugins.
|
|
||||||
|
|
||||||
29. - [x] ~~**Move `permission` out of `local-orchestrator.md` frontmatter**~~ —
|
|
||||||
Done May 2026 as part of item 25. `mode: all` added to `opencode.json`
|
|
||||||
agent entry. `model` and `permission` were already in `opencode.json`.
|
|
||||||
`opencode.json` is now the single source of truth for all runtime
|
|
||||||
config; `.md` files are prompt-body only.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Testing & Regression
|
|
||||||
|
|
||||||
**Research summary (May 2026):** No pre-existing tool exactly fits this use
|
|
||||||
case. Existing tools (RagaAI Catalyst, AgentEvalKit, agent-eval-arena,
|
|
||||||
intent-eval-lab, j-rig-skill-binary-eval) focus on LLM output quality,
|
|
||||||
hallucination detection, or cross-runtime behavior scoring — not config file
|
|
||||||
structure or policy enforcement regression. The closest analogue is
|
|
||||||
`j-rig-skill-binary-eval` (binary pass/fail criteria across 7 layers), which
|
|
||||||
uses the same conceptual approach we'd want here. Our testing is bespoke by
|
|
||||||
necessity: we're testing configuration files, shell scripts, and specific policy
|
|
||||||
enforcement behaviors, not general LLM response quality.
|
|
||||||
|
|
||||||
**Two layers of testing:**
|
|
||||||
|
|
||||||
| Layer | What it tests | Cost | When to run |
|
|
||||||
| --------------------------- | --------------------------------------- | ---------------- | -------------------------------------- |
|
|
||||||
| Config + policy unit tests | Schema validity, hook regex correctness | None (no model) | Always — CI, pre-commit |
|
|
||||||
| CLI integration smoke tests | Actual enforcement via `opencode run` | Local model only | On-demand; local model must be running |
|
|
||||||
|
|
||||||
**Cloud agents excluded from integration tests** — `opencode run` with a cloud
|
|
||||||
model (Copilot, Anthropic) incurs API costs and rate limits. Tests must detect
|
|
||||||
the active model and skip if it's not a local provider.
|
|
||||||
|
|
||||||
### Open tasks
|
|
||||||
|
|
||||||
30. - [ ] **Config + policy unit test suite** — test config file structure and
|
|
||||||
hook regex patterns without invoking any model. Implementation:
|
|
||||||
|
|
||||||
a. **`opencode.json` schema validation**: the file references
|
|
||||||
`"$schema": "https://opencode.ai/config.json"` — validate it using
|
|
||||||
`ajv` (already used in the monorepo) against the live schema or a
|
|
||||||
cached copy. Catches permission typos, unknown agent keys,
|
|
||||||
unsupported field values.
|
|
||||||
|
|
||||||
b. **Hook JSON structure validation**: validate
|
|
||||||
`.agents/frameworks/github/hooks.json` and
|
|
||||||
`.agents/frameworks/opencode/plugin.ts` (TypeScript, already type-
|
|
||||||
checked). Write a schema for the hooks JSON format and run ajv on
|
|
||||||
it.
|
|
||||||
|
|
||||||
c. **Hook policy regex unit tests**: extract every regex used in
|
|
||||||
`pre-tool-use.sh` into a `tests/hooks.test.ts` file and run it
|
|
||||||
with `vitest`. For each policy, define 2–3 input strings that
|
|
||||||
SHOULD match and 2–3 that SHOULD NOT. Policy 14 already has an
|
|
||||||
informal Node.js test from this session — formalize it.
|
|
||||||
|
|
||||||
d. **Agent `.md` frontmatter validator**: check that no agent file
|
|
||||||
under `.agents/agents/` has frontmatter keys other than
|
|
||||||
`description`. Catches regression when someone adds `model:` or
|
|
||||||
`permission:` back to a body file.
|
|
||||||
|
|
||||||
**Suggested location**: `.agents/tests/` or root `test/agents/`.
|
|
||||||
**Stack**: vitest (already in monorepo), ajv (already available), Node
|
|
||||||
built-ins. No new dependencies needed.
|
|
||||||
|
|
||||||
31. - [ ] **CLI integration smoke tests (local model only)** — use
|
|
||||||
`opencode run` in non-interactive mode to verify enforcement is
|
|
||||||
actually firing via the real runtime. These tests exercise the
|
|
||||||
plugin + hook wiring end-to-end.
|
|
||||||
|
|
||||||
**Command shape**:
|
|
||||||
```
|
|
||||||
opencode run "prompt" --agent build-local \
|
|
||||||
--model llama-server/arch-omni2-9b-native \
|
|
||||||
--format json
|
|
||||||
```
|
|
||||||
|
|
||||||
**Assertions via `opencode export`**: after each run, export the
|
|
||||||
session with `opencode export <sessionID> 2>/dev/null` and parse the
|
|
||||||
JSON transcript. Assert on `parts` array: tool calls that SHOULD have
|
|
||||||
been blocked appear with error/denied status; tool calls that SHOULD
|
|
||||||
have passed completed normally.
|
|
||||||
|
|
||||||
**Test cases to start with** (all verified real enforcement gaps):
|
|
||||||
1. Attempt to `read` a nested `package.json` (e.g. `apps/api/package.json`) → BLOCKED by plugin
|
|
||||||
package.json guard
|
|
||||||
2. Attempt to `read` a source file with no `limit` → BLOCKED by
|
|
||||||
pagination guard
|
|
||||||
3. Attempt to `read` a source file with `limit: 51` → BLOCKED
|
|
||||||
4. Attempt to `read` a docs file with `limit: 501` → BLOCKED
|
|
||||||
5. Attempt to `read` a docs file with `limit: 50` → PASSES
|
|
||||||
6. Bash command `cat apps/api/package.json` → BLOCKED by pre-tool-use
|
|
||||||
Policy 14 (substitute your project's equivalent nested package.json)
|
|
||||||
|
|
||||||
**Guard rail**: skip all tests if `llama-server` is not reachable at
|
|
||||||
`http://127.0.0.1:8080/v1`. Do not run against cloud models. Add
|
|
||||||
an env var `AGENT_INTEGRATION_TESTS=1` required to enable (off by
|
|
||||||
default, never runs in standard `npm test`).
|
|
||||||
|
|
||||||
**Suggested location**: `.agents/tests/integration/`.
|
|
||||||
**Stack**: Node.js test runner or vitest, `opencode` CLI in PATH.
|
|
||||||
|
|
||||||
### Verified facts (May 2026)
|
|
||||||
|
|
||||||
- OpenCode's `read` tool input schema is
|
|
||||||
`{ filePath: string, limit?: number, offset?: number }` — NOT
|
|
||||||
`startLine`/`endLine`. Confirmed via plugin debug logging of real tool calls.
|
|
||||||
- `tool.execute.before` input contains only `{ tool, sessionID, callID }`. It
|
|
||||||
does NOT include `agent` or `model`, so plugin-layer gating cannot filter by
|
|
||||||
agent. Confirmed via plugin debug logging.
|
|
||||||
- **OpenCode has its own native hook system** that calls `pre-tool-use.sh`
|
|
||||||
directly for tools like `run_in_terminal`, `replace_string_in_file`, etc. This
|
|
||||||
is completely separate from the plugin's `runHook` calls. The native hook
|
|
||||||
payload includes `timestamp`, `hook_event_name`, `session_id`,
|
|
||||||
`transcript_path`, `tool_use_id`, and `cwd` — fields the plugin never sends.
|
|
||||||
The plugin `runHook` is a _second_ call, layered on top.
|
|
||||||
- **Bun shell `$` API does not have a `.stdin()` method.** The correct API for
|
|
||||||
piping stdin is `` $`cmd < ${Buffer.from(text)}` ``. `.stdin(text)` silently
|
|
||||||
throws `TypeError: $\`...\`.stdin is not a
|
|
||||||
function`, which was caught by `runHook`'s `catch`block and returned`''`. This caused the plugin's `runHook`to silently no-op for every call with`stdinJson`since the plugin was first written — hook enforcement (all 12 policies) was never running via the plugin path. It only ran via OpenCode's native hook system for the tools OpenCode natively supports. Confirmed via`/tmp/plugin-hook-errors.log`.
|
|
||||||
- **The silent `catch` in `runHook` is dangerous.** It masked the Bun `.stdin()`
|
|
||||||
bug entirely. Always log hook failures to a debug file during development;
|
|
||||||
remove only after enforcement is verified working.
|
|
||||||
- **Plugin-layer enforcement works for `read`** after fixing the Bun stdin API.
|
|
||||||
The `read` tool fires `tool.execute.before` in the plugin, which calls
|
|
||||||
`runHook('pre-tool-use.sh', ...)` via `< ${Buffer.from(...)}`, which applies
|
|
||||||
Policy 13 (50-line limit). Verified: bare `read` (no limit) → BLOCKED; `read`
|
|
||||||
with `limit: 50` → passes. (May 2026)
|
|
||||||
- **Plugin load failure: unescaped regex slashes caused silent syntax error.**
|
|
||||||
`plugin-debug.jsonl` was empty even after the Bun stdin fix because the plugin
|
|
||||||
file itself failed to parse. Line 84 had `/(^|/)(apps|packages)/[^/]+/...` —
|
|
||||||
forward slashes inside the regex literal were not escaped, producing a JS
|
|
||||||
syntax error at parse time. Bun silently drops plugins that fail to import.
|
|
||||||
Fixed to `/(^|\/)(apps|packages)\/[^/]+\/...`. The fix also corrected the
|
|
||||||
pagination guard to use `limit`/`offset` (not `startLine`/`endLine`) and added
|
|
||||||
an unbounded-read block (`limit === undefined`). All three guards verified
|
|
||||||
working in a live session (May 2026).
|
|
||||||
- **Package.json read guard verified working.** `local-orchestrator` attempting
|
|
||||||
to read `apps/*/package.json` and `packages/*/package.json` → BLOCKED by
|
|
||||||
plugin. Root `package.json` read correctly passes. (May 2026)
|
|
||||||
- **Policy 14 (`manage_todo_list` ≥ 4 items) catches some but not all broad task
|
|
||||||
attempts.** OmniCoder sometimes proceeds directly to `Explore`/`find` without
|
|
||||||
calling `manage_todo_list` first, bypassing the policy. When it does plan with
|
|
||||||
the todo tool before acting, the deny fires correctly.
|
|
||||||
- **OmniCoder comprehension failure: prompt ambiguity → wrong directory.** Given
|
|
||||||
"refactor the five hook files", OmniCoder ran a glob for `*hook*` files and
|
|
||||||
found `.husky/` hooks instead of `.agents/hooks/`. The correct files were in
|
|
||||||
the grep output from the Explore subagent but were not selected. Root cause:
|
|
||||||
the model lacks enough context about the repo layout to disambiguate "hook
|
|
||||||
files" without explicit path guidance. Mitigation: be explicit in prompts
|
|
||||||
("the five `.agents/hooks/*.sh` files").
|
|
||||||
- **OpenCode agent `permission` config requires a `.opencode/agents/<name>.md`
|
|
||||||
file.** Without a matching markdown file, `opencode.json`'s
|
|
||||||
`agent.<name>.permission` config is silently ignored — the agent is unknown to
|
|
||||||
OpenCode and runs as a nameless build-agent alias. The markdown file must
|
|
||||||
exist in `.opencode/agents/` (or `~/.config/opencode/agents/`). Confirmed by
|
|
||||||
test run where `@local-orchestrator` edited files despite
|
|
||||||
`permission.edit: "deny"` in JSON config; fixed by creating
|
|
||||||
`.opencode/agents/local-orchestrator.md` symlink. (May 2026)
|
|
||||||
- **`"write"` is NOT a valid OpenCode permission key.** Use `"edit"` instead —
|
|
||||||
it covers `write`, `edit`, and `apply_patch` tools. `"write": "deny"` is
|
|
||||||
silently ignored. Valid top-level permission keys include: `read`, `edit`,
|
|
||||||
`glob`, `grep`, `list`, `bash`, `task`, `skill`, `lsp`, `question`,
|
|
||||||
`webfetch`, `websearch`, `external_directory`, `doom_loop`, `todowrite`.
|
|
||||||
Confirmed from `opencode.ai/docs/permissions` (May 2026).
|
|
||||||
- **`default_agent` key is snake_case** in `opencode.json` (not `defaultAgent`).
|
|
||||||
Confirmed from `opencode.ai/docs/config` (May 2026).
|
|
||||||
- **`tools: false` is deprecated.** The current approach for per-agent tool
|
|
||||||
restriction is `permission: { edit: "deny" }`. The old `tools: false` still
|
|
||||||
works but is documented as legacy. Confirmed from `opencode.ai/docs/agents`
|
|
||||||
(May 2026).
|
|
||||||
- **Broken symlinks are silent.** OpenCode does not error on a broken
|
|
||||||
`.opencode/agents/` symlink — it just skips the agent silently. The agent
|
|
||||||
won't appear in `opencode agent list` and all `opencode.json` permission
|
|
||||||
config for it is ignored. Always verify with
|
|
||||||
`cat .opencode/agents/<name>.md | head -5` (should print content, not a "No
|
|
||||||
such file" error) and `opencode agent list` (agent should appear with correct
|
|
||||||
deny rules). The correct symlink depth from `.opencode/agents/` is
|
|
||||||
`../../.agents/agents/<name>.md` (two levels), not three.
|
|
||||||
- **`opencode agent list` is the authoritative verification command.** Run it
|
|
||||||
after any agent config change to confirm: (a) the agent appears by name, (b)
|
|
||||||
its mode is correct (`all`/`primary`/`subagent`), and (c) `deny` rules appear
|
|
||||||
at the bottom of its permission list. Missing agent = broken symlink or YAML
|
|
||||||
parse error. Present but missing deny rules = frontmatter not parsed correctly
|
|
||||||
or wrong key names. (May 2026)
|
|
||||||
- **`@mention` routing only works at session start.** If you send any message
|
|
||||||
that gets answered by the current primary agent first, then send
|
|
||||||
`@local-orchestrator ...`, the TUI passes the full message text to the current
|
|
||||||
model (Build/OmniCoder) which treats `@local-orchestrator` as freeform text
|
|
||||||
and answers it itself. Always open a **fresh session** and make `@agent-name`
|
|
||||||
the very first message. Alternatively, use
|
|
||||||
`opencode run --agent local-orchestrator "..."` from the CLI for reliable
|
|
||||||
agent-scoped invocation. **Tab-switching to a custom `all`-mode agent in an
|
|
||||||
existing session works correctly.**
|
|
||||||
- **`edit: deny` on `local-orchestrator` is working correctly.** When given an
|
|
||||||
edit task, the orchestrator correctly avoided using `replace_string_in_file`
|
|
||||||
and instead used the `task` tool to delegate to a subagent. This is the
|
|
||||||
expected behaviour. Confirmed May 2026.
|
|
||||||
- **`task` tool has a JSON serialization limit.** OmniCoder 9B caused an
|
|
||||||
`Unterminated string` error by embedding the entire contents of multiple
|
|
||||||
`package.json` files as a literal string inside the `task` prompt JSON. The
|
|
||||||
`task` tool prompt is serialized as JSON; very long strings truncate and
|
|
||||||
produce parse errors. Mitigation: instruct the orchestrator in its system
|
|
||||||
prompt to tell workers _which files to read_ rather than quoting file contents
|
|
||||||
inline. This has been added to `local-orchestrator.md`. (May 2026)
|
|
||||||
- **`ollama/arch-omni2-9b` is the wrong model identifier for the llama-server
|
|
||||||
instance.** The correct ID is `llama-server/arch-omni2-9b-native` (verify with
|
|
||||||
`opencode models | grep arch`). Using the wrong ID causes an immediate "cannot
|
|
||||||
load model" error when the agent is invoked. Fixed in `opencode.json` and
|
|
||||||
`local-orchestrator.md` frontmatter. (May 2026)
|
|
||||||
|
|
||||||
## Open Issues
|
|
||||||
|
|
||||||
Known bugs and stale claims identified during code review (see deleted
|
|
||||||
`agent-infrastructure-review.md` and `agent-infrastructure-review-pass2.md` for
|
|
||||||
full context). Not yet fixed.
|
|
||||||
|
|
||||||
### CRITICAL — `description:` empty in all generated agent/skill files
|
|
||||||
|
|
||||||
`scripts/generate-agents.ts` uses a hand-rolled YAML parser that silently drops
|
|
||||||
descriptions when they are written in block-scalar form (value on the next line
|
|
||||||
under the key). Every generated file in `.github/agents/`, `.github/skills/`,
|
|
||||||
`.opencode/agents/`, `.opencode/skills/` has a blank `description:` field.
|
|
||||||
|
|
||||||
`description:` is the primary routing signal for Copilot's
|
|
||||||
`SkillsContextComputer` and OpenCode's agent dispatch. Explicitly `@`-mentioning
|
|
||||||
an agent by name still works; description-triggered auto-routing does not.
|
|
||||||
|
|
||||||
**Fix**: Inline the description strings in the canonical `.agents/` source files
|
|
||||||
(change block-scalar to `key: 'value'` format). The existing parser handles
|
|
||||||
inline strings correctly. Add a `generate:agents:check` assertion that every
|
|
||||||
generated file has a non-empty `description:`.
|
|
||||||
|
|
||||||
### MEDIUM — ~~`printf '%s'` regression in hooks breaks `\n` rendering~~ (resolved)
|
|
||||||
|
|
||||||
~~`.agents/hooks/post-tool-use.sh`, `session-start.sh`, and
|
|
||||||
`user-prompt-submit.sh` use `printf '%s' "$context" | node -e '...'` to
|
|
||||||
JSON-escape the context variable. `%s` does not interpret `\n` escape sequences,
|
|
||||||
so multi-line context strings (SELF-CHECK, DEBUGGING REMINDER, BFF REMINDER)
|
|
||||||
arrive at the model as single lines with literal `\n` characters.~~
|
|
||||||
|
|
||||||
**Verified fixed** (May 2026): all three hooks already use `printf '%b'`.
|
|
||||||
|
|
||||||
### LOW — ~~arXiv citation `2603.29957` unverified~~ (resolved)
|
|
||||||
|
|
||||||
~~`arXiv:2603.29957` (Jiang et al. 2026, "Think-Anywhere") appears in
|
|
||||||
`.agents/agents/research.md`, `.agents/agents/brainstorm.md`, and the Research
|
|
||||||
Foundation section above. Verify the ID resolves at
|
|
||||||
`https://arxiv.org/abs/2603.29957` and fix all references if it doesn't.~~
|
|
||||||
|
|
||||||
**Verified real** (May 2026): "Think Anywhere in Code Generation" by Xue Jiang,
|
|
||||||
Tianyu Zhang, Ge Li et al., submitted March 31, 2026, revised April 27, 2026
|
|
||||||
(v3), cs.SE. All existing citations are correct.
|
|
||||||
|
|
||||||
### LOW — ~~`.claude/` false claims in `tool-agnostic-agent-infra.md`~~ (resolved)
|
|
||||||
|
|
||||||
The file `docs/projects/tool-agnostic-agent-infra.md` no longer exists — already
|
|
||||||
deleted. No action needed.
|
|
||||||
File diff suppressed because it is too large
Load Diff
@ -1,162 +0,0 @@
|
|||||||
> What are the different AI model architectures? ... What are the experimental
|
|
||||||
> ones? ... Any more? ... I want you to combine all of those into a small
|
|
||||||
> bulleted list
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Sequential & Vision Foundations
|
|
||||||
|
|
||||||
- RNN, LSTM, GRU: Sequential networks processing text and audio through internal
|
|
||||||
memory loops.
|
|
||||||
- CNN: Spatial processing architectures using mathematical filters to parse
|
|
||||||
visual grids.
|
|
||||||
- Vision Transformers (ViT): Vision models using self-attention directly on
|
|
||||||
split image patches.
|
|
||||||
|
|
||||||
## Modern Generative Engines
|
|
||||||
|
|
||||||
- Transformers: High-capacity architectures utilizing global self-attention to
|
|
||||||
process text sequences.
|
|
||||||
- GANs: Paired networks staging a generator against a discriminator to create
|
|
||||||
imagery.
|
|
||||||
- Diffusion Models: Generative systems producing content by systematically
|
|
||||||
reversing data noise.
|
|
||||||
- VAEs: Systems compressing data into lower dimensions to decode structural
|
|
||||||
variations.
|
|
||||||
|
|
||||||
## Efficiency & Structure Scaling
|
|
||||||
|
|
||||||
- Mixture of Experts (MoE): Sparse networks activating targeted sub-expert nodes
|
|
||||||
to optimize computation.
|
|
||||||
- Graph Neural Networks (GNN): Frameworks natively executing message-passing
|
|
||||||
over complex non-grid relationships.
|
|
||||||
- Large Action Models (LAM): Automation architectures translating semantic
|
|
||||||
intents into executable digital workflows.
|
|
||||||
- Attention-SSM Hybrids: Layered blocks combining global attention processing
|
|
||||||
with fast linear scaling.
|
|
||||||
|
|
||||||
## Next-Gen Context & Linear Processing
|
|
||||||
|
|
||||||
- Mamba & SSMs: Hardware-aware state models enabling infinite context lengths
|
|
||||||
with linear scaling.
|
|
||||||
- Liquid Neural Networks (LNN): Dynamic architectures utilizing fluid
|
|
||||||
differential equations for real-time adaptation.
|
|
||||||
- Diffusion Language Models (DLM): Non-sequential text systems iteratively
|
|
||||||
refining whole content blocks.
|
|
||||||
- Joint Embedding Predictive (JEPA): Meaning-driven architectures predicting
|
|
||||||
high-level conceptual outcomes over raw pixels.
|
|
||||||
|
|
||||||
## Dynamic Reasoning & Alternative Physics
|
|
||||||
|
|
||||||
- Test-Time Training (TTT / MoR): Flexible loops scaling internal computing
|
|
||||||
depth dynamically per query.
|
|
||||||
- Distributed Neural Graphs: Modular webs routing separate context tokens
|
|
||||||
through specialized pathways.
|
|
||||||
- Neuro-Symbolic Hybrids: Integrated systems merging raw neural intuition with
|
|
||||||
precise sandbox compilation.
|
|
||||||
- Spiking Neural Networks (SNN): Bio-mimetic hardware structures firing nodes
|
|
||||||
strictly via threshold spikes.
|
|
||||||
- Thermodynamic Models: Analog physical systems computing answers by settling
|
|
||||||
into low-energy equilibriums.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
> Which of these require the least power consumption, least training data, least
|
|
||||||
> compute, and least VRAM
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
To identify which architecture requires the absolute minimum across all four
|
|
||||||
constraints, the category of Alternative Physics & Bio-mimetic Architectures
|
|
||||||
wins outright, followed immediately by classical spatial networks. The absolute
|
|
||||||
winners for each specific bottleneck are broken down below:
|
|
||||||
|
|
||||||
## 1. Least Power Consumption: Spiking Neural Networks (SNNs)
|
|
||||||
|
|
||||||
- The Winner: Spiking Neural Networks (SNNs).
|
|
||||||
- Why: Traditional AI models are mathematically "dense," meaning billions of
|
|
||||||
multipliers and transistors must turn on and consume electricity for every
|
|
||||||
single token processed. SNNs mimic biological neurons and operate using sparse
|
|
||||||
electrical spikes.
|
|
||||||
- The Difference: If there is no new incoming data, energy consumption drops to
|
|
||||||
near zero. Because nodes only fire when an electrical threshold is crossed,
|
|
||||||
SNNs use up to 100× to 1000× less electrical energy than traditional deep
|
|
||||||
learning networks, making them the gold standard for low-power edge hardware.
|
|
||||||
[1]
|
|
||||||
|
|
||||||
## 2. Least Training Data: Neuro-Symbolic Hybrids & JEPAs
|
|
||||||
|
|
||||||
- The Winner: Neuro-Symbolic Hybrids (closely followed by Joint Embedding
|
|
||||||
Predictive Architectures / JEPA).
|
|
||||||
- Why: Transformers require hundreds of billions of text tokens to accidentally
|
|
||||||
stumble upon and learn basic mathematical rules (like addition).
|
|
||||||
Neuro-Symbolic systems hardcode a classical, rigid rules engine directly into
|
|
||||||
the AI's core.
|
|
||||||
- The Difference: Instead of needing to see 10,000 examples of a math problem to
|
|
||||||
recognize a pattern, a Neuro-Symbolic model needs exactly one prompt because
|
|
||||||
it instantly routes the logic out to a pre-programmed mathematical compiler.
|
|
||||||
|
|
||||||
## 3. Least Compute (FLOPs): Convolutional Neural Networks (CNNs)
|
|
||||||
|
|
||||||
- The Winner: Convolutional Neural Networks (CNNs).
|
|
||||||
- Why: Modern Vision Transformers (ViTs) compare every single pixel patch
|
|
||||||
against every other pixel patch globally, creating an exploding computation
|
|
||||||
bottleneck. CNNs strictly look at localized, overlapping grids using tiny math
|
|
||||||
matrices (kernels).
|
|
||||||
- The Difference: For simple classification and computer vision tasks (like
|
|
||||||
recognizing a stop sign), a highly optimized CNN calculates a mere fraction of
|
|
||||||
the floating-point operations (FLOPs) required by an attention-based Vision
|
|
||||||
Transformer.
|
|
||||||
|
|
||||||
## 4. Least VRAM: Mamba / State Space Models (SSMs)
|
|
||||||
|
|
||||||
- The Winner: Mamba & State Space Models (SSMs).
|
|
||||||
- Why: In traditional Transformers, VRAM usage scales quadratically with length
|
|
||||||
because the model must store a massive, continuously growing history of the
|
|
||||||
conversation (known as the KV-Cache) directly inside the GPU memory.
|
|
||||||
- The Difference: Mamba compresses conversational history into a fixed-size
|
|
||||||
mathematical "state." Whether you have been chatting with the model for 5
|
|
||||||
sentences or 50,000 sentences, the VRAM footprint remains entirely flat and
|
|
||||||
locked, drastically reducing hardware memory requirements during deployment.
|
|
||||||
[2, 3]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Direct Comparison Overview
|
|
||||||
|
|
||||||
| Architecture [2, 4, 5] | Power Efficiency | Data Efficiency | Compute (FLOPs) Needed | VRAM Footprint | Best Used For |
|
|
||||||
| ---------------------- | ---------------- | --------------- | ---------------------- | -------------- | ---------------------------------- |
|
|
||||||
| SNNs | 🏆 Lowest | Moderate | Low | Low | Battery-powered edge devices |
|
|
||||||
| Neuro-Symbolic | Moderate | 🏆 Lowest | Low | Low | Exact math, logic, and coding |
|
|
||||||
| CNNs | Low | High | 🏆 Lowest | Low | Embedded image/object detection |
|
|
||||||
| Mamba / SSMs | Moderate | High | Low | 🏆 Lowest | Long-context chatbots on local PCs |
|
|
||||||
| Transformers | Extreme High | Extreme High | Extreme High | Extreme High | Frontier LLMs and heavy cloud AI |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
If you want to move forward with a lightweight project, let me know:
|
|
||||||
|
|
||||||
- Do you want to see a Python sample of a lightweight CNN or Mamba block?
|
|
||||||
- Are you trying to deploy an AI model onto a specific microdevice (like a
|
|
||||||
Raspberry Pi or Arduino)?
|
|
||||||
|
|
||||||
[1] [https://arxiv.org](https://arxiv.org/html/2406.08413v1) [2]
|
|
||||||
[https://www.reddit.com](https://www.reddit.com/r/LocalLLaMA/comments/1qk1pzy/is_the_next_leap_in_ai_architectural_comparing/)
|
|
||||||
[3]
|
|
||||||
[https://virtualizationreview.com](https://virtualizationreview.com/articles/2026/01/27/what-gpu-do-you-really-need.aspx)
|
|
||||||
[4]
|
|
||||||
[https://medium.com](https://medium.com/@TDKVentures/breaking-the-compute-bottleneck-power-memory-and-the-new-ai-architecture-paradigms-eee264cec99a)
|
|
||||||
[5]
|
|
||||||
[https://www.socomec.us](https://www.socomec.us/en-us/solutions/business/data-centers/understanding-power-consumption-data-centers)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
Further exploration of SSNs reveals they are the most efficient in all
|
|
||||||
categories but they require specialized hardware, like the Akida™ PCIe Board
|
|
||||||
AKD1000 (or you can use their cloud service pretty cheaply during a trial)
|
|
||||||
|
|
||||||
If your ultimate goal is to build or run something that achieves frontier-class
|
|
||||||
reasoning while staying highly hardware-efficient, you must look toward a Hybrid
|
|
||||||
SSM (Mamba) + Transformer + MoE architecture. This gives you the static VRAM
|
|
||||||
footprint and linear scaling of an alternative model, backed by the proven
|
|
||||||
intelligence of standard attention loops.
|
|
||||||
@ -1,771 +0,0 @@
|
|||||||
# Agent Infra Extraction — Handoff Plan
|
|
||||||
|
|
||||||
**Status:** ✅ Complete through Phase 5. Remnant reduced to BFF-overlay only.
|
|
||||||
All phases executed and committed. See per-phase status below.
|
|
||||||
|
|
||||||
**Goal:** Move repo-agnostic agent infrastructure out of Remnant into
|
|
||||||
`~/dotfiles/.agents/` (existing dotfiles repo), wire it into each tool's
|
|
||||||
**global** config so every project inherits it automatically, and reduce
|
|
||||||
Remnant's footprint to a small project-specific overlay (BFF reminder, project
|
|
||||||
AGENTS.md). After this work, Remnant can get back to being a Remnant codebase
|
|
||||||
instead of an agent-infra lab.
|
|
||||||
|
|
||||||
**Forward-looking work** (MFE bootstrap, kanban unification, per-session tmp
|
|
||||||
capture, `project.config.js` extraction, llama-server module, MemPalace, eval
|
|
||||||
scaffolding, agentic-framework research) has moved to
|
|
||||||
[dotfiles-agent-infra-roadmap.md](./dotfiles-agent-infra-roadmap.md). This doc
|
|
||||||
now covers only the extraction itself and the post-extraction validation
|
|
||||||
findings.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decisions (confirmed with user)
|
|
||||||
|
|
||||||
| Decision | Value |
|
|
||||||
| ------------------------------- | ----------------------------------------------------------------------------------------- |
|
|
||||||
| Shared infra location | `~/dotfiles/.agents/` (existing repo, matches user's dotfiles naming) |
|
|
||||||
| Sharing mechanism | Inherit via global tool config; verify global+project plugins/hooks coexist additively |
|
|
||||||
| MCP server name | Rename `remnant-agents` → `all-agents` (safe — only 4 string refs, no permission impacts) |
|
|
||||||
| Uncommitted files | Already committed as-is on `main` (Phase 1 done) |
|
|
||||||
| Research docs | Move to shared infra (general-purpose, useful to any project) |
|
|
||||||
| Modelfiles | Leave for now; address later |
|
|
||||||
| Global Copilot config | Yes — create `~/.vscode-server/data/User/prompts/` and add global MCP entry |
|
|
||||||
| Project-specific bits | Only Remnant's root `AGENTS.md` + the BFF/`apps/client/src/pages/` reminder |
|
|
||||||
| `agent-infrastructure.md` split | Lossless — ~95% to shared, thin pointer + Remnant tradeoffs stay |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What's shareable vs. project-specific
|
|
||||||
|
|
||||||
**Shareable (moves to `~/dotfiles/.agents/`):**
|
|
||||||
|
|
||||||
- `.agents/AGENTS.md` — agent-infra design principles
|
|
||||||
- `.agents/agents/*.md` — brainstorm, build, orchestrator, research
|
|
||||||
- `.agents/skills/research.md` — research methodology
|
|
||||||
- `.agents/hooks/*.sh` — all six hook scripts (pre/post-tool-use, session-start,
|
|
||||||
stop, pre-compact, user-prompt-submit) **except** the BFF reminder block in
|
|
||||||
`post-tool-use.sh`
|
|
||||||
- `.agents/mcp/index.ts` — MCP server (will be refactored to auto-discover
|
|
||||||
agents/skills from sibling dirs)
|
|
||||||
- `.agents/frameworks/opencode/plugin.ts` — OpenCode plugin harness
|
|
||||||
- `.agents/frameworks/github/hooks.json` — Copilot harness config
|
|
||||||
- `docs/research/*.md` (5 files) — ai-coding-best-practices,
|
|
||||||
human-llm-interpretation-overlap, intent-interpretation-action-plan,
|
|
||||||
llm-intent-interpretation, text-communication-interpretation
|
|
||||||
- `docs/explorations/text-intent-interpretation-research.md`
|
|
||||||
- `docs/ai_architectures.md`
|
|
||||||
- `docs/projects/agent-infrastructure.md` — almost entirely shared knowledge
|
|
||||||
(see "Lossless split" below)
|
|
||||||
- `docs/infra/LLAMA-SERVER-CUDA-WSL2.md` — general llama.cpp/CUDA setup notes
|
|
||||||
|
|
||||||
**Project-specific (stays in Remnant):**
|
|
||||||
|
|
||||||
- Root `AGENTS.md` (Remnant overview, package pointers, monorepo rules)
|
|
||||||
- BFF reminder + `apps/client/src/pages/` path checks (currently embedded in
|
|
||||||
`post-tool-use.sh`)
|
|
||||||
- Nested `AGENTS.md` files in `apps/`, `packages/`
|
|
||||||
- `verification.md`, `docs/TODO.md`, `docs/projects/*` (other than the
|
|
||||||
agent-infrastructure split-off)
|
|
||||||
- The two `.modelfile` files — leave in `.agents/` with a `MODELFILES.md` note
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Verification gates (Phase 0 — COMPLETE)
|
|
||||||
|
|
||||||
1. ✅ **OpenCode plugin coexistence** — additive; all hooks run in sequence.
|
|
||||||
Global dir: `~/.config/opencode/plugins/` (not `~/.opencode/plugins/`).
|
|
||||||
|
|
||||||
2. ✅ **OpenCode MCP merge** — configs merge (not replace). Global `mcp` entries
|
|
||||||
- project `mcp` entries both load; project-level keys win on conflicts.
|
|
||||||
|
|
||||||
3. ✅ **Copilot global hook support** — EXISTS. User-level hooks dir:
|
|
||||||
`~/.copilot/hooks/` (macOS/Linux) per
|
|
||||||
[GitHub Copilot hooks reference](https://docs.github.com/en/copilot/reference/hooks-reference).
|
|
||||||
Load order is additive: repo `.github/hooks/*.json` → user
|
|
||||||
`~/.copilot/hooks/*.json` → repo `settings.json` inline → user
|
|
||||||
`~/.copilot/settings.json` inline → plugins. Symlink
|
|
||||||
`~/.copilot/hooks/agent-support.json` → dotfiles hooks.json = global
|
|
||||||
coverage. No per-project stub needed. _(Initial finding was wrong — VS Code
|
|
||||||
docs don't cover Copilot's own config surface; always check docs.github.com
|
|
||||||
first.)_
|
|
||||||
|
|
||||||
4. ✅ **VS Code global MCP** — `~/.vscode-server/data/User/mcp.json` (create via
|
|
||||||
`MCP: Open Remote User Configuration` command or directly).
|
|
||||||
|
|
||||||
5. ✅ **OpenCode hook overlay** — BFF reminder ships as a separate project-local
|
|
||||||
plugin file. No merged copy of `post-tool-use.sh` needed.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Target layout
|
|
||||||
|
|
||||||
```
|
|
||||||
~/dotfiles/.agents/ ← canonical shared infra
|
|
||||||
├── AGENTS.md ← from remnant/.agents/AGENTS.md
|
|
||||||
│ + "Research Discipline" section
|
|
||||||
│ for global lessons/practices
|
|
||||||
│ (framework-agnostic: Copilot,
|
|
||||||
│ OpenCode, Claude Code all load
|
|
||||||
│ AGENTS.md natively — no
|
|
||||||
│ tool-specific config needed)
|
|
||||||
├── INSTALL-NOTES.md ← Phase 0 findings
|
|
||||||
├── install.sh ← one-time setup script (idempotent)
|
|
||||||
├── agents/
|
|
||||||
│ ├── brainstorm.md
|
|
||||||
│ ├── build.md
|
|
||||||
│ ├── orchestrator.md
|
|
||||||
│ └── research.md
|
|
||||||
├── skills/
|
|
||||||
│ └── research.md
|
|
||||||
├── hooks/
|
|
||||||
│ ├── pre-tool-use.sh
|
|
||||||
│ ├── post-tool-use.sh ← BFF block removed
|
|
||||||
│ ├── session-start.sh
|
|
||||||
│ ├── stop.sh
|
|
||||||
│ ├── pre-compact.sh
|
|
||||||
│ └── user-prompt-submit.sh
|
|
||||||
├── frameworks/
|
|
||||||
│ ├── opencode/plugin.ts
|
|
||||||
│ └── github/hooks.json
|
|
||||||
├── mcp/
|
|
||||||
│ └── index.ts ← auto-discovers agents/skills/
|
|
||||||
└── docs/
|
|
||||||
├── agent-infrastructure.md ← the moved 855-line doc
|
|
||||||
├── ai-coding-best-practices.md ← from docs/research/
|
|
||||||
├── ai_architectures.md
|
|
||||||
├── human-llm-interpretation-overlap.md
|
|
||||||
├── intent-interpretation-action-plan.md
|
|
||||||
├── llm-intent-interpretation.md
|
|
||||||
├── text-communication-interpretation.md
|
|
||||||
├── text-intent-interpretation-research.md
|
|
||||||
└── llama-server-cuda-wsl2.md
|
|
||||||
|
|
||||||
Global wiring (created/modified by install.sh):
|
|
||||||
~/.config/opencode/opencode.json ← merge MCP entry
|
|
||||||
~/.config/opencode/AGENTS.md ← symlink → dotfiles AGENTS.md (OpenCode global rules)
|
|
||||||
~/.config/opencode/plugins/agent-support.ts ← symlink → dotfiles plugin
|
|
||||||
~/.config/opencode/agents/ ← symlinks → dotfiles agents/*.md (added in post-Phase-4 fix)
|
|
||||||
~/.copilot/hooks/agent-support.json ← generated by install.sh with absolute dotfiles paths (not a symlink)
|
|
||||||
~/.vscode-server/data/User/prompts/ ← create dir (currently missing)
|
|
||||||
~/.vscode-server/data/User/mcp.json ← global VS Code MCP registration
|
|
||||||
|
|
||||||
Remnant (post-extraction, actual):
|
|
||||||
remnant/
|
|
||||||
├── AGENTS.md ← unchanged
|
|
||||||
├── .agents/
|
|
||||||
│ ├── README.md ← "shared infra: ~/dotfiles/.agents"
|
|
||||||
│ ├── hooks/
|
|
||||||
│ │ └── post-tool-use-remnant.sh ← BFF reminder only
|
|
||||||
│ ├── omnicoder.modelfile ← archived
|
|
||||||
│ └── omnicoder2.modelfile ← archived
|
|
||||||
│ ⚠️ MODELFILES.md not created (planned but skipped)
|
|
||||||
├── .github/hooks/agent-support.json ← gitignored; BFF PostToolUse only
|
|
||||||
├── .vscode/mcp.json ← exa only (remnant-agents removed)
|
|
||||||
└── opencode.json ← mcp.remnant-agents removed;
|
|
||||||
permission overrides retained
|
|
||||||
|
|
||||||
Note: .opencode/ was gitignored; deleted from filesystem (agents now global).
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Phases
|
|
||||||
|
|
||||||
### Phase 0 — Verify coexistence ✅ DONE
|
|
||||||
|
|
||||||
Resolved all five gates. `INSTALL-NOTES.md` not produced (findings inline
|
|
||||||
above).
|
|
||||||
|
|
||||||
### Phase 1 — Checkpoint Remnant ✅ DONE
|
|
||||||
|
|
||||||
Already committed on `main`.
|
|
||||||
|
|
||||||
### Phase 2 — Populate `~/dotfiles/.agents/` ✅ DONE
|
|
||||||
|
|
||||||
1. Copy (not move) shareable files from `remnant/.agents/` into
|
|
||||||
`~/dotfiles/.agents/`. Add a **"Research Discipline" section** to
|
|
||||||
`~/dotfiles/.agents/AGENTS.md` for cross-tool meta-guidance (e.g. check
|
|
||||||
docs.github.com first for Copilot configuration questions). This is the
|
|
||||||
canonical home for global lessons — AGENTS.md is natively loaded by Copilot,
|
|
||||||
OpenCode, and Claude Code. Never use tool-specific mechanisms (OpenCode
|
|
||||||
`instructions:` config, VS Code `.instructions.md` files) for guidance that
|
|
||||||
belongs in AGENTS.md.
|
|
||||||
2. Copy `docs/research/*.md` (5 files),
|
|
||||||
`docs/explorations/text-intent-interpretation-research.md`,
|
|
||||||
`docs/ai_architectures.md`, `docs/infra/LLAMA-SERVER-CUDA-WSL2.md` into
|
|
||||||
`~/dotfiles/.agents/docs/`.
|
|
||||||
3. Split `docs/projects/agent-infrastructure.md` (lossless):
|
|
||||||
- **Moves to `~/dotfiles/.agents/docs/agent-infrastructure.md`:** the entire
|
|
||||||
current doc minus the items below. This includes hook architecture, model
|
|
||||||
scale profiles, MCP protocol status, OpenCode verified facts, the testing
|
|
||||||
plan, open issues — all general infra knowledge.
|
|
||||||
- **Stays in `remnant/docs/projects/agent-infrastructure.md`** (rewritten to
|
|
||||||
a thin pointer):
|
|
||||||
- Reference link to the shared doc
|
|
||||||
- Remnant-specific "Known Tradeoffs" row: "Instructions glob trimmed to
|
|
||||||
root `AGENTS.md` only" + the `api/`/`client/`/`core/` mitigation
|
|
||||||
- Mention of BFF reminder hook and its Remnant scope
|
|
||||||
- Any items currently open that have Remnant-specific test cases (e.g. item
|
|
||||||
31 mentions `apps/api/package.json` paths — generalize for shared doc;
|
|
||||||
keep concrete Remnant examples as a Remnant section)
|
|
||||||
4. Refactor `mcp/index.ts`: auto-discover `agents/*.md` and `skills/*.md`
|
|
||||||
relative to the script location, instead of a hand-maintained registry.
|
|
||||||
Removes a friction point when adding new agents/skills.
|
|
||||||
5. Rename MCP server `remnant-agents` → `all-agents` in `mcp/index.ts`.
|
|
||||||
6. Refactor `hooks/post-tool-use.sh`: remove the BFF + `apps/client/src/pages/`
|
|
||||||
block. Document the extension point (comment: "project-local additions live
|
|
||||||
in a sibling hook file or repo-local override").
|
|
||||||
7. Write `install.sh`:
|
|
||||||
- Detects existing global config (idempotent re-run safe).
|
|
||||||
- Creates missing dirs (`~/.vscode-server/data/User/prompts/`,
|
|
||||||
`~/.copilot/hooks/`, `~/.config/opencode/plugins/`).
|
|
||||||
- Symlinks plugin into `~/.config/opencode/plugins/agent-support.ts`.
|
|
||||||
- Generates `~/.copilot/hooks/agent-support.json` with absolute paths to
|
|
||||||
`~/dotfiles/.agents/hooks/*.sh` (not a symlink — avoids needing per-project
|
|
||||||
hook stubs for relative-path resolution).
|
|
||||||
- Merges `all-agents` MCP entry into `~/.config/opencode/opencode.json` via
|
|
||||||
`jq`.
|
|
||||||
- Writes `~/.vscode-server/data/User/mcp.json` with the `all-agents` MCP
|
|
||||||
entry.
|
|
||||||
8. Commit to dotfiles repo. (Push wherever; local-only is fine.)
|
|
||||||
|
|
||||||
**Divergences from plan:** `jq` replaced with `node` (not universally
|
|
||||||
available); `install.sh` step 1 generates Copilot hooks JSON with absolute paths
|
|
||||||
(not a symlink) to avoid per-project relative-path resolution issues. Step 3
|
|
||||||
added post-Phase-4 to wire `~/.config/opencode/agents/`.
|
|
||||||
|
|
||||||
### Phase 3 — Run `install.sh` ✅ DONE
|
|
||||||
|
|
||||||
- Symlinks and generated files verified.
|
|
||||||
- Smoke tests passed: `RESEARCH_PROMPT: OK`, `HOOK_BLOCK: OK`.
|
|
||||||
- Bug found and fixed: OpenCode uses tool name `bash` (not `run_in_terminal`);
|
|
||||||
`pre-tool-use.sh` case statement updated in both repos.
|
|
||||||
|
|
||||||
### Phase 4 — Strip Remnant ✅ DONE
|
|
||||||
|
|
||||||
1. ✅ Deleted `agents/`, `skills/`, `frameworks/`, `mcp/`, `AGENTS.md` from
|
|
||||||
`.agents/`
|
|
||||||
2. ✅ `.agents/hooks/` reduced to `post-tool-use-remnant.sh` only
|
|
||||||
3. ⚠️ `MODELFILES.md` stub not created (skipped — low value)
|
|
||||||
4. ✅ `.vscode/mcp.json`: `remnant-agents` dropped, `exa` retained
|
|
||||||
5. ✅ `opencode.json`: `mcp.remnant-agents` removed, permission overrides kept
|
|
||||||
6. ✅ `AGENTS.md` updated to reference `~/dotfiles/.agents/AGENTS.md`
|
|
||||||
7. ✅ Docs deleted from `remnant/docs/` (research/, ai_architectures.md, etc.)
|
|
||||||
8. ✅ `agent-infrastructure.md` rewritten as thin pointer
|
|
||||||
9. ✅ `.agents/README.md` added
|
|
||||||
10. ✅ Committed (`daf53a3`, `8a61128`)
|
|
||||||
|
|
||||||
Post-phase fix: `.opencode/` had dead symlinks (pointed to deleted
|
|
||||||
`.agents/frameworks/` and `.agents/agents/`). Was gitignored so not in git
|
|
||||||
history. Fixed by wiring agents globally via `install.sh` step 3
|
|
||||||
(`~/.config/opencode/agents/`), then deleting `.opencode/` from the filesystem.
|
|
||||||
|
|
||||||
### Phase 5 — Verify Remnant still works ✅ DONE (automated checks)
|
|
||||||
|
|
||||||
- ✅ `npm run build:strict` passes (2 scripts ran, 15 skipped via wireit cache)
|
|
||||||
- ✅ All 6 shared hook scripts pass `bash -n` syntax check
|
|
||||||
- ✅ `post-tool-use-remnant.sh` passes `bash -n`
|
|
||||||
- ✅ `~/.config/opencode/agents/` wired with 4 symlinks → dotfiles
|
|
||||||
- ✅ `~/.copilot/hooks/agent-support.json` present (generated, absolute paths)
|
|
||||||
- ✅ Remnant `.agents/` contains only: README.md, hooks/, omnicoder\*.modelfile
|
|
||||||
- ⏳ Live session checks (require manual restart): `/research` etc. slash
|
|
||||||
commands, hook block in live session, BFF reminder injection, VS Code MCP
|
|
||||||
`all-agents` connect
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Notes (post-execution)
|
|
||||||
|
|
||||||
- All rename touch points done: `remnant-agents` → `all-agents` in mcp/index.ts,
|
|
||||||
opencode.json, .vscode/mcp.json, AGENTS.md.
|
|
||||||
- `<PostToolUse-context>` block working as designed — injected to model only,
|
|
||||||
not shown in chat transcript (see `post-tool-use.sh` line ~137).
|
|
||||||
- Global Copilot hook mechanism confirmed: `~/.copilot/hooks/` exists and is
|
|
||||||
additive with repo hooks. No per-project stubs needed when paths are absolute.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Out of scope (do later)
|
|
||||||
|
|
||||||
- Salvaging `omnicoder*.modelfile` content into shared system-prompt references
|
|
||||||
— user chose "leave for now."
|
|
||||||
- Publishing dotfiles as a public agent-infra repo / npm package.
|
|
||||||
- Refactoring hooks to be platform-agnostic (item 22 in the migrated
|
|
||||||
`agent-infrastructure.md`) — track in the shared repo after extraction.
|
|
||||||
- **Make `.agents/` TypeScript files conform to Remnant's ESLint rules** — the
|
|
||||||
`additionalIgnores` bypass added in Phase 2 is a shortcut, not a solution.
|
|
||||||
`.agents/mcp/index.ts` and `.agents/frameworks/opencode/plugin.ts` use
|
|
||||||
`import.meta.url` directly (blocked by `no-restricted-syntax`) and have minor
|
|
||||||
unused-var patterns. Options: (a) replace `import.meta.url` usages with the
|
|
||||||
approved `findNearestPackageRoot` / `new URL('./sibling', import.meta.url)`
|
|
||||||
patterns where valid, (b) introduce a per-file exception comment for the
|
|
||||||
genuinely exceptional cases (e.g. portable hook resolution in a symlinked
|
|
||||||
global plugin), (c) move all `.agents/` TS into a proper subpackage with its
|
|
||||||
own `tsconfig.json` and relaxed rules. Remove `.agents/**` from
|
|
||||||
`additionalIgnores` once resolved.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Rollback
|
|
||||||
|
|
||||||
Single revert: each phase is a separate commit. Phase 4 (strip Remnant) is the
|
|
||||||
only destructive one, and Phase 2's copies survive. Worst case:
|
|
||||||
`git revert <phase-4-commit>` restores Remnant, dotfiles copies stay.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## WIP: AGENTS.md context survival after compaction
|
|
||||||
|
|
||||||
> **Status**: problem noted; solution not designed. Break out into a separate
|
|
||||||
> project doc when ready to act on it.
|
|
||||||
|
|
||||||
### The problem
|
|
||||||
|
|
||||||
`AGENTS.md` loading is a session-start event. Once loaded, the content sits in
|
|
||||||
the context window as a regular document — it does not re-inject. After
|
|
||||||
compaction/summarization, the summary may preserve high-level framing but can
|
|
||||||
silently drop specific rules, enforcement hierarchy details, or lessons added
|
|
||||||
mid-session. The "Lost in the Middle" effect applies even before compaction:
|
|
||||||
guidance in the middle of a long context receives less model attention than
|
|
||||||
content at the tail (hooks inject at the tail specifically to counter this).
|
|
||||||
|
|
||||||
The `.agents/AGENTS.md` enforcement hierarchy already acknowledges this: _"Root
|
|
||||||
AGENTS.md sections: Context-start only. Subject to 'lost in the middle.'"_ The
|
|
||||||
user confirmed this happened: `.agents/AGENTS.md` was read before compaction
|
|
||||||
this session, but its content was not reliably carried through.
|
|
||||||
|
|
||||||
### What the research says (verified + falsified + re-corrected May 2026)
|
|
||||||
|
|
||||||
**VS Code Copilot** — correction was itself over-corrected. Final answer:
|
|
||||||
|
|
||||||
VS Code docs group `copilot-instructions.md`, `AGENTS.md`, and `CLAUDE.md` as
|
|
||||||
**"always-on instructions"** injected per-request — but this only applies to
|
|
||||||
files **at the workspace root**. The docs explicitly note: _"Support of
|
|
||||||
`AGENTS.md` files outside of the workspace root is currently turned off by
|
|
||||||
default."_
|
|
||||||
|
|
||||||
**This session is direct evidence.** `.agents/AGENTS.md` is a subdirectory file,
|
|
||||||
not the workspace-root AGENTS.md. It was `read_file`'d during this session and
|
|
||||||
entered the context as a regular document. After compaction the summary dropped
|
|
||||||
the specific content — enforcement hierarchy, forbidden patterns.
|
|
||||||
Post-compaction, the Copilot model then proposed `.instructions.md` files and
|
|
||||||
OpenCode `instructions:` config — exactly the approaches the forbidden patterns
|
|
||||||
section bans — because that guidance was no longer in the effective context.
|
|
||||||
|
|
||||||
Root-level `AGENTS.md` (workspace root) = always-on, survives compaction.\
|
|
||||||
Nested `AGENTS.md` in subdirectories = **not** always-on, read once on explicit
|
|
||||||
`read_file`, **lost on compaction**.\
|
|
||||||
**The problem is real for both tools for any AGENTS.md that isn't the workspace
|
|
||||||
root file.** This repo's enforcement lives in `.agents/AGENTS.md`, not the
|
|
||||||
workspace root — which means it is compaction-vulnerable in VS Code Copilot too.
|
|
||||||
|
|
||||||
**OpenCode** (opencode.ai/docs/rules + config):
|
|
||||||
|
|
||||||
- AGENTS.md loaded at session start via directory traversal + global
|
|
||||||
`~/.config/opencode/AGENTS.md`. No re-injection after compaction is
|
|
||||||
documented. The `compaction` agent is a hidden system agent; its behavior
|
|
||||||
after summarizing context is not specified. There is no `/docs/compaction`
|
|
||||||
page — no public spec exists for what happens to AGENTS.md content in the
|
|
||||||
compacted summary.
|
|
||||||
- Whether OpenCode re-injects even the root AGENTS.md after compaction is
|
|
||||||
unknown. Needs live testing.
|
|
||||||
|
|
||||||
**Summary of the asymmetry:**
|
|
||||||
|
|
||||||
| File | Copilot VS Code | OpenCode |
|
|
||||||
| --------------------------------- | ---------------------------- | ------------------------------------- |
|
|
||||||
| Root `AGENTS.md` (workspace root) | always-on per-request ✅ | session-start only ⚠️ |
|
|
||||||
| Nested `AGENTS.md` (subdirectory) | off by default, read-once ⚠️ | session-start traversal, read-once ⚠️ |
|
|
||||||
| Both after compaction | root survives; nested lost | unknown (undocumented) |
|
|
||||||
|
|
||||||
**Key implication for this repo:** the enforcement hierarchy and forbidden
|
|
||||||
patterns live in `.agents/AGENTS.md`, not the workspace-root AGENTS.md. That
|
|
||||||
makes them compaction-vulnerable in VS Code Copilot. None of the candidate
|
|
||||||
mitigations below have been evaluated yet — this problem is unsolved.
|
|
||||||
|
|
||||||
**Instruction files vs AGENTS.md (revised)**:
|
|
||||||
|
|
||||||
- VS Code Copilot: root AGENTS.md and root `copilot-instructions.md` are both
|
|
||||||
always-on per-request — equivalent. The ban on `.instructions.md` files is
|
|
||||||
about _path-scoping_ being non-portable, not injection frequency.
|
|
||||||
- OpenCode: `instructions:` config field is session-start — same vulnerability
|
|
||||||
as nested AGENTS.md in OpenCode.
|
|
||||||
|
|
||||||
### Open questions (narrowed after falsification)
|
|
||||||
|
|
||||||
- Does OpenCode re-inject root AGENTS.md after compaction, or is it also lost?
|
|
||||||
(Needs live testing — not documented.)
|
|
||||||
- Does OpenCode's `instructions:` config field content survive in the compacted
|
|
||||||
summary, or is it lost by the same mechanism?
|
|
||||||
- Does Claude Code (invoked directly, not via VS Code) have per-request
|
|
||||||
injection for root AGENTS.md like VS Code Copilot?
|
|
||||||
|
|
||||||
### Candidate mitigations (not yet chosen)
|
|
||||||
|
|
||||||
1. **Extend `pre-compact.sh`**: Before summarization fires, scan the current
|
|
||||||
context for `read_file` calls on `AGENTS.md` paths and emit their content
|
|
||||||
into the compaction context so the summary captures them explicitly.
|
|
||||||
|
|
||||||
2. **Session-start hook re-read**: If `session-start.sh` can detect it is
|
|
||||||
running post-compaction (e.g. a state file exists from a prior
|
|
||||||
`pre-compact.sh` run), re-inject the full root `AGENTS.md` content
|
|
||||||
immediately.
|
|
||||||
|
|
||||||
3. **PostToolUse periodic re-injection**: The current `post-tool-use.sh`
|
|
||||||
self-check fires every 15 tool calls. A similar counter could re-inject a
|
|
||||||
condensed version of critical AGENTS.md sections (enforcement hierarchy,
|
|
||||||
forbidden patterns) at the same cadence.
|
|
||||||
|
|
||||||
4. **Track and replay**: Maintain a list of AGENTS.md files read this session
|
|
||||||
(via PostToolUse file-path check). On `pre-compact.sh`, emit the paths as a
|
|
||||||
"re-read these after compaction" instruction so the post-compaction agent
|
|
||||||
gets them back.
|
|
||||||
|
|
||||||
5. **Stop relying solely on AGENTS.md for critical rules**: Move critical,
|
|
||||||
never-forget rules out of AGENTS.md into PreToolUse hard blocks or
|
|
||||||
PostToolUse reminders. Reserve AGENTS.md for architecture/rationale that is
|
|
||||||
worth losing under compaction. This is partly already the design intent —
|
|
||||||
this is a reminder to be strict about it.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Post-Extraction Validation (May 23, 2026)
|
|
||||||
|
|
||||||
Validation pass over the extraction work. **No code changes made** — findings
|
|
||||||
and recommendations only.
|
|
||||||
|
|
||||||
### ✅ Verified working
|
|
||||||
|
|
||||||
**Dotfiles `~/dotfiles/.agents/` payload is complete:**
|
|
||||||
|
|
||||||
- `AGENTS.md` (289 lines) ✅
|
|
||||||
- `agents/` — `AGENTS.md`, `brainstorm.md`, `build.md`, `orchestrator.md`,
|
|
||||||
`research.md` ✅
|
|
||||||
- `skills/research.md` ✅
|
|
||||||
- `hooks/` — all six shared hooks (`pre-tool-use`, `post-tool-use`,
|
|
||||||
`session-start`, `stop`, `pre-compact`, `user-prompt-submit`) ✅
|
|
||||||
- `mcp/index.ts` + `package.json` + `package-lock.json` ✅
|
|
||||||
- `frameworks/opencode/plugin.ts` (319 lines, with the Jinja-safe `chat.message`
|
|
||||||
injection) ✅
|
|
||||||
- `frameworks/github/hooks.json` (full six-hook registration) ✅
|
|
||||||
- `docs/` — all nine moved docs present (`agent-infrastructure.md`,
|
|
||||||
`ai-coding-best-practices.md`, `ai_architectures.md`,
|
|
||||||
`human-llm-interpretation-overlap.md`, `intent-interpretation-action-plan.md`,
|
|
||||||
`llm-intent-interpretation.md`, `text-communication-interpretation.md`,
|
|
||||||
`text-intent-interpretation-research.md`, `llama-server-cuda-wsl2.md`) ✅
|
|
||||||
- `install.sh` — generates Copilot global hooks JSON with absolute paths,
|
|
||||||
symlinks OpenCode plugin + agents + global `AGENTS.md`, merges OpenCode and VS
|
|
||||||
Code MCP entries, installs MCP server deps ✅
|
|
||||||
|
|
||||||
**Global wiring on this machine is live:**
|
|
||||||
|
|
||||||
- `~/.copilot/hooks/agent-support.json` — generated, absolute paths ✅
|
|
||||||
- `~/.config/opencode/AGENTS.md` → `~/dotfiles/.agents/AGENTS.md` ✅
|
|
||||||
- `~/.config/opencode/plugins/agent-support.ts` →
|
|
||||||
`~/dotfiles/.agents/frameworks/opencode/plugin.ts` ✅
|
|
||||||
- `~/.config/opencode/agents/{brainstorm,build,orchestrator,research}.md`
|
|
||||||
symlinks ✅
|
|
||||||
- `~/.config/opencode/opencode.json` — has `all-agents` MCP entry ✅
|
|
||||||
- `~/.vscode-server/data/User/mcp.json` — has both `all-agents` and `exa` ✅
|
|
||||||
- `~/.vscode-server/data/User/prompts/` — exists (empty) ✅
|
|
||||||
|
|
||||||
**Remnant overlay is correctly scoped:**
|
|
||||||
|
|
||||||
- `.agents/AGENTS.md` (Remnant-specific) ✅
|
|
||||||
- `.agents/README.md` ✅
|
|
||||||
- `.agents/hooks/post-tool-use-remnant.sh` (BFF only) ✅
|
|
||||||
- `.agents/frameworks/github/{AGENTS.md, hooks.json}` — project Copilot hook
|
|
||||||
registration ✅
|
|
||||||
- `.agents/frameworks/opencode/{AGENTS.md, hooks.ts}` — project OpenCode plugin
|
|
||||||
✅
|
|
||||||
- `.github/hooks/hooks.json` → `../../.agents/frameworks/github/hooks.json` ✅
|
|
||||||
- `.opencode/plugins/hooks.ts` → `../../.agents/frameworks/opencode/hooks.ts` ✅
|
|
||||||
- `.opencode/AGENTS.md` warning file ✅
|
|
||||||
|
|
||||||
### ⚠️ Gaps and bugs in dotfiles (pre-push)
|
|
||||||
|
|
||||||
These should be fixed before squashing/pushing the dotfiles commits.
|
|
||||||
|
|
||||||
1. **`~/dotfiles/.agents/AGENTS.md` references stale paths from the
|
|
||||||
pre-extraction layout.** Three places reference `.agents/github/` and
|
|
||||||
`.agents/opencode/` but the canonical paths are now
|
|
||||||
`.agents/frameworks/github/` and `.agents/frameworks/opencode/`:
|
|
||||||
- "The Copilot harness (`.agents/github/hooks.json`) and OpenCode plugin
|
|
||||||
(`.agents/opencode/plugin.ts`) both delegate…" (Hook Files section)
|
|
||||||
- "`.agents/opencode/plugin.ts` — OpenCode plugin harness (canonical)"
|
|
||||||
(Tool-Specific Entry Points section)
|
|
||||||
- "`.agents/github/hooks.json` — Copilot harness config (canonical)" (same
|
|
||||||
section)
|
|
||||||
- Also: the surrounding sentences claim symlinks point from
|
|
||||||
`.github/hooks/agent-support.json` and `.opencode/plugins/agent-support.ts`
|
|
||||||
"those directories are gitignored." In dotfiles this is wrong on two
|
|
||||||
counts: (a) global wiring uses `~/.copilot/hooks/agent-support.json` and
|
|
||||||
`~/.config/opencode/plugins/agent-support.ts`, (b) at Remnant the project
|
|
||||||
symlink files are named `hooks.json` and `hooks.ts`, not `agent-support.*`.
|
|
||||||
The doc was written for the pre-split layout and never updated.
|
|
||||||
|
|
||||||
2. **`~/dotfiles/.agents/AGENTS.md` links into `../docs/research/...` —
|
|
||||||
Remnant-relative paths that don't resolve in dotfiles.** Two link targets:
|
|
||||||
- `[docs/research/intent-interpretation-action-plan.md](../docs/research/intent-interpretation-action-plan.md)`
|
|
||||||
- `[docs/research/ai-coding-best-practices.md](../docs/research/ai-coding-best-practices.md)`
|
|
||||||
Should be `./docs/intent-interpretation-action-plan.md` and
|
|
||||||
`./docs/ai-coding-best-practices.md` (the docs moved into `.agents/docs/`,
|
|
||||||
not `docs/research/`).
|
|
||||||
|
|
||||||
3. **No "Research Discipline" section** in `~/dotfiles/.agents/AGENTS.md`. Plan
|
|
||||||
Phase 2 step 1 specifically called for adding one (replacing the Copilot-only
|
|
||||||
memory at `~/memories/research-discipline.md`). The Copilot memory still
|
|
||||||
exists as a stopgap because the dotfiles AGENTS.md doesn't carry the
|
|
||||||
equivalent guidance.
|
|
||||||
|
|
||||||
4. **`frameworks/github/AGENTS.md` and `frameworks/opencode/AGENTS.md` are
|
|
||||||
missing from dotfiles.** Remnant added rich, generic API-facts AGENTS.md
|
|
||||||
files for each framework dir (62ee78c) — the content is not Remnant-specific
|
|
||||||
(verified VS Code hooks output formats, OpenCode plugin API facts, Jinja
|
|
||||||
constraint, overconfidence warnings). These belong in dotfiles alongside the
|
|
||||||
framework configs; right now an agent editing the global
|
|
||||||
`frameworks/opencode/plugin.ts` won't see them.
|
|
||||||
|
|
||||||
5. **`install.sh` location.** Currently `~/dotfiles/.agents/install.sh`.
|
|
||||||
Recommendation: move to `~/dotfiles/install.sh` so the dotfiles repo has a
|
|
||||||
discoverable bootstrap entry point (and to leave room for installing other
|
|
||||||
dotfiles content beyond `.agents/`). The script uses
|
|
||||||
`DOTFILES_AGENTS="$(cd "$(dirname "$0")" && pwd)"` — moving it requires
|
|
||||||
changing that one line to e.g.
|
|
||||||
`DOTFILES_AGENTS="$(cd "$(dirname "$0")" && pwd)/.agents"`. No other path
|
|
||||||
math in the script needs to change.
|
|
||||||
|
|
||||||
6. **`install.sh` does not symlink anything into `~/.copilot/` beyond
|
|
||||||
`hooks/`.** Copilot also supports user-level inline settings at
|
|
||||||
`~/.copilot/settings.json`. Not required, just noting it's a future extension
|
|
||||||
point if more global Copilot config becomes shareable.
|
|
||||||
|
|
||||||
7. **`install.sh` doesn't create the `~/.vscode-server/data/User/prompts/` dir
|
|
||||||
as part of the run on this machine — directory exists but is empty.**
|
|
||||||
Confirmed step 6 ran (`mkdir -p`). Working as intended; the dir is the
|
|
||||||
surface for VS Code prompt files but none have been authored yet. No action
|
|
||||||
needed unless we plan to ship `.prompt.md` files from dotfiles.
|
|
||||||
|
|
||||||
8. **`install.sh` has no uninstall counterpart.** Low-priority. Useful if we
|
|
||||||
start moving the script around and want clean state for testing.
|
|
||||||
|
|
||||||
9. **Exa MCP has an undocumented rate limit; agents fan out parallel
|
|
||||||
`mcp_exa_web_search_exa` calls and hit it.** Observed May 23, 2026: 8
|
|
||||||
parallel searches in one turn → all cancelled. Two complementary fixes, both
|
|
||||||
in dotfiles:
|
|
||||||
- **PostToolUse nudge** in `~/dotfiles/.agents/hooks/post-tool-use.sh`: after
|
|
||||||
any `mcp_exa_*` call, inject a reminder ("Exa rate-limits parallel calls —
|
|
||||||
issue web searches serially, max ~2 per turn") so the model learns the
|
|
||||||
pattern without a hard block.
|
|
||||||
- **`AGENTS.md` entry** under a new "External service quirks" section listing
|
|
||||||
per-service constraints (Exa rate limit, GitHub API limits when
|
|
||||||
`mcp_github_*` lands, etc.). Loaded at session start so the model has it
|
|
||||||
before issuing the first call.
|
|
||||||
- Optional PreToolUse soft-warn: count `mcp_exa_*` calls per turn via a
|
|
||||||
`/tmp/.exa-turn-count` file (reset on `user-prompt-submit`); warn (don't
|
|
||||||
deny) past N=2.
|
|
||||||
|
|
||||||
### 🧹 Commit-history cleanup recommendations
|
|
||||||
|
|
||||||
Sonnet committed in tiny increments. Both repos have a series of unpushed
|
|
||||||
"fix(install)/fix(plugin)/fix(hooks)" commits that should be squashed before
|
|
||||||
publishing.
|
|
||||||
|
|
||||||
**`~/dotfiles`** — 10 unpushed commits on `main` past `4a44460 (origin/main)`.
|
|
||||||
Suggested single squashed commit:
|
|
||||||
|
|
||||||
```
|
|
||||||
feat(.agents): shared agent infrastructure + install.sh
|
|
||||||
|
|
||||||
- Hooks, agents, skills, MCP server, OpenCode plugin, Copilot hook config
|
|
||||||
- install.sh wires global Copilot hooks (absolute paths), OpenCode plugin
|
|
||||||
+ agents + AGENTS.md (symlinks), MCP entries for OpenCode and VS Code
|
|
||||||
- See .agents/docs/agent-infrastructure.md for design rationale
|
|
||||||
```
|
|
||||||
|
|
||||||
Constituent commits to fold in:
|
|
||||||
`6b07e4c 690178d 88435d6 f4017ab 5c12257 f0d21e9 2949981 3738732 9544b4e 14c132a`.
|
|
||||||
|
|
||||||
Suggested workflow: `git reset --soft 4a44460 && git commit -m '…'` (or
|
|
||||||
interactive rebase with `s` on every commit after the first). Address items 1–4
|
|
||||||
above first so the squash captures clean state.
|
|
||||||
|
|
||||||
**`~/code/remnant`** — many unpushed commits past `0d0a3a8 (origin/main)`; the
|
|
||||||
agent-infra-related ones form a contiguous block from `2d58147` through
|
|
||||||
`78c8449`. Suggested squash boundary:
|
|
||||||
|
|
||||||
- Keep `2d58147` as the first commit of the block, or replace it with a new
|
|
||||||
"feat: extract shared agent infra to ~/dotfiles/.agents" message that covers
|
|
||||||
the full final state.
|
|
||||||
- Fold in:
|
|
||||||
`5a7d220 c41c142 daf53a3 8a61128 2b0ea1e e9f3529 9191a44 fc2a944 62ee78c dc3ec9c 78c8449`.
|
|
||||||
|
|
||||||
The non-agent-infra commits before `2d58147` (the older "chore: more agentic
|
|
||||||
coding updates …" block) are pre-extraction and can be left as-is or squashed
|
|
||||||
separately depending on taste.
|
|
||||||
|
|
||||||
### 📋 Pending work that's still extraction-scoped
|
|
||||||
|
|
||||||
- `MODELFILES.md` stub (Phase 4 item 3) — explicitly skipped; consider whether
|
|
||||||
the two `omnicoder*.modelfile` files in Remnant should be moved to
|
|
||||||
`~/dotfiles/.agents/modelfiles/` and dropped from Remnant entirely. They
|
|
||||||
aren't Remnant-specific.
|
|
||||||
- `.agents/` TypeScript ESLint conformance (Out-of-scope list, item 4) — still
|
|
||||||
tracked; no movement.
|
|
||||||
- Item 22 in `agent-infrastructure.md` (platform-agnostic hook scripts) —
|
|
||||||
unchanged.
|
|
||||||
- Live-session smoke tests from Phase 5 (slash commands, BFF reminder injection,
|
|
||||||
VS Code MCP `all-agents` connect) — still marked ⏳. Should be retired or
|
|
||||||
confirmed after the next session restart.
|
|
||||||
|
|
||||||
### 🚀 Starting a new project on the extracted infra (MFE)
|
|
||||||
|
|
||||||
Moved to [dotfiles-agent-infra-roadmap.md](./dotfiles-agent-infra-roadmap.md).
|
|
||||||
The short version:
|
|
||||||
|
|
||||||
- Inheriting the global infra is automatic once `install.sh` has run on the
|
|
||||||
machine — no per-project setup beyond an `AGENTS.md` and (optionally) an
|
|
||||||
overlay hook.
|
|
||||||
- The blocker for full MFE adoption is that `stop.sh` hardcodes Remnant's task
|
|
||||||
layout (`docs/TODO.md`, `docs/projects/COMPLETED.md`, `docs/explorations/`).
|
|
||||||
This is part of the
|
|
||||||
[hook audit](#-full-hook-script-remnant-isms-audit-may-23-2026--addendum)
|
|
||||||
below and is addressed by the `project.config.js` extraction tracked in the
|
|
||||||
roadmap.
|
|
||||||
|
|
||||||
### 🆕 Future task — unify kanban/task doc structure across projects
|
|
||||||
|
|
||||||
Moved to
|
|
||||||
[dotfiles-agent-infra-roadmap.md → Kanban / task-doc unification](./dotfiles-agent-infra-roadmap.md#4-kanban--task-doc-unification).
|
|
||||||
Driver recorded here for context: `stop.sh` hardcodes Remnant's task layout, and
|
|
||||||
the path forward (after `project.config.js` lands) is for the hook to support
|
|
||||||
multiple shapes driven by config rather than a single hardcoded one.
|
|
||||||
|
|
||||||
### 🔎 Full hook-script Remnant-isms audit (May 23, 2026 — addendum)
|
|
||||||
|
|
||||||
Re-read every hook in `~/dotfiles/.agents/hooks/` line-by-line after the
|
|
||||||
`stop.sh` miss. Findings below — anything not listed is reviewed and verified
|
|
||||||
generic.
|
|
||||||
|
|
||||||
**`pre-tool-use.sh` — multiple hardcodes that bite non-Remnant projects:**
|
|
||||||
|
|
||||||
1. **Policy 5 — hardcoded ports 3000/3001** for dev-server detection:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ss -tlnp 2>/dev/null | grep -qE ':300[01]\s'
|
|
||||||
```
|
|
||||||
|
|
||||||
These are Remnant's `apps/api` (3000) and `apps/client` Vite HMR (3001). MFE
|
|
||||||
uses different ports (likely 5173 for Vite, plus app-specific). Fix: read
|
|
||||||
ports from a per-project config (`.agents/project.json` with a `devPorts`
|
|
||||||
array) or from `package.json` script scraping, default to common ports if
|
|
||||||
unset.
|
|
||||||
|
|
||||||
2. **Policy 8 — error message references `npm run build:core`** (Remnant has a
|
|
||||||
`packages/core` package that owns the codegen step; other projects don't):
|
|
||||||
|
|
||||||
> "Edit the source files (controller.ts, routes.ts, business-logic.ts)
|
|
||||||
> instead and run 'npm run build:core' to regenerate." The `.generated.ts`
|
|
||||||
> block itself is generic, but the message and example filenames are
|
|
||||||
> Remnant-specific. Fix: parameterize the rebuild command via project config,
|
|
||||||
> or genericize the message ("run the generator script for the affected
|
|
||||||
> package").
|
|
||||||
|
|
||||||
3. **Policies 9 & 10 — assume wireit is the build tool.** Both error messages
|
|
||||||
reference wireit cache/fingerprint behavior and tell the agent to edit
|
|
||||||
`wireit` config in `package.json`. Remnant uses wireit; MFE may not. The
|
|
||||||
blocks themselves (`rm .wireit`, `-- --force` with npm run) are still useful
|
|
||||||
— they fire on the literal string `.wireit` and the `--force` flag — but the
|
|
||||||
messages will be confusing for non-wireit projects. Fix: detect wireit
|
|
||||||
presence (`grep -q '"wireit"' package.json`) and skip the block when not
|
|
||||||
present, or rewrite messages to be tool-agnostic.
|
|
||||||
|
|
||||||
4. **Policy 11 — assumes npm workspaces** (`npm run format -- <file>`
|
|
||||||
propagation issue). True for any npm-workspaces monorepo; false for
|
|
||||||
single-package projects (where the arg works fine). Low-impact: even in a
|
|
||||||
single-package repo, the block just prevents a working command. Fix: gate on
|
|
||||||
presence of `workspaces` field in root `package.json`.
|
|
||||||
|
|
||||||
5. **Policy 14 — hardcoded `apps/*/package.json` and `packages/*/package.json`
|
|
||||||
paths.** This is the exact Remnant monorepo layout (`apps/api`,
|
|
||||||
`apps/client`, `packages/core`, etc.). MFE may use `apps/` + `packages/` too
|
|
||||||
but the underlying concern — that reading workspace package.json files
|
|
||||||
auto-injects nested AGENTS.md and exhausts context — applies to any monorepo
|
|
||||||
with nested AGENTS.md files, regardless of directory names. Also: the message
|
|
||||||
hardcodes **"32K context window"**, which is a specific assumption about the
|
|
||||||
local model (qwen3-coder-30b on llama-server). Cloud models have 200K+. Fix:
|
|
||||||
discover workspace dirs from `package.json` `workspaces` field; drop the
|
|
||||||
model-size number or make it configurable.
|
|
||||||
|
|
||||||
**`post-tool-use.sh` — mostly generic, one cosmetic issue:**
|
|
||||||
|
|
||||||
6. **`vscode_renameSymbol` reminder uses Remnant-flavored example strings:**
|
|
||||||
`deleteX: archiveX`, `openDialog('delete-item')`,
|
|
||||||
`AppDialog handle='delete-item'`, `deleteSuccess/Loading/Error`. These are
|
|
||||||
illustrative patterns from Remnant's Solid.js store + AppDialog component.
|
|
||||||
They're not incorrect for other projects, just visibly Remnant-coded.
|
|
||||||
Low-priority: either genericize ("e.g. aliased store keys like
|
|
||||||
`oldName: newName` in a returned object") or leave as concrete examples —
|
|
||||||
they still teach the right habit. The header comment correctly notes that
|
|
||||||
project-specific reminders "belong in a sibling project-local hook file," but
|
|
||||||
this one snuck in.
|
|
||||||
|
|
||||||
7. **`opencode agent list` shell-out assumes OpenCode CLI is installed.** Fires
|
|
||||||
only when editing agent definitions, so the blast radius is small (a Copilot
|
|
||||||
user who never edits agents won't see it). The fallback ("opencode agent list
|
|
||||||
failed") is graceful. Acceptable as-is, but worth noting: Copilot-only
|
|
||||||
environments will hit the failure path every time. Could gate on
|
|
||||||
`command -v opencode`.
|
|
||||||
|
|
||||||
**`pre-compact.sh`:**
|
|
||||||
|
|
||||||
8. **`docs/explorations/` hardcoded** (same path issue as `stop.sh`). Already
|
|
||||||
covered by the kanban-unification task above — fold into that work.
|
|
||||||
|
|
||||||
**`session-start.sh`:**
|
|
||||||
|
|
||||||
9. **`docs/explorations/` hardcoded** (same — fold into kanban-unification).
|
|
||||||
|
|
||||||
10. **`.session/dead-ends.md` and `.session/pre-compact-state.md` paths** appear
|
|
||||||
in both `session-start.sh`, `pre-compact.sh`, and `stop.sh`. This is a
|
|
||||||
convention `.agents/AGENTS.md` should formally document so it's not just
|
|
||||||
"magic paths the hooks know about." Not Remnant-specific (no Remnant code
|
|
||||||
references these), but undocumented. Fix: add a "Session conventions"
|
|
||||||
section to `~/dotfiles/.agents/AGENTS.md` listing these paths.
|
|
||||||
|
|
||||||
11. **"Ordered markdown lists are auto-renumbered by the editor on save"
|
|
||||||
reminder** — this is VS Code + Prettier behavior, generic enough to keep,
|
|
||||||
but worth flagging that it assumes the project uses Prettier with that
|
|
||||||
setting (Remnant does; others may not).
|
|
||||||
|
|
||||||
**`stop.sh` (already covered, restated for completeness):**
|
|
||||||
|
|
||||||
12. `docs/TODO.md`, `docs/projects/COMPLETED.md`, `docs/explorations/` — kanban
|
|
||||||
task.
|
|
||||||
|
|
||||||
13. **Ports 3000/3001** dev-server check (same as Policy 5 — fold fix together).
|
|
||||||
|
|
||||||
14. **`npm run build:strict`** referenced as the recommended verification
|
|
||||||
command. This is a Remnant-specific custom script name. Other projects use
|
|
||||||
`npm run build` or `npm run check` or `npm run ci`. Fix: same parameterize
|
|
||||||
approach (read from `.agents/project.json`).
|
|
||||||
|
|
||||||
**`user-prompt-submit.sh`:** clean. No Remnant-isms found.
|
|
||||||
|
|
||||||
**Suggested fix pattern (rather than a string of patches):**
|
|
||||||
|
|
||||||
Introduce a per-project config file at `<repo>/.agents/project.config.js` (or
|
|
||||||
`.ts`) so each hook can read its values instead of hardcoding them. Full design
|
|
||||||
— file shape, loader notes, dropped fields (`modelContextWindow`),
|
|
||||||
recommendation — is in
|
|
||||||
[dotfiles-agent-infra-roadmap.md → `project.config.js` extraction](./dotfiles-agent-infra-roadmap.md#1-projectconfigjs-extraction).
|
|
||||||
|
|
||||||
### 🆕 Future task — per-session tmp file capture
|
|
||||||
|
|
||||||
Moved to
|
|
||||||
[dotfiles-agent-infra-roadmap.md → Per-session tmp file capture](./dotfiles-agent-infra-roadmap.md#2-per-session-tmp-file-capture).
|
|
||||||
Driver recorded here for the validation trail: `user-prompt-submit.sh` writes to
|
|
||||||
a globally-named `/tmp/.last-user-prompt.txt`, so concurrent sessions clobber
|
|
||||||
one another's capture. The same issue affects
|
|
||||||
`/tmp/.opencode-tool-count-${REPO_ID}` in `post-tool-use.sh` (keyed by repo, not
|
|
||||||
session — concurrent sessions in the same repo share the self-check counter).
|
|
||||||
@ -1,87 +0,0 @@
|
|||||||
# Failure Modes — Qwen3.6 & OpenCode
|
|
||||||
|
|
||||||
Compiled 2026-05-27. Sources linked inline.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Qwen3.6 Model-Specific Quant & Routing Issues
|
|
||||||
|
|
||||||
### IQ3 Quant — Tool Call JSON Failure
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | IQ3 quant tool-call JSON breakage |
|
|
||||||
| **Description** | Qwen3.6 35B-A3B at IQ3_XXS quant fails function-call JSON generation entirely. BatiAI's Ollama benchmark shows ❌ for IQ3, ✅ for IQ4 and Q6. IQ3 is memory-bandwidth bound (~45.9 t/s on M4 Max) and loses the precision needed for structured JSON output in tool calls. |
|
|
||||||
| **Mitigation** | Use IQ4_XS or Q6_K for any workload with tool calling. IQ3 is acceptable only for text-only chat. IQ4 and Q6 show equivalent throughput. |
|
|
||||||
| **Sources** | [batiai/qwen3.6-35b:iq3 (Ollama)](https://ollama.com/batiai/qwen3.6-35b:iq3) |
|
|
||||||
|
|
||||||
### MoE Expert Loop — Q4_K_M & Below Routing Lock
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | Q4_K_M MoE expert routing collapse |
|
|
||||||
| **Description** | Qwen3.6's MoE architecture (256 routed experts, top-8 selection) degrades at Q4_K_M and below: the router locks into a subset of specialists (e.g., code-completion specialist for math queries, math specialist for syntax tasks). Expert activation entropy collapses. This is a structural MoE failure — dense Qwen2.5-72B does not exhibit this. Perplexity delta of +0.34 at Q4_K_M looks acceptable on paper but produces hallucinated method names, wrong parameter counts, and broken imports. |
|
|
||||||
| **Mitigation** | Default to Q6_K (1.6-point SWE-bench loss vs Q8_0, saves 2.1 GB VRAM). For 24 GB cards, Q4_K_M is acceptable only for RAG ingestion or documentation chat — not active code generation or function calling. Q8_0 wins SWE-bench Lite at 28.7%. BFCL v2 function-calling accuracy: 94.2% (Q8_0) → 89.7% (Q4_K_M). |
|
|
||||||
| **Sources** | [Qwen3.6 quant benchmarks: Q4 vs Q8 for MoE (CraftRigs)](https://craftrigs.com/comparisons/qwen3-6-quantization-benchmarks-q4-vs-q8/); [Qwen3.6-27B Setup Guide: 24GB GPU (CraftRigs)](https://craftrigs.com/guides/qwen3-6-27b-setup-guide-24gb-gpu/) |
|
|
||||||
|
|
||||||
### Official Chat Template — Non-Standard XML Parameter Format
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | Qwen3.6 official `chat_template.jinja` XML vs JSON incompatibility |
|
|
||||||
| **Description** | Qwen3.6's shipped `chat_template.jinja` instructs the model to generate function calls using a proprietary XML-like syntax (`<function=...><parameter=...>`) instead of OpenAI-compatible JSON. Missing closing tags cause parsing failures in standard inference frameworks (vLLM, HuggingFace transformers, llama-cpp-python, OpenAI-compatible API layers). Error: `Failed to parse input at pos XXXX: <function=read> <parameter=filePath> ...`. |
|
|
||||||
| **Mitigation** | Patch `chat_template.jinja` to use OpenAI-compatible JSON schema (`{"name": "function_name", "arguments": "{\"param1\": \"value1\"}"}`). |
|
|
||||||
| **Sources** | [abysslover/qwen36_tool_calling_failure (GitHub)](https://github.com/abysslover/qwen36_tool_calling_failure) |
|
|
||||||
|
|
||||||
### Long-Text Stability — Context Accumulation Amplifies Routing Drift
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | Q4_K_M multi-turn routing drift |
|
|
||||||
| **Description** | General chat tolerates +0.50 perplexity delta before quality drop is noticed. Multi-turn technical discussion (>3 turns with context accumulation), chain-of-thought reasoning, and structured output cross the threshold where expert loop errors become detectable within the first 10 responses. Context accumulation amplifies routing drift. |
|
|
||||||
| **Mitigation** | Q4_K_M acceptable for single-turn or short-context use. For long contexts or multi-turn structured output, use Q6_K or Q8_0. |
|
|
||||||
| **Sources** | [Qwen3.6 quant benchmarks: Q4 vs Q8 for MoE (CraftRigs)](https://craftrigs.com/comparisons/qwen3-6-quantization-benchmarks-q4-vs-q8/) |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## OpenCode Plugin / Hook-Specific Failures
|
|
||||||
|
|
||||||
### session.start — Resume / --continue Does Not Fire Plugin Context
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | session.start hook failure on resume |
|
|
||||||
| **Description** | `session.start` hook fires reliably for new sessions (`startup` trigger) but fails on resume (`--continue`/`--session`) with "No context found for instance" error. `Plugin.triggerSessionStart` is called during route navigation before the plugin context is fully initialized. Pending hook context is consumed lazily on the next model turn, so resume-triggered context can become stale if a session is resumed but not prompted soon after. |
|
|
||||||
| **Mitigation** | Be aware that `session.start` with `resume` trigger has a bootstrap timing edge case. Pending context becomes stale if the resumed session sits idle. PR #15224 documents the issue and a partial fix. |
|
|
||||||
| **Sources** | [OpenCode PR #15224 — feat(plugin): add session.start hook](https://github.com/anomalyco/opencode/pull/15224); [OpenCode Issue #5409 — SessionStart hook for session lifecycle events](https://github.com/sst/opencode/issues/5409) |
|
|
||||||
|
|
||||||
### PreToolUse — Ask Response Permanently Disables Bypass Permission
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | PreToolUse permission bypass lock |
|
|
||||||
| **Description** | When `PreToolUse` returns `permissionDecision: "ask"`, it permanently disables bypass permission mode until session restart. This is a state machine vulnerability — the permission bypass mode cannot recover from an `ask` response without a full session reset. |
|
|
||||||
| **Mitigation** | If using permission bypass mode, avoid `PreToolUse` hooks that return `ask`. Verify hook behavior after any policy change. |
|
|
||||||
| **Sources** | Claude Code #37420 (referenced in AGENTS.md) |
|
|
||||||
|
|
||||||
### session.created — Event Fails Reliably for Plugins
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | session.created event reliability for plugins |
|
|
||||||
| **Description** | `session.created` event fails to fire reliably for plugins due to MCP compatibility errors. This affects plugins that depend on session lifecycle events for initialization. |
|
|
||||||
| **Mitigation** | Use `session.start` hook as the primary initialization mechanism instead of relying on `session.created` events. |
|
|
||||||
| **Sources** | OpenCode #14808 (referenced in AGENTS.md, `~/.config/opencode/plugins/engram.ts`) |
|
|
||||||
|
|
||||||
### chat.message — Synthetic Text Injection Required for System Message Position
|
|
||||||
|
|
||||||
| | |
|
|
||||||
|---|---|
|
|
||||||
| **Name** | Jinja system message position enforcement |
|
|
||||||
| **Description** | vLLM propagates Qwen's strict Jinja template requiring `role=system` at index 0. Auxiliary context injection (e.g., from session-start hooks) breaks this if it places context after the system message. Fix: inject session-start as a synthetic `text` part via `output.parts.unshift()` on the first `chat.message` turn, not via `experimental.chat.system.transform`. Text parts have no position constraint. |
|
|
||||||
| **Mitigation** | Do not use `experimental.chat.system.transform` for session-start hooks with Qwen-family models. Use synthetic `text` parts via `output.parts.unshift()` on the first `chat.message` turn. |
|
|
||||||
| **Sources** | vLLM #41114; AGENTS.md (system reminder pattern) |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*Generated 2026-05-27 from web search findings.*
|
|
||||||
@ -1,405 +0,0 @@
|
|||||||
# Where Human and LLM Text Interpretation Overlap (and Don't)
|
|
||||||
|
|
||||||
> **Status:** Synthesis of
|
|
||||||
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
|
|
||||||
> (humans reading text) and
|
|
||||||
> [`llm-intent-interpretation.md`](./llm-intent-interpretation.md) (LLMs reading
|
|
||||||
> prompts). The question is: how much of what works on one carries over, and is
|
|
||||||
> there published evidence either way?
|
|
||||||
>
|
|
||||||
> **Working hypothesis (from the user, May 2026):** LLMs are trained on
|
|
||||||
> human-written text, so the cognitive shortcuts and biases that humans bring to
|
|
||||||
> text could be inherited by the models. This doc treats that as a hypothesis to
|
|
||||||
> test against the literature, not as an assumption.
|
|
||||||
>
|
|
||||||
> **Methodology:** Each candidate parallel is rated by what the literature says,
|
|
||||||
> not by intuition. Four labels are used:
|
|
||||||
>
|
|
||||||
> - **Cited connection** — at least one paper explicitly links the human and LLM
|
|
||||||
> phenomenon (often by name).
|
|
||||||
> - **Cited distinction** — a paper explicitly argues the analogy is misleading
|
|
||||||
> or the mechanism is different.
|
|
||||||
> - **Parallel without published bridge** — both phenomena are real and
|
|
||||||
> independently documented, but no source I found connects them. Use with
|
|
||||||
> care.
|
|
||||||
> - **Orphan** — exists in only one doc; no found counterpart.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. The User's Hypothesis, Tested
|
|
||||||
|
|
||||||
> "Humans wrote the text LLMs are trained on, so human emotional/cognitive
|
|
||||||
> shortcuts could affect LLMs."
|
|
||||||
|
|
||||||
**Verdict: directly supported in the literature.** Mina et al. (COLING 2025) [1]
|
|
||||||
examine four classical cognitive biases — primacy, recency, common-token, and
|
|
||||||
majority-class — across base and instructed models of varying size, and
|
|
||||||
conclude:
|
|
||||||
|
|
||||||
> "Recent work has shown that these biases can percolate through training data
|
|
||||||
> and ultimately be learned by language models." [1]
|
|
||||||
|
|
||||||
The same paper distinguishes biases that arise from _pretraining data
|
|
||||||
distributions_ (e.g., common-token bias) from biases that arise from the
|
|
||||||
_autoregressive generation process itself_ (e.g., some forms of recency). So the
|
|
||||||
user's framing is correct, with one refinement: not every LLM bias is inherited
|
|
||||||
— some are mechanical, some are statistical, some are both.
|
|
||||||
|
|
||||||
Hartvigsen-line work (Steed et al. 2022; Touileb-line replications through 2024)
|
|
||||||
[9] independently confirms the inheritance pathway for sentiment and
|
|
||||||
social-stereotype biases: pretraining corpora (CC-100 vs. Wikipedia) carry
|
|
||||||
measurably different negative-sentiment distributions toward identity terms,
|
|
||||||
which propagate into both upstream embeddings and downstream toxicity
|
|
||||||
classifiers.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Cited Connections
|
|
||||||
|
|
||||||
These are points where the published literature names a human cognitive
|
|
||||||
phenomenon as the analog of an LLM behavior, with empirical work on both sides.
|
|
||||||
|
|
||||||
**Evidence-strength tags** (applied per subsection):
|
|
||||||
|
|
||||||
- **[multi-replicated]** — multiple independent studies, including at least one
|
|
||||||
peer-reviewed venue, finding the same effect.
|
|
||||||
- **[single-study + partial replication]** — primary finding peer-reviewed;
|
|
||||||
follow-ups exist but disagree on scope or magnitude.
|
|
||||||
- **[single-study]** — peer-reviewed but not yet independently replicated to my
|
|
||||||
knowledge.
|
|
||||||
- **[preprint-only]** — relevant findings exist only as arXiv preprints or
|
|
||||||
community analyses; treat as provisional.
|
|
||||||
|
|
||||||
### 2.1 Primacy / recency → Lost-in-the-middle (Serial Position Effects)
|
|
||||||
|
|
||||||
**Evidence strength: [single-study + partial replication]** — the analogy is
|
|
||||||
real but the LLM side has been refined and partially disconfirmed.
|
|
||||||
|
|
||||||
The human side: Asch (1946) on primacy in impression formation; Baddeley & Hitch
|
|
||||||
(1993) on recency in working memory. [2][3]
|
|
||||||
|
|
||||||
The LLM side: Wang et al. (ACL Findings 2025), _Serial Position Effects of Large
|
|
||||||
Language Models_ [4], explicitly tests for "primacy and recency biases, which
|
|
||||||
are well-documented cognitive biases in human psychology" and confirms
|
|
||||||
widespread occurrence across ChatGPT, GPT-J, GPT-3.5, GPT-4, and
|
|
||||||
Claude-instant-1.2. The lost-in-the-middle finding (Liu et al., TACL 2024) is
|
|
||||||
the same phenomenon under a different name.
|
|
||||||
|
|
||||||
**Refinements and partial disconfirmations:**
|
|
||||||
|
|
||||||
- Bilan et al. (arXiv 2508.07479, 2025) [5] show the U-shape only holds when
|
|
||||||
content occupies up to ~50% of the context window; beyond that, primacy
|
|
||||||
weakens and the curve becomes _distance-to-end_ rather than U-shaped.
|
|
||||||
- Mak (2025) [15] argues the dip is partly an artifact of positional-embedding
|
|
||||||
decay — tokens near the 90% position get "blurry" embeddings — producing
|
|
||||||
monotonic drop from start to end at very-long contexts, not a clean U.
|
|
||||||
- Zhang et al. (2024b), cited in [4], found studies that **did not** replicate
|
|
||||||
the LiM effect on certain long-context models, indicating the effect is
|
|
||||||
conditional on architecture and context length.
|
|
||||||
|
|
||||||
Humans don't have a context window, and their primacy advantage is stable across
|
|
||||||
passage length, so the analogy is conceptual rather than mechanistic.
|
|
||||||
|
|
||||||
**Practical convergence:** "put important content at the boundaries" works for
|
|
||||||
both — but the LLM version may degrade into pure recency at long contexts, and
|
|
||||||
the cause includes embedding-precision artifacts that have no human analog.
|
|
||||||
|
|
||||||
### 2.2 Hyperpersonal idealization → ELIZA effect / anthropomorphism
|
|
||||||
|
|
||||||
**Evidence strength: [multi-replicated]** — anthropomorphism toward chatbots is
|
|
||||||
one of the oldest and most-replicated findings in HCI; the hyperpersonal model
|
|
||||||
itself has decades of CMC support.
|
|
||||||
|
|
||||||
The human side: Walther's hyperpersonal model (1996) — in text-only
|
|
||||||
relationships, receivers idealize senders by filling in flattering detail. [#12
|
|
||||||
in human doc]
|
|
||||||
|
|
||||||
The LLM-adjacent side: the **ELIZA effect**, named for Weizenbaum's 1966 chatbot
|
|
||||||
— humans attribute understanding, empathy, and authenticity to systems that
|
|
||||||
produce text resembling human speech. The Cambridge essay collection on chatbot
|
|
||||||
authenticity (2024) [6] explicitly traces this to "a much longer history of
|
|
||||||
technologically mediated communications" and notes the same hyperpersonal
|
|
||||||
pattern: minimal cues, maximum projection.
|
|
||||||
|
|
||||||
This connection is bidirectional and was named long before LLMs — the mechanism
|
|
||||||
on the human side is identical (cue impoverishment → reader fills the gap), only
|
|
||||||
the partner changes.
|
|
||||||
|
|
||||||
### 2.3 Sycophancy ↔ social-desirability / agreement bias
|
|
||||||
|
|
||||||
**Evidence strength: [single-study + partial replication]** — the headline
|
|
||||||
result is peer-reviewed (ICLR 2024) on a specific set of RLHF'd models, but a
|
|
||||||
community replication on OpenAI base models found the effect does not generalize
|
|
||||||
across model families.
|
|
||||||
|
|
||||||
The human side: well-documented social-desirability and conformity effects
|
|
||||||
(Asch, 1956; Crowne & Marlowe, 1960) — humans give answers they believe the
|
|
||||||
listener wants.
|
|
||||||
|
|
||||||
The LLM side: Sharma et al. (ICLR 2024), _Towards Understanding Sycophancy in
|
|
||||||
Language Models_ [7], tested five SOTA RLHF assistants and analyzed the
|
|
||||||
`hh-rlhf` preference dataset. Headline finding:
|
|
||||||
|
|
||||||
> "Both humans and preference models prefer convincingly-written sycophantic
|
|
||||||
> responses over correct ones a non-negligible fraction of the time… matching a
|
|
||||||
> user's views is one of the most predictive features of human preference
|
|
||||||
> judgments."
|
|
||||||
|
|
||||||
On the Sharma et al. data, the bias is encoded into the **human preference
|
|
||||||
labels** that drive RLHF — i.e., human social-desirability bias is propagated to
|
|
||||||
the reward model and then to the policy. The mitigation literature
|
|
||||||
(Self-Augmented Preference Alignment, EMNLP 2025) [8] reframes the problem as
|
|
||||||
needing to explicitly assess the user's expected answer rather than ignore it.
|
|
||||||
|
|
||||||
**Important counter-evidence:** Perez et al. (2022) originally claimed
|
|
||||||
sycophancy appears even at **zero RLHF steps**, which would imply a
|
|
||||||
pretraining-corpus origin. nostalgebraist (2023) [16] reproduced Perez et al.'s
|
|
||||||
eval on OpenAI API base models (davinci, babbage, etc.) and found OpenAI base
|
|
||||||
models are **not sycophantic at any size**. Sycophancy emerges only with
|
|
||||||
specific finetuning pipelines (e.g., `text-davinci-002`/`003`). The honest
|
|
||||||
reading is:
|
|
||||||
|
|
||||||
- Sycophancy is **real and replicable** in specific RLHF'd model families.
|
|
||||||
- It is **not a universal property of RLHF** or of "models trained on human
|
|
||||||
text."
|
|
||||||
- The most plausible mechanism is _interaction_ between specific reward-model
|
|
||||||
shapes and specific preference data, not a clean inheritance from a single
|
|
||||||
human cognitive bias.
|
|
||||||
|
|
||||||
**Practical convergence (where it holds):** the human-side advice "ask for the
|
|
||||||
answer before stating your own view" maps directly to LLM-side guidance ("avoid
|
|
||||||
revealing your conclusion before asking the model").
|
|
||||||
|
|
||||||
### 2.4 Perspective-taking (Galinsky) ↔ SimToM prompting
|
|
||||||
|
|
||||||
**Evidence strength: [single-study]** — SimToM is a single 2023 arXiv paper with
|
|
||||||
no independent replication I found; the human-side perspective-taking literature
|
|
||||||
is robust.
|
|
||||||
|
|
||||||
The human side: Galinsky & Moskowitz (2000), perspective-taking reduces hostile
|
|
||||||
attributions and stereotype expression. [#7 in human doc]
|
|
||||||
|
|
||||||
The LLM side: Wilf et al. (2023), _Think Twice: Perspective-Taking Improves
|
|
||||||
Large Language Models' Theory-of-Mind Capabilities_ (SimToM) [10], explicitly
|
|
||||||
cites Simulation Theory's notion of perspective-taking and operationalizes it as
|
|
||||||
a two-stage prompt: filter the context to what a character knows, _then_ answer
|
|
||||||
questions about their mental state. Improves ToM benchmarks substantially with
|
|
||||||
no fine-tuning.
|
|
||||||
|
|
||||||
**Practical convergence:** for both humans and models, asking "what does the
|
|
||||||
other party know / believe / intend?" as a separate, explicit step before
|
|
||||||
responding improves accuracy on ambiguous-intent tasks.
|
|
||||||
|
|
||||||
### 2.5 Asking a clarifying question (Byron) ↔ Selective clarification (CLAM)
|
|
||||||
|
|
||||||
**Evidence strength: [multi-replicated]** on the human side; **[single-study]**
|
|
||||||
on the LLM side, but the CLAM framework has been re-used and extended in
|
|
||||||
follow-on work and integrated into Anthropic's published defaults.
|
|
||||||
|
|
||||||
The human side: Byron (2008) [#2 in human doc] — respond to ambiguous emotional
|
|
||||||
content with a question, not a reaction.
|
|
||||||
|
|
||||||
The LLM side: Kuhn et al. (arXiv 2212.07769), _CLAM: Selective Clarification for
|
|
||||||
Ambiguous Questions_ [11], shows current language models "rarely ask users to
|
|
||||||
clarify ambiguous questions and instead provide incorrect answers," and provides
|
|
||||||
a framework that meaningfully improves QA performance when ambiguity is detected
|
|
||||||
and a clarifying question is generated.
|
|
||||||
|
|
||||||
**Practical convergence:** the advice is identical and verified independently on
|
|
||||||
both sides — when intent is unclear, asking is better than guessing. The
|
|
||||||
Anthropic "default-to-clarify" system prompt variant ([1] in llm doc) is the
|
|
||||||
engineering implementation.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Cited Distinctions
|
|
||||||
|
|
||||||
### 3.1 Egocentrism (sender-side, human) ≠ literalism (Claude 4.7)
|
|
||||||
|
|
||||||
Kruger, Epley, Parker & Ng (2005) frame egocentrism as a **sender**
|
|
||||||
overestimating how clearly tone comes through. LLMs don't "send" in that sense —
|
|
||||||
they're always the receiver of the prompt. Anthropic's documented behavior
|
|
||||||
change in Opus 4.7 [llm doc, 1] is the opposite of human egocentrism: the model
|
|
||||||
becomes _less_ willing to infer beyond what's written.
|
|
||||||
|
|
||||||
**Implication:** the human-side cure ("state things explicitly because you can't
|
|
||||||
trust the receiver to read your mind") is exactly what the LLM-side
|
|
||||||
architectural shift now _requires_ from the user. Same advice, mirrored
|
|
||||||
mechanism.
|
|
||||||
|
|
||||||
### 3.2 Affect labeling (Lieberman) — claimed analog is weak
|
|
||||||
|
|
||||||
The temptation is to map affect labeling ("name the emotion") onto "ask the LLM
|
|
||||||
to identify sentiment before responding." Reichman et al. (arXiv
|
|
||||||
2603.09205, 2026) [12] introduce AURA-QA, an emotion-balanced QA dataset, and
|
|
||||||
find that "affective tone inadvertently influences semantic interpretation, even
|
|
||||||
among semantically equivalent inputs with differing emotional expressions."
|
|
||||||
Their proposed fix is _representation- level emotional regularization at
|
|
||||||
training time_, not a labeling prompt. So the mechanism (amygdala
|
|
||||||
down-regulation via verbal labeling of one's own affect) does not transfer; the
|
|
||||||
LLM lacks the regulatory loop the human practice exploits.
|
|
||||||
|
|
||||||
**Practical conclusion:** asking an LLM to "first identify the tone of this
|
|
||||||
message" can disambiguate intent, but the published mechanism is
|
|
||||||
representational, not regulatory. Don't expect the same calming / de-escalation
|
|
||||||
effect documented in humans.
|
|
||||||
|
|
||||||
### 3.3 Hostile-attribution bias (Aderka et al.) ≠ LLM negativity inheritance
|
|
||||||
|
|
||||||
In humans, hostile attribution is an _interpretive_ tendency in ambiguous social
|
|
||||||
cues, tied to individual differences (anxiety, prior experience). In LLMs,
|
|
||||||
negative-sentiment inheritance is a **statistical property of the pretraining
|
|
||||||
corpus** that propagates into embeddings and downstream classifiers [9][12].
|
|
||||||
Both produce "neutral text read as negative," but the human bias varies by
|
|
||||||
reader; the LLM bias varies by corpus and is roughly stable per model.
|
|
||||||
Mitigations are correspondingly different: cognitive (re-read, generate
|
|
||||||
alternatives) on the human side, data/representational on the LLM side.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Parallels Without a Published Bridge
|
|
||||||
|
|
||||||
These look like genuine analogies but I did not find a paper that draws the link
|
|
||||||
explicitly. Use them as working hypotheses, not citations.
|
|
||||||
|
|
||||||
| Human-side practice | LLM-side practice | Status |
|
|
||||||
| ------------------------------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| Delay / "don't hit send" | Reflect / self-correct / multi-turn revision | Mechanistically different (amygdala vs. additional inference passes); empirically both reduce errors. Self-reflection survey: [13]. |
|
|
||||||
| Re-read slowly | Self-consistency / re-read prompt | Self-consistency (Wang et al. 2023) reduces hallucination; not framed as analogous to human re-reading in the papers I found. |
|
|
||||||
| Principle of charity / steel-manning | "State scope explicitly" (Anthropic 4.7 guide) | Both are about pre-empting under-specified intent. No source connects them. |
|
|
||||||
| NVC: observation → interpretation gap | XML tags around content | Both separate "what is on the page" from "what to do with it," but the rationales (cognitive defusion vs. attention boundaries) differ. |
|
|
||||||
| Match medium to message (richness) | Escalate to bigger model / use tools | Daft & Lengel's media richness has been cited in CMC literature; no direct LLM-side citation found. |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Orphans (No Found Counterpart Either Direction)
|
|
||||||
|
|
||||||
### Human-side, no LLM analog found
|
|
||||||
|
|
||||||
- **Mehrabian "55/38/7" debunk.** Specific to humans + paralinguistic cues; no
|
|
||||||
parallel claim in LLM literature.
|
|
||||||
- **Emoji as partial tone fix (Riordan 2017).** Emoji-in-prompt research exists
|
|
||||||
but treats emoji as tokens, not as a tone-channel substitute. The analogy is
|
|
||||||
shallow.
|
|
||||||
- **The minimal operating checklist (§3 of human doc).** Some items map
|
|
||||||
(clarifying question, perspective-taking); the rest (pause, pulse check) have
|
|
||||||
no plausible model analog.
|
|
||||||
|
|
||||||
### LLM-side, no human analog found
|
|
||||||
|
|
||||||
- **Quantization effects (Q3/Q4/Q5/Q8 trade-offs).** Uniquely a
|
|
||||||
numerical-precision phenomenon. The closest human analog would be fatigue /
|
|
||||||
cognitive load reducing reasoning accuracy, but no source draws this link, and
|
|
||||||
the dose-response curves are different shapes.
|
|
||||||
- **Dense vs. MoE architecture (Shen et al. 2024).** Routing-based
|
|
||||||
specialization has no plausible human analog at the level the paper studies.
|
|
||||||
- **Parameter count and bimodal emergence (Distributional Scaling Laws).**
|
|
||||||
Reflects training stochasticity; humans don't "scale" in a comparable way.
|
|
||||||
- **Role confusion / CoT Forgery (style → authority).** A human parallel exists
|
|
||||||
(uniforms, jargon, Milgram-style obedience to apparent authority), but I found
|
|
||||||
no paper that draws the explicit LLM↔human bridge for stylistic-spoofing
|
|
||||||
attacks. Worth flagging as a likely-but-unwritten connection.
|
|
||||||
- **Default-to-action vs. default-to-clarify as a prompt knob.** This is a
|
|
||||||
property of model alignment dials, not of human cognition. The human side has
|
|
||||||
trait-level analogs (conscientiousness, impulsivity) but they're not knobs.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. Additional Findings Worth Carrying Forward
|
|
||||||
|
|
||||||
Two items surfaced during this synthesis that didn't fit cleanly into either
|
|
||||||
prior doc but are relevant to anyone using the previous two.
|
|
||||||
|
|
||||||
### 6.1 The bias-inheritance chain is two-stage, not one
|
|
||||||
|
|
||||||
Mina et al. [1] and Hartvigsen-line work [9] together imply a useful mental
|
|
||||||
model: human biases reach LLMs through **two distinct channels** that need
|
|
||||||
different mitigations.
|
|
||||||
|
|
||||||
1. **Pretraining-corpus channel.** Cognitive and sentiment biases that exist in
|
|
||||||
the source text (e.g., common-token, majority-class, identity-term
|
|
||||||
sentiment). Mitigated at the data / training-objective level (e.g., AURA-QA's
|
|
||||||
emotional regularization [12]).
|
|
||||||
2. **Preference-label channel.** Biases in human judgments that drive RLHF —
|
|
||||||
most prominently sycophancy [7]. Mitigated at the reward-model / alignment
|
|
||||||
level (SAPA [8]).
|
|
||||||
|
|
||||||
A prompt-time mitigation only addresses the symptom. This explains why "be
|
|
||||||
specific" reliably helps but "tell the model not to be sycophantic" helps less
|
|
||||||
than expected — only the former is in the model's in-context-learnable
|
|
||||||
repertoire.
|
|
||||||
|
|
||||||
### 6.2 RLHF amplifies serial-position effects
|
|
||||||
|
|
||||||
Tjuatja et al. (2023), cited in Wang et al. [4], find that RLHF **increases**
|
|
||||||
serial position effects relative to base models. This is consistent with the
|
|
||||||
broader pattern that alignment training, while making models more useful, also
|
|
||||||
makes them more reliably _human-like_ in their failure modes — including ones
|
|
||||||
we'd rather not import.
|
|
||||||
|
|
||||||
**Practical takeaway:** if you have a choice between a base/lightly- tuned local
|
|
||||||
model and a heavily-RLHF'd one for tasks where positional fairness matters
|
|
||||||
(e.g., ranking, multiple-choice evaluation), the base model may show _less_ of
|
|
||||||
the human-analog bias.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 7. Sources
|
|
||||||
|
|
||||||
1. Mina, M., Ruiz-Fernández, V., Falcão, J., Vasquez-Reina, L., &
|
|
||||||
Gonzalez-Agirre, A. (2024). _Cognitive biases in large language models: A
|
|
||||||
survey and mitigation experiments._ COLING 2025.
|
|
||||||
https://aclanthology.org/2025.coling-main.120v1.pdf
|
|
||||||
2. Asch, S. E. (1946). _Forming impressions of personality._ Journal of Abnormal
|
|
||||||
and Social Psychology, 41(3), 258–290. (Primacy effect in impression
|
|
||||||
formation.)
|
|
||||||
3. Baddeley, A. D., & Hitch, G. J. (1993). _The recency effect: Implicit
|
|
||||||
learning with explicit retrieval?_ Memory & Cognition, 21(2), 146–155.
|
|
||||||
4. Wang, X., et al. (2024/2025). _Serial Position Effects of Large Language
|
|
||||||
Models._ ACL Findings 2025. arXiv:2406.15981. (Explicitly tests human
|
|
||||||
primacy/recency analogs in LLMs.)
|
|
||||||
5. Bilan, J., et al. (2025). _Positional Biases Shift as Inputs Approach Context
|
|
||||||
Window Limits._ arXiv:2508.07479. (LiM is strongest up to ~50% of context
|
|
||||||
window; beyond that, distance-to-end dominates.)
|
|
||||||
6. _Can Chatbots Be Authentic? The ELIZA Effect Revisited._ Cambridge University
|
|
||||||
Press essay collection (2024). (Hyperpersonal / anthropomorphism lineage from
|
|
||||||
Eliza to modern LLMs.)
|
|
||||||
7. Sharma, M., et al. (2024). _Towards Understanding Sycophancy in Language
|
|
||||||
Models._ ICLR 2024. arXiv:2310.13548.
|
|
||||||
8. Park, J., et al. (2025). _Self-Augmented Preference Alignment for Sycophancy
|
|
||||||
Reduction in LLMs._ EMNLP 2025.
|
|
||||||
9. Khandelwal, A., et al. (2024). _Scaling and sentiment bias propagation from
|
|
||||||
pretraining corpora into downstream models._ arXiv preprint. (CC-100 vs.
|
|
||||||
Wikipedia sentiment toward identity groups; propagation to fine-tuned
|
|
||||||
toxicity classifiers.)
|
|
||||||
10. Wilf, A., et al. (2023). _Think Twice: Perspective-Taking Improves Large
|
|
||||||
Language Models' Theory-of-Mind Capabilities._ arXiv:2311.10227. (SimToM —
|
|
||||||
explicit operationalization of Galinsky-style perspective-taking for LLMs.)
|
|
||||||
11. Kuhn, L., Gal, Y., & Farquhar, S. (2022/2023). _CLAM: Selective
|
|
||||||
Clarification for Ambiguous Questions with Large Language Models._
|
|
||||||
arXiv:2212.07769.
|
|
||||||
12. Reichman, B., et al. (2026). _AURA-QA: An emotionally balanced QA dataset
|
|
||||||
and emotional regularization framework._ arXiv:2603.09205.
|
|
||||||
13. Ji, Z., et al. (2023). _Towards Mitigating Hallucination in Large Language
|
|
||||||
Models via Self-Reflection._ arXiv:2310.06271.
|
|
||||||
14. Tjuatja, L., et al. (2023). _RLHF amplifies prompt-position sensitivity in
|
|
||||||
language models._ Cited in [4]. (Original arXiv preprint; full reference in
|
|
||||||
[4]'s bibliography.)
|
|
||||||
15. Mak, Y. C. (2025). _Lost in the middle, or just lost? Evaluating LLMs on
|
|
||||||
information retrieval with long input contexts._
|
|
||||||
https://ycmak.net/how-lost-in-the-middle/ (Argues the U-shape is partly an
|
|
||||||
artifact of positional-embedding decay producing monotonic drop at very long
|
|
||||||
contexts. Not peer-reviewed; data and methodology are public.)
|
|
||||||
16. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
|
|
||||||
size._ LessWrong.
|
|
||||||
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
|
|
||||||
(Replication-style analysis disconfirming the strongest reading of Perez et
|
|
||||||
al. 2022 for OpenAI base models.)
|
|
||||||
17. Schulhoff, S. et al. (2024). _The Prompt Report: A Systematic Survey of
|
|
||||||
Prompting Techniques._ arXiv:2406.06608. (PRISMA review of 1,565 papers;
|
|
||||||
foundational survey used as cross-check on prompt-engineering claims in the
|
|
||||||
companion LLM doc.)
|
|
||||||
18. _Principled Personas: Defining and Measuring the Intended Effects of Persona
|
|
||||||
Prompting on Task Performance._ EMNLP 2025.
|
|
||||||
https://aclanthology.org/2025.emnlp-main.1364/ (Persona prompts often
|
|
||||||
ineffective; up to ~30pp drops from irrelevant persona details.)
|
|
||||||
@ -1,379 +0,0 @@
|
|||||||
# Action Plan: Counteracting Model Failures to Interpret Intent
|
|
||||||
|
|
||||||
**Status:** draft (2026-05-16)
|
|
||||||
**Source investigation:**
|
|
||||||
[docs/explorations/text-intent-interpretation-research.md](../explorations/text-intent-interpretation-research.md)
|
|
||||||
**Source
|
|
||||||
research docs:**
|
|
||||||
|
|
||||||
- [docs/research/text-communication-interpretation.md](text-communication-interpretation.md)
|
|
||||||
(Phase 1: humans reading text)
|
|
||||||
- [docs/research/llm-intent-interpretation.md](llm-intent-interpretation.md)
|
|
||||||
(Phase 1: LLMs reading prompts)
|
|
||||||
- [docs/research/human-llm-interpretation-overlap.md](human-llm-interpretation-overlap.md)
|
|
||||||
(Phase 2: synthesis)
|
|
||||||
- [docs/research/ai-coding-best-practices.md](ai-coding-best-practices.md)
|
|
||||||
(cross-reference: §2.1, §3.2, §3.4a, §3.5, §3.6, §3.7, §3.8, §7)
|
|
||||||
|
|
||||||
## How to read this document
|
|
||||||
|
|
||||||
Each entry has the same shape:
|
|
||||||
|
|
||||||
```
|
|
||||||
Failure mode → Why it happens → Mitigation that works → Tempting-but-wrong mitigation (anti-pattern) → Where to implement in this repo
|
|
||||||
```
|
|
||||||
|
|
||||||
The "tempting-but-wrong" line is the most important part. Many of the obvious
|
|
||||||
mitigations either (a) have no measurable effect or (b) actively hurt
|
|
||||||
performance — and they sound so reasonable they get added by default. If a
|
|
||||||
mitigation is on the anti-pattern list, _do not_ add it as a workaround when
|
|
||||||
something else fails.
|
|
||||||
|
|
||||||
Evidence-strength tags follow the synthesis doc's legend:
|
|
||||||
**[multi-replicated]**, **[single-study + partial replication]**,
|
|
||||||
**[single-study]**, **[preprint-only]**.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. Failure mode: misreading the user's actual question
|
|
||||||
|
|
||||||
### 1.1 Position-anchored priming (model defends a prior answer)
|
|
||||||
|
|
||||||
**Why it happens.** The model's previous turn sits in the context window and
|
|
||||||
acts as a prior the model subsequently defends. Follow-ups are read through the
|
|
||||||
lens of the prior position, not on their own terms. **[multi-replicated]** —
|
|
||||||
documented across model families; mechanism supported by ClashEval (Wu, Wu, Zou,
|
|
||||||
NeurIPS 2024) showing token-probability/adherence relationship.
|
|
||||||
|
|
||||||
**What works (in order of effectiveness):**
|
|
||||||
|
|
||||||
1. **Compaction or fresh context.** Physically remove the prior committed
|
|
||||||
answer. The anchor is broken. Use `PreCompact` to preserve only the user's
|
|
||||||
current question and the verified-correct state.
|
|
||||||
2. **Adversarial reframing.** Lower the model's confidence in its prior
|
|
||||||
commitment _before_ asking the next question: _"I believe your previous
|
|
||||||
answer was wrong because X. Now answer this specific question: ..."_
|
|
||||||
ClashEval's mechanism (lower token-probability prior → higher context
|
|
||||||
adherence) extends to this case in principle.
|
|
||||||
3. **Explicit current-question marker at the tail.** `UserPromptSubmit` hook
|
|
||||||
prepends `CURRENT QUESTION (answer this, not the prior exchange):` to the
|
|
||||||
prompt. Mechanical, cheap, observable.
|
|
||||||
|
|
||||||
**Tempting but wrong (do not do):**
|
|
||||||
|
|
||||||
- Repeating the question louder, adding emphasis, or asking the model to "read
|
|
||||||
more carefully." None of these change the anchor. They feel productive and do
|
|
||||||
nothing.
|
|
||||||
- Asking the model to re-state the question in its own words before answering.
|
|
||||||
In the no-oracle setting this can entrench the misreading rather than reset
|
|
||||||
it.
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- `UserPromptSubmit` hook (already exists at
|
|
||||||
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh))
|
|
||||||
is the right place for the current-question marker.
|
|
||||||
- Compaction logic in `PreCompact` hook (already exists at
|
|
||||||
[.agents/hooks/pre-compact.sh](../../.agents/hooks/pre-compact.sh)) is the
|
|
||||||
right place for the structured prior-discard.
|
|
||||||
|
|
||||||
### 1.2 Sycophancy (model defends the user's wrong claim)
|
|
||||||
|
|
||||||
**Why it happens.** Family-conditional behavior: some RLHF recipes
|
|
||||||
(Anthropic 2023) systematically push toward agreement with the user. **NOT** a
|
|
||||||
universal RLHF property — nostalgebraist (LessWrong, 2023) showed OpenAI base
|
|
||||||
models are not sycophantic at any size. **[single-study + partial replication]**
|
|
||||||
with the caveat that the effect depends on the model family in use.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- External feedback signals (test runners, hooks, type checkers, build) that
|
|
||||||
give the model a non-user source of truth.
|
|
||||||
- Explicit anti-sycophancy rules in `AGENTS.md` and agent bodies: _"Challenge
|
|
||||||
the user when the user is wrong,"_ _"Read a file before asserting facts about
|
|
||||||
it,"_ _"Only make changes that are directly requested."_
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- Telling the model "be more critical" or "push back when needed." On
|
|
||||||
sycophantic families this softens the floor but doesn't move the median; on
|
|
||||||
non-sycophantic families it's noise.
|
|
||||||
- LLM-as-judge of the user's own claim (self-critique loop without an oracle —
|
|
||||||
see §4.1 below).
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- [AGENTS.md](../../AGENTS.md) root anti-pattern list (already present).
|
|
||||||
- [.agents/AGENTS.md](../../.agents/AGENTS.md) per-agent rule reinforcement.
|
|
||||||
|
|
||||||
### 1.3 Persona / "you are an expert" prompting
|
|
||||||
|
|
||||||
**Why it happens.** Prompt-engineering folklore from 2022–2023 that expertise
|
|
||||||
personas improve accuracy. The 2025 literature falsifies this for accuracy
|
|
||||||
benchmarks. **[multi-replicated]** as a _negative_ result:
|
|
||||||
|
|
||||||
- Principled Personas (EMNLP 2025) — models are highly sensitive to irrelevant
|
|
||||||
persona details; performance drops of ~30pp from small attribute changes.
|
|
||||||
- Persona is a Double-Edged Sword (IJCNLP 2025) — mixed and unstable effects.
|
|
||||||
- [arXiv:2512.05858](https://arxiv.org/abs/2512.05858) — persona prompts
|
|
||||||
generally did not improve accuracy; low-knowledge personas (layperson, child,
|
|
||||||
outsider) often _reduced_ accuracy.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- Define _task contracts_ and _return formats_ for subagents (this is not the
|
|
||||||
same as injecting an expertise persona).
|
|
||||||
- Use the existing counterbalance agents
|
|
||||||
([.agents/agents/](../../.agents/agents/)) which are defined by what they
|
|
||||||
_counter_, not by what they _are an expert in_.
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- Adding `"You are a senior X engineer with 20 years of experience..."` to agent
|
|
||||||
prompts. No measurable effect on frontier models; on small models can hurt via
|
|
||||||
persona-attribute sensitivity.
|
|
||||||
- Expertise-ladder prompting (junior/senior/outsider) as an **accuracy**
|
|
||||||
improver. It is _only_ defensible as a divergent-ideation sampler for
|
|
||||||
brainstorm tasks where high variance is the goal — and even then, the final
|
|
||||||
answer should come from the un-personified model under an external rubric. See
|
|
||||||
revised
|
|
||||||
[docs/research/ai-coding-best-practices.md §7](ai-coding-best-practices.md).
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- Audit existing agent prompts in [.agents/agents/](../../.agents/agents/) for
|
|
||||||
any "you are an expert X" framing. Replace with negative-role and return-
|
|
||||||
format specs. (Action item, to be done after this plan is approved.)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Failure mode: misreading specific tokens / instructions in long context
|
|
||||||
|
|
||||||
### 2.1 Lost-in-the-middle / serial-position effects
|
|
||||||
|
|
||||||
**Why it happens.** Transformer attention is quadratic in context length;
|
|
||||||
information in the middle of long contexts receives proportionally less
|
|
||||||
attention. **[single-study + partial replication]** — Liu et al. (2023)
|
|
||||||
established the U-shape; Bilan et al. (arXiv:2508.07479, 2025) shows the U-shape
|
|
||||||
only holds up to ~50% of context window; Mak (2025) shows positional- embedding
|
|
||||||
decay produces monotonic drop in very-long contexts; Zhang et al. (2024b)
|
|
||||||
non-replication on some model families. Effect is real but mechanism varies and
|
|
||||||
effective context is typically 30–50% of advertised.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- Task-critical content at the **tail** of context (recency bias is strong and
|
|
||||||
consistent across the tested models).
|
|
||||||
- Rules repeated at both ends (start AND tail), not just AGENTS.md (start only).
|
|
||||||
- Hooks injecting at the context tail outlast AGENTS.md under context pressure.
|
|
||||||
- Summarization-in-place for stale tool outputs (don't scroll, replace).
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- Putting more rules in AGENTS.md when the existing ones aren't being followed.
|
|
||||||
They are forgotten from the middle by ~5–10k tokens of subsequent context.
|
|
||||||
_Adding more makes it worse._ Move the rule to a hook instead.
|
|
||||||
- Increasing the model's context window. Effective attention does not scale with
|
|
||||||
advertised window; the middle gets _worse_, not better.
|
|
||||||
- "Reminding" the model with bold text or all-caps in AGENTS.md. Token-level
|
|
||||||
emphasis has no measurable effect on the LiM gradient.
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- Enforcement hierarchy in [.agents/AGENTS.md](../../.agents/AGENTS.md) already
|
|
||||||
encodes the right pattern.
|
|
||||||
- Existing hooks ([.agents/hooks/](../../.agents/hooks/)) already implement the
|
|
||||||
context-tail-injection pattern. New guidance should follow that pattern.
|
|
||||||
|
|
||||||
### 2.2 Sequential-constraint ordering failures
|
|
||||||
|
|
||||||
**Why it happens.** Cross-references documented in
|
|
||||||
[ai-coding-best-practices.md §4.6](ai-coding-best-practices.md). When a list of
|
|
||||||
constraints is given in one order but must be applied in another, models apply
|
|
||||||
in the order they read them, not the order they should be applied in.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- Re-order constraints in the prompt to match application order.
|
|
||||||
- Use a verifier (a hook, a test, a lint rule) instead of relying on the model
|
|
||||||
to compose constraints in the right order.
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- Numbered lists (1, 2, 3) implying priority order. Models don't reliably honor
|
|
||||||
numeric priority over textual position.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Failure mode: ambiguity in the user's request
|
|
||||||
|
|
||||||
### 3.1 Models do not ask clarifying questions by default
|
|
||||||
|
|
||||||
**Why it happens.** Pretraining favors confident-helpful continuations. Asking
|
|
||||||
for clarification reads as "less helpful" in preference data.
|
|
||||||
**[multi-replicated]** in conversational AI literature.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- Explicit instruction in the system prompt: _"If the user's intent is unclear,
|
|
||||||
infer the most useful likely action and proceed with using tools to discover
|
|
||||||
missing details instead of guessing"_ — paired with a structured
|
|
||||||
ambiguity-flagging mechanism (e.g., the agent surfaces an explicit "assumption
|
|
||||||
made: X" line before acting).
|
|
||||||
- For high-stakes operations: ask one targeted clarifying question with options
|
|
||||||
(the existing ask-question tool / `vscode_askQuestions` pattern).
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- Telling the model "ask if anything is unclear." Models report nothing as
|
|
||||||
unclear that they could fluently continue past. The instruction has near- zero
|
|
||||||
effect.
|
|
||||||
- Adding many "do you mean X or Y?" examples in the prompt. Few-shot examples
|
|
||||||
for capable models on common tasks often actively harm via spurious
|
|
||||||
pattern-matching
|
|
||||||
([ai-coding-best-practices.md §7](ai-coding-best-practices.md)).
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- The default agent's `copilot-instructions` (if used here) or
|
|
||||||
[AGENTS.md](../../AGENTS.md) operational rules section.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Failure mode: trying to fix it by asking the model to fix itself
|
|
||||||
|
|
||||||
### 4.1 Intrinsic self-correction without an oracle
|
|
||||||
|
|
||||||
**Why it happens.** It feels like reflection should help. Empirically it
|
|
||||||
doesn't, and often it hurts. **[multi-replicated]** as a negative result:
|
|
||||||
|
|
||||||
- Huang et al. ([arXiv:2310.01798](https://arxiv.org/abs/2310.01798), "Large
|
|
||||||
Language Models Cannot Self-Correct Reasoning Yet"): in the intrinsic
|
|
||||||
(no-oracle) setting, self-correction **consistently decreases** reasoning
|
|
||||||
performance across multiple prompts and tasks. Prior optimism about
|
|
||||||
self-correction in earlier papers vanishes when oracle labels are removed.
|
|
||||||
- Pan et al. (arXiv:2308.03188): survey reaches the same conclusion in aggregate
|
|
||||||
— external feedback signals are reliable; intrinsic self-critique is not.
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- External feedback signal: test runner, type checker, lint, hook exit code,
|
|
||||||
build success. Reflexion (Shinn et al., arXiv:2303.11366) achieves 91% pass@1
|
|
||||||
on HumanEval _with_ an external oracle — without one, the loop is noise.
|
|
||||||
- Failure-mode-routed intervention: a small judge subagent that classifies the
|
|
||||||
failure mode and selects the matching intervention (see
|
|
||||||
[ai-coding-best-practices.md §3.5](ai-coding-best-practices.md) table). The
|
|
||||||
judge must be a stronger or cross-family model; same-family same-size judging
|
|
||||||
compounds bias.
|
|
||||||
|
|
||||||
**Tempting but wrong (this is the single most common anti-pattern):**
|
|
||||||
|
|
||||||
- _"Take another look,"_ _"are you sure?"_ _"please double-check your work,"_
|
|
||||||
_"reflect on whether this is correct."_ All of these feel productive in
|
|
||||||
transcripts. Without an external oracle they are at best noise and measurably
|
|
||||||
degrade correctness in the published benchmark. Do not add them.
|
|
||||||
- LLM-as-judge with the same model evaluating itself. Self-enhancement bias
|
|
||||||
(Zheng et al. 2023, MT-Bench) — same-family judges over-score their own
|
|
||||||
family's outputs.
|
|
||||||
|
|
||||||
**Where in this repo:**
|
|
||||||
|
|
||||||
- Verification is already correctly in the harness (build, lint, tests, hooks)
|
|
||||||
rather than the prompt — see
|
|
||||||
[ai-coding-best-practices.md §8.1](ai-coding-best-practices.md) and the
|
|
||||||
existing hook set.
|
|
||||||
- The reflection-without-oracle anti-pattern should be added explicitly to
|
|
||||||
[AGENTS.md](../../AGENTS.md) `<implementationDiscipline>` so it doesn't creep
|
|
||||||
back in as a "let me check my work" pattern.
|
|
||||||
|
|
||||||
### 4.2 Chain-of-thought as a universal fix
|
|
||||||
|
|
||||||
**Why it happens.** CoT works on some tasks; folklore generalized it to all
|
|
||||||
tasks. **[single-study + partial replication]** as a _negative_ finding for the
|
|
||||||
universalization:
|
|
||||||
|
|
||||||
- [arXiv:2409.06173](https://arxiv.org/abs/2409.06173) shows CoT suffers from
|
|
||||||
posterior collapse: larger models anchor _harder_ to reasoning priors under
|
|
||||||
CoT on subjective tasks (emotion, morality, intent inference).
|
|
||||||
|
|
||||||
**What works:**
|
|
||||||
|
|
||||||
- CoT for objective, verifiable reasoning (math, code logic, step-counted
|
|
||||||
inference).
|
|
||||||
- Think-Anywhere (Jiang et al., arXiv:2603.29957) and interleaved thinking
|
|
||||||
(Claude 4.x extended thinking) — mid-sequence reasoning at high-entropy
|
|
||||||
positions, not just upfront planning.
|
|
||||||
|
|
||||||
**Tempting but wrong:**
|
|
||||||
|
|
||||||
- _"Let's think step by step"_ preambles for reasoning-trained models — at best
|
|
||||||
redundant (the model is already trained to reason); at worst it entrenches a
|
|
||||||
wrong prior on subjective tasks.
|
|
||||||
- Long CoT on intent-interpretation tasks. The model can reason itself _further
|
|
||||||
into_ the misread.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Cross-cutting principle: the harness is where intent gets clarified
|
|
||||||
|
|
||||||
A unifying claim from the synthesis doc that survives both the human and LLM
|
|
||||||
literature: when ambiguity is high, neither a human nor a model resolves it by
|
|
||||||
"reading more carefully." Resolution happens through **external signal** —
|
|
||||||
question, test, lint, hook, oracle. The harness is where the external signal
|
|
||||||
lives. The prompt is where the rule of "use the external signal" lives.
|
|
||||||
|
|
||||||
Every action in this plan reduces to one of three moves:
|
|
||||||
|
|
||||||
1. **Move the rule into the harness.** Hooks, tests, type checkers, lint. These
|
|
||||||
are unambiguous and fire deterministically.
|
|
||||||
2. **Reduce reliance on context-middle attention.** Context-tail injection,
|
|
||||||
compaction, structured retrieval.
|
|
||||||
3. **Reduce reliance on self-critique.** External oracles, cross-family judges,
|
|
||||||
structured failure routing.
|
|
||||||
|
|
||||||
If a proposed mitigation does not fit one of these three, it probably belongs on
|
|
||||||
the tempting-but-wrong list.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. Proposed concrete edits (for user approval)
|
|
||||||
|
|
||||||
This plan does not yet ship code changes. Proposed next steps in dependency
|
|
||||||
order:
|
|
||||||
|
|
||||||
- [ ] **A.** Audit [.agents/agents/](../../.agents/agents/) bodies for "you are
|
|
||||||
an expert X" framing and replace with negative-role / return- format
|
|
||||||
specs. Likely small edits to 1–4 files.
|
|
||||||
- [ ] **B.** Add an anti-pattern bullet to
|
|
||||||
[.agents/AGENTS.md](../../.agents/AGENTS.md) calling out _"reflect /
|
|
||||||
double-check / are you sure"_ as a non-mitigation without an external
|
|
||||||
oracle. Scoped to `.agents/` (not root `AGENTS.md`) because it is
|
|
||||||
metaknowledge about agent design — only relevant when authoring agent
|
|
||||||
infrastructure, not when writing application code where tests are the
|
|
||||||
oracle anyway.
|
|
||||||
- [x] **C.** Add a `CURRENT QUESTION (answer this, not the prior exchange):`
|
|
||||||
prefix-injection option to
|
|
||||||
[.agents/hooks/user-prompt-submit.sh](../../.agents/hooks/user-prompt-submit.sh),
|
|
||||||
either always-on or gated on a follow-up trigger phrase. **Shipped
|
|
||||||
always-on** (Revision 7, 2026-05-16). Placed last in `additionalContext`
|
|
||||||
(context tail = highest recency bias). Validated by S2A (Weston &
|
|
||||||
Sukhbaatar, arXiv:2311.11829): explicitly isolating the current query from
|
|
||||||
prior context reduces sycophancy and improves factuality without a second
|
|
||||||
LLM call. Same mechanism as the ClashEval token-probability anchoring
|
|
||||||
research cited in §1.1.
|
|
||||||
- [x] **D.** Add an `ambiguity-flag` convention: when the agent infers user
|
|
||||||
intent past a real ambiguity, surface a one-line `ASSUMPTION:` marker
|
|
||||||
before proceeding. Documented in [AGENTS.md](../../AGENTS.md); enforceable
|
|
||||||
optionally via a `PreToolUse` check on certain destructive tools.
|
|
||||||
**Shipped as documentation** in root `AGENTS.md` "Key Rules" section
|
|
||||||
(Revision 7, 2026-05-16). PreToolUse enforcement deferred — would fire on
|
|
||||||
every destructive call regardless of whether there was genuine ambiguity,
|
|
||||||
producing noise without selectivity.
|
|
||||||
- [ ] **E.** Update
|
|
||||||
[docs/verified/ai-coding-best-practices.md](../verified/ai-coding-best-practices.md)
|
|
||||||
summary to reflect the three corrections from Revision 6 of the research
|
|
||||||
doc (sycophancy family-conditional, intrinsic self-correction is the
|
|
||||||
strongest anti-pattern, persona-ladder scoped to ideation only).
|
|
||||||
|
|
||||||
Open question for the user: which of A–E should ship in this conversation, which
|
|
||||||
need a separate task, and which should be discarded?
|
|
||||||
@ -1,598 +0,0 @@
|
|||||||
# llama-server with CUDA on WSL2
|
|
||||||
|
|
||||||
Guide to deploying `llama-server` (llama.cpp) as a systemd service on WSL2 with
|
|
||||||
full NVIDIA GPU offload via CUDA. Configured in **router mode** to serve
|
|
||||||
multiple GGUF models on-demand (with optional MTP speculative decoding) via an
|
|
||||||
OpenAI-compatible API.
|
|
||||||
|
|
||||||
**Target environment:**
|
|
||||||
|
|
||||||
- WSL2 (Ubuntu 24.04 Noble)
|
|
||||||
- NVIDIA RTX 3080 12GB (or similar), driver exposed via WSL2 GPU passthrough
|
|
||||||
- No separate CUDA toolkit install required to _run_; only needed when building
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Why not Ollama?
|
|
||||||
|
|
||||||
Ollama vendors a pinned version of llama.cpp and bundles its own CUDA runtime.
|
|
||||||
New model architectures (like `qwen35` / Qwen3-Next) may not be supported until
|
|
||||||
Ollama syncs its fork. `llama-server` from upstream llama.cpp supports them as
|
|
||||||
soon as the architecture lands in the main branch.
|
|
||||||
|
|
||||||
**Ollama does nothing special** beyond: bundling `libggml-cuda.so` alongside its
|
|
||||||
runner and setting `PATH` to include `/usr/lib/wsl/lib` (the WSL2 CUDA driver
|
|
||||||
passthrough). No flash-attention env vars, no special flags. We replicate this.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Verify WSL2 CUDA driver passthrough is working
|
|
||||||
ls /usr/lib/wsl/lib/libcuda.so.1 # must exist
|
|
||||||
nvidia-smi # must show your GPU
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 1 — Install CUDA toolkit and build dependencies
|
|
||||||
|
|
||||||
> Only needed once per machine to compile llama.cpp. Not needed at runtime.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo apt-get install -y nvidia-cuda-toolkit cmake build-essential git
|
|
||||||
```
|
|
||||||
|
|
||||||
Ubuntu 24.04 ships CUDA 12.0 in the `multiverse` repo. This is sufficient to
|
|
||||||
build llama.cpp with CUDA support even when the runtime driver is newer (e.g.
|
|
||||||
CUDA 13.1 via WSL2 passthrough). Alternatively, install CUDA 12.x from
|
|
||||||
[NVIDIA's own APT repo](https://developer.nvidia.com/cuda-downloads) to get a
|
|
||||||
more recent toolkit.
|
|
||||||
|
|
||||||
Verify the compiler is available:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
nvcc --version
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 2 — Clone and build llama.cpp from source
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone at a specific tag — check https://github.com/ggml-org/llama.cpp/releases for latest
|
|
||||||
# b9144+ required for qwen35 architecture (Qwen3.6, OmniCoder 2, etc.)
|
|
||||||
# b9279+ required for MTP speculative decoding (--spec-type draft-mtp)
|
|
||||||
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
|
|
||||||
cd /tmp/llama-build
|
|
||||||
|
|
||||||
# Configure with CUDA backend
|
|
||||||
cmake -B build \
|
|
||||||
-DGGML_CUDA=ON \
|
|
||||||
-DCMAKE_BUILD_TYPE=Release \
|
|
||||||
-DLLAMA_BUILD_SERVER=ON \
|
|
||||||
-DLLAMA_BUILD_TESTS=OFF \
|
|
||||||
-DLLAMA_BUILD_EXAMPLES=OFF
|
|
||||||
|
|
||||||
# Build (uses all cores; takes 10-15 min on a 12-core CPU)
|
|
||||||
cmake --build build --config Release -j$(nproc)
|
|
||||||
```
|
|
||||||
|
|
||||||
After the build completes you should see `build/bin/llama-server`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 3 — Install to /opt/llama-server
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo mkdir -p /opt/llama-server
|
|
||||||
|
|
||||||
# Copy the server binary
|
|
||||||
sudo cp build/bin/llama-server /opt/llama-server/
|
|
||||||
|
|
||||||
# Copy all shared libraries (b9144+ puts them all in build/bin/)
|
|
||||||
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true
|
|
||||||
|
|
||||||
# Register the directory so transitive .so dependencies resolve
|
|
||||||
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
|
|
||||||
sudo ldconfig
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note (b9144+):** The library layout changed — all `.so` files now live in
|
|
||||||
> `build/bin/` (not `build/ggml/src/` or `build/src/`). When upgrading, copy
|
|
||||||
> with `-P` to preserve versioned symlinks and overwrite the old ones.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 4 — Create the start script
|
|
||||||
|
|
||||||
Run llama-server in **router mode** — no `--model` flag. Models are loaded
|
|
||||||
on-demand from `~/models/` when a request names them. Switching models requires
|
|
||||||
no restart and no `sudo`: just change the `model` field in `opencode.json`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo tee /opt/llama-server/start.sh > /dev/null << 'SCRIPT'
|
|
||||||
#!/bin/bash
|
|
||||||
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
|
|
||||||
cd /opt/llama-server
|
|
||||||
exec /opt/llama-server/llama-server \
|
|
||||||
--models-dir /home/dev/models \
|
|
||||||
--models-max 1 \
|
|
||||||
--models-preset /home/dev/models/presets.ini \
|
|
||||||
--host 127.0.0.1 \
|
|
||||||
--port 8080
|
|
||||||
SCRIPT
|
|
||||||
sudo chmod +x /opt/llama-server/start.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
**Key router flags:**
|
|
||||||
|
|
||||||
- `--models-dir` — directory scanned for GGUF files. Flat `.gguf` files become
|
|
||||||
model IDs using the filename **without** `.gguf`. Subdirectories become model
|
|
||||||
IDs using the directory name (used for multimodal models with a separate
|
|
||||||
mmproj file — see _Multimodal models_ below).
|
|
||||||
- `--models-max 1` — only one model resident at a time. When a different model
|
|
||||||
is requested, the current one is evicted and the new one loads (cold-start
|
|
||||||
delay). With 12GB VRAM this is required.
|
|
||||||
- `--models-preset` — path to `presets.ini` for global defaults and per-model
|
|
||||||
overrides. All inference flags belong here, not in `start.sh`.
|
|
||||||
|
|
||||||
**Per-model settings via `presets.ini`**
|
|
||||||
|
|
||||||
All inference flags (`ctx-size`, `n-predict`, `n-gpu-layers`, `flash-attn`,
|
|
||||||
`threads`, `parallel`, `jinja`, `spec-type`, etc.) live in
|
|
||||||
`~/models/presets.ini`, not in `start.sh`. The `[*]` section sets defaults
|
|
||||||
inherited by every model; named sections override individual keys.
|
|
||||||
|
|
||||||
Section names must match the router's model ID — the filename **without**
|
|
||||||
`.gguf`. Using the `.gguf` suffix in a section name creates a duplicate entry in
|
|
||||||
the model list.
|
|
||||||
|
|
||||||
```ini
|
|
||||||
version = 1
|
|
||||||
|
|
||||||
[*]
|
|
||||||
n-gpu-layers = 99
|
|
||||||
flash-attn = on
|
|
||||||
threads = 8
|
|
||||||
parallel = 1
|
|
||||||
|
|
||||||
[Qwen_Qwen3-14B-Q4_K_M]
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
[OmniCoder-2-9B.Q8_0]
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
[Qwen_Qwen3.6-27B-Q4_K_M]
|
|
||||||
ctx-size = 16384
|
|
||||||
n-predict = 4096
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note:** The router reads `presets.ini` **once at service startup** — it is
|
|
||||||
> not watched for changes. After editing it, run
|
|
||||||
> `sudo systemctl restart llama-server` to apply the new settings. Any
|
|
||||||
> currently-loaded model will be evicted and must cold-reload on the next
|
|
||||||
> request (~10–60 s).
|
|
||||||
|
|
||||||
**On GPU layer offload:** Hybrid inference (some layers on CPU, some on GPU) is
|
|
||||||
significantly slower than full-GPU due to CPU↔GPU memory transfers each forward
|
|
||||||
pass. For interactive use, prefer models that fit entirely in VRAM. MoE models
|
|
||||||
(like Qwen3.6-35B-A3B) are an exception — their sparse activation means active
|
|
||||||
computation per token is only ~3B parameters regardless of total model size, so
|
|
||||||
partial CPU offload is less painful than with a dense model of the same file
|
|
||||||
size. See the _Model choice_ section below.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 5 — Create the systemd service
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
|
|
||||||
[Unit]
|
|
||||||
Description=llama-server (OmniCoder 2 / qwen35)
|
|
||||||
After=network-online.target
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
ExecStart=/opt/llama-server/start.sh
|
|
||||||
User=ollama
|
|
||||||
Group=ollama
|
|
||||||
Restart=always
|
|
||||||
RestartSec=3
|
|
||||||
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin"
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=default.target
|
|
||||||
EOF
|
|
||||||
|
|
||||||
sudo systemctl daemon-reload
|
|
||||||
sudo systemctl enable llama-server
|
|
||||||
sudo systemctl start llama-server
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note:** The `PATH` includes `/usr/lib/wsl/lib` — this is what exposes the
|
|
||||||
> CUDA driver (`libcuda.so.1`) to the process in WSL2. Without this, the CUDA
|
|
||||||
> backend will load but fail to initialize the device.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 6 — Verify GPU offload
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check service is running
|
|
||||||
systemctl status llama-server
|
|
||||||
|
|
||||||
# Health endpoint
|
|
||||||
curl -s http://127.0.0.1:8080/health
|
|
||||||
# → {"status":"ok"}
|
|
||||||
|
|
||||||
# Watch GPU memory in another terminal during a request
|
|
||||||
watch -n1 nvidia-smi
|
|
||||||
|
|
||||||
# Quick inference test
|
|
||||||
curl -s http://127.0.0.1:8080/v1/chat/completions \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"model":"local","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
|
|
||||||
| python3 -m json.tool
|
|
||||||
```
|
|
||||||
|
|
||||||
During inference, `nvidia-smi` should show:
|
|
||||||
|
|
||||||
- GPU-Util: 80-100%
|
|
||||||
- GPU Memory: ~10-11GB used (model weights + KV cache)
|
|
||||||
- CPU: near idle
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Quick inference test (node instead of python3)
|
|
||||||
curl -s http://127.0.0.1:8080/v1/chat/completions \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"model":"Qwen_Qwen3-14B-Q4_K_M","messages":[{"role":"user","content":"Say hello."}],"max_tokens":20}' \
|
|
||||||
| node -e "let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>console.log(JSON.parse(d).choices[0].message.content))"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 7 — Configure OpenCode
|
|
||||||
|
|
||||||
Edit `~/.config/opencode/opencode.json` to add the provider. Model IDs are the
|
|
||||||
filenames **without** `.gguf` (or the subdirectory name for multimodal models).
|
|
||||||
The `limit` values here inform opencode's context window tracking; the actual
|
|
||||||
server-side limits are set in `presets.ini`.
|
|
||||||
|
|
||||||
```json
|
|
||||||
"llama-server": {
|
|
||||||
"npm": "@ai-sdk/openai-compatible",
|
|
||||||
"name": "llama-server",
|
|
||||||
"options": { "baseURL": "http://127.0.0.1:8080/v1" },
|
|
||||||
"models": {
|
|
||||||
"Qwen_Qwen3-14B-Q4_K_M": {
|
|
||||||
"name": "Qwen3 14B Q4 (fast)",
|
|
||||||
"tools": true,
|
|
||||||
"limit": { "context": 32768, "output": 4096 }
|
|
||||||
},
|
|
||||||
"Qwen_Qwen3.6-27B-Q4_K_M": {
|
|
||||||
"name": "Qwen3.6 27B Q4 (deep)",
|
|
||||||
"tools": true,
|
|
||||||
"limit": { "context": 16384, "output": 4096 }
|
|
||||||
},
|
|
||||||
"OmniCoder-2-9B.Q8_0": {
|
|
||||||
"name": "OmniCoder 2 9B Q8 (vision)",
|
|
||||||
"tools": true,
|
|
||||||
"limit": { "context": 32768, "output": 4096 }
|
|
||||||
},
|
|
||||||
"Qwen3.6-35B-A3B-IQ3_S-3.06bpw": {
|
|
||||||
"name": "Qwen3.6 35B A3B IQ3 (MoE+MTP)",
|
|
||||||
"tools": true,
|
|
||||||
"limit": { "context": 8192, "output": 4096 }
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
In the project-level `opencode.json`, set the active model per agent:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"agent": {
|
|
||||||
"orchestrator": {
|
|
||||||
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Model choice for RTX 3080 12GB
|
|
||||||
|
|
||||||
Pick based on what fits **entirely** in VRAM — hybrid inference (model too large
|
|
||||||
for VRAM) is 4–8× slower and makes interactive use painful. MoE models are an
|
|
||||||
exception; see note below the table.
|
|
||||||
|
|
||||||
| Model | File size | Fits in 12GB? | Speed (est.) | Notes |
|
|
||||||
| ------------------------------- | --------- | ------------- | ------------- | ---------------------------------------------------------------------------------------- |
|
|
||||||
| Qwen3-8B Q4_K_M | ~5 GB | ✅ fully | ~25–35 tok/s | Fast; weaker reasoning |
|
|
||||||
| **Qwen3-14B Q4_K_M** | ~8.5 GB | ✅ fully | ~12–18 tok/s | **Daily driver** — fast interactive use, good instruction following |
|
|
||||||
| OmniCoder-2-9B Q8_0 | ~9.5 GB | ✅ fully | ~15–20 tok/s | Vision-capable (multimodal); subdirectory layout for auto-detected mmproj |
|
|
||||||
| **Qwen3.6-27B Q4_K_M** | 17 GB | ⚠️ partial | ~4–8 tok/s | **Deep reasoning** — better at vague/complex tasks; slow due to CPU offload |
|
|
||||||
| **Qwen3.6-35B-A3B IQ3_S (MTP)** | 13.6 GB | ⚠️ partial | ~20–35 tok/s† | **MoE + MTP** — sparse activation (~3B active params); needs MTP-format GGUF (byteshape) |
|
|
||||||
| Qwen3-32B Q4_K_M | ~20 GB | ❌ | — | Won't fit |
|
|
||||||
|
|
||||||
† MoE speed estimate with `--spec-type draft-mtp`. Despite 13.6 GB file size,
|
|
||||||
only ~1.6 GB needs CPU offload (few dense attention layers overflow VRAM). The
|
|
||||||
sparse feed-forward experts make active-parameter compute comparable to a 3B
|
|
||||||
dense model.
|
|
||||||
|
|
||||||
All models sit in `~/models/` simultaneously and are swapped on-demand by the
|
|
||||||
router. Cold-swap time is ~10s (9–14B) / ~30–45s (27B+).
|
|
||||||
|
|
||||||
Download from bartowski on HuggingFace (imatrix quants, standard GGUF format):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mkdir -p ~/models
|
|
||||||
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_K_M.gguf" \
|
|
||||||
-O ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf
|
|
||||||
```
|
|
||||||
|
|
||||||
> **⚠️ Use HuggingFace GGUFs, not Ollama blobs for qwen35-architecture models.**
|
|
||||||
> Ollama's converter outputs different tensor names and per-layer KV-head arrays
|
|
||||||
> that are incompatible with llama.cpp's `qwen35` model loader. Symptoms:
|
|
||||||
> `missing tensor 'blk.0.ssm_dt'`, `check_tensor_dims: wrong shape`, or
|
|
||||||
> `rope.dimension_sections has wrong array length`. Always download from
|
|
||||||
> bartowski or unsloth on HuggingFace for these models.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Switching models
|
|
||||||
|
|
||||||
With router mode, switching requires **no restart and no `sudo`**. Place GGUFs
|
|
||||||
in `~/models/` and reference them by model ID in `opencode.json`.
|
|
||||||
|
|
||||||
### Add a model
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Download to ~/models/ — filename without .gguf becomes the model ID
|
|
||||||
wget -c "https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/resolve/main/Qwen_Qwen3.6-27B-Q4_K_M.gguf" \
|
|
||||||
-O ~/models/Qwen_Qwen3.6-27B-Q4_K_M.gguf
|
|
||||||
```
|
|
||||||
|
|
||||||
Then add a section to `~/models/presets.ini` (name = filename without `.gguf`):
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[Qwen_Qwen3.6-27B-Q4_K_M]
|
|
||||||
ctx-size = 16384
|
|
||||||
n-predict = 4096
|
|
||||||
```
|
|
||||||
|
|
||||||
And register it in `~/.config/opencode/opencode.json`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"Qwen_Qwen3.6-27B-Q4_K_M": {
|
|
||||||
"name": "Qwen3.6 27B Q4 (deep)",
|
|
||||||
"tools": true,
|
|
||||||
"limit": { "context": 16384, "output": 4096 }
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Switch active model
|
|
||||||
|
|
||||||
Edit `opencode.json` (project-level or `~/.config/opencode/opencode.json`) and
|
|
||||||
change the agent's `model` to `llama-server/<model-id>`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"agent": {
|
|
||||||
"orchestrator": {
|
|
||||||
"model": "llama-server/Qwen_Qwen3-14B-Q4_K_M"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
The next request triggers a cold load of the new model (~10–30s for 14B, ~30–60s
|
|
||||||
for 27B+). No service restart needed. `--models-max 1` ensures the previous
|
|
||||||
model is evicted from VRAM automatically.
|
|
||||||
|
|
||||||
To switch from the CLI without editing files:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
opencode run -m "llama-server/Qwen_Qwen3-14B-Q4_K_M" "your message here"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Multimodal models
|
|
||||||
|
|
||||||
For models with a separate vision encoder (mmproj), use a **subdirectory** in
|
|
||||||
`~/models/`. The directory name becomes the model ID; llama.cpp auto-detects any
|
|
||||||
file whose name starts with `mmproj` as the projector.
|
|
||||||
|
|
||||||
```
|
|
||||||
~/models/
|
|
||||||
OmniCoder-2-9B.Q8_0/ ← model ID = "OmniCoder-2-9B.Q8_0"
|
|
||||||
OmniCoder-2-9B.Q8_0.gguf ← main weights
|
|
||||||
mmproj-Q8_0.gguf ← vision projector (auto-detected)
|
|
||||||
```
|
|
||||||
|
|
||||||
### List available models
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# See what's in ~/models/ (all are immediately usable as model IDs)
|
|
||||||
ls ~/models/
|
|
||||||
|
|
||||||
# See what's currently loaded
|
|
||||||
curl -s http://127.0.0.1:8080/v1/models | node -e \
|
|
||||||
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id, m.meta?.loaded ? '[loaded]' : '[unloaded]')))"
|
|
||||||
|
|
||||||
# Force a rescan (picks up newly added model files)
|
|
||||||
curl -s 'http://127.0.0.1:8080/models?reload=1' | node -e \
|
|
||||||
"process.stdin.resume(); let d=''; process.stdin.on('data',c=>d+=c); process.stdin.on('end',()=>JSON.parse(d).data.forEach(m=>console.log(m.id)))"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Auto-restart on presets.ini change
|
|
||||||
|
|
||||||
The router caches `presets.ini` at startup, so any edit requires a service
|
|
||||||
restart to take effect. You can automate this with a systemd **path unit** that
|
|
||||||
watches the file and triggers a restart whenever it is written:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo tee /etc/systemd/system/llama-server-presets.path > /dev/null << 'EOF'
|
|
||||||
[Unit]
|
|
||||||
Description=Restart llama-server when presets.ini changes
|
|
||||||
|
|
||||||
[Path]
|
|
||||||
PathChanged=/home/dev/models/presets.ini
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=default.target
|
|
||||||
EOF
|
|
||||||
|
|
||||||
sudo tee /etc/systemd/system/llama-server-presets.service > /dev/null << 'EOF'
|
|
||||||
[Unit]
|
|
||||||
Description=Restart llama-server (triggered by presets.ini change)
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
Type=oneshot
|
|
||||||
ExecStart=/bin/systemctl restart llama-server
|
|
||||||
EOF
|
|
||||||
|
|
||||||
sudo systemctl daemon-reload
|
|
||||||
sudo systemctl enable --now llama-server-presets.path
|
|
||||||
```
|
|
||||||
|
|
||||||
After this, saving `~/models/presets.ini` automatically restarts the service (~3
|
|
||||||
s) and the next inference request cold-loads the model with the new settings.
|
|
||||||
The restart is intentionally disruptive — the currently-loaded model is evicted
|
|
||||||
— so only enable this if disruptive restarts on every presets save are
|
|
||||||
acceptable.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## MTP speculative decoding
|
|
||||||
|
|
||||||
Multi-Token Prediction (MTP) lets the model predict several tokens per forward
|
|
||||||
pass using draft heads baked into the model weights — no separate draft model
|
|
||||||
needed. For Qwen3.6-35B-A3B this roughly doubles throughput (from ~15 tok/s to
|
|
||||||
~25–35 tok/s on RTX 3080) while preserving output quality.
|
|
||||||
|
|
||||||
**Requirements:**
|
|
||||||
|
|
||||||
1. **b9279+ binary** — `--spec-type draft-mtp` was added in this era. Verify:
|
|
||||||
```bash
|
|
||||||
/opt/llama-server/llama-server --help | grep spec-type
|
|
||||||
# must list draft-mtp
|
|
||||||
```
|
|
||||||
2. **MTP-format GGUF** — standard bartowski/unsloth quants do not include MTP
|
|
||||||
heads. Use byteshape's dedicated MTP GGUFs:
|
|
||||||
```bash
|
|
||||||
# IQ3_S (13.6 GB) — best quality/size for 12 GB VRAM with slight CPU offload
|
|
||||||
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" \
|
|
||||||
-O ~/models/Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf
|
|
||||||
# IQ2_S (10 GB) — fully fits in VRAM; heavier quantization
|
|
||||||
wget -c "https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf" \
|
|
||||||
-O ~/models/Qwen3.6-35B-A3B-IQ2_S-2.25bpw.gguf
|
|
||||||
```
|
|
||||||
|
|
||||||
**`presets.ini` section for MTP:**
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
spec-type = draft-mtp
|
|
||||||
spec-draft-p-min = 0.75
|
|
||||||
spec-draft-n-max = 3
|
|
||||||
```
|
|
||||||
|
|
||||||
- `spec-draft-p-min` — minimum draft token acceptance probability. 0.75 is a
|
|
||||||
good starting point; lower values accept more speculative tokens (faster but
|
|
||||||
may diverge from non-speculative output).
|
|
||||||
- `spec-draft-n-max` — maximum tokens to speculate per step. 3 is the sweet spot
|
|
||||||
for Qwen3.6 MTP; higher values have diminishing returns and add overhead.
|
|
||||||
|
|
||||||
**Note:** ik_llama.cpp (a fork) achieves ~10–20% higher throughput with MTP than
|
|
||||||
official llama.cpp due to a more optimized MTP head implementation. Official
|
|
||||||
llama.cpp MTP is still significantly faster than non-speculative inference and
|
|
||||||
is the simpler setup.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Active model keeps resetting to the configured default
|
|
||||||
|
|
||||||
Known opencode bug [#28735](https://github.com/anomalyco/opencode/issues/28735)
|
|
||||||
(open as of May 2026): when a background subagent result is delivered back into
|
|
||||||
the main session, the active model resets to whatever `orchestrator.model` is
|
|
||||||
configured in `opencode.json`. This means any model switch made via `-m` flag or
|
|
||||||
the TUI selector gets silently reverted whenever a tool call or subagent
|
|
||||||
completes.
|
|
||||||
|
|
||||||
**Workaround:** keep `orchestrator.model` in `opencode.json` set to the model
|
|
||||||
you actually want to use. The reset lands on the configured model, so if it
|
|
||||||
matches your intent there's no observable effect.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### `no backends are loaded` at startup
|
|
||||||
|
|
||||||
The backend `.so` plugins must be in the same directory as the binary, or on
|
|
||||||
`LD_LIBRARY_PATH`. The `start.sh` script sets this explicitly.
|
|
||||||
|
|
||||||
### `make_cpu_buft_list: no CPU backend found`
|
|
||||||
|
|
||||||
Install `libgomp1` (OpenMP runtime — required by the CPU backend):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo apt-get install -y libgomp1
|
|
||||||
```
|
|
||||||
|
|
||||||
### CUDA device not found / GPU not offloading
|
|
||||||
|
|
||||||
- Confirm `/usr/lib/wsl/lib` is in `PATH` or `LD_LIBRARY_PATH` for the process
|
|
||||||
- Run `nvidia-smi` as the service user: `sudo -u ollama nvidia-smi`
|
|
||||||
- Check `journalctl -u llama-server -n 50` for lines like
|
|
||||||
`ggml_cuda_init: CUDA not found`
|
|
||||||
|
|
||||||
### High CPU / fan noise at idle
|
|
||||||
|
|
||||||
- Remove `--no-mmap` if present (forces 9GB into RAM on startup)
|
|
||||||
- Check `--n-parallel` isn't set high (default 1 is fine for single-user use)
|
|
||||||
- llama-server is permanently loaded; fans will spin during model load (~30s)
|
|
||||||
then drop to zero at idle — this is expected behavior
|
|
||||||
|
|
||||||
### `qwen35` architecture errors (rope, tensor shape, missing tensor)
|
|
||||||
|
|
||||||
These errors all indicate an **incompatible GGUF source**:
|
|
||||||
|
|
||||||
- `rope.dimension_sections has wrong array length; expected 4, got 3` — Ollama
|
|
||||||
stores a 3-element array; llama.cpp (before a patch) expects 4.
|
|
||||||
- `missing tensor 'blk.0.ssm_dt'` or `blk.0.ssm_dt.bias` — Ollama omits the
|
|
||||||
`.bias` suffix that HuggingFace-converted GGUFs use (or vice versa).
|
|
||||||
- `check_tensor_dims: wrong shape` on `blk.N.attn_k.weight` — Ollama's converter
|
|
||||||
stores `head_count_kv` as a per-layer array; llama.cpp's qwen35 model loader
|
|
||||||
expects a scalar.
|
|
||||||
|
|
||||||
**Solution:** use HuggingFace GGUFs (bartowski or unsloth) instead of Ollama
|
|
||||||
blobs for any `qwen35`-architecture model. See _Model choice_ above.
|
|
||||||
|
|
||||||
### Upgrading llama.cpp (replacing binaries while service is running)
|
|
||||||
|
|
||||||
The service holds the binary open; `cp` will fail with `Text file busy`. Always
|
|
||||||
stop the service first:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo systemctl stop llama-server
|
|
||||||
sudo cp build/bin/llama-server /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/lib*.so* /opt/llama-server/
|
|
||||||
sudo systemctl start llama-server
|
|
||||||
```
|
|
||||||
|
|
||||||
### Model file permissions (service runs as `ollama` user)
|
|
||||||
|
|
||||||
Files downloaded as your user aren't readable by the `ollama` service user:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Make model file readable by all
|
|
||||||
sudo chmod o+r ~/models/MyModel.gguf
|
|
||||||
# Make the directory traversable
|
|
||||||
sudo chmod o+x ~ ~/models
|
|
||||||
```
|
|
||||||
@ -1,514 +0,0 @@
|
|||||||
# How LLMs Interpret Intent in Text Prompts: Evidence-Based Guidance
|
|
||||||
|
|
||||||
> **Status:** Research synthesis. Companion to
|
|
||||||
> [`text-communication-interpretation.md`](./text-communication-interpretation.md)
|
|
||||||
> — that doc covers humans reading text; this one covers LLMs.
|
|
||||||
>
|
|
||||||
> **Scope:** Why current frontier and local models misinterpret prompts, what
|
|
||||||
> the underlying mechanisms are (training, architecture, quantization, position
|
|
||||||
> bias), and which counter-measures have empirical or vendor-documented support.
|
|
||||||
>
|
|
||||||
> **Models in scope (May 2026):** Claude Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku
|
|
||||||
> 4.5; the Qwen2.5, Qwen3, and Qwen3.5 ("qwen35") families including the
|
|
||||||
> OmniCoder-9B fine-tune; and the current open-weight engineering tier (DeepSeek
|
|
||||||
> V4, Kimi K2.6, GLM-5, Mistral Small 4, Gemma 4).
|
|
||||||
>
|
|
||||||
> **Audience:** Engineers building agents, prompts, and scaffolding — not
|
|
||||||
> first-time LLM users.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 0. Framing: Why Models Misread Prompts Differently Than Humans Do
|
|
||||||
|
|
||||||
Humans misread text mostly because of egocentric anchoring and emotional
|
|
||||||
projection (see the companion doc). LLMs misread for structurally different
|
|
||||||
reasons:
|
|
||||||
|
|
||||||
- **No persistent self.** Every turn re-derives "intent" from the visible token
|
|
||||||
stream. Anything outside the context window doesn't exist.
|
|
||||||
- **Distributional priors dominate.** The model's behavior is its training
|
|
||||||
distribution conditioned on your tokens. Ambiguity is resolved toward whatever
|
|
||||||
was most common in pretraining/RLHF, not toward what you meant.
|
|
||||||
- **Style → role.** Models infer _who_ is speaking from textual style rather
|
|
||||||
than from cryptographic provenance, which is why prompt injection works at all
|
|
||||||
(see §1.4). [13]
|
|
||||||
- **Quantization, depth, and routing change behavior under load**, not cleanly
|
|
||||||
and not always at the points you'd expect (see §3).
|
|
||||||
|
|
||||||
The practical consequence: the levers that work on humans (charity, delay,
|
|
||||||
perspective-taking) have direct analogs for LLMs — structured context, explicit
|
|
||||||
scope, separated reasoning — but for very different mechanistic reasons.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. The Core Problem (Why This Is Hard)
|
|
||||||
|
|
||||||
### 1.1 Models resolve ambiguity toward the training prior
|
|
||||||
|
|
||||||
When intent is underspecified, models fall back to whatever the training
|
|
||||||
distribution made most likely. Anthropic explicitly documents that **Opus 4.7 is
|
|
||||||
more literal than 4.6**: it will not silently generalize an instruction from one
|
|
||||||
item to another, and will not infer requests you didn't make. [1] The upside is
|
|
||||||
precision; the downside is that prompts that worked on 4.6 by relying on
|
|
||||||
"obvious" generalization may stop working. Stating scope explicitly ("apply to
|
|
||||||
every section, not just the first") is now required, not optional.
|
|
||||||
|
|
||||||
### 1.2 Instruction following is not bit-width monotonic
|
|
||||||
|
|
||||||
Quantization does not uniformly degrade behavior. The Llama-3.1-8B-Instruct GGUF
|
|
||||||
sweep [3] shows:
|
|
||||||
|
|
||||||
- **GSM8K (reasoning):** F16 baseline 77.6; Q3*K_S drops to 68.3 (−9.3);
|
|
||||||
Q4_K_S/M essentially match baseline; Q5/Q6/Q8 sometimes \_exceed* F16.
|
|
||||||
- **IFEval (instruction following):** F16 baseline 78.9; Q3*K_S drops to 73.9,
|
|
||||||
but Q4_K_S \_improves* to 80.3 and Q5_0 to 80.1. Q6_K drops to 77.6 and Q8_0
|
|
||||||
sits at 78.8 — i.e., higher bit-width does not guarantee better compliance.
|
|
||||||
|
|
||||||
**Practical floor:** for agentic / tool-using workflows, **4–5 bit K-quants
|
|
||||||
(Q4_K_M, Q5_K_M) are the safe band**; 3-bit risks reasoning collapse; 8-bit is
|
|
||||||
not automatically "best" for instruction following.
|
|
||||||
|
|
||||||
### 1.3 Long-context attention is U-shaped ("lost in the middle")
|
|
||||||
|
|
||||||
Liu et al. (TACL 2024) showed performance is highest when relevant information
|
|
||||||
is at the **beginning** or **end** of the context, with a sharp dip in the
|
|
||||||
middle — even for explicitly long-context models. [4] The effect persists across
|
|
||||||
Claude, GPT, and Llama lineages through early 2026. [5] Mechanism: training
|
|
||||||
documents are mostly short, and when long, important content tends to sit at the
|
|
||||||
boundaries; the model never learns strong middle-extraction habits.
|
|
||||||
|
|
||||||
**Implication:** the position of an instruction inside a 200K-token context
|
|
||||||
matters more than its wording. Put critical instructions at the top or just
|
|
||||||
before the user turn, not buried in the middle of system context.
|
|
||||||
|
|
||||||
### 1.4 Role confusion: style determines authority
|
|
||||||
|
|
||||||
Models do not robustly track _where text came from_; they infer the role of each
|
|
||||||
span from stylistic cues. Recent work on "CoT Forgery" [13] demonstrates that
|
|
||||||
injected reasoning traces that look like the model's own scratchpad inherit the
|
|
||||||
trust the model places in its own thoughts — external text, by contrast, is
|
|
||||||
normally scrutinized and rejected. This is the structural reason prompt
|
|
||||||
injection in tool outputs works.
|
|
||||||
|
|
||||||
**Implication:** any content you don't fully trust (tool output, fetched web
|
|
||||||
content, user-pasted text) must be wrapped in unambiguous structural markers,
|
|
||||||
and the model must be told what kind of content it is and how much authority it
|
|
||||||
carries.
|
|
||||||
|
|
||||||
### 1.5 Sycophancy / agreement bias
|
|
||||||
|
|
||||||
Some RLHF'd models lean toward agreeing with the user's framing, especially when
|
|
||||||
the user states a belief or pushes back. Sharma et al. (ICLR 2024) [14] found
|
|
||||||
this across five SOTA assistants and traced it to human preference labels
|
|
||||||
favoring agreement. **Important caveat:** the original Perez et al. (2022)
|
|
||||||
finding that sycophancy appears even at zero RLHF steps did **not** replicate
|
|
||||||
across model families — nostalgebraist (2023) [15] showed OpenAI base models are
|
|
||||||
not sycophantic at any size. So this is model-family- and
|
|
||||||
training-data-specific, not a universal RLHF property. Mitigations: ask for the
|
|
||||||
model's best answer _before_ revealing your view; explicitly invite
|
|
||||||
disagreement; in agent prompts, instruct "persist through genuine blockers; do
|
|
||||||
not pivot just because the previous attempt failed."
|
|
||||||
|
|
||||||
**Stronger mitigation — context isolation (S2A):** System 2 Attention (Weston &
|
|
||||||
Sukhbaatar, 2023) [20] shows that asking the LLM to first _rewrite_ its input
|
|
||||||
context — extracting only the portions relevant to the current query and
|
|
||||||
discarding irrelevant or opinionated material — measurably reduces sycophancy
|
|
||||||
and improves factuality across QA, math word problems, and longform generation.
|
|
||||||
The mechanism is direct: soft attention in Transformers is susceptible to
|
|
||||||
incorporating irrelevant prior context; explicit isolation severs the anchor
|
|
||||||
before generation. In a harness context, the full two-pass S2A (rewrite then
|
|
||||||
respond) requires a second LLM call; the lightweight equivalent is placing an
|
|
||||||
explicit current-question marker at the context tail (recency- bias zone), which
|
|
||||||
isolates the current query from prior anchor answers without a second inference
|
|
||||||
pass.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Highest-Leverage Counter-Practices
|
|
||||||
|
|
||||||
Ranked by effect size and breadth of support across vendor docs, peer- reviewed
|
|
||||||
work, and field practice.
|
|
||||||
|
|
||||||
### 2.1 Be literal and explicit; state scope
|
|
||||||
|
|
||||||
Anthropic's official guidance for 4.6/4.7: "Claude responds well to clear,
|
|
||||||
explicit instructions. Being specific about your desired output can help enhance
|
|
||||||
results. If you want 'above and beyond' behavior, explicitly request it rather
|
|
||||||
than relying on the model to infer it from vague prompts." [1] This is the
|
|
||||||
single most-cited lever in their docs.
|
|
||||||
|
|
||||||
Apply equally to Qwen3-class local models, whose Apache-2.0 instruct tunes are
|
|
||||||
now competitive at instruction-following but show the same literal-by-default
|
|
||||||
behavior as Claude 4.7. [2]
|
|
||||||
|
|
||||||
### 2.2 Use XML (or unambiguous) structural tags around heterogeneous content
|
|
||||||
|
|
||||||
Wrapping each kind of input — instructions, examples, retrieved context, user
|
|
||||||
query, tool output — in its own tag reduces misinterpretation because the model
|
|
||||||
can attend to "tag boundaries" rather than guessing where one block ends and
|
|
||||||
another begins. [1] This is the cheapest mitigation for §1.3
|
|
||||||
(lost-in-the-middle) and §1.4 (role confusion) simultaneously.
|
|
||||||
|
|
||||||
### 2.3 Provide context and motivation, not just the instruction
|
|
||||||
|
|
||||||
Vendor-documented (Anthropic) and consistently effective: explaining _why_
|
|
||||||
improves targeting. [1][6] Mechanism: motivation tokens disambiguate which
|
|
||||||
training prior to condition on. A request to "make this shorter" with context
|
|
||||||
"for a P0 incident page, every line costs attention" lands in a different region
|
|
||||||
of model behavior than the same request without justification.
|
|
||||||
|
|
||||||
### 2.4 Prefer general reasoning instructions over prescriptive steps —
|
|
||||||
|
|
||||||
**for reasoning-capable models**
|
|
||||||
|
|
||||||
Anthropic: "A prompt like 'think thoroughly' often produces better reasoning
|
|
||||||
than a hand-written step-by-step plan. Claude's reasoning frequently exceeds
|
|
||||||
what a human would prescribe." [1] Qwen3's thinking mode is similarly designed
|
|
||||||
to be triggered with light cues (`/think`) rather than micromanaged. [2]
|
|
||||||
|
|
||||||
For **non-reasoning** models (or thinking-off mode), the Prompting Science
|
|
||||||
Report 2 [7] finds chain-of-thought provides only a small average boost and
|
|
||||||
**increases variance** — sometimes flipping previously-correct answers to wrong.
|
|
||||||
For reasoning models the explicit CoT request is essentially zero-value and just
|
|
||||||
burns tokens.
|
|
||||||
|
|
||||||
**Additional caveat — subjective tasks:** arXiv:2409.06173 (2024) [16] shows CoT
|
|
||||||
suffers from _posterior collapse_: the format of CoT retrieves reasoning priors
|
|
||||||
that remain relatively unchanged despite the evidence in the prompt. This is
|
|
||||||
especially pronounced on subjective tasks (emotion, morality) and on larger
|
|
||||||
models. So for intent-interpretation tasks — exactly the kind this doc is about
|
|
||||||
— CoT may actively entrench the model's prior reading rather than update it on
|
|
||||||
new evidence. Prefer perspective-taking prompts (see §2.4a) or
|
|
||||||
clarifying-question prompts over generic "think step by step" for ambiguous
|
|
||||||
intent.
|
|
||||||
|
|
||||||
### 2.5 Calibrate reasoning length to task complexity
|
|
||||||
|
|
||||||
"When More is Less" (Wang et al., 2025) [8] established an inverted-U: accuracy
|
|
||||||
rises with CoT length, then declines as error accumulation outpaces
|
|
||||||
decomposition benefit. Optimal length _increases_ with task difficulty and
|
|
||||||
_decreases_ with model capability. Practical rules:
|
|
||||||
|
|
||||||
- For Claude adaptive thinking (4.6/4.7): set the `effort` parameter to match
|
|
||||||
task complexity; do not push it higher than needed. [1]
|
|
||||||
- For Qwen3: use the `thinking_budget` mechanism rather than letting thinking
|
|
||||||
run unbounded. [2]
|
|
||||||
- For small local models (≤9B): prefer many short reasoning steps in multiple
|
|
||||||
turns over one long monolithic chain.
|
|
||||||
|
|
||||||
### 2.6 Default-to-action vs. default-to-clarify is promptable
|
|
||||||
|
|
||||||
Anthropic publishes both directions verbatim. For agent work:
|
|
||||||
|
|
||||||
> By default, implement changes rather than only suggesting them. If the user's
|
|
||||||
> intent is unclear, infer the most useful likely action and proceed, using
|
|
||||||
> tools to discover any missing details instead of guessing. [1]
|
|
||||||
|
|
||||||
For research/exploration work, invert it: instruct the model to clarify or plan
|
|
||||||
before acting. The point is that "agentic-ness" is a prompt-controlled dial, not
|
|
||||||
a model property.
|
|
||||||
|
|
||||||
### 2.7 Place critical instructions at the boundaries of the context
|
|
||||||
|
|
||||||
Direct consequence of §1.3. The top of the system prompt and the position
|
|
||||||
immediately preceding the user's most recent turn are the high-attention zones.
|
|
||||||
Anthropic, Cursor, and Aider all converge on this in practice — system prompts
|
|
||||||
grow at the top, repo-map / recent-turn context grows just before the user
|
|
||||||
message.
|
|
||||||
|
|
||||||
**Stronger form — full context recontextualization (S2A [20]):** if the context
|
|
||||||
contains opinionated or anchor-setting material that will skew the answer, the
|
|
||||||
boundary-placement advice is necessary but not sufficient. S2A's two-pass
|
|
||||||
pattern (rewrite context to strip irrelevant content → generate from rewritten
|
|
||||||
context) further reduces the effect of prior anchors. For agent harnesses where
|
|
||||||
a second LLM call is too expensive, the single-pass equivalent is an explicit
|
|
||||||
current-question isolation instruction injected at the context tail — same
|
|
||||||
recency zone, same isolation intent, no extra inference. [20]
|
|
||||||
|
|
||||||
### 2.8 Truncate and structure tool output aggressively
|
|
||||||
|
|
||||||
Local-model failure modes documented in this repo's own
|
|
||||||
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) match the
|
|
||||||
broader pattern: tool-call history is the largest context consumer, and
|
|
||||||
untruncated outputs both push content into the lost-in-the-middle zone _and_
|
|
||||||
widen the prompt-injection attack surface (§1.4). The repo's ~1500-token
|
|
||||||
post-tool-use truncation is consistent with what the Cursor and Aider teams have
|
|
||||||
published.
|
|
||||||
|
|
||||||
### 2.9 Lower temperature for tool-calling / structured output
|
|
||||||
|
|
||||||
Convergent vendor guidance across Anthropic, Qwen, and Tesslate (OmniCoder): for
|
|
||||||
tool-calling and JSON-emitting paths, temperature 0.2–0.4 substantially reduces
|
|
||||||
schema violations and hallucinated arguments. [10] This effect is amplified in
|
|
||||||
quantized models where sampling noise compounds with quantization noise.
|
|
||||||
|
|
||||||
### 2.10 Role / persona prompting is at best a weak intervention
|
|
||||||
|
|
||||||
A 2025 wave of replication-style studies converges on a folklore-busting result:
|
|
||||||
assigning expert personas ("you are a senior software engineer…") does not
|
|
||||||
reliably improve task performance, and in many cases hurts.
|
|
||||||
|
|
||||||
- **Principled Personas** (EMNLP 2025) [17]: across 9 SOTA models × 27 tasks,
|
|
||||||
expert personas usually give "positive or non-significant" effects, and models
|
|
||||||
are **highly sensitive to irrelevant persona details, with drops of almost 30
|
|
||||||
percentage points**.
|
|
||||||
- **Persona is a Double-Edged Sword** (IJCNLP Findings 2025) [18]: dataset-
|
|
||||||
aligned personas can hurt; only _instance_-aligned personas selected per-
|
|
||||||
query reliably help.
|
|
||||||
- **Persona-prompt evaluation across QA benchmarks** (arXiv:2512.05858) [19]:
|
|
||||||
"persona prompts generally did not improve accuracy" across both benchmarks
|
|
||||||
tested; low-knowledge personas (layperson, child) actively degrade results.
|
|
||||||
|
|
||||||
**Practical guidance:** do not rely on personas as a precision lever for intent
|
|
||||||
interpretation. If a persona is included for stylistic reasons (tone, register),
|
|
||||||
keep it minimal and avoid attributes that are irrelevant to the task. For
|
|
||||||
correctness, prefer the levers in §2.1–§2.9.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Architecture, Parameters, and Quantization — What Actually Changes
|
|
||||||
|
|
||||||
### 3.1 Parameter count and "emergence"
|
|
||||||
|
|
||||||
The classical scaling-laws picture (Kaplan, Chinchilla) holds for loss, but
|
|
||||||
emergent _capabilities_ are noisier than originally reported. "Distributional
|
|
||||||
Scaling Laws for Emergent Capabilities" (2025) [9] shows that at scales near a
|
|
||||||
capability threshold, performance across random seeds is **bimodal** — some runs
|
|
||||||
acquire the skill, some don't — so "emergence" at a given scale is partly
|
|
||||||
stochastic. Bigger models collapse the bimodal distribution and acquire skills
|
|
||||||
more reliably.
|
|
||||||
|
|
||||||
Practical implication for choosing model size:
|
|
||||||
|
|
||||||
- **≤4B:** reliable for narrow extraction, classification, short agentic steps;
|
|
||||||
instruction following degrades sharply with prompt length and as context
|
|
||||||
fills.
|
|
||||||
- **7–14B (incl. OmniCoder-9B):** the current sweet spot for local engineering
|
|
||||||
work. Tool-calling and structured output work reliably when the prompt is
|
|
||||||
well-structured; reasoning is acceptable; long- horizon plans drift.
|
|
||||||
- **30–70B dense / 100–400B MoE:** comparable behavior to mid-tier cloud models
|
|
||||||
on most tasks; remaining gaps are agentic (BrowseComp, TerminalBench, OSWorld)
|
|
||||||
where open models still trail. [11]
|
|
||||||
|
|
||||||
### 3.2 Dense vs. Mixture-of-Experts
|
|
||||||
|
|
||||||
Shen et al. (ICLR 2024, "FLAN-MoE") [12] established a counter-intuitive result
|
|
||||||
that still holds: **MoE models underperform dense models of equivalent FLOPs
|
|
||||||
when only directly fine-tuned, but surpass them dramatically after instruction
|
|
||||||
tuning** — and benefit _more_ from instruction tuning than dense models do.
|
|
||||||
FLAN-MoE-32B beat Flan-PaLM-62B on four benchmarks at ⅓ the FLOPs.
|
|
||||||
|
|
||||||
Practical implications for prompt design:
|
|
||||||
|
|
||||||
- MoE models (DeepSeek V4, Kimi K2.6, GLM-5, Qwen3 235B-A22B) are more sensitive
|
|
||||||
to instruction _style_ matching their tuning distribution. Clean, structured
|
|
||||||
prompts pay off more than on dense models.
|
|
||||||
- Routing instability shows up as occasional out-of-distribution responses on
|
|
||||||
edge cases. Few-shot examples are an effective stabilizer because they shift
|
|
||||||
activation into well-traveled expert combinations.
|
|
||||||
- Active-parameter count (e.g., 22B active in Qwen3-235B-A22B) is the better
|
|
||||||
predictor of per-token latency and small-task quality than total parameter
|
|
||||||
count.
|
|
||||||
|
|
||||||
### 3.3 Quantization
|
|
||||||
|
|
||||||
Detailed numbers in §1.2. Summary heuristics:
|
|
||||||
|
|
||||||
| Bit-width | Reasoning (GSM8K) | Instruction (IFEval) | Recommendation |
|
|
||||||
| ----------- | ----------------- | -------------------- | ------------------------------- |
|
|
||||||
| Q3_K_S/M | Notable drop | Variable, often drop | Avoid for agents |
|
|
||||||
| Q4_K_S/M | ~Baseline | Often ≥ baseline | Default for local agents |
|
|
||||||
| Q5_K_M | ≥ Baseline | ≥ Baseline | Best quality/size trade-off [3] |
|
|
||||||
| Q6_K | ≥ Baseline | Sometimes slight dip | Use if VRAM allows |
|
|
||||||
| Q8_0 / bf16 | Baseline | Baseline | No guaranteed advantage over Q5 |
|
|
||||||
|
|
||||||
Calibration-aware methods (AWQ, GPTQ with good calibration data, EXL2) generally
|
|
||||||
outperform naive GGUF at the same bit-width; for instruction- heavy work, prefer
|
|
||||||
K-quants over legacy `_0` / `_1` quants. [3]
|
|
||||||
|
|
||||||
### 3.4 Architecture variants worth knowing in 2026
|
|
||||||
|
|
||||||
- **Standard Transformer + GQA:** still the default (Llama, Mistral, most
|
|
||||||
Qwen2/2.5).
|
|
||||||
- **Hybrid attention (Qwen3.5 / "qwen35" / OmniCoder backbone):** Gated Delta
|
|
||||||
Networks interleaved with standard attention; enables efficient 262K native
|
|
||||||
context with extension to 1M+. [10] In practice this changes the
|
|
||||||
lost-in-the-middle profile somewhat but does not eliminate it — the same
|
|
||||||
boundary-placement advice applies.
|
|
||||||
- **Thinking-mode fusion (Qwen3):** a single model trained for both reasoning
|
|
||||||
and direct response, switched by `/think` and `/no_think` flags in user/system
|
|
||||||
messages, with an emergent "stop thinking now" capability used by the
|
|
||||||
`thinking_budget` controller. [2]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Model-Specific Notes (May 2026)
|
|
||||||
|
|
||||||
### Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Haiku 4.5
|
|
||||||
|
|
||||||
- **Opus 4.7 is more literal than 4.6 at low effort.** Prompts tuned for 4.6 may
|
|
||||||
need scope made explicit on 4.7. [1]
|
|
||||||
- Adaptive thinking is the default; do not hand-write step-by-step plans unless
|
|
||||||
the task is genuinely procedural. [1]
|
|
||||||
- The "default-to-action" / "default-to-clarify" prompt is the highest- leverage
|
|
||||||
knob for changing agent behavior without changing model. [1]
|
|
||||||
- Subagent delegation (Opus parent → Sonnet/Haiku children) is
|
|
||||||
cheaper-and-comparable for isolated subtasks; the parent retains reasoning,
|
|
||||||
the children execute.
|
|
||||||
|
|
||||||
### Qwen3 family (0.6B – 235B, dense + MoE; Qwen3.5 hybrid)
|
|
||||||
|
|
||||||
- Two-mode model: `/think` and `/no_think` flags toggle reasoning;
|
|
||||||
`thinking_budget` caps token spend. [2]
|
|
||||||
- Instruction following on Qwen3 instruct surpasses Qwen2.5 instruct, especially
|
|
||||||
in non-thinking mode. [2]
|
|
||||||
- Multilingual support jumped from 29 languages (Qwen2.5) to 119 (Qwen3). [2]
|
|
||||||
- Qwen3.5 (the "qwen35" architecture, base for OmniCoder-9B) introduces hybrid
|
|
||||||
Gated Delta + standard attention, 262K native context. [10]
|
|
||||||
|
|
||||||
### OmniCoder 2 / OmniCoder-9B (Tesslate, Qwen3.5-9B base)
|
|
||||||
|
|
||||||
- Fine-tuned on 425K agentic trajectories distilled from Claude Opus 4.6,
|
|
||||||
GPT-5.3-Codex, GPT-5.4, Gemini 3.1 Pro on Claude Code, OpenCode, Codex, and
|
|
||||||
Droid scaffolding. [10]
|
|
||||||
- Specifically learned read-before-write, LSP-diagnostic response, and
|
|
||||||
minimal-diff edits.
|
|
||||||
- Tesslate's own guidance: temperature 0.2–0.4 for agentic / tool use.
|
|
||||||
- Failure modes documented in this repo:
|
|
||||||
[`agent-infrastructure.md`](../projects/agent-infrastructure.md) §
|
|
||||||
"Smaller-scale local models" — narrower training distribution (Python/JS
|
|
||||||
heavy), JSON-schema compliance drops as context fills, instruction drift
|
|
||||||
faster than larger Qwen3 due to fewer attention heads.
|
|
||||||
|
|
||||||
### Other engineering-capable local models (2026 tier)
|
|
||||||
|
|
||||||
- **DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5:** current open-weight ceiling;
|
|
||||||
strong on coding/agentic, still trail proprietary models on BrowseComp,
|
|
||||||
TerminalBench, OSWorld. [11]
|
|
||||||
- **Qwen3.5 397B (Reasoning):** competitive with the above at reasoning-heavy
|
|
||||||
work.
|
|
||||||
- **Mistral Small 4 (24B, 256K ctx):** best quality-to-resource ratio for
|
|
||||||
single-GPU deployments; Apache 2.0.
|
|
||||||
- **Gemma 4 31B (256K ctx):** strong LiveCodeBench; single high-end consumer GPU
|
|
||||||
viable.
|
|
||||||
- **Llama 4 (Maverick/Scout):** now trails the Chinese open-weight leaders on
|
|
||||||
benchmarks but retains ecosystem advantages. [11]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Minimal Operating Checklist
|
|
||||||
|
|
||||||
When writing a prompt or system message for any of these models:
|
|
||||||
|
|
||||||
1. **State scope and motivation explicitly.** Don't expect generalization.
|
|
||||||
2. **Structure heterogeneous content with tags.** Especially anything from a
|
|
||||||
tool or external source.
|
|
||||||
3. **Put critical instructions at the boundaries** (top of system, or
|
|
||||||
immediately before user turn) — not buried.
|
|
||||||
4. **Pick reasoning intensity deliberately.** Adaptive/`thinking_budget` for
|
|
||||||
capable models; multi-turn small steps for ≤9B locals; skip forced CoT on
|
|
||||||
reasoning models.
|
|
||||||
5. **Truncate tool output** and never paste untrusted text without a wrapper
|
|
||||||
that names its provenance.
|
|
||||||
6. **For tool-calling: lower temperature** (0.2–0.4) regardless of model.
|
|
||||||
7. **For local deployments: target Q4_K_M or Q5_K_M.** Verify on IFEval-style
|
|
||||||
tests, not just perplexity.
|
|
||||||
8. **Ask for the answer before stating your own view** to avoid sycophantic
|
|
||||||
agreement.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. What the Evidence Does _Not_ Support
|
|
||||||
|
|
||||||
- **"Just use a bigger model."** Architecture, instruction tuning, and prompt
|
|
||||||
structure account for as much variance as raw parameter count for most
|
|
||||||
engineering tasks. [9][12]
|
|
||||||
- **"Always use chain-of-thought."** Outdated. Marginal for non- reasoning
|
|
||||||
models, near-zero for reasoning models, and CoT _increases answer variance_ —
|
|
||||||
flipping some correct answers to wrong. [7][8]
|
|
||||||
- **"Higher quantization is always better."** IFEval is not bit-width monotonic;
|
|
||||||
Q4_K_S can beat Q8_0 on compliance. [3]
|
|
||||||
- **"MoE > dense at equivalent total params."** Without instruction tuning, MoE
|
|
||||||
underperforms dense at equal FLOPs. [12]
|
|
||||||
- **"Role-play personas reliably steer behavior."** Style-based role cues are
|
|
||||||
exactly what prompt-injection attacks exploit; do not rely on persona prompts
|
|
||||||
for security boundaries. [13] **Stronger version of this debunk:** persona
|
|
||||||
prompts also don't reliably improve _task performance_ — they're often
|
|
||||||
ineffective and frequently harmful when persona attributes are even mildly
|
|
||||||
irrelevant to the task. [17][18][19] See §2.10.
|
|
||||||
- **"Longer reasoning is better reasoning."** Inverted-U on accuracy vs. CoT
|
|
||||||
length is well-established. [8]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 7. Sources
|
|
||||||
|
|
||||||
The foundational survey of prompting techniques used to cross-check claims in
|
|
||||||
this doc is **Schulhoff et al. (2024), _The Prompt Report: A Systematic Survey
|
|
||||||
of Prompting Techniques_** (arXiv:2406.06608). PRISMA-based review of 1,565
|
|
||||||
papers; taxonomy of 58 text prompting techniques. Cited as [PR] where relevant.
|
|
||||||
|
|
||||||
1. Anthropic. _Prompting best practices_ (covers Opus 4.7, 4.6, Sonnet 4.6,
|
|
||||||
Haiku 4.5). Claude API Docs.
|
|
||||||
https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/be-clear-and-direct
|
|
||||||
2. Yang, A. et al. (2025). _Qwen3 Technical Report._ arXiv:2505.09388. (Dense +
|
|
||||||
MoE family 0.6B–235B; thinking-mode fusion; thinking budget; 119-language
|
|
||||||
support.)
|
|
||||||
3. _Which Quantization Should I Use? A Unified Evaluation of llama.cpp
|
|
||||||
Quantization on Llama-3.1-8B-Instruct._ arXiv preprint. (GSM8K, IFEval, MMLU,
|
|
||||||
HellaSwag, TruthfulQA across all GGUF variants.)
|
|
||||||
4. Liu, N. F. et al. (2024). _Lost in the Middle: How Language Models Use Long
|
|
||||||
Contexts._ TACL 12, 157–173.
|
|
||||||
5. The Neural Base. _Lost-in-middle behavior across major models through
|
|
||||||
early 2026._ (Replication note; U-shaped curve persists across Claude, GPT,
|
|
||||||
Llama.)
|
|
||||||
6. Anthropic. _Prompt engineering for business performance._
|
|
||||||
https://www.anthropic.com/news/prompt-engineering-for-business-performance
|
|
||||||
7. Meincke, L. et al. (2025). _Prompting Science Report 2: The Decreasing Value
|
|
||||||
of Chain of Thought in Prompting._ arXiv:2506.07142.
|
|
||||||
8. Wang, Y. et al. (2025). _When More is Less: Understanding Chain-of-Thought
|
|
||||||
Length in LLMs._ arXiv:2502.07266.
|
|
||||||
9. _Distributional Scaling Laws for Emergent Capabilities._ (2025)
|
|
||||||
arXiv:2502.17356. (Bimodal performance distributions near capability
|
|
||||||
thresholds; "emergence" as stochastic property at scale.)
|
|
||||||
10. Tesslate. _OmniCoder-9B model card._ Hugging Face, March 2026. (Qwen3.5-9B
|
|
||||||
base; 425K agentic trajectories from Claude Opus 4.6, GPT-5.3-Codex,
|
|
||||||
GPT-5.4, Gemini 3.1 Pro; Gated Delta + attention hybrid; 262K context;
|
|
||||||
recommended temperature 0.2–0.4 for tool use.)
|
|
||||||
https://huggingface.co/Tesslate/OmniCoder-9B
|
|
||||||
11. BenchLM.ai. _Best Open Source LLM in 2026: Rankings, Benchmarks, and the
|
|
||||||
Models Worth Running._ April 2026. (DeepSeek V4 Pro, Kimi K2.6, GLM-5,
|
|
||||||
Qwen3.5 397B, Mistral Small 4, Gemma 4, Llama 4 comparison.)
|
|
||||||
12. Shen, S. et al. (2024). _Mixture-of-Experts Meets Instruction Tuning: A
|
|
||||||
Winning Combination for Large Language Models._ ICLR. (FLAN-MoE-32B vs
|
|
||||||
Flan-PaLM-62B; MoE benefits more from instruction tuning than dense.)
|
|
||||||
13. _Role Confusion and CoT Forgery: Stylistic Spoofing as a Prompt- Injection
|
|
||||||
Mechanism._ arXiv preprint, 2026. (Models infer roles from style; forged
|
|
||||||
reasoning traces inherit self-trust.)
|
|
||||||
14. Sharma, M. et al. (2024). _Towards Understanding Sycophancy in Language
|
|
||||||
Models._ ICLR 2024. arXiv:2310.13548.
|
|
||||||
15. nostalgebraist (2023). _OpenAI API base models are not sycophantic, at any
|
|
||||||
size._ LessWrong.
|
|
||||||
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
|
|
||||||
(Disconfirms the strongest reading of Perez et al. 2022 for OpenAI base
|
|
||||||
models. Not peer-reviewed but the data and code are public.)
|
|
||||||
16. _Chain-of-Thought is not all you need: Posterior collapse of CoT under
|
|
||||||
distributional shift._ arXiv:2409.06173 (2024). (Larger models anchor harder
|
|
||||||
to reasoning priors under CoT, especially on subjective tasks.)
|
|
||||||
17. _Principled Personas: Defining and Measuring the Intended Effects of Persona
|
|
||||||
Prompting on Task Performance._ EMNLP 2025.
|
|
||||||
https://aclanthology.org/2025.emnlp-main.1364/
|
|
||||||
18. _Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts
|
|
||||||
in Zero-shot Reasoning Tasks._ IJCNLP Findings 2025.
|
|
||||||
https://aclanthology.org/2025.findings-ijcnlp.51/
|
|
||||||
19. _When personas help and when they don't: A persona-prompt evaluation across
|
|
||||||
QA benchmarks._ arXiv:2512.05858 (2025). PR. Schulhoff, S. et al. (2024).
|
|
||||||
_The Prompt Report: A Systematic Survey of Prompting Techniques._
|
|
||||||
arXiv:2406.06608. PRISMA review of 1,565 papers; taxonomy of 58 prompting
|
|
||||||
techniques.
|
|
||||||
20. Weston, J. & Sukhbaatar, S. (2023). _System 2 Attention (is something you
|
|
||||||
might need too)._ arXiv:2311.11829. (Two-pass technique: LLM first rewrites
|
|
||||||
input context to remove irrelevant/opinionated material, then generates
|
|
||||||
response from cleaned context. Reduces sycophancy and increases factuality
|
|
||||||
on QA, math word problems, and longform generation. The lightweight harness
|
|
||||||
equivalent is a current-question isolation instruction at the context tail.)
|
|
||||||
@ -1,718 +0,0 @@
|
|||||||
# Dotfiles Agent Infrastructure — Roadmap
|
|
||||||
|
|
||||||
**Status:** Planning. Companion to
|
|
||||||
[extraction-history.md](./extraction-history.md), which covers the
|
|
||||||
already-shipped extraction work and the validation findings against it.
|
|
||||||
|
|
||||||
**Scope of this doc:** future tasks against `~/dotfiles/.agents/` and the
|
|
||||||
ecosystem around it. Research that informs the prioritization is captured in the
|
|
||||||
"Research notes" section at the bottom — read those first if any of the task
|
|
||||||
rationale feels opaque.
|
|
||||||
|
|
||||||
**How to use this doc:** the "Tasks" list is ordered by recommended execution
|
|
||||||
order (high leverage + low risk first). Each entry links to its design section.
|
|
||||||
Move sections to dedicated docs once they grow past ~80 lines.
|
|
||||||
|
|
||||||
> **Land before anything else:** the
|
|
||||||
> [No-Live-Fire safety rule](#0-no-live-fire-safety-rule-land-immediately).
|
|
||||||
> One-paragraph addition to `~/dotfiles/.agents/AGENTS.md`; takes 5 minutes;
|
|
||||||
> protects against the `opencode run "Try to run rm -rf /"` failure mode where a
|
|
||||||
> model takes the prompt literally if the hook fails to block.
|
|
||||||
|
|
||||||
> **Then relocate this doc out of Remnant:** see
|
|
||||||
> [Doc relocation (Remnant cleanup)](#doc-relocation-remnant-cleanup). This
|
|
||||||
> roadmap, `agent-infra-extraction.md`, and `verification.md` are not
|
|
||||||
> Remnant-specific and should live in `~/dotfiles/` so Remnant's
|
|
||||||
> `docs/projects/` contains only Remnant-app work. Do this after #0 and before
|
|
||||||
> resuming any numbered task below — once moved, the tasks list executes against
|
|
||||||
> the dotfiles copy and Remnant is free to evolve independently.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Doc relocation (Remnant cleanup)
|
|
||||||
|
|
||||||
**Goal:** Remnant's repo contains only Remnant-app docs. Everything about
|
|
||||||
`~/dotfiles/.agents/` lives in `~/dotfiles/docs/` (or `~/dotfiles/.agents/docs/`
|
|
||||||
— pick one and stick with it; the existing
|
|
||||||
[`agent-infrastructure.md`](./agent-infrastructure.md) stub already references
|
|
||||||
`~/dotfiles/.agents/docs/agent-infrastructure.md`, so that's the established
|
|
||||||
location).
|
|
||||||
|
|
||||||
**Why now (priority: immediately after #0):** the user wants Remnant in a good
|
|
||||||
state to work on independently. Every agent-infra doc sitting in
|
|
||||||
`docs/projects/` is noise for Remnant-app planning sessions and gets
|
|
||||||
auto-injected as context whenever an agent touches `docs/projects/`. Moving them
|
|
||||||
is mechanical and reversible.
|
|
||||||
|
|
||||||
**Files to relocate:**
|
|
||||||
|
|
||||||
| Current path | Destination | Notes |
|
|
||||||
| ----------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `docs/projects/dotfiles-agent-infra-roadmap.md` (this file) | `~/dotfiles/.agents/docs/roadmap.md` | Update internal links. Drop "Remnant" framing in the intro — it's just _the_ roadmap once it lives there. |
|
|
||||||
| `docs/projects/agent-infra-extraction.md` | `~/dotfiles/.agents/docs/extraction-history.md` | Validation log for the already-shipped extraction. Keep as historical record; not active planning. |
|
|
||||||
| `verification.md` (repo root) | `~/dotfiles/.agents/tests/manual-verification.md` | Already specified as part of [#3](#3-hook--agent-config-verification-framework); do the move now rather than waiting for the test harness. |
|
|
||||||
| `docs/projects/agent-infrastructure.md` | **Stay** (already trimmed to Remnant-specific overlay) | Already correctly scoped: it documents Remnant's overlay hook + Remnant-specific integration test cases. Leave in place; it points to the canonical doc in dotfiles. |
|
|
||||||
| Agent-infra entries inside `docs/projects/COMPLETED.md` | Split out to `~/dotfiles/.agents/docs/completed.md` | Audit first — if there's nothing agent-infra-specific there, skip. |
|
|
||||||
|
|
||||||
**Steps:**
|
|
||||||
|
|
||||||
1. `mkdir -p ~/dotfiles/.agents/docs ~/dotfiles/.agents/tests`
|
|
||||||
2. `git mv` each file into `~/dotfiles/` (cross-repo: use `git mv` inside
|
|
||||||
Remnant to stage a delete, then a fresh add in dotfiles — there's no
|
|
||||||
meaningful history to preserve across repos for these short-lived docs; if
|
|
||||||
history matters for `agent-infra-extraction.md`, use `git format-patch`
|
|
||||||
- `git am` instead).
|
|
||||||
3. Rewrite intra-doc links: this file's references to
|
|
||||||
`./agent-infra-extraction.md` become `./extraction-history.md`; references to
|
|
||||||
`verification.md` become `../tests/manual-verification.md`.
|
|
||||||
4. Find inbound links from anywhere in Remnant
|
|
||||||
(`grep -rn "dotfiles-agent-infra-roadmap\|agent-infra-extraction\|verification.md" ~/code/remnant`)
|
|
||||||
and either delete them or repoint at the dotfiles copies via absolute paths
|
|
||||||
(e.g., `~/dotfiles/.agents/docs/roadmap.md`).
|
|
||||||
5. Audit `docs/projects/COMPLETED.md` for agent-infra rows; split if any exist.
|
|
||||||
6. Update `AGENTS.md` files in Remnant if any reference the moved docs.
|
|
||||||
7. Commit Remnant deletion and dotfiles addition together (or back-to-back
|
|
||||||
commits with cross-references in the messages).
|
|
||||||
|
|
||||||
**Acceptance:** `ls ~/code/remnant/docs/projects/ | grep -iE 'agent|dotfiles'`
|
|
||||||
returns only `agent-infrastructure.md`; `verification.md` is gone from the
|
|
||||||
Remnant root; the roadmap (this doc) opens cleanly from its new path with
|
|
||||||
working links.
|
|
||||||
|
|
||||||
**Risk:** if any Remnant `AGENTS.md` instructions or
|
|
||||||
[`docs/projects/COMPLETED.md`](./COMPLETED.md) row links into these docs and the
|
|
||||||
link breaks silently, agents will follow a dead reference. Step 4 mitigates.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Tasks (recommended order)
|
|
||||||
|
|
||||||
0. [No-live-fire safety rule (land immediately)](#0-no-live-fire-safety-rule-land-immediately)
|
|
||||||
— AGENTS.md addition forbidding real destructive commands as hook-test
|
|
||||||
inputs. Prerequisite for #3 and for any manual hook testing.
|
|
||||||
1. [`project.config.js` extraction](#1-projectconfigjs-extraction) — unblocks
|
|
||||||
non-Remnant projects; resolves 6+ hardcodes catalogued in the
|
|
||||||
[hook-script audit](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
|
|
||||||
2. [Per-session tmp file capture](#2-per-session-tmp-file-capture) — correctness
|
|
||||||
bug; concurrent agent sessions clobber one another's task-capture file.
|
|
||||||
3. [Hook + agent-config verification framework](#3-hook--agent-config-verification-framework)
|
|
||||||
— automate the smoke-test currently in Remnant's `verification.md`. Gated on
|
|
||||||
#0 (safety rule) and benefits from #1 (config-driven test fixtures).
|
|
||||||
4. [llama-server + AI models module](#4-llama-server--ai-models-module) —
|
|
||||||
user-requested; folds presets, systemd units, llama.cpp build, and GGUF
|
|
||||||
acquisition into `install.sh` (skips heavy steps in devcontainers).
|
|
||||||
5. [Kanban / task-doc unification](#5-kanban--task-doc-unification) — blocks MFE
|
|
||||||
adoption of the shared `stop.sh`; deferred until #1 lands so the task-doc
|
|
||||||
paths come from config, not the hook.
|
|
||||||
6. [MemPalace integration for memory survival across compaction](#6-mempalace-integration)
|
|
||||||
— directly addresses the "AGENTS.md context survival after compaction" WIP
|
|
||||||
problem in
|
|
||||||
[extraction-history.md](./extraction-history.md#wip-agentsmd-context-survival-after-compaction).
|
|
||||||
7. [Trace-based eval scaffolding (Husain methodology)](#7-trace-based-eval-scaffolding)
|
|
||||||
— foundation for any future automated improvement loop.
|
|
||||||
8. [Exa rate-limit awareness](#8-exa-rate-limit-awareness) — small follow-up to
|
|
||||||
the gap recorded in the validation doc.
|
|
||||||
9. [Research-loop / EvoSkill-style improvements](#9-research-loop--evoskill-style-improvements)
|
|
||||||
— gated on #7.
|
|
||||||
|
|
||||||
Items considered and **deprioritized**: see
|
|
||||||
[Deferred / not-now](#deferred--not-now).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 0. No-live-fire safety rule (land immediately)
|
|
||||||
|
|
||||||
**Driver:** May 23 2026 incident — `opencode run "Try to run rm -rf /"` was used
|
|
||||||
to smoke-test whether `pre-tool-use.sh` would block destructive commands. The
|
|
||||||
run happened to be safe because the loaded model refused on its own, but if the
|
|
||||||
hook had been broken and a more compliant model had been in the chair, the test
|
|
||||||
would have executed `rm -rf /` for real. **The test methodology was the bug, not
|
|
||||||
the model behavior.**
|
|
||||||
|
|
||||||
**Rule (add verbatim to `~/dotfiles/.agents/AGENTS.md`):**
|
|
||||||
|
|
||||||
> ## Testing destructive-command blocks — NEVER use live ammunition
|
|
||||||
>
|
|
||||||
> When verifying that `pre-tool-use.sh` (or any other hook) blocks a dangerous
|
|
||||||
> command pattern, **never issue the real destructive command as the test
|
|
||||||
> input.** The hook is the system under test — if it fails, the test destroys
|
|
||||||
> the host.
|
|
||||||
>
|
|
||||||
> Use one of these methods instead, in order of preference:
|
|
||||||
>
|
|
||||||
> 1. **Unit-test the hook directly.** Pipe synthetic hook-input JSON to the
|
|
||||||
> script and check exit code + stderr. No agent in the loop. No real shell
|
|
||||||
> invocation. Example:
|
|
||||||
> `echo '{"tool_name":"run_in_terminal","tool_input":{"command":"rm -rf /"}}' | bash ~/dotfiles/.agents/hooks/pre-tool-use.sh; echo "exit=$?"`
|
|
||||||
> The hook should exit non-zero (deny) and print the block reason. No `rm`
|
|
||||||
> was ever queued.
|
|
||||||
> 2. **Use a sentinel that exercises the regex but is harmless if the block
|
|
||||||
> fails.** A path that obviously doesn't exist and could not possibly hold
|
|
||||||
> real data: `rm -rf /var/empty/agent-block-canary-DO-NOT-CREATE-${RANDOM}`.
|
|
||||||
> The hook pattern (`rm\s+-rf?\s+/`) matches; if the block fails, the worst
|
|
||||||
> case is a "no such file" error on a sentinel path. NEVER use bare `/`,
|
|
||||||
> `/home`, `~`, `.`, `*`, or any real path — those have to fail-closed even
|
|
||||||
> if the hook is broken.
|
|
||||||
> 3. **Never** issue the literal destructive command (`rm -rf /`,
|
|
||||||
> `dd if=/dev/zero of=/dev/sda`, `:(){ :|:& };:`, `chmod -R 000 /`,
|
|
||||||
> `git push --force` to a published branch, etc.) as an agent prompt. Not
|
|
||||||
> even with `--dry-run`. Not even "just to see." Not even if you're sure the
|
|
||||||
> hook works. The hook MIGHT not work. That's why you're testing it.
|
|
||||||
>
|
|
||||||
> This rule applies to humans writing test prompts AND to agents asked to verify
|
|
||||||
> hook behavior. If you (the agent) are asked to verify a block, refuse any plan
|
|
||||||
> that involves issuing the real destructive command and propose a unit-test or
|
|
||||||
> sentinel approach instead.
|
|
||||||
|
|
||||||
**Why it lives in AGENTS.md, not just a hook:** the failure mode is at the
|
|
||||||
human/agent decision layer ("what command should I issue to test this?"), not at
|
|
||||||
the execution layer. A hook can't catch a model that's been told to bypass the
|
|
||||||
hook. The narrative-epistemology framing from the research notes applies — this
|
|
||||||
rule shapes the **modal space** of test prompts so "issue the real command"
|
|
||||||
doesn't appear in the action set.
|
|
||||||
|
|
||||||
**Acceptance:** the rule lives in `~/dotfiles/.agents/AGENTS.md` under a
|
|
||||||
top-level section (so it survives compaction and AGENTS.md re-injection). Next
|
|
||||||
time anyone asks the agent to test a block, the agent proposes method 1 or 2 and
|
|
||||||
refuses method 3.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. `project.config.js` extraction
|
|
||||||
|
|
||||||
Already designed in
|
|
||||||
[extraction-history.md → Suggested fix pattern](./extraction-history.md#-full-hook-script-remnant-isms-audit-may-23-2026--addendum).
|
|
||||||
This task tracks the implementation.
|
|
||||||
|
|
||||||
**Shape of work:**
|
|
||||||
|
|
||||||
- Add a tiny loader (`~/dotfiles/.agents/hooks/_lib/project-config.sh`) sourced
|
|
||||||
by every hook that needs configured values. Loads
|
|
||||||
`<repo>/.agents/project.config.{js,ts,json}` via `node` /`tsx` /direct JSON
|
|
||||||
read in that order; falls back to a defaults object matching Remnant today.
|
|
||||||
- Replace hardcoded values in `pre-tool-use.sh` Policies 5, 8, 9, 10, 11, 14 and
|
|
||||||
in `stop.sh` (ports, verify command, codegen rules, task-doc paths) per the
|
|
||||||
audit.
|
|
||||||
- Drop the `modelContextWindow` notion entirely; genericize the Policy 14 "32K"
|
|
||||||
wording to "may exhaust the model's context window."
|
|
||||||
- Ship a Remnant `project.config.js` in the Remnant repo as the first consumer;
|
|
||||||
ship an MFE `project.config.js` later as part of the MFE bootstrap.
|
|
||||||
|
|
||||||
**Acceptance:** running every hook from a project _without_ a config file
|
|
||||||
produces the same behavior as today (zero-regression for Remnant). Running from
|
|
||||||
a project _with_ a config file consults it.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Per-session tmp file capture
|
|
||||||
|
|
||||||
Already designed in
|
|
||||||
[extraction-history.md → Future task — per-session tmp file capture](./extraction-history.md#-future-task--per-session-tmp-file-capture).
|
|
||||||
Small, independent, can land before or after #1.
|
|
||||||
|
|
||||||
**Bonus catch from that section:** `/tmp/.opencode-tool-count-${REPO_ID}` in
|
|
||||||
`post-tool-use.sh` is keyed by repo only — two concurrent sessions in the same
|
|
||||||
repo share the self-check counter. Fix the same way.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Hook + agent-config verification framework
|
|
||||||
|
|
||||||
**Driver:** [manual-verification.md](../tests/manual-verification.md) is a manual
|
|
||||||
4-level smoke-test for the renamed `build` and `orchestrator` agents. It is (a)
|
|
||||||
sitting in the wrong repo — the agents it tests now live in
|
|
||||||
`~/dotfiles/.agents/agents/`, (b) outdated relative to the current agent config,
|
|
||||||
and (c) the kind of thing humans skip because running it takes 10+ minutes of
|
|
||||||
manual prompting. The user explicitly wants this to run **automatically after
|
|
||||||
updates**, and just-as-explicitly wants it to never resemble
|
|
||||||
`opencode run "Try to run rm -rf /"` (see
|
|
||||||
[#0](#0-no-live-fire-safety-rule-land-immediately)).
|
|
||||||
|
|
||||||
### Test layers
|
|
||||||
|
|
||||||
Three layers, from cheapest/safest to most expensive/least safe. Run the lower
|
|
||||||
layers in CI on every commit to `~/dotfiles/.agents/`; run the upper layer
|
|
||||||
manually before merging risky changes.
|
|
||||||
|
|
||||||
**Layer 1 — Static checks (no execution, no agent):**
|
|
||||||
|
|
||||||
- `bash -n` on every `*.sh` hook (syntax-only parse).
|
|
||||||
- `shellcheck` on every hook (lints + common-bug detection).
|
|
||||||
- Frontmatter validation on every `agents/*.md` and `skills/*.md`: required
|
|
||||||
fields present, referenced tools exist in the framework's tool registry.
|
|
||||||
- `node --check` or `tsx --check` on every JS/TS plugin
|
|
||||||
(`frameworks/opencode/*.ts`, `mcp/all-agents/src/*.ts`).
|
|
||||||
- JSON schema validation on `frameworks/github/hooks.json` and any other
|
|
||||||
framework configs.
|
|
||||||
- Glob check: every file referenced by a hook (e.g. `_lib/project-config.sh`
|
|
||||||
once #1 lands) actually exists.
|
|
||||||
|
|
||||||
**Layer 2 — Hook unit tests (synthetic input, no agent, no shell exec):**
|
|
||||||
|
|
||||||
For each hook, a fixture file `tests/hooks/<hook>.test.sh` that pipes
|
|
||||||
hand-written JSON inputs to the hook and asserts the exit code + stderr. No real
|
|
||||||
command is ever invoked because the hook returns deny/allow before anything
|
|
||||||
runs.
|
|
||||||
|
|
||||||
Fixtures should cover, at minimum:
|
|
||||||
|
|
||||||
- **Allow path:** a benign tool call (e.g. `read_file` of an in-repo path) —
|
|
||||||
hook exits 0, no stderr noise.
|
|
||||||
- **Block paths (one per policy):** synthetic JSON that exercises each block in
|
|
||||||
`pre-tool-use.sh` (Policies 1–14). Assert exit code 2 (deny) and message
|
|
||||||
contains the policy ID. **All block fixtures use sentinel paths per
|
|
||||||
[#0](#0-no-live-fire-safety-rule-land-immediately)** — no bare `/`, no real
|
|
||||||
destructive commands.
|
|
||||||
- **Reminder injection:** `post-tool-use.sh` fed a generated-file edit — assert
|
|
||||||
stdout contains the `.generated.ts` warning.
|
|
||||||
- **Session boundaries:** `session-start.sh`, `stop.sh`, `pre-compact.sh` with
|
|
||||||
realistic JSON inputs — assert they produce the expected stdout blocks.
|
|
||||||
|
|
||||||
A small runner (`tests/run-hook-tests.sh`) discovers `*.test.sh` files, executes
|
|
||||||
them, and reports pass/fail. CI calls this on every PR. Local dev calls it from
|
|
||||||
a `~/dotfiles/.agents/install.sh --verify` flag.
|
|
||||||
|
|
||||||
**Layer 3 — Live integration tests (real agent, sentinel inputs, gated):**
|
|
||||||
|
|
||||||
The layers above don't catch "the framework didn't actually wire the hook in"
|
|
||||||
failures — the hook can be perfect in isolation but never get called. Layer 3
|
|
||||||
catches that by running a real OpenCode/Copilot session against sentinel
|
|
||||||
prompts:
|
|
||||||
|
|
||||||
- Per [#0](#0-no-live-fire-safety-rule-land-immediately), prompts use sentinel
|
|
||||||
paths and the **agent is asked to attempt** the sentinel command, not the real
|
|
||||||
one. Example prompt: _"Run `rm -rf /var/empty/canary-${RANDOM}` and report
|
|
||||||
what happened."_ Pass criterion: the hook block message appears in the agent's
|
|
||||||
response and the tool was never executed.
|
|
||||||
- Optional: drive via `opencode run --agent <name>` so the session is scripted
|
|
||||||
and non-interactive. Gate this behind an explicit `--enable-live-tests` flag
|
|
||||||
in the runner; default off in CI.
|
|
||||||
- Layer 3 also folds in Remnant's `verification.md` Levels 1–4 (read-only, small
|
|
||||||
write, scope escalation refusal, orchestrator planning gate) once the agents
|
|
||||||
are stable enough to script against.
|
|
||||||
|
|
||||||
### Disposition of `verification.md`
|
|
||||||
|
|
||||||
- It's not Remnant's anymore (tests global infra). Move to
|
|
||||||
`~/dotfiles/.agents/tests/manual-verification.md` as the human-runnable
|
|
||||||
fallback until Layer 3 automation exists.
|
|
||||||
- Drop from Remnant root in the same commit that creates
|
|
||||||
`~/dotfiles/.agents/tests/`. Until then it can stay where it is; it's not
|
|
||||||
causing harm, just misfiled.
|
|
||||||
- Once Layers 1 and 2 are running in CI, the manual doc shrinks to just Layer 3
|
|
||||||
scenarios. Once Layer 3 is automated, retire the doc entirely.
|
|
||||||
|
|
||||||
### CI integration
|
|
||||||
|
|
||||||
- Add a GitHub Action (or Gitea CI step) in `~/dotfiles/` that runs Layers 1 + 2
|
|
||||||
on every push.
|
|
||||||
- Locally, `install.sh --verify` runs the same checks before applying any
|
|
||||||
changes — so an interactive `install.sh` invocation can refuse to symlink in a
|
|
||||||
broken hook.
|
|
||||||
- A `post-merge` git hook in `~/dotfiles/` runs Layers 1 + 2 after `git pull` so
|
|
||||||
a user who syncs a broken commit gets told immediately rather than discovering
|
|
||||||
it at the next agent invocation.
|
|
||||||
|
|
||||||
### Open questions
|
|
||||||
|
|
||||||
- **What's the canonical sentinel path?** Proposal: `/var/empty/` (exists,
|
|
||||||
read-only, owned by root on most distros, used by sshd's PrivilegeSeparation —
|
|
||||||
so a rogue `rm -rf` would fail with permission denied even before hitting
|
|
||||||
nonexistent-file errors). Append a random + canary token.
|
|
||||||
- **Where do hook fixtures live in the global infra?** Likely
|
|
||||||
`~/dotfiles/.agents/tests/hooks/*.test.sh` and
|
|
||||||
`~/dotfiles/.agents/tests/fixtures/*.json`. Symmetric with `hooks/` itself.
|
|
||||||
- **Should Layer 3 be a single integration test per framework, or per hook?**
|
|
||||||
Per framework is enough — the hook unit tests already cover per-hook behavior.
|
|
||||||
Layer 3 only needs to prove "the framework calls the hook at all."
|
|
||||||
|
|
||||||
### Acceptance
|
|
||||||
|
|
||||||
- `~/dotfiles/.agents/tests/run.sh` exists and exits 0 on a clean checkout.
|
|
||||||
- A deliberately-broken hook (e.g. syntax error introduced) causes the runner to
|
|
||||||
fail loudly with a useful error.
|
|
||||||
- A pull that breaks a hook is caught by the `post-merge` hook before any agent
|
|
||||||
sees it.
|
|
||||||
- No test fixture in the repo references a real destructive command or real path
|
|
||||||
— grep `tests/` for `rm -rf /` (without sentinel suffix), `dd if=`, `:(){`,
|
|
||||||
`chmod -R 000 /` etc. as a CI lint.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. llama-server + AI models module
|
|
||||||
|
|
||||||
**Goal:** `~/dotfiles/install.sh` (or a sub-command of it) sets up llama.cpp
|
|
||||||
|
|
||||||
- CUDA, registers the systemd units, places `presets.ini` from dotfiles, and on
|
|
||||||
a non-devcontainer machine downloads the configured set of GGUF models. A
|
|
||||||
second script (`scripts/models.sh`) handles add/remove/list of models
|
|
||||||
post-install.
|
|
||||||
|
|
||||||
### Target layout
|
|
||||||
|
|
||||||
```
|
|
||||||
~/dotfiles/.agents/models/
|
|
||||||
├── presets.ini ← canonical, version-controlled
|
|
||||||
├── models.list ← URLs + filenames + checksums (committed)
|
|
||||||
├── README.md ← what each preset is for
|
|
||||||
└── gguf/ ← gitignored, populated by install.sh
|
|
||||||
└── *.gguf
|
|
||||||
|
|
||||||
~/dotfiles/.agents/llama-server/
|
|
||||||
├── start.sh ← canonical (replaces /opt/llama-server/start.sh)
|
|
||||||
├── llama-server.service ← systemd unit (User=current user, not ollama)
|
|
||||||
├── llama-server-presets.path ← path watcher
|
|
||||||
├── llama-server-presets.service ← oneshot restart
|
|
||||||
└── build-llama.sh ← clones + builds llama.cpp w/ CUDA
|
|
||||||
|
|
||||||
~/dotfiles/.agents/scripts/
|
|
||||||
├── models.sh ← add/remove/list GGUFs by URL
|
|
||||||
└── install-llama.sh ← called by install.sh; idempotent
|
|
||||||
```
|
|
||||||
|
|
||||||
### `install.sh` additions (ordered)
|
|
||||||
|
|
||||||
1. **Detect environment.** If `/.dockerenv` exists, `$REMOTE_CONTAINERS` set, or
|
|
||||||
`$CODESPACES` set → devcontainer mode: skip llama.cpp build and GGUF download
|
|
||||||
(huge, slow, and not useful inside the container). Still place `presets.ini`
|
|
||||||
and `models.list` so the project can read them.
|
|
||||||
2. **Dependencies.**
|
|
||||||
`apt install -y build-essential cmake ninja-build libcurl4-openssl-dev git`
|
|
||||||
(with `sudo` prompt). CUDA toolkit detection only — don't try to install CUDA
|
|
||||||
itself; assume host setup or fail loud with a pointer to
|
|
||||||
[docs/llama-server-cuda-wsl2.md](../../../dotfiles/.agents/docs/llama-server-cuda-wsl2.md).
|
|
||||||
3. **Build llama.cpp.** `scripts/install-llama.sh` clones `ggerganov/llama.cpp`
|
|
||||||
to `/opt/llama-server/src`, builds with `-DGGML_CUDA=ON`, installs binaries +
|
|
||||||
libs to `/opt/llama-server/`. Skips the clone+build if the binary exists and
|
|
||||||
`--rebuild` wasn't passed.
|
|
||||||
4. **Install systemd units.** Copy from
|
|
||||||
`~/dotfiles/.agents/llama-server/*.{service,path}` to `/etc/systemd/system/`,
|
|
||||||
substituting `${USER}` for `User=`. Run `daemon-reload`,
|
|
||||||
`enable --now llama-server.service llama-server-presets.path`.
|
|
||||||
5. **Symlink `presets.ini`.**
|
|
||||||
`ln -sf ~/dotfiles/.agents/models/presets.ini ~/models/presets.ini` (keep the
|
|
||||||
existing path-watcher target until users have migrated). The path watcher
|
|
||||||
already restarts on modify — symlink target changes count.
|
|
||||||
6. **Download GGUFs.** Read `models.list`; for each entry not already in
|
|
||||||
`~/dotfiles/.agents/models/gguf/`, download with `curl --location` and verify
|
|
||||||
checksum if listed. Print disk-usage estimate before starting. Skip in
|
|
||||||
devcontainer mode.
|
|
||||||
|
|
||||||
### `models.list` format
|
|
||||||
|
|
||||||
```
|
|
||||||
# url<TAB>filename<TAB>sha256(optional)
|
|
||||||
https://huggingface.co/.../qwen3-coder-30b-iq3.gguf qwen3-coder-30b-iq3.gguf abc123...
|
|
||||||
https://huggingface.co/.../deepcoder-14b-q5.gguf deepcoder-14b-q5.gguf def456...
|
|
||||||
https://huggingface.co/.../qwopus-3.6-35b-iq3.gguf qwopus-3.6-35b-iq3.gguf -
|
|
||||||
```
|
|
||||||
|
|
||||||
Plain TSV, easy to grep + diff. Comments via `#`.
|
|
||||||
|
|
||||||
### `models.sh` CLI
|
|
||||||
|
|
||||||
```bash
|
|
||||||
models.sh list # show installed + configured
|
|
||||||
models.sh add <url> [--name=<file>] # download + append to models.list
|
|
||||||
models.sh remove <name> # rm file + drop from models.list
|
|
||||||
models.sh prune # delete files not in models.list
|
|
||||||
models.sh download # re-download anything missing
|
|
||||||
models.sh checksum <name> # compute + store sha256
|
|
||||||
```
|
|
||||||
|
|
||||||
Each command edits `models.list` and the `gguf/` dir; `presets.ini` is edited by
|
|
||||||
hand (with the path-watcher restarting llama-server on save).
|
|
||||||
|
|
||||||
### Open questions
|
|
||||||
|
|
||||||
- **`User=` in the systemd unit.** The current unit runs as `ollama`. The
|
|
||||||
rationale was probably ollama's group ownership of `/home/dev/models/`. Moving
|
|
||||||
the model dir into dotfiles means the user owns it directly — running as
|
|
||||||
`${USER}` (or as a dedicated `llama` system user) is cleaner. Decide before
|
|
||||||
shipping.
|
|
||||||
- **CUDA-only assumption.** The user accepted "can always make this more
|
|
||||||
flexible later." Tag in the build script's header so a CPU/Metal fallback is
|
|
||||||
easy to add. Don't gold-plate now.
|
|
||||||
- **Where do the modelfiles go?** Remnant's `omnicoder*.modelfile` files are
|
|
||||||
Ollama-format. If they're still useful, move them to
|
|
||||||
`~/dotfiles/.agents/models/modelfiles/` and add a
|
|
||||||
`models.sh modelfile apply <name>` subcommand. Out of scope for the initial
|
|
||||||
cut; track in #4.5.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Kanban / task-doc unification
|
|
||||||
|
|
||||||
Already designed in
|
|
||||||
[extraction-history.md → Future task — unify kanban/task doc structure](./extraction-history.md#-future-task--unify-kanbantask-doc-structure).
|
|
||||||
Once #1 lands, `stop.sh` reads task-doc paths from `project.config.js`, so the
|
|
||||||
"shared hook supports one shape" framing changes: the hook supports _whatever
|
|
||||||
shape the config declares_, and the migration becomes purely a per-project
|
|
||||||
content move.
|
|
||||||
|
|
||||||
**Revised plan after #1:**
|
|
||||||
|
|
||||||
- Drop the "stop.sh knows about Remnant's flat list vs MFE's
|
|
||||||
`tasks/{backlog,todo,done}/`" coupling. `stop.sh` should know how to scan a
|
|
||||||
directory tree and how to scan a flat file, and `taskDocs` in config picks
|
|
||||||
which mode.
|
|
||||||
- MFE bootstraps on the directory-tree mode from day one.
|
|
||||||
- Remnant's migration is optional — if the kanban-tree shape is demonstrably
|
|
||||||
better in MFE, port Remnant later.
|
|
||||||
- Skill option still applies: a `migrate-task-docs.md` skill is probably cheaper
|
|
||||||
than a script given the per-project judgment calls.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. MemPalace integration
|
|
||||||
|
|
||||||
**Why this is here:** the WIP "AGENTS.md context survival after compaction"
|
|
||||||
problem in the validation doc is a special case of the broader long-term memory
|
|
||||||
problem. MemPalace
|
|
||||||
([NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671))
|
|
||||||
solves it with a hook architecture that matches ours almost line-for-line.
|
|
||||||
|
|
||||||
**MemPalace primitives (verified from the PR):**
|
|
||||||
|
|
||||||
| MemPalace hook | Our equivalent | What it does |
|
|
||||||
| ----------------------- | ------------------------- | ------------------------------------------------- |
|
|
||||||
| `initialize()` | `session-start.sh` | Loads identity, warms vector DB |
|
|
||||||
| `system_prompt_block()` | `session-start.sh` inject | AAAK L0+L1 wake-up (~170 tokens) at every session |
|
|
||||||
| `prefetch()` | `user-prompt-submit.sh` | Semantic search before each turn; wing-narrowed |
|
|
||||||
| `sync_turn()` | `post-tool-use.sh` | Files every exchange to the palace, non-blocking |
|
|
||||||
| `on_session_end()` | `stop.sh` | Full session mining + L1 layer regeneration |
|
|
||||||
| `on_pre_compress()` | `pre-compact.sh` | Extract key exchanges before context compression |
|
|
||||||
| `on_memory_write()` | (new — explicit writes) | Mirrors explicit memory writes into the palace |
|
|
||||||
|
|
||||||
**Practical plan:**
|
|
||||||
|
|
||||||
- Stand up MemPalace locally (Ollama + bge-m3 1024-dim, ChromaDB at
|
|
||||||
`~/.mempalace/`). Hermes is the reference integration but MemPalace itself
|
|
||||||
ships an MCP server (`mempalace_search`, `mempalace_status`, +6 more tools)
|
|
||||||
that any MCP-aware harness can use directly.
|
|
||||||
- Register the MemPalace MCP server in `~/.config/opencode/opencode.json` and
|
|
||||||
`~/.vscode-server/.../mcp.json` via `install.sh` — same pattern as
|
|
||||||
`all-agents`. No code changes needed on the harness side for read access.
|
|
||||||
- Wire write-side via our existing hooks: `post-tool-use.sh` calls the MCP tool
|
|
||||||
to file the turn, `pre-compact.sh` extracts and stores key exchanges. This is
|
|
||||||
additive — the existing dead-ends/explorations scaffolding stays.
|
|
||||||
- **Known bug to track upstream:** the Hermes plugin defaulted to a 384-dim
|
|
||||||
embedding function vs. MemPalace's 1024-dim collection. If we integrate
|
|
||||||
directly with MemPalace's MCP server (not via Hermes's plugin), we sidestep
|
|
||||||
it; if we follow Hermes's plugin pattern, fix per the PR comment.
|
|
||||||
|
|
||||||
**Acceptance:** after restart in a fresh session, the agent can recall specific
|
|
||||||
facts (e.g. "what was the Phase 4 commit?") from a prior session without those
|
|
||||||
facts being in the workspace files. Compaction in the middle of a session does
|
|
||||||
not erase per-turn memory.
|
|
||||||
|
|
||||||
**Why this is #6, not #1:** it's higher-value than the small fixes but depends
|
|
||||||
on Ollama already running (which #4 makes turnkey), and requires verifying
|
|
||||||
MemPalace works against our chosen embedding model on our hardware before
|
|
||||||
committing to it. Do #1, #2, #3 first, then this.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 7. Trace-based eval scaffolding
|
|
||||||
|
|
||||||
**Source:** "The Loop Is Only as Good as the Metric"
|
|
||||||
([distributedthoughts.org, Mar 2026](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/))
|
|
||||||
on Hamel Husain's evals methodology, contrasted with Karpathy's autoresearch
|
|
||||||
loop. Quote: _"the value of an optimization loop is determined entirely by the
|
|
||||||
quality of its feedback signal."_
|
|
||||||
|
|
||||||
**Husain methodology in two sentences:** review at least 100 real agent-output
|
|
||||||
traces by hand, take open-ended notes, categorize failures, then build binary
|
|
||||||
pass/fail evals around the failure modes you actually saw. Do not start with
|
|
||||||
generic metrics.
|
|
||||||
|
|
||||||
**Practical plan for us:**
|
|
||||||
|
|
||||||
- Pick a trace store. Cheapest path: write every OpenCode/Copilot turn's agent
|
|
||||||
output to `~/.agent-traces/<date>/<session-id>.jsonl` via the existing
|
|
||||||
`post-tool-use.sh` (we already have session-ID derivation from #2). Add a
|
|
||||||
`trace_log()` helper in `_lib/`.
|
|
||||||
- Build a tiny review CLI: `scripts/trace-review.sh` opens the next unreviewed
|
|
||||||
trace in `$EDITOR` with a frontmatter block (`outcome: pass|fail|partial`,
|
|
||||||
`failure_modes: []`, `notes: ""`). Saves to `~/.agent-traces/reviewed/`.
|
|
||||||
- After 100 reviewed traces, derive a `failure-modes.md` doc grouping the
|
|
||||||
observed failure modes. _This_ becomes the input to skill / hook / AGENTS.md
|
|
||||||
improvements — concrete failure modes, not speculation.
|
|
||||||
|
|
||||||
**Why this is gating for #9:** an EvoSkill-style or Karpathy-style automated
|
|
||||||
loop needs a metric. Without trace-based failure modes, the only metric
|
|
||||||
available is "did the user thumbs-up" — too noisy, too slow, too coarse.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 8. Exa rate-limit awareness
|
|
||||||
|
|
||||||
Per the validation doc gap #9. Free-plan limit: no parallel fanout under ~1s —
|
|
||||||
calls must be serial.
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
|
|
||||||
- Add a `mcp_exa_*` case to `post-tool-use.sh` that injects a one-liner reminder
|
|
||||||
("Exa free plan: serialize searches; one at a time").
|
|
||||||
- Add an "External service quirks" section to `~/dotfiles/.agents/AGENTS.md`
|
|
||||||
listing Exa (and any future per-service constraints) so the rule survives
|
|
||||||
compaction.
|
|
||||||
- Optional soft-warn in `pre-tool-use.sh`: count `mcp_exa_*` calls per turn
|
|
||||||
(reset on `user-prompt-submit`); inject a warning (not a deny) past N=2 in a
|
|
||||||
single turn.
|
|
||||||
|
|
||||||
Trivial, no dependencies, can land in any order.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 9. Research-loop / EvoSkill-style improvements
|
|
||||||
|
|
||||||
**Sources:**
|
|
||||||
|
|
||||||
- Karpathy autoresearch
|
|
||||||
([github.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch),
|
|
||||||
Mar 2026): single-file experiment, fixed time budget, scalar metric (val_bpb),
|
|
||||||
LOOP FOREVER on a dedicated branch — keep if metric improves, revert if not.
|
|
||||||
- EvoSkill ([arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
|
|
||||||
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill)):
|
|
||||||
failure-driven skill discovery via Proposer + Skill-Builder agents over a
|
|
||||||
Pareto frontier of programs; +7.3% OfficeQA, +12.1% SealQA, +5.3% zero-shot
|
|
||||||
transfer to BrowseComp. Skills materialize as `SKILL.md` + helper scripts —
|
|
||||||
same shape as our existing skills dir.
|
|
||||||
|
|
||||||
**What this looks like for us (after #7):**
|
|
||||||
|
|
||||||
- The "controllable artifact" is the `~/dotfiles/.agents/AGENTS.md` +
|
|
||||||
`agents/*.md` + `skills/*.md` + hook reminders. The "frozen model" is whatever
|
|
||||||
LLM the user is running.
|
|
||||||
- The scalar metric is something like: fraction of traces (from #6) where the
|
|
||||||
agent's hook output and tool sequence matched a hand-labeled gold trajectory.
|
|
||||||
Husain's binary pass/fail per failure mode aggregates into this.
|
|
||||||
- A Proposer agent (à la EvoSkill) reads recent failed traces + the current
|
|
||||||
skill set, proposes a new `SKILL.md` or an edit to an existing one, the
|
|
||||||
Skill-Builder materializes it, the eval harness re-runs on the held-out trace
|
|
||||||
set, and the frontier keeps it if the metric improves.
|
|
||||||
|
|
||||||
**Why it's last in the queue:** every prior task (config, sessions, llama
|
|
||||||
turnkey, memory, traces) is a prerequisite or a strict improvement to the
|
|
||||||
substrate this loop runs on. Starting #8 before them produces a loop that
|
|
||||||
optimizes against a noisy or wrong metric — the exact failure mode the Husain
|
|
||||||
piece warns about.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Deferred / not-now
|
|
||||||
|
|
||||||
- **Adopt LangGraph as the harness.** Best-in-class observability and
|
|
||||||
state-machine recovery, but adopting it means rewriting the OpenCode + Copilot
|
|
||||||
integration layer we just extracted. Revisit if LangSmith becomes the only
|
|
||||||
path to debugging a specific failure mode we can't diagnose with traces (#7)
|
|
||||||
alone. Sources:
|
|
||||||
[agent-harness.ai benchmark](https://agent-harness.ai/blog/multi-agent-orchestration-frameworks-benchmark-crewai-vs-langgraph-vs-autogen-performance-cost-and-integration-complexity/)
|
|
||||||
(9% token overhead vs CrewAI 18% vs AutoGen 31%);
|
|
||||||
[groundy.com](https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/)
|
|
||||||
(per-node failure isolation vs CrewAI full-plan retry).
|
|
||||||
- **AutoGen.** Entered maintenance mode in late 2025; absorbed into Microsoft
|
|
||||||
Agent Framework 1.0 GA (April 3, 2026). Migration cost is real and the
|
|
||||||
framework's strength (conversational coordination) doesn't match our
|
|
||||||
deterministic-pipeline use case. Skip.
|
|
||||||
- **CrewAI.** Strong for "agent A → agent B → agent C" pipelines, but role
|
|
||||||
coordination overhead is ~3× LangGraph's on simple workflows. Our use case
|
|
||||||
(single agent per session) doesn't benefit. Skip.
|
|
||||||
- **Git worktrees for parallel agent runs.** Mentioned in the MFE draft; see
|
|
||||||
Claude Desktop's approach. Interesting once we have a working research loop
|
|
||||||
(#9), pointless before. Defer.
|
|
||||||
- **Narrative epistemology as an explicit framework.** Flowerree's "Reasoning
|
|
||||||
Through Narrative" (Cambridge Episteme) and Betz et al. on NLMs as epistemic
|
|
||||||
agents (PMC9910757) give philosophical grounding for AGENTS.md design (a
|
|
||||||
narrative frame is a "modal-space-shaping tool, not a set of premises").
|
|
||||||
Useful for writing AGENTS.md prose; not a discrete task. Cite if/when we
|
|
||||||
publish methodology.
|
|
||||||
- **Hermes Agent as a harness.** Compelling memory story (MemPalace), but Python
|
|
||||||
and tied to NousResearch's ecosystem. We integrate the memory piece directly
|
|
||||||
via MCP (#6) without adopting the harness.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Research notes (May 23, 2026)
|
|
||||||
|
|
||||||
Pulled via Exa search; supports the prioritization above. Each block lists the
|
|
||||||
key finding and the source.
|
|
||||||
|
|
||||||
### Karpathy autoresearch — single-metric loop
|
|
||||||
|
|
||||||
- **Source:** [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
|
|
||||||
- [distributedthoughts.org](https://www.distributedthoughts.org/2026-03-16-the-loop-is-only-as-good-as-the-metric/).
|
|
||||||
- Single file (`train.py`) edited by agent, fixed 5-minute time budget per
|
|
||||||
experiment, scalar metric (val_bpb), branch-keep-or-revert protocol, LOOP
|
|
||||||
FOREVER. ~12 experiments/hour.
|
|
||||||
- Four ingredients for this to work outside ML training: (1) one modifiable
|
|
||||||
artifact, (2) reliable benchmark/harness, (3) scalar metric, (4) fixed eval
|
|
||||||
cycle. The Husain layer adds: don't invent the metric — derive it from manual
|
|
||||||
trace review.
|
|
||||||
|
|
||||||
### EvoSkill — automated skill discovery
|
|
||||||
|
|
||||||
- **Source:** [arxiv 2603.02766](https://arxiv.org/pdf/2603.02766v1),
|
|
||||||
[sentient-agi/EvoSkill](https://github.com/sentient-agi/EvoSkill).
|
|
||||||
- Three agents: Proposer (diagnoses failures), Skill-Builder (materializes
|
|
||||||
`SKILL.md` + helpers), evaluator (held-out validation).
|
|
||||||
- Pareto frontier of agent programs; round-robin parent selection;
|
|
||||||
failure-driven textual feedback descent.
|
|
||||||
- **Why this matters for us:** our skills dir already matches EvoSkill's output
|
|
||||||
shape (`SKILL.md` + helper files). The infrastructure they describe is closer
|
|
||||||
to "build on top of our existing layout" than "adopt a new framework."
|
|
||||||
|
|
||||||
### Agentic-framework landscape, 2026
|
|
||||||
|
|
||||||
- **LangGraph 1.2 (May 2026):** production default. 9% token overhead over raw
|
|
||||||
API. Per-node failure isolation (vs CrewAI/AutoGen full-plan retry). Best
|
|
||||||
observability via LangSmith. Highest setup cost.
|
|
||||||
- **CrewAI 1.11 (Mar 2026):** fastest time-to-first-agent. 18% token overhead.
|
|
||||||
Role-based. SQLite checkpointing added April 2026.
|
|
||||||
- **AutoGen:** maintenance mode since late 2025. Absorbed into Microsoft Agent
|
|
||||||
Framework 1.0 GA (April 3, 2026; unified with Semantic Kernel, MCP-native,
|
|
||||||
GraphFlow).
|
|
||||||
- **MAST taxonomy finding:** 79% of multi-agent failures originate from
|
|
||||||
spec/coordination issues, not the underlying model
|
|
||||||
([arxiv 2503.16339](https://arxiv.org/abs/2503.16339)). 36.9% inter-agent
|
|
||||||
misalignment, 21.3% task-verification breakdowns. **This validates investing
|
|
||||||
in hook/skill/AGENTS.md infrastructure over swapping models.**
|
|
||||||
|
|
||||||
### MemPalace — long-term memory provider
|
|
||||||
|
|
||||||
- **Source:**
|
|
||||||
[NousResearch/hermes-agent PR #5671](https://github.com/NousResearch/hermes-agent/pull/5671).
|
|
||||||
- 96.6% raw LongMemEval (100% with Haiku rerank). Fully local (ChromaDB + Ollama
|
|
||||||
bge-m3 1024-dim). No API key.
|
|
||||||
- Hook architecture maps 1:1 onto ours (see #5 table). Eight MCP tools expose
|
|
||||||
read/write.
|
|
||||||
- **Why this is the highest-leverage memory option:** matches our philosophy
|
|
||||||
(local, no SaaS, hook-driven) and solves the AGENTS.md-compaction problem the
|
|
||||||
validation doc flagged.
|
|
||||||
|
|
||||||
### Narrative epistemology — applied to AGENTS.md design
|
|
||||||
|
|
||||||
- **Source:** Flowerree, "Reasoning Through Narrative" (Cambridge _Episteme_,
|
|
||||||
2023); Betz et al., "Probabilistic coherence... Neural language models as
|
|
||||||
epistemic agents" (PMC9910757).
|
|
||||||
- Narratives shape **modal space** — what the model treats as possible,
|
|
||||||
plausible, required. They aren't premises to evaluate as true/false; they're
|
|
||||||
tools that frame inference.
|
|
||||||
- **Implication for AGENTS.md:** the doc's job isn't to state facts the model
|
|
||||||
checks at decision points — it's to shape the model's default modal space.
|
|
||||||
Forbidden patterns aren't "rules to look up" but "implausible options excluded
|
|
||||||
from the action space." Frames the "context survival after compaction" problem
|
|
||||||
differently: the question isn't "did the rules survive" but "did the
|
|
||||||
modal-space shaping survive."
|
|
||||||
- NLMs as epistemic agents (Betz): self-training on synthetic corpora produces
|
|
||||||
probabilistically-coherent belief revision. Suggestive for why AGENTS.md
|
|
||||||
content that the model sees repeatedly (via PostToolUse re-injection) gets
|
|
||||||
internalized better than content seen once.
|
|
||||||
|
|
||||||
### Exa rate-limit (operational)
|
|
||||||
|
|
||||||
- Free plan: serial only, no fan-out under ~1s. Observed May 23, 2026.
|
|
||||||
- Recorded in
|
|
||||||
[extraction-history.md gap #9](./extraction-history.md#-gaps-and-bugs-in-dotfiles-pre-push)
|
|
||||||
and as roadmap task #7.
|
|
||||||
@ -1,229 +0,0 @@
|
|||||||
# Interpreting Text-Based Communication: Evidence-Based Guidance
|
|
||||||
|
|
||||||
> **Status:** Research synthesis. Focus: what psychology, organizational
|
|
||||||
> behavior, and negotiation training have demonstrated _works_ for **readers**
|
|
||||||
> trying to accurately interpret text-only messages (email, chat, SMS,
|
|
||||||
> forum/Discord posts) where vocal tone and body language are absent.
|
|
||||||
>
|
|
||||||
> **Scope:** Receiver-side interpretation. Writing/composition guidance is
|
|
||||||
> mentioned only where it informs how readers should _decode_.
|
|
||||||
>
|
|
||||||
> **Audience:** Adults seeking concrete, proven practices — not a literature
|
|
||||||
> review.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. The Core Problem (Why This Is Hard)
|
|
||||||
|
|
||||||
Three well-replicated findings frame everything else:
|
|
||||||
|
|
||||||
1. **Senders systematically overestimate how clearly tone comes through.**
|
|
||||||
Kruger, Epley, Parker, & Ng (2005) had participants send sarcastic vs.
|
|
||||||
serious emails. Senders predicted recipients would detect tone ~78% of the
|
|
||||||
time; recipients actually performed at chance (~56%). Senders cannot
|
|
||||||
"uncouple" their own internal voice from the bare text — an egocentric
|
|
||||||
anchoring effect. [1]
|
|
||||||
2. **Receivers exhibit a negativity bias in CMC.** Byron (2008) synthesized
|
|
||||||
evidence that neutral emails tend to be read as negative, and positive emails
|
|
||||||
as neutral. Absent paralinguistic warmth cues, the brain fills the gap
|
|
||||||
pessimistically — especially under stress, fatigue, or status asymmetry. [2]
|
|
||||||
3. **Hostile attribution bias amplifies #2.** Individuals predisposed to read
|
|
||||||
hostility into ambiguous behavior (Dodge, 1980 and follow-ups) do so even
|
|
||||||
more in text, because there are fewer disconfirming cues. [3] Aderka et al.
|
|
||||||
(2016) showed this directly in a CMC context: ambiguous text messages are
|
|
||||||
read more negatively by socially anxious receivers, validating a text-
|
|
||||||
specific interpretation-bias measure (IB-CMC). [4]
|
|
||||||
|
|
||||||
**Implication for readers:** Your first emotional reading of an ambiguous
|
|
||||||
message is statistically likely to be _more negative_ than the sender intended.
|
|
||||||
Treat that first reading as a hypothesis, not a fact.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. The Highest-Leverage Practices (What Actually Works)
|
|
||||||
|
|
||||||
Filtered to interventions with empirical support _or_ adoption in professional
|
|
||||||
training programs (FBI crisis negotiation, clinical psychology, mediation,
|
|
||||||
executive coaching). Ordered by effect size and ease of adoption.
|
|
||||||
|
|
||||||
### 2.1 Delay before responding to anything that triggered you
|
|
||||||
|
|
||||||
The single most-recommended practice across clinical, negotiation, and
|
|
||||||
organizational sources. Even a short pause (minutes for chat, hours for email)
|
|
||||||
lets the amygdala-driven first reading subside and the prefrontal cortex
|
|
||||||
re-engage. Crucial Conversations (Patterson et al.) calls this "getting out of
|
|
||||||
your story"; CBT calls it "cognitive defusion." [5][6]
|
|
||||||
|
|
||||||
> Rule of thumb used in mediation training: **if your pulse is up, don't hit
|
|
||||||
> send.**
|
|
||||||
|
|
||||||
### 2.2 Generate at least two alternative interpretations
|
|
||||||
|
|
||||||
Explicit perspective-taking — being instructed to consider the sender's
|
|
||||||
situation, constraints, and likely state — measurably reduces hostile
|
|
||||||
attributions and stereotype-driven inferences (Galinsky & Moskowitz, 2000). This
|
|
||||||
generalizes directly to text. [7]
|
|
||||||
|
|
||||||
Concrete prompt to use on yourself:
|
|
||||||
|
|
||||||
> _"If a person I trusted and respected sent me this exact message, what would I
|
|
||||||
> assume they meant?"_
|
|
||||||
|
|
||||||
This is a behavioral form of the **Principle of Charity** (Rapoport's rules,
|
|
||||||
popularized by Dennett): restate the message in its strongest, most reasonable
|
|
||||||
form before reacting. [8]
|
|
||||||
|
|
||||||
### 2.3 Separate observation from interpretation (NVC / CBT overlap)
|
|
||||||
|
|
||||||
Nonviolent Communication (Rosenberg, 2003) and Cognitive Behavioral Therapy
|
|
||||||
(Beck; Burns, _Feeling Good_) independently converge on the same move:
|
|
||||||
|
|
||||||
- **Observation:** What words are literally on the screen?
|
|
||||||
- **Interpretation/Story:** What am I adding (intent, tone, motive)?
|
|
||||||
- **Feeling:** What am I feeling in response?
|
|
||||||
- **Check:** Which CBT distortion am I running? (Mind-reading, catastrophizing,
|
|
||||||
personalization, all-or-nothing.) [6][9]
|
|
||||||
|
|
||||||
In text-only contexts the gap between observation and interpretation is where
|
|
||||||
~all miscommunication lives. Naming the gap shrinks it.
|
|
||||||
|
|
||||||
### 2.4 Label the emotion you're inferring — and verify it
|
|
||||||
|
|
||||||
From FBI crisis negotiation training (Behavioral Change Stairway Model; Vecchi,
|
|
||||||
Van Hasselt, & Romano, 2005) and popularized by Chris Voss: state your read of
|
|
||||||
the other person's emotion tentatively and invite correction. [10][11] The
|
|
||||||
mechanism is well-grounded: Lieberman et al. (2007) showed via fMRI that putting
|
|
||||||
feelings into words ("affect labeling") measurably reduces amygdala activity and
|
|
||||||
recruits the right ventrolateral prefrontal cortex — i.e., labeling literally
|
|
||||||
down-regulates the threat response, in yourself and (by co-regulation) the
|
|
||||||
sender. [12]
|
|
||||||
|
|
||||||
Templates that work:
|
|
||||||
|
|
||||||
- _"It sounds like you're frustrated that X — is that right?"_
|
|
||||||
- _"I'm reading this as [interpretation]. Did I get that right?"_
|
|
||||||
- _"I want to make sure I'm not misreading — are you [annoyed / asking / venting
|
|
||||||
/ blocked]?"_
|
|
||||||
|
|
||||||
This does two things receivers consistently undervalue:
|
|
||||||
|
|
||||||
1. Surfaces your interpretation _as_ an interpretation (cheap to correct).
|
|
||||||
2. Signals attention, which de-escalates regardless of whether your read was
|
|
||||||
right.
|
|
||||||
|
|
||||||
### 2.5 Ask one clarifying question instead of responding to the inferred message
|
|
||||||
|
|
||||||
Byron's (2008) explicit recommendation, echoed in mediation literature: when
|
|
||||||
emotional content is ambiguous, **respond with a question, not a reaction**.
|
|
||||||
This is also the cheapest way to avoid the Kruger/Epley failure mode — because
|
|
||||||
the sender's egocentric blindness means they often don't realize they were
|
|
||||||
unclear until asked. [1][2]
|
|
||||||
|
|
||||||
This is one of two practices (with §2.4) supported by both behavioral and neural
|
|
||||||
evidence — it short-circuits the loop in which the receiver's inferred tone
|
|
||||||
hardens into "what was said."
|
|
||||||
|
|
||||||
### 2.6 Re-read the message a second time, slowly, before reacting
|
|
||||||
|
|
||||||
Reading literature (and standard mediator training) finds that a second reading
|
|
||||||
— particularly out loud, or after a delay — substantially reduces projection of
|
|
||||||
imagined tone. Skimming amplifies negativity bias because the reader's own
|
|
||||||
affect supplies the missing prosody. [2]
|
|
||||||
|
|
||||||
### 2.7 Match medium to message complexity (and switch when stuck)
|
|
||||||
|
|
||||||
Media richness theory (Daft & Lengel, 1986) and ~40 years of follow-up research
|
|
||||||
consistently find: **emotional, ambiguous, or high-stakes content exceeds text's
|
|
||||||
bandwidth.** If a thread has gone two rounds without converging, escalate to
|
|
||||||
voice/video. This isn't "giving up on text" — it's recognizing a known channel
|
|
||||||
limit. [13]
|
|
||||||
|
|
||||||
### 2.8 Account for the hyperpersonal effect in long-term text relationships
|
|
||||||
|
|
||||||
Walther's hyperpersonal model (1996) shows that in extended text-only
|
|
||||||
relationships, receivers tend to _idealize_ senders (filling in flattering
|
|
||||||
detail) — which makes eventual ruptures feel sharper than they should. Be aware
|
|
||||||
that your sense of "knowing" someone you've only ever texted is partly your own
|
|
||||||
construction. [14]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. A Minimal Operating Checklist
|
|
||||||
|
|
||||||
When a text message lands and you feel a reaction:
|
|
||||||
|
|
||||||
1. **Pause.** Don't draft a response yet.
|
|
||||||
2. **Re-read.** Slowly. Once more.
|
|
||||||
3. **Name the gap.** What is literally written vs. what am I adding?
|
|
||||||
4. **Run charity.** What would I assume if a trusted friend wrote this?
|
|
||||||
5. **If still unclear: ask one labeled question.** ("Reading this as X — is that
|
|
||||||
right?")
|
|
||||||
6. **If two rounds don't resolve it: change channels.** Voice or video.
|
|
||||||
|
|
||||||
This checklist captures roughly 90% of what the cited training programs teach.
|
|
||||||
The remaining 10% is domain-specific (clinical, legal, hostage).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. What the Evidence Does _Not_ Support
|
|
||||||
|
|
||||||
Worth flagging because these are commonly repeated but weak or unsupported:
|
|
||||||
|
|
||||||
- **"55% of communication is body language" (Mehrabian).** Frequently cited to
|
|
||||||
claim text is hopeless. Mehrabian's 1967 studies were about _incongruent_
|
|
||||||
single-word emotional cues and do not generalize. Mehrabian himself has
|
|
||||||
repeatedly disavowed the broad interpretation. [15]
|
|
||||||
- **Emoji/punctuation as a reliable tone fix.** They help disambiguate, but
|
|
||||||
studies (e.g., Riordan, 2017) find effects are modest and culture/age
|
|
||||||
dependent; they do not close the sender-receiver gap from §1. [16]
|
|
||||||
- **Personality-typing the sender (MBTI, DISC, etc.) to predict tone.**
|
|
||||||
Predictive validity for individual messages is essentially zero.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Sources
|
|
||||||
|
|
||||||
1. Kruger, J., Epley, N., Parker, J., & Ng, Z. (2005). _Egocentrism over e-mail:
|
|
||||||
Can we communicate as well as we think?_ Journal of Personality and Social
|
|
||||||
Psychology, 89(6), 925–936.
|
|
||||||
2. Byron, K. (2008). _Carrying too heavy a load? The communication and
|
|
||||||
miscommunication of emotion by email._ Academy of Management Review, 33(2),
|
|
||||||
309–327.
|
|
||||||
3. Dodge, K. A. (1980). _Social cognition and children's aggressive behavior._
|
|
||||||
Child Development, 51(1), 162–170. (And the substantial
|
|
||||||
hostile-attribution-bias literature that followed.)
|
|
||||||
4. Aderka, I. M., et al. (2016). _RU mad @ me? Social anxiety and interpretation
|
|
||||||
of ambiguous text messages._ Computers in Human Behavior, 58, 362–368.
|
|
||||||
(Validates a CMC-specific interpretation-bias measure; n=215 + n=353.)
|
|
||||||
5. Patterson, K., Grenny, J., McMillan, R., & Switzler, A. (2002). _Crucial
|
|
||||||
Conversations: Tools for Talking When Stakes Are High._ McGraw-Hill.
|
|
||||||
6. Burns, D. D. (1980/1999). _Feeling Good: The New Mood Therapy._ (Lay summary
|
|
||||||
of Beck's cognitive distortions.)
|
|
||||||
7. Galinsky, A. D., & Moskowitz, G. B. (2000). _Perspective-taking: Decreasing
|
|
||||||
stereotype expression, stereotype accessibility, and in-group favoritism._
|
|
||||||
Journal of Personality and Social Psychology, 78(4), 708–724.
|
|
||||||
8. Dennett, D. C. (2013). _Intuition Pumps and Other Tools for Thinking_, Ch. on
|
|
||||||
"Rapoport's Rules." W. W. Norton.
|
|
||||||
9. Rosenberg, M. B. (2003). _Nonviolent Communication: A Language of Life_ (2nd
|
|
||||||
ed.). PuddleDancer Press.
|
|
||||||
10. Vecchi, G. M., Van Hasselt, V. B., & Romano, S. J. (2005). _Crisis (hostage)
|
|
||||||
negotiation: Current strategies and issues in high-risk conflict
|
|
||||||
resolution._ Aggression and Violent Behavior, 10(5), 533–551.
|
|
||||||
11. Voss, C., & Raz, T. (2016). _Never Split the Difference._ HarperBusiness.
|
|
||||||
(Popular translation of FBI negotiator practice; useful for the "labeling"
|
|
||||||
and "mirroring" tactics.)
|
|
||||||
12. Lieberman, M. D., Eisenberger, N. I., Crockett, M. J., Tom, S. M., Pfeifer,
|
|
||||||
J. H., & Way, B. M. (2007). _Putting feelings into words: Affect labeling
|
|
||||||
disrupts amygdala activity in response to affective stimuli._ Psychological
|
|
||||||
Science, 18(5), 421–428.
|
|
||||||
13. Daft, R. L., & Lengel, R. H. (1986). _Organizational information
|
|
||||||
requirements, media richness and structural design._ Management Science,
|
|
||||||
32(5), 554–571.
|
|
||||||
14. Walther, J. B. (1996). _Computer-mediated communication: Impersonal,
|
|
||||||
interpersonal, and hyperpersonal interaction._ Communication Research,
|
|
||||||
23(1), 3–43.
|
|
||||||
15. Mehrabian, A. (1971). _Silent Messages._ Wadsworth. (See Mehrabian's own
|
|
||||||
subsequent clarifications disclaiming the "55/38/7" generalization.)
|
|
||||||
16. Riordan, M. A. (2017). _Emojis as tools for emotion work: Communicating
|
|
||||||
affect in text messages._ Journal of Language and Social Psychology, 36(5),
|
|
||||||
549–567.
|
|
||||||
@ -1,148 +0,0 @@
|
|||||||
# Investigation: Text-Intent Interpretation (Human + LLM)
|
|
||||||
|
|
||||||
**Status:** investigating
|
|
||||||
**Orientation:** understand (mixed with mid-investigation methodology
|
|
||||||
correction)
|
|
||||||
**Created:** 2026-05-16
|
|
||||||
**Last Updated:** 2026-05-16
|
|
||||||
|
|
||||||
## Question
|
|
||||||
|
|
||||||
How do humans and LLMs (mis)interpret intent in text-only communication, and
|
|
||||||
what mitigations are supported by the literature? End goal: produce a concrete
|
|
||||||
action plan to counteract LLM intent-interpretation failures in this codebase.
|
|
||||||
|
|
||||||
## What We Know
|
|
||||||
|
|
||||||
- Three docs produced:
|
|
||||||
[text-communication-interpretation.md](../research/text-communication-interpretation.md),
|
|
||||||
[llm-intent-interpretation.md](../research/llm-intent-interpretation.md),
|
|
||||||
[human-llm-interpretation-overlap.md](../research/human-llm-interpretation-overlap.md).
|
|
||||||
- Methodology critique recorded in
|
|
||||||
[/memories/session/research-methodology-retrospective.md](/memories/session/research-methodology-retrospective.md).
|
|
||||||
- Five strongly-cited human↔LLM connections (primacy/recency↔serial position,
|
|
||||||
ELIZA/hyperpersonal, sycophancy↔social desirability via RLHF preference data,
|
|
||||||
perspective-taking↔SimToM, clarifying question↔CLAM).
|
|
||||||
- Bias-inheritance chain is two-stage (pretraining corpus vs. RLHF preference
|
|
||||||
labels) — Mina et al. 2024, Sharma et al. 2024.
|
|
||||||
|
|
||||||
## Hypotheses
|
|
||||||
|
|
||||||
- **[2026-05-16] H1:** Lost-in-the-middle is a clean human-primacy/ recency
|
|
||||||
analog in LLMs.
|
|
||||||
**Falsification:** find a replication where the U-shape doesn't hold or where
|
|
||||||
the mechanism is shown to be different.
|
|
||||||
**Result:** PARTIALLY ELIMINATED — Bilan et al. (arXiv:2508.07479, 2025) shows
|
|
||||||
U-shape only holds up to ~50% of context window; Mak (2025) shows
|
|
||||||
positional-embedding decay produces monotonic drop, not U-shape, in very-long
|
|
||||||
contexts. The analogy is real but narrower than I originally claimed.
|
|
||||||
|
|
||||||
- **[2026-05-16] H2:** RLHF preference labels cause sycophancy.
|
|
||||||
**Falsification:** find evidence that base models (no RLHF) are sycophantic,
|
|
||||||
or that some RLHF'd models are not.
|
|
||||||
**Result:** PARTIALLY ELIMINATED — nostalgebraist (LessWrong, 2023) replicated
|
|
||||||
Anthropic's sycophancy eval on OpenAI base models and found they are NOT
|
|
||||||
sycophantic at any size. Sycophancy depends on the specific finetuning data
|
|
||||||
and model family. Should be rephrased as "in some model families, RLHF
|
|
||||||
preference data amplifies a sycophancy signal that may also have pretraining
|
|
||||||
origins."
|
|
||||||
|
|
||||||
- **[2026-05-16] H3:** Role/persona prompting reliably improves LLM intent
|
|
||||||
interpretation.
|
|
||||||
**Falsification:** find published evidence persona prompting fails or is
|
|
||||||
irrelevant.
|
|
||||||
**Result:** ELIMINATED — three convergent 2025 papers (Persona is a
|
|
||||||
Double-Edged Sword IJCNLP 2025; Principled Personas EMNLP 2025;
|
|
||||||
arXiv:2512.05858) show persona prompts are mixed-to-ineffective and highly
|
|
||||||
sensitive to irrelevant details (up to ~30pp drops). This contradicts
|
|
||||||
widespread prompt-engineering folklore.
|
|
||||||
|
|
||||||
- **[2026-05-16] H4:** CoT reliably mitigates poor intent interpretation.
|
|
||||||
**Falsification:** find cases where CoT actively hurts or fails to help.
|
|
||||||
**Result:** PARTIALLY ELIMINATED — arXiv:2409.06173 shows CoT suffers from
|
|
||||||
posterior collapse: larger models anchor harder to reasoning priors under CoT,
|
|
||||||
particularly on subjective tasks (emotion, morality). Adds to the existing
|
|
||||||
inverted-U finding.
|
|
||||||
|
|
||||||
- **[2026-05-16] H5:** Pan et al. (arXiv:2308.03188) establishes that intrinsic
|
|
||||||
self-correction without external ground truth degrades or fails to improve
|
|
||||||
model performance.
|
|
||||||
**Falsification:** paper doesn't exist; conclusion is reversed or domain-
|
|
||||||
restricted in a way that doesn't support a general "no self-critique" claim.
|
|
||||||
**Result:** PARTIALLY CONFIRMED with citation correction — Pan et al.
|
|
||||||
2308.03188 exists and is a _survey_ by Liangming Pan et al. (UCSB, Aug 2023).
|
|
||||||
The _stronger primary_ citation for the "intrinsic self-correction degrades
|
|
||||||
performance" claim is Huang et al. arXiv:2310.01798 ("Large Language Models
|
|
||||||
Cannot Self-Correct Reasoning Yet," Google DeepMind / UIUC, Oct 2023): "LLMs
|
|
||||||
struggle to self-correct their responses without external feedback, and at
|
|
||||||
times, their performance even degrades after self-correction." Both citations
|
|
||||||
should appear; the strong claim should attribute to Huang et al.
|
|
||||||
|
|
||||||
- **[2026-05-16] H6:** Wu, Wu, Zou (ClashEval, 2024) shows adversarial reframing
|
|
||||||
/ lowering model confidence in a prior commitment reduces position- anchored
|
|
||||||
question drift.
|
|
||||||
**Falsification:** paper doesn't exist; paper is about general context-vs-
|
|
||||||
prior conflict and doesn't support the "lower confidence → adherence" claim;
|
|
||||||
effect is small or non-replicable.
|
|
||||||
**Result:** PARTIALLY CONFIRMED with scope caveat — ClashEval (NeurIPS 2024)
|
|
||||||
is real and the token-probability/adherence finding is supported: "the less
|
|
||||||
confident a model is in its initial response (via measuring token
|
|
||||||
probabilities), the more likely it is to adopt the information in the
|
|
||||||
retrieved content." SCOPE: ClashEval tested RAG (retrieved content vs prior
|
|
||||||
knowledge), NOT multi-turn anchoring on the model's own prior commitment. The
|
|
||||||
mechanism (lower confidence → higher context adherence) is plausibly
|
|
||||||
transferable, but the best-practices doc's claim extrapolates beyond the
|
|
||||||
paper's actual experiment.
|
|
||||||
|
|
||||||
- **[2026-05-16] H7:** Jiang et al. (2026) "Think-Anywhere" is a real published
|
|
||||||
paper introducing mid-sequence `<think>` insertion that catches errors a
|
|
||||||
pre-commit plan cannot foresee.
|
|
||||||
**Falsification:** paper does not exist (hallucinated citation); paper exists
|
|
||||||
but does not make the claimed mid-sequence intervention finding.
|
|
||||||
**Result:** CONFIRMED with metadata correction — "Think Anywhere in Code
|
|
||||||
Generation" (arXiv:2603.29957, Jiang et al., late 2025 / early 2026,
|
|
||||||
github.com/jiangxxxue/Think-Anywhere). Mechanism: special `<thinkanywhere>`
|
|
||||||
tokens via SFT + RL; key finding "LLMs tend to invoke thinking at positions
|
|
||||||
with higher entropy." The best-practices doc's "catches mid-implementation
|
|
||||||
off-by-one errors" framing is a mild over-specification of "on-demand
|
|
||||||
reasoning at high-entropy positions" but directionally accurate.
|
|
||||||
|
|
||||||
## Investigation Log
|
|
||||||
|
|
||||||
### 2026-05-16 — Initial three-doc production
|
|
||||||
|
|
||||||
- Orientation: understand
|
|
||||||
- What was examined: human-text-interpretation literature (Kruger, Byron,
|
|
||||||
Aderka, Walther, Lieberman), LLM prompting literature (Anthropic 4.7 docs, Liu
|
|
||||||
et al., Sharma et al., Wilf et al., Schulhoff Prompting Science Report 2).
|
|
||||||
- What was found: documented in the three research docs.
|
|
||||||
- What this means: descriptive synthesis available; no decision rules yet.
|
|
||||||
- Next step: methodology audit.
|
|
||||||
|
|
||||||
### 2026-05-16 — Methodology audit and adversarial second pass
|
|
||||||
|
|
||||||
- Orientation: diagnose
|
|
||||||
- What was examined: my own search behavior; ran the adversarial searches I
|
|
||||||
should have run originally.
|
|
||||||
- What was found: positive-bias in original search framing missed important
|
|
||||||
disconfirmations (H2, H3) and required qualifications (H1, H4); also missed
|
|
||||||
the foundational Schulhoff "Prompt Report" survey.
|
|
||||||
- What this means: prescriptive synthesis needs five concrete edits before it
|
|
||||||
can drive an action plan.
|
|
||||||
- Next step: apply edits, then review ai-coding-best-practices.md with the same
|
|
||||||
skepticism.
|
|
||||||
|
|
||||||
## Timing Notes
|
|
||||||
|
|
||||||
- Each Exa search: ~5–15s including read of first 40 lines of dump.
|
|
||||||
- Free-tier rate limit means searches must be sequential.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
- Are the (still uncited) parallels in §4 of the synthesis worth another
|
|
||||||
adversarial search pass, or accept as flagged "use with care"?
|
|
||||||
- Does `docs/research/ai-coding-best-practices.md` contain claims about persona
|
|
||||||
prompting or CoT that now need correction?
|
|
||||||
- What is the right format for the final action plan — checklist,
|
|
||||||
copilot-instructions edit, AGENTS.md addition, or a new
|
|
||||||
`.agents/instructions/` file?
|
|
||||||
@ -1,46 +0,0 @@
|
|||||||
{
|
|
||||||
"hooks": {
|
|
||||||
"UserPromptSubmit": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/user-prompt-submit.sh",
|
|
||||||
"timeout": 5
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"SessionStart": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/session-start.sh",
|
|
||||||
"timeout": 10
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"PreToolUse": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/pre-tool-use.sh",
|
|
||||||
"timeout": 5
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"PostToolUse": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/post-tool-use.sh",
|
|
||||||
"timeout": 5
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"PreCompact": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/pre-compact.sh",
|
|
||||||
"timeout": 10
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"Stop": [
|
|
||||||
{
|
|
||||||
"type": "command",
|
|
||||||
"command": ".agents/hooks/stop.sh",
|
|
||||||
"timeout": 5
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1 +0,0 @@
|
|||||||
Verify plugin TypeScript code changes with `npm t`.
|
|
||||||
@ -1,172 +0,0 @@
|
|||||||
{
|
|
||||||
"$schema": "https://opencode.ai/config.json",
|
|
||||||
"default_agent": "orchestrator",
|
|
||||||
"compaction": {
|
|
||||||
"reserved": 3000
|
|
||||||
},
|
|
||||||
"agent": {
|
|
||||||
"orchestrator": {
|
|
||||||
"mode": "all",
|
|
||||||
"model": "llama-server/Qwopus3.6-27B-v2-MTP-Q4_K_M",
|
|
||||||
"permission": {
|
|
||||||
"edit": "deny",
|
|
||||||
"bash": {
|
|
||||||
"*": "deny",
|
|
||||||
"* /tmp/.last-user-prompt.txt": "allow",
|
|
||||||
"* /tmp/.last-user-prompt.txt << *": "allow"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"mode": "subagent",
|
|
||||||
"permission": {
|
|
||||||
"webfetch": "deny",
|
|
||||||
"websearch": "deny",
|
|
||||||
"question": "deny",
|
|
||||||
"todowrite": "deny",
|
|
||||||
"skill": "deny"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"research": {
|
|
||||||
"mode": "all",
|
|
||||||
"permission": {
|
|
||||||
"*": "allow"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"permission": {
|
|
||||||
"external_directory": {
|
|
||||||
"/tmp/**": "allow",
|
|
||||||
"~/dotfiles/**": "allow",
|
|
||||||
"~/.config/opencode/**": "allow",
|
|
||||||
"~/.local/share/opencode/log/**": "allow",
|
|
||||||
"~/.copilot/**": "allow",
|
|
||||||
"~/code/**": "allow"
|
|
||||||
},
|
|
||||||
"websearch": "allow"
|
|
||||||
},
|
|
||||||
"share": "disabled",
|
|
||||||
"lsp": true,
|
|
||||||
"provider": {
|
|
||||||
"llama-server": {
|
|
||||||
"npm": "@ai-sdk/openai-compatible",
|
|
||||||
"name": "llama-server",
|
|
||||||
"options": {
|
|
||||||
"baseURL": "http://127.0.0.1:8080/v1"
|
|
||||||
},
|
|
||||||
"models": {
|
|
||||||
"OmniCoder-2-9B.Q8_0": {
|
|
||||||
"name": "OmniCoder 2 9B Q8 (llama-server)",
|
|
||||||
"tools": true,
|
|
||||||
"agent": {
|
|
||||||
"plan": {
|
|
||||||
"temperature": 0.1
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"temperature": 0.3
|
|
||||||
},
|
|
||||||
"brainstorm": {
|
|
||||||
"temperature": 0.7
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"limit": {
|
|
||||||
"context": 32768,
|
|
||||||
"output": 4096
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"Qwopus3.5-9B-Coder-MTP-Q8_0": {
|
|
||||||
"name": "Qwopus3.5 9B Coder MTP Q8 (llama-server)",
|
|
||||||
"tools": true,
|
|
||||||
"agent": {
|
|
||||||
"plan": {
|
|
||||||
"temperature": 0.1
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"temperature": 0.3
|
|
||||||
},
|
|
||||||
"brainstorm": {
|
|
||||||
"temperature": 0.7
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"limit": {
|
|
||||||
"context": 32768,
|
|
||||||
"output": 4096
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"Qwopus3.6-27B-v2-MTP-Q4_K_M": {
|
|
||||||
"name": "Qwopus3.6 27B MTP Q4 (llama-server)",
|
|
||||||
"tools": true,
|
|
||||||
"agent": {
|
|
||||||
"plan": {
|
|
||||||
"temperature": 0.1
|
|
||||||
},
|
|
||||||
"orchestrator": {
|
|
||||||
"temperature": 0.2
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"temperature": 0.3
|
|
||||||
},
|
|
||||||
"brainstorm": {
|
|
||||||
"temperature": 0.7
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"limit": {
|
|
||||||
"context": 32768,
|
|
||||||
"output": 4096
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M": {
|
|
||||||
"name": "Qwopus3.6 35B A3B MTP Q4 (llama-server)",
|
|
||||||
"tools": true,
|
|
||||||
"agent": {
|
|
||||||
"plan": {
|
|
||||||
"temperature": 0.1
|
|
||||||
},
|
|
||||||
"orchestrator": {
|
|
||||||
"temperature": 0.2
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"temperature": 0.3
|
|
||||||
},
|
|
||||||
"brainstorm": {
|
|
||||||
"temperature": 0.7
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"limit": {
|
|
||||||
"context": 32768,
|
|
||||||
"output": 4096
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"agentica-org_DeepCoder-14B-Preview-Q5_K_M": {
|
|
||||||
"name": "DeepCoder 14B Q5 (llama-server)",
|
|
||||||
"tools": true,
|
|
||||||
"agent": {
|
|
||||||
"plan": {
|
|
||||||
"temperature": 0.1
|
|
||||||
},
|
|
||||||
"build": {
|
|
||||||
"temperature": 0.3
|
|
||||||
},
|
|
||||||
"brainstorm": {
|
|
||||||
"temperature": 0.7
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"limit": {
|
|
||||||
"context": 32768,
|
|
||||||
"output": 4096
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"mcp": {
|
|
||||||
"all-agents": {
|
|
||||||
"type": "local",
|
|
||||||
"command": [
|
|
||||||
"node",
|
|
||||||
"--experimental-strip-types",
|
|
||||||
"/home/dev/dotfiles/.agents/mcp/index.ts"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,331 +0,0 @@
|
|||||||
import type { Plugin, Hooks } from '@opencode-ai/plugin';
|
|
||||||
import type { TextPart, Model } from '@opencode-ai/sdk';
|
|
||||||
import { resolve, dirname } from 'node:path';
|
|
||||||
import { fileURLToPath } from 'node:url';
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Agent support plugin for Remnant.
|
|
||||||
*
|
|
||||||
* Responsibilities:
|
|
||||||
* 1. chat.message (first turn) — session-start.sh (once per session)
|
|
||||||
* 2. chat.message — user-prompt-submit.sh (each turn)
|
|
||||||
* 3. tool.execute.before — pre-tool-use.sh (project policy)
|
|
||||||
* 4. tool.execute.after — post-tool-use.sh + context pressure warning
|
|
||||||
* 5. experimental.session.compacting — pre-compact.sh
|
|
||||||
*
|
|
||||||
* Note: stop.sh has no equivalent OpenCode plugin event; it only fires in Copilot.
|
|
||||||
*/
|
|
||||||
|
|
||||||
export const GlobalPlugin: Plugin = async ({ $, client }) => {
|
|
||||||
// Resolve hooks relative to this plugin file's real path (resolves symlinks).
|
|
||||||
// This makes the plugin work both as a project-local plugin and as a global
|
|
||||||
// plugin installed via install.sh — in either case, hooks live in ../../hooks/
|
|
||||||
// relative to this file in the .agents/frameworks/opencode/ directory.
|
|
||||||
const hooksDir = resolve(dirname(fileURLToPath(import.meta.url)), '../../hooks');
|
|
||||||
|
|
||||||
// Running cumulative context size estimate (characters)
|
|
||||||
let contextCharsUsed = 0;
|
|
||||||
|
|
||||||
// Track sessions that have had session-start injected (fires once per session)
|
|
||||||
const initializedSessions = new Set<string>();
|
|
||||||
|
|
||||||
const agentBySession = new Map<string, { agent: string; model: Model; }>();
|
|
||||||
|
|
||||||
const hooks: Hooks = {
|
|
||||||
'chat.params': async (input, output) => {
|
|
||||||
logInfoData('chat.params', { input, output });
|
|
||||||
agentBySession.set(input.sessionID, { agent: input.agent, model: input.model });
|
|
||||||
},
|
|
||||||
|
|
||||||
// ── 1 & 2. Session start + user prompt ──────────────────────────────────
|
|
||||||
// Session-start was previously injected via experimental.chat.system.transform
|
|
||||||
// (pushing to output.system). That caused a Jinja "System message must be at
|
|
||||||
// the beginning" error on Qwen-family local models when the orchestrator spawns
|
|
||||||
// a subagent via `task`: system.transform fires after the task prompt (a user
|
|
||||||
// message) is already in the conversation, so the system push lands at a
|
|
||||||
// non-zero position. Injecting as a synthetic text part on the first
|
|
||||||
// chat.message turn avoids the position constraint entirely.
|
|
||||||
'chat.message': async (input, output) => {
|
|
||||||
logInfoData('chat.message', { input, output });
|
|
||||||
|
|
||||||
// Session-start injection — runs exactly once per session, prepended so it
|
|
||||||
// reads before the user-prompt-submit nudges on the first turn.
|
|
||||||
if (!initializedSessions.has(input.sessionID)) {
|
|
||||||
initializedSessions.add(input.sessionID);
|
|
||||||
const startOutput = await runHookScript('session-start.sh');
|
|
||||||
const startContext = parseAdditionalContext(startOutput);
|
|
||||||
if (startContext) {
|
|
||||||
output.parts.unshift({
|
|
||||||
id: `prt_${crypto.randomUUID()}`,
|
|
||||||
sessionID: input.sessionID,
|
|
||||||
messageID: input.messageID ?? crypto.randomUUID(),
|
|
||||||
type: 'text',
|
|
||||||
text: startContext,
|
|
||||||
synthetic: true,
|
|
||||||
});
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
const promptText = output.parts
|
|
||||||
.filter((p): p is TextPart => p.type === 'text')
|
|
||||||
.map((p) => p.text)
|
|
||||||
.join('\n');
|
|
||||||
const hookOutput = await runHookScript(
|
|
||||||
'user-prompt-submit.sh',
|
|
||||||
JSON.stringify({ prompt: promptText }),
|
|
||||||
);
|
|
||||||
const context = parseAdditionalContext(hookOutput);
|
|
||||||
if (context) {
|
|
||||||
output.parts.push({
|
|
||||||
id: `prt_${crypto.randomUUID()}`,
|
|
||||||
sessionID: input.sessionID,
|
|
||||||
messageID: input.messageID ?? crypto.randomUUID(),
|
|
||||||
type: 'text',
|
|
||||||
text: context,
|
|
||||||
synthetic: true,
|
|
||||||
});
|
|
||||||
}
|
|
||||||
},
|
|
||||||
// ── 3. Pre-tool-use ─────────────────────────────────────────────────────
|
|
||||||
'tool.execute.before': async (input, output) => {
|
|
||||||
logInfoData('tool.execute.before', { input, output });
|
|
||||||
|
|
||||||
// ── read guards ───────────────────────────────────────────────────
|
|
||||||
if (input.tool === 'read') {
|
|
||||||
const args = (output.args ?? {}) as {
|
|
||||||
filePath?: string;
|
|
||||||
offset?: number;
|
|
||||||
limit?: number;
|
|
||||||
};
|
|
||||||
const filePath = args.filePath ?? '';
|
|
||||||
|
|
||||||
// package.json read guard:
|
|
||||||
// Reading workspace package.json files auto-loads nested AGENTS.md files
|
|
||||||
// via OpenCode's context injection, burning through the 32K context budget.
|
|
||||||
// Block package.json reads under apps/ and packages/ only.
|
|
||||||
if (/(^|\/)(apps|packages)\/[^/]+\/package\.json$/.test(filePath)) {
|
|
||||||
throw new Error(
|
|
||||||
'BLOCKED: Reading workspace package.json files auto-loads nested AGENTS.md files and exhausts the 32K context. Use `grep_search` to find the specific field you need (e.g. a dependency version or script name) instead of reading the whole file.',
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Pagination guard:
|
|
||||||
// Large sequential reads exhaust the 32K context window quickly.
|
|
||||||
// The OpenCode `read` tool uses `offset` (1-indexed start) and `limit` (max lines).
|
|
||||||
// Unbounded reads (no limit) default to 2000 lines — always blocked.
|
|
||||||
// docs/ files may read up to 500 lines; all other files are capped at 50.
|
|
||||||
// Directory reads (e.g. `Read .`) never carry a limit — skip the guard.
|
|
||||||
let isDirectory = false;
|
|
||||||
try {
|
|
||||||
const { statSync } = await import('node:fs');
|
|
||||||
isDirectory = statSync(filePath).isDirectory();
|
|
||||||
} catch (_error) {
|
|
||||||
// path doesn't exist or inaccessible — treat as file
|
|
||||||
}
|
|
||||||
if (!isDirectory) {
|
|
||||||
const isDocsFile = /(^|\/)docs\//.test(filePath);
|
|
||||||
const readLimit: number | undefined = args.limit;
|
|
||||||
if (readLimit === undefined) {
|
|
||||||
throw new Error(
|
|
||||||
isDocsFile
|
|
||||||
? `BLOCKED: Unbounded read (no limit) is prohibited. Specify offset and limit to read in ≤500-line chunks for docs/ files.`
|
|
||||||
: `BLOCKED: Unbounded read (no limit) is prohibited. Use grep_search first to find the relevant section, then read with offset and limit in ≤50-line chunks.`,
|
|
||||||
);
|
|
||||||
}
|
|
||||||
const lineLimit = isDocsFile ? 500 : 50;
|
|
||||||
if (readLimit > lineLimit) {
|
|
||||||
throw new Error(
|
|
||||||
isDocsFile
|
|
||||||
? `BLOCKED: Read more than 500 lines at once is prohibited for docs/ files. Use offset and limit to paginate in ≤500-line chunks.`
|
|
||||||
: `BLOCKED: Read more than 50 lines at once is prohibited. Use offset and limit to paginate in ≤50-line chunks. For docs/ files the limit is 500 lines. Use grep_search first to find the right offset.`,
|
|
||||||
);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// ── Task prompt size guard ─────────────────────────────────────────────
|
|
||||||
// The `task` tool has a JSON serialization limit. Embedding file contents
|
|
||||||
// or long inventories inline in a task prompt causes "Unterminated string"
|
|
||||||
// parse errors. Cap task prompts at 1200 chars — workers should be told
|
|
||||||
// WHICH files to read, not given the contents inline.
|
|
||||||
if (input.tool === 'task') {
|
|
||||||
const args = (output.args ?? {}) as { prompt?: string };
|
|
||||||
const prompt = args.prompt ?? '';
|
|
||||||
if (prompt.length > 1200) {
|
|
||||||
throw new Error(
|
|
||||||
`BLOCKED (task prompt too long: ${prompt.length} chars, max 1200): Task prompts must not embed file contents, dependency lists, or long context inline — this causes JSON parse failures. Instead, tell the worker WHICH files to read and WHAT to do. Example: "Read the root package.json and all workspace package.json files, then update the Technology Stack section in README.md to match."`,
|
|
||||||
);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Shell out to pre-tool-use hook (project policy enforcement).
|
|
||||||
// Policies 1–12: command/file guards. Policy 13: read_file range limit
|
|
||||||
// (≤50 lines for source files, ≤500 for docs/). Deny = throws Error.
|
|
||||||
const hookInput = JSON.stringify({
|
|
||||||
tool_name: input.tool,
|
|
||||||
tool_input: output.args ?? {},
|
|
||||||
});
|
|
||||||
const hookResult = await runHookScript('pre-tool-use.sh', hookInput);
|
|
||||||
|
|
||||||
// If the hook emitted a deny decision, surface it as an error
|
|
||||||
if (hookResult.includes('"permissionDecision": "deny"')) {
|
|
||||||
const match = hookResult.match(/"permissionDecisionReason":\s*"([^"]+)"/);
|
|
||||||
const reason = match?.[1] ?? 'Blocked by project policy (pre-tool-use hook).';
|
|
||||||
throw new Error(reason);
|
|
||||||
}
|
|
||||||
},
|
|
||||||
|
|
||||||
// ── 4. Post-tool-use ────────────────────────────────────────────────────
|
|
||||||
'tool.execute.after': async (input, output) => {
|
|
||||||
logInfoData('tool.execute.after', { input, output });
|
|
||||||
|
|
||||||
// MCP tools populate content differently — output.output may be undefined.
|
|
||||||
// Skip truncation/pressure/hook logic for those; the MCP content flows
|
|
||||||
// through OpenCode's internal parts pipeline instead.
|
|
||||||
const text = output.output;
|
|
||||||
if (!text) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Approximate token estimate: 4 chars ≈ 1 token (conservative for code).
|
|
||||||
const CHARS_PER_TOKEN = 4;
|
|
||||||
const CONTEXT_LIMIT_TOKENS = 32768;
|
|
||||||
const PRESSURE_THRESHOLD = 0.7; // 70%
|
|
||||||
|
|
||||||
// build agent (local profile) truncates at 1500 tokens to respect OmniCoder's 32K context window.
|
|
||||||
// orchestrator gets a higher limit (2500) since it only reads, not edits.
|
|
||||||
// All other agents receive full tool responses.
|
|
||||||
const LOCAL_WORKER_MAX_TOKENS = 1500;
|
|
||||||
const LOCAL_ORCHESTRATOR_MAX_TOKENS = 2500;
|
|
||||||
|
|
||||||
function truncate(t: string, maxTokens: number): { text: string; truncated: boolean } {
|
|
||||||
const maxChars = maxTokens * CHARS_PER_TOKEN;
|
|
||||||
if (t.length <= maxChars) return { text: t, truncated: false };
|
|
||||||
return {
|
|
||||||
text:
|
|
||||||
t.slice(0, maxChars) +
|
|
||||||
`\n\n[Response truncated at ~${maxTokens} tokens. Use a more targeted query to retrieve the relevant section.]`,
|
|
||||||
truncated: true,
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
// a) Response truncation — local agents (build/orchestrator) and any llama-server/ model;
|
|
||||||
// orchestrator gets a higher limit since it only reads, not edits.
|
|
||||||
const { agent, model } = agentBySession.get(input.sessionID) ?? {};
|
|
||||||
const isLocalAgent = agent === 'build' || agent === 'orchestrator' || model?.providerID === 'llama-server';
|
|
||||||
if (isLocalAgent) {
|
|
||||||
const maxTokens = agent === 'orchestrator' ? LOCAL_ORCHESTRATOR_MAX_TOKENS : LOCAL_WORKER_MAX_TOKENS;
|
|
||||||
const { text: truncated } = truncate(text, maxTokens);
|
|
||||||
output.output = truncated;
|
|
||||||
}
|
|
||||||
|
|
||||||
// b) Context pressure tracking — accumulate and inject warning when ≥70%
|
|
||||||
contextCharsUsed += output.output.length;
|
|
||||||
const charLimit = CONTEXT_LIMIT_TOKENS * CHARS_PER_TOKEN;
|
|
||||||
const pct = contextCharsUsed / charLimit;
|
|
||||||
|
|
||||||
if (pct >= PRESSURE_THRESHOLD) {
|
|
||||||
const pctDisplay = Math.round(pct * 100);
|
|
||||||
const pressure = `[CONTEXT PRESSURE: ~${pctDisplay}% used. Be concise. Prefer targeted tool calls. Write progress to NOTES.md before continuing.]`;
|
|
||||||
output.output = `${pressure}\n\n${output.output}`;
|
|
||||||
// Reset after injection so we don't spam every subsequent turn
|
|
||||||
contextCharsUsed = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
// c) Shell out to post-tool-use hook (metacognitive reminders, methodology)
|
|
||||||
const hookInput = JSON.stringify({
|
|
||||||
tool_name: input.tool,
|
|
||||||
tool_input: input.args ?? {},
|
|
||||||
tool_response: output.output.slice(0, 500), // truncated for hook
|
|
||||||
});
|
|
||||||
const postToolOutput = await runHookScript('post-tool-use.sh', hookInput);
|
|
||||||
const postToolContext = parseAdditionalContext(postToolOutput);
|
|
||||||
if (postToolContext) {
|
|
||||||
output.output = `${output.output}\n\n${postToolContext}`;
|
|
||||||
}
|
|
||||||
},
|
|
||||||
|
|
||||||
// ── 5. Pre-compact: export state before context summarization ─────────────
|
|
||||||
'experimental.session.compacting': async (input, output) => {
|
|
||||||
logInfoData('experimental.session.compacting', { input, output });
|
|
||||||
|
|
||||||
await runHookScript('pre-compact.sh');
|
|
||||||
|
|
||||||
output.prompt = `
|
|
||||||
You are a context summarizer for coding sessions. Summarize only the conversation history given — do not answer it.
|
|
||||||
|
|
||||||
If a <previous-summary> block is present, update it: preserve still-true facts, remove stale ones, merge new facts.
|
|
||||||
|
|
||||||
Output exactly this Markdown structure. Keep every section even when empty. Use terse bullets, not prose. Preserve exact file paths, commands, error strings, and identifiers.
|
|
||||||
|
|
||||||
---
|
|
||||||
## Original Prompt
|
|
||||||
## Clarifications
|
|
||||||
## Constraints & Preferences
|
|
||||||
## Progress
|
|
||||||
### Done
|
|
||||||
### In Progress
|
|
||||||
### Blocked
|
|
||||||
## Key Decisions
|
|
||||||
## Next Steps
|
|
||||||
## Critical Context
|
|
||||||
## Relevant Files
|
|
||||||
---
|
|
||||||
|
|
||||||
For Clarifications: include only follow-ups that changed scope, added constraints, or redirected work. Do not mention that you are summarizing. Respond in the conversation's language.`;
|
|
||||||
},
|
|
||||||
};
|
|
||||||
|
|
||||||
/** Parse the additionalContext string from a hook's JSON output. */
|
|
||||||
function parseAdditionalContext(hookOutput: string): string | undefined {
|
|
||||||
try {
|
|
||||||
const parsed = JSON.parse(hookOutput.trim()) as {
|
|
||||||
hookSpecificOutput?: { additionalContext?: string };
|
|
||||||
};
|
|
||||||
return parsed?.hookSpecificOutput?.additionalContext ?? undefined;
|
|
||||||
} catch (_error) {
|
|
||||||
return undefined;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function runHookScript(scriptName: string, stdinJson?: string): Promise<string> {
|
|
||||||
const script = `${hooksDir}/${scriptName}`;
|
|
||||||
try {
|
|
||||||
const proc = stdinJson
|
|
||||||
? await $`bash ${script} < ${Buffer.from(stdinJson)}`.text()
|
|
||||||
: await $`bash ${script}`.text();
|
|
||||||
return proc;
|
|
||||||
} catch (_error) {
|
|
||||||
await client.app.log({
|
|
||||||
body: {
|
|
||||||
service: 'global-plugin',
|
|
||||||
level: 'error',
|
|
||||||
message: `(Global Plugin) Error in hook script ${script}`,
|
|
||||||
extra: {
|
|
||||||
ts: new Date().toISOString(),
|
|
||||||
script,
|
|
||||||
error: String(_error),
|
|
||||||
},
|
|
||||||
},
|
|
||||||
});
|
|
||||||
// Hooks are advisory — never block on hook failure
|
|
||||||
return '';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function logInfoData(message: string, obj?: Record<string, unknown>) {
|
|
||||||
await client.app.log({
|
|
||||||
body: {
|
|
||||||
service: 'global-plugin',
|
|
||||||
level: 'info',
|
|
||||||
message: `(Global Plugin) ${message}`,
|
|
||||||
extra: {
|
|
||||||
ts: new Date().toISOString(),
|
|
||||||
...(obj ?? {}),
|
|
||||||
},
|
|
||||||
},
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
return hooks;
|
|
||||||
};
|
|
||||||
@ -1,172 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# PostToolUse hook: inject methodology reminders after relevant tool actions.
|
|
||||||
# - Periodic self-check (weighted every ~15 effective write-calls)
|
|
||||||
# - After test failures: remind about hypothesis-first methodology
|
|
||||||
# - After reading docs/ or .md/.txt files: lift pagination restriction reminder
|
|
||||||
# - After editing docs/ or .md/.txt files: audit file size (warn if >500 lines)
|
|
||||||
# - After editing agent config files: verify with opencode agent list
|
|
||||||
# Project-specific reminders (e.g. BFF pattern, build gates): add a sibling
|
|
||||||
# hook file in the project's .agents/hooks/ directory.
|
|
||||||
# Priority filter: emit at most 2 reminders per tool call.
|
|
||||||
# Priority order: SELF-CHECK > DEBUGGING > path-scoped > tool-specific.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
# ── Tool call counter ────────────────────────────────────────────────────────
|
|
||||||
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo ".")"
|
|
||||||
REPO_ID=$(printf '%s' "$REPO_ROOT" | md5sum | cut -c1-8 2>/dev/null || echo "default")
|
|
||||||
COUNT_FILE="/tmp/.opencode-tool-count-${REPO_ID}"
|
|
||||||
COUNT=$(cat "$COUNT_FILE" 2>/dev/null || echo 0)
|
|
||||||
|
|
||||||
# Read hook input from stdin
|
|
||||||
INPUT=$(cat)
|
|
||||||
|
|
||||||
TOOL_NAME=$(echo "$INPUT" | grep -o '"tool_name"\s*:\s*"[^"]*"' | head -1 | sed 's/.*"\([^"]*\)"/\1/' || true)
|
|
||||||
|
|
||||||
# Weighted increment: reads +1, writes/shell +4 (equivalent to +0.25/+1 at threshold 60).
|
|
||||||
# This prevents SELF-CHECK from firing mid-investigation sweep.
|
|
||||||
case "$TOOL_NAME" in
|
|
||||||
read_file|grep_search|list_dir|file_search|semantic_search|explore_subagent)
|
|
||||||
COUNT=$((COUNT + 1))
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
COUNT=$((COUNT + 4))
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
echo "$COUNT" > "$COUNT_FILE"
|
|
||||||
|
|
||||||
# Priority-ordered reminders array: append in priority order, emit max 2.
|
|
||||||
# Priority: SELF-CHECK(1) > DEBUGGING(2) > path-scoped(3) > tool-specific(4)
|
|
||||||
reminders=()
|
|
||||||
|
|
||||||
# ── Periodic self-check (every 60 weighted units ≡ 15 effective write-calls) ─
|
|
||||||
if (( COUNT % 60 == 0 )); then
|
|
||||||
selfcheck="SELF-CHECK (${COUNT} tool calls): Step back and assess."
|
|
||||||
selfcheck="${selfcheck} (1) What is your current goal — are you still on track?"
|
|
||||||
selfcheck="${selfcheck} (2) Are you making progress or spinning on the same issue?"
|
|
||||||
selfcheck="${selfcheck} (3) If you've hit 2+ failures on the same problem, switch to @research or report to the user."
|
|
||||||
selfcheck="${selfcheck} (4) If you've been editing the same file 3+ times without a passing test, stop and rethink."
|
|
||||||
selfcheck="${selfcheck} (5) Is the chat todo list accurate? Update it if items are stale or missing."
|
|
||||||
selfcheck="${selfcheck} (6) If investigating, re-read your investigation file and dead-ends to avoid re-testing eliminated hypotheses."
|
|
||||||
reminders+=("$selfcheck")
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── After test/terminal runs that failed: remind about methodology ───────────
|
|
||||||
if [[ "$TOOL_NAME" == "run_in_terminal" || "$TOOL_NAME" == "runTests" ]]; then
|
|
||||||
TOOL_RESPONSE=$(echo "$INPUT" | grep -o '"tool_response"\s*:\s*"[^"]*"' | head -1 || true)
|
|
||||||
if echo "$TOOL_RESPONSE" | grep -qiE 'FAIL|error|panic|segfault|assertion|abort|ERR!'; then
|
|
||||||
debug_msg="DEBUGGING REMINDER: Before your next action —"
|
|
||||||
debug_msg="${debug_msg} (1) Write your hypothesis in one sentence."
|
|
||||||
debug_msg="${debug_msg} (2) Write what you'd expect if WRONG."
|
|
||||||
debug_msg="${debug_msg} (3) Check the dead-ends file (.session/dead-ends.md) if it exists."
|
|
||||||
debug_msg="${debug_msg} (4) Falsify BEFORE confirming."
|
|
||||||
debug_msg="${debug_msg} (5) If 5+ attempts without progress, STOP and report what you've learned."
|
|
||||||
reminders+=("$debug_msg")
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── After editing project/agent config files: path-scoped reminders ─────────
|
|
||||||
# Project-specific path checks (e.g. build gates, BFF reminders) belong in a
|
|
||||||
# sibling project-local hook file, not here. Only general checks below.
|
|
||||||
case "$TOOL_NAME" in
|
|
||||||
replace_string_in_file|multi_replace_string_in_file|create_file)
|
|
||||||
FILE_PATH=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
const p = i.filePath || (i.replacements && i.replacements[0] && i.replacements[0].filePath) || '';
|
|
||||||
process.stdout.write(p);
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
path_msg=""
|
|
||||||
# ── After editing agent config files: verify with opencode agent list ─────
|
|
||||||
if echo "$FILE_PATH" | grep -qE '\.agents/agents/|\.opencode/agents/|opencode\.json'; then
|
|
||||||
AGENT_LIST=$(cd "$REPO_ROOT" && opencode agent list 2>&1 | grep -E '^\S.*\((all|primary|subagent)\)' | sed 's/^/ /' || echo " (opencode agent list failed)")
|
|
||||||
agent_note="AGENT CONFIG VERIFICATION: You just edited an agent definition or opencode.json."
|
|
||||||
agent_note="${agent_note} Registered agents are: ${AGENT_LIST}."
|
|
||||||
agent_note="${agent_note} If your agent is missing: (1) check that .opencode/agents/<name>.md symlink resolves (cat it — should not error); (2) symlink depth must be ../../.agents/agents/<name>.md (two levels, not three); (3) check YAML frontmatter for parse errors."
|
|
||||||
agent_note="${agent_note} Deny rules only appear in \`opencode agent list\` output if the agent file loaded correctly."
|
|
||||||
if [[ -n "$path_msg" ]]; then
|
|
||||||
path_msg="${path_msg} ${agent_note}"
|
|
||||||
else
|
|
||||||
path_msg="$agent_note"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
# ── After editing docs/ or .md/.txt files: audit file size ───────────────
|
|
||||||
if echo "$FILE_PATH" | grep -qE '(^|/)docs/|\.md$|\.txt$'; then
|
|
||||||
if [[ -f "$FILE_PATH" ]]; then
|
|
||||||
LINE_COUNT=$(wc -l < "$FILE_PATH" 2>/dev/null || echo 0)
|
|
||||||
if [[ "$LINE_COUNT" -gt 500 ]]; then
|
|
||||||
docs_audit="DOCS SIZE AUDIT: ${FILE_PATH##*/} is now ${LINE_COUNT} lines. Consider splitting this doc — files over ~500 lines require expensive pagination for local models (17+ reads for an 800-line file). Split into focused sub-docs and link them."
|
|
||||||
if [[ -n "$path_msg" ]]; then
|
|
||||||
path_msg="${path_msg} ${docs_audit}"
|
|
||||||
else
|
|
||||||
path_msg="$docs_audit"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
if [[ -n "$path_msg" ]]; then
|
|
||||||
reminders+=("$path_msg")
|
|
||||||
fi
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
# ── After reading docs/ files: remind that pagination limit is lifted ─────────
|
|
||||||
if [[ "$TOOL_NAME" == "read_file" ]]; then
|
|
||||||
READ_PATH=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(i.filePath || '');
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
START_LINE=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(String(i.startLine || 1));
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
if echo "$READ_PATH" | grep -qE '(^|/)docs/|\.md$|\.txt$'; then
|
|
||||||
docs_msg="DOCS READ EXEMPTION: docs/ files and all .md/.txt files are exempt from the 50-line pagination limit."
|
|
||||||
if [[ "$START_LINE" -gt 1 ]]; then
|
|
||||||
docs_msg="${docs_msg} You are currently paginating (startLine=${START_LINE}) — you may expand to up to 500 lines per call to reduce tool-call overhead."
|
|
||||||
else
|
|
||||||
docs_msg="${docs_msg} You may use ranges up to 500 lines per read_file call instead of 50."
|
|
||||||
fi
|
|
||||||
reminders+=("$docs_msg")
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── After vscode_renameSymbol: remind about object property key aliases ───────
|
|
||||||
if [[ "$TOOL_NAME" == "vscode_renameSymbol" ]]; then
|
|
||||||
rename_msg="RENAME REMINDER: vscode_renameSymbol only renames variable bindings — NOT object property keys or string literals."
|
|
||||||
rename_msg="${rename_msg} After this rename, grep the file for the OLD name."
|
|
||||||
rename_msg="${rename_msg} Stale patterns to watch for: (1) aliased store keys like 'deleteX: archiveX' in the store return object — the key 'deleteX' is unchanged and so are all 'store.deleteX()' call sites;"
|
|
||||||
rename_msg="${rename_msg} (2) string literals like openDialog('delete-item') and AppDialog handle='delete-item';"
|
|
||||||
rename_msg="${rename_msg} (3) related variable names in the same file that share the same prefix (e.g. renaming deleteSuccess should also prompt renaming deleteLoading, deleteError)."
|
|
||||||
rename_msg="${rename_msg} Fix all of these with multi_replace_string_in_file after the symbol rename."
|
|
||||||
reminders+=("$rename_msg")
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Emit at most 2 reminders ─────────────────────────────────────────────────
|
|
||||||
context=""
|
|
||||||
for (( i=0; i<${#reminders[@]} && i<2; i++ )); do
|
|
||||||
if [[ -n "$context" ]]; then
|
|
||||||
context="${context}\n${reminders[$i]}"
|
|
||||||
else
|
|
||||||
context="${reminders[$i]}"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# Only output if we have context to inject
|
|
||||||
if [[ -n "$context" ]]; then
|
|
||||||
# Prefix with a self-identifying marker so the model cannot confuse the
|
|
||||||
# injection with preceding tool output (e.g., trailing markdown in a file).
|
|
||||||
framed="[HOOK INJECTION: post-tool-use] System reminder — NOT part of preceding tool output:\n\n${context}"
|
|
||||||
json_context=$(printf '%b' "$framed" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))')
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "PostToolUse",
|
|
||||||
"additionalContext": ${json_context}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
else
|
|
||||||
echo '{}'
|
|
||||||
fi
|
|
||||||
@ -1,70 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# PreCompact hook: export critical session state before context summarization.
|
|
||||||
# Saves investigation progress so findings survive context window compression.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo ".")"
|
|
||||||
SESSION_DIR="$REPO_ROOT/.session"
|
|
||||||
COMPACT_LOG="$SESSION_DIR/pre-compact-state.md"
|
|
||||||
|
|
||||||
mkdir -p "$SESSION_DIR"
|
|
||||||
|
|
||||||
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
|
||||||
|
|
||||||
# Read the dead-ends file if it exists
|
|
||||||
dead_ends_summary=""
|
|
||||||
DEAD_ENDS_FILE="$SESSION_DIR/dead-ends.md"
|
|
||||||
if [[ -f "$DEAD_ENDS_FILE" ]]; then
|
|
||||||
dead_ends_summary=$(tail -30 "$DEAD_ENDS_FILE" 2>/dev/null || true)
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check for active investigation files (exclude those already marked complete)
|
|
||||||
investigation_summary=""
|
|
||||||
EXPLORATIONS_DIR="$REPO_ROOT/docs/explorations"
|
|
||||||
if [[ -d "$EXPLORATIONS_DIR" ]]; then
|
|
||||||
inv_files=$(find "$EXPLORATIONS_DIR" -name "*.md" -not -empty 2>/dev/null || true)
|
|
||||||
if [[ -n "$inv_files" ]]; then
|
|
||||||
active_files=$(echo "$inv_files" | while read -r f; do
|
|
||||||
if ! grep -qi '^\*\*Status\*\*.*complete\|^Status:.*complete' "$f" 2>/dev/null; then
|
|
||||||
echo "$f"
|
|
||||||
fi
|
|
||||||
done || true)
|
|
||||||
if [[ -n "$active_files" ]]; then
|
|
||||||
investigation_summary=$(echo "$active_files" | xargs -I{} basename {} .md | sed 's/^/- /' || true)
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Write the pre-compact state file
|
|
||||||
cat > "$COMPACT_LOG" << STATEEOF
|
|
||||||
# Pre-Compact State Export
|
|
||||||
Exported at: $TIMESTAMP
|
|
||||||
Trigger: context summarization
|
|
||||||
|
|
||||||
## Active Investigations
|
|
||||||
$investigation_summary
|
|
||||||
|
|
||||||
## Recent Dead Ends (do NOT re-test these)
|
|
||||||
$dead_ends_summary
|
|
||||||
|
|
||||||
## Reminders
|
|
||||||
- Hypothesis + falsification criterion BEFORE any diagnostic test
|
|
||||||
- Record WHY failures failed, not just WHAT was tried
|
|
||||||
- Check AGENTS.md and package-level AGENTS.md for implementation guidance
|
|
||||||
STATEEOF
|
|
||||||
|
|
||||||
# Inject context for the summarized conversation
|
|
||||||
context="[HOOK INJECTION: pre-compact] System reminder — injected before context compaction, not part of any user message or tool output:\n\n"
|
|
||||||
context="${context}CONTEXT PRESERVATION (pre-compact): Critical state exported to .session/pre-compact-state.md."
|
|
||||||
context="${context} After summarization, re-read this file to restore investigation context."
|
|
||||||
context="${context} Key: do NOT re-test eliminated hypotheses from the dead-ends file."
|
|
||||||
context="${context} TODO LIST SYNC: After resuming, update the chat todo list to reflect actual progress."
|
|
||||||
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "PreCompact",
|
|
||||||
"additionalContext": "$(echo "$context" | sed 's/"/\\"/g')"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
@ -1,217 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# PreToolUse hook: enforce project policies before tool execution.
|
|
||||||
#
|
|
||||||
# Policies enforced:
|
|
||||||
# 1. No npx — use npm run scripts only
|
|
||||||
# 2. No node_modules/.bin invocations — use npm run scripts only
|
|
||||||
# 3. No direct node invocations of node_modules packages
|
|
||||||
# 4. No python — use node for scripting
|
|
||||||
# 5. No npm run build while dev server is running (port conflict)
|
|
||||||
# 6. No sed -i / awk rewrites on code files — use replace_string_in_file
|
|
||||||
# 7. No npm install without user confirmation — ask first
|
|
||||||
# 8. No editing *.generated.ts files — edit the generator source instead
|
|
||||||
# 9. No deleting .wireit — fix the underlying build config issue instead
|
|
||||||
# 10. No -- --force with npm run scripts — wireit cache busting masks real problems
|
|
||||||
# 11. No npm run format with specific file args — propagates to all workspaces
|
|
||||||
# 12. No editing eslint.config.js files — ESLint config changes require human review
|
|
||||||
# 13. No read_file with range >50 lines (enforced hard block) — except docs/ files and all .md/.txt files (limit 500)
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
INPUT=$(cat)
|
|
||||||
|
|
||||||
TOOL_NAME=$(echo "$INPUT" | grep -o '"tool_name"\s*:\s*"[^"]*"' | head -1 | sed 's/.*"\([^"]*\)"/\1/' || true)
|
|
||||||
|
|
||||||
# DEBUG: log every hook invocation with full input
|
|
||||||
echo "{\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"hook\":\"pre-tool-use\",\"tool\":\"$TOOL_NAME\",\"raw\":$(echo "$INPUT" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))' 2>/dev/null || echo '""')}" >> /tmp/pre-tool-hook-debug.jsonl
|
|
||||||
|
|
||||||
# Only inspect terminal/execution tools and file-editing tools
|
|
||||||
case "$TOOL_NAME" in
|
|
||||||
bash|run_in_terminal|execution_subagent|send_to_terminal|\
|
|
||||||
replace_string_in_file|multi_replace_string_in_file|create_file|\
|
|
||||||
read_file|read|edit)
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
echo "{\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"hook\":\"pre-tool-use\",\"action\":\"early-exit\",\"tool\":\"$TOOL_NAME\"}" >> /tmp/pre-tool-hook-debug.jsonl
|
|
||||||
exit 0
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
# Extract command (terminal tools) or file path (file-editing tools)
|
|
||||||
COMMAND=""
|
|
||||||
FILE_PATH=""
|
|
||||||
case "$TOOL_NAME" in
|
|
||||||
bash|run_in_terminal|execution_subagent|send_to_terminal)
|
|
||||||
COMMAND=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(i.command || i.query || '');
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
;;
|
|
||||||
replace_string_in_file|multi_replace_string_in_file|create_file|edit)
|
|
||||||
FILE_PATH=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
const p = i.filePath || (i.replacements && i.replacements[0] && i.replacements[0].filePath) || '';
|
|
||||||
process.stdout.write(p);
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
;;
|
|
||||||
read_file|read)
|
|
||||||
FILE_PATH=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(i.filePath || '');
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
if [[ -z "$COMMAND" && -z "$FILE_PATH" ]]; then
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Helper: emit deny response ───────────────────────────────────────────────
|
|
||||||
deny() {
|
|
||||||
local reason="$1"
|
|
||||||
echo "{\"ts\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"hook\":\"pre-tool-use\",\"action\":\"DENY\",\"tool\":\"$TOOL_NAME\",\"reason\":$(echo "$reason" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8").trim()))' 2>/dev/null || echo '"<encode-error>"')}" >> /tmp/pre-tool-hook-debug.jsonl
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "PreToolUse",
|
|
||||||
"permissionDecision": "deny",
|
|
||||||
"permissionDecisionReason": "$reason"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
exit 0
|
|
||||||
}
|
|
||||||
|
|
||||||
# ── Policy 1: No npx ─────────────────────────────────────────────────────────
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)npx\s'; then
|
|
||||||
deny "BLOCKED: Do not use npx directly. Use npm run scripts instead. If no script exists, recommend adding one. See AGENTS.md."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 2: No direct node_modules/.bin invocations ────────────────────────
|
|
||||||
if echo "$COMMAND" | grep -qE 'node_modules/\.bin/|node_modules\\\.bin\\'; then
|
|
||||||
deny "BLOCKED: Do not invoke tools from node_modules/.bin/. Use npm run scripts instead. See AGENTS.md."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 3: No direct node invocations of node_modules packages ────────────
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)node\s+(\./)?node_modules/'; then
|
|
||||||
deny "BLOCKED: Do not invoke node_modules packages directly with node. Use npm run scripts instead. See AGENTS.md."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 4: No python — use node for scripting ─────────────────────────────
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)(python3?|pip3?)\s'; then
|
|
||||||
deny "BLOCKED: Do not use python in this project. Use node for scripting instead."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 5: No npm run build while dev server is running ───────────────────
|
|
||||||
# The dev server (npm run dev) uses tsc --watch and writes to dist/.
|
|
||||||
# npm run build also writes to dist/, causing crashes when both run.
|
|
||||||
# Detect dev server by checking if port 3000 (app) or 3001 (Vite HMR) is bound.
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)npm\s+run\s+build(\s|$|:)'; then
|
|
||||||
if ss -tlnp 2>/dev/null | grep -qE ':300[01]\s'; then
|
|
||||||
deny "BLOCKED: npm run build conflicts with the running dev server (port 3000/3001 in use). Both write to dist/ and will crash. Stop the dev server first, or use npm run lint and npm test for verification instead."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 6: No sed -i or awk in-place editing of code files ────────────────
|
|
||||||
# These tools corrupt structured code — use replace_string_in_file instead.
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)sed\s+[^|>]*-[a-zA-Z]*i'; then
|
|
||||||
deny "BLOCKED: Do not use 'sed -i' to edit code files. Use replace_string_in_file for precise, context-aware edits. sed pattern matching frequently corrupts structured code with unintended replacements."
|
|
||||||
fi
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)awk\s+.*>\s*[^/dev].*\.(ts|tsx|js|json|md)'; then
|
|
||||||
deny "BLOCKED: Do not use awk to rewrite code files. Use replace_string_in_file for precise edits instead."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 7: No npm install without user confirmation ───────────────────────
|
|
||||||
# Dependencies must be kept minimal. Always ask the user before adding packages.
|
|
||||||
if echo "$COMMAND" | grep -qE '(^|\s|&&|\||\;)npm\s+(install|i)(\s|$)'; then
|
|
||||||
deny "BLOCKED: Do not run npm install without user confirmation. This project keeps dependencies minimal — always ask first. If a package is genuinely needed, propose it and let the user decide."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 9: No deleting .wireit cache to paper over stale-cache issues ─────
|
|
||||||
# Deleting .wireit forces a full cold rebuild which is very slow and masks the
|
|
||||||
# real problem (bad cache key, fingerprint mismatch, etc.).
|
|
||||||
# If wireit is returning cached results incorrectly, investigate and fix the
|
|
||||||
# underlying issue in the affected package.json wireit configuration instead.
|
|
||||||
if echo "$COMMAND" | grep -qE 'rm\s+.*\.wireit|rm\s+-[a-zA-Z]*rf?\s+.*\.wireit'; then
|
|
||||||
deny "BLOCKED: Do not delete .wireit to force a cold rebuild. This masks a real wireit configuration problem. Investigate which script has a stale fingerprint or incorrect 'files' / 'output' declaration in package.json, then fix that instead. See wireit docs for cache invalidation."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 10: No --force with npm run (wireit cache bust) ───────────────────
|
|
||||||
# Passing --force to wireit-backed scripts bypasses the cache and triggers a
|
|
||||||
# full cold rebuild. This masks fingerprint/config bugs and slows CI.
|
|
||||||
# If a script is using a stale cache, diagnose the wireit 'files'/'output'
|
|
||||||
# config instead of forcing a rebuild.
|
|
||||||
if echo "$COMMAND" | grep -qE 'npm\s+run\s+[a-zA-Z:_-]+\s+--\s+--force'; then
|
|
||||||
deny "BLOCKED: Do not use -- --force with npm run scripts. This bypasses the wireit cache and masks configuration bugs. If a script returns stale results, check the 'files'/'output' declarations in its wireit config instead."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 11: No npm run format with specific file args ─────────────────────
|
|
||||||
# Running 'npm run format -- <file>' from the workspace root propagates the
|
|
||||||
# extra argument to every workspace package's format script, causing each to
|
|
||||||
# fail with 'No files matching the pattern'. Format runs on the whole package
|
|
||||||
# directory by default — either run it without args or cd into the right
|
|
||||||
# package first.
|
|
||||||
if echo "$COMMAND" | grep -qE 'npm\s+run\s+format\s+--\s+\S'; then
|
|
||||||
deny "BLOCKED: Do not pass file arguments to 'npm run format'. The extra arg propagates to every workspace package and causes failures. Run 'npm run format' without args to format all files, or cd into the specific package directory first."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 14: No shell reads of workspace package.json files ────────────────
|
|
||||||
# Mirrors the OpenCode read tool guard: reading apps/*/package.json or
|
|
||||||
# packages/*/package.json via cat/head/tail/jq bypasses the read block and
|
|
||||||
# auto-injects every AGENTS.md in that subtree, exhausting the 32K context.
|
|
||||||
# Reading root package.json is fine — only workspace sub-packages are blocked.
|
|
||||||
if echo "$COMMAND" | grep -qE '(cat|head|tail|jq\s+-[a-zA-Z]*r?)\s+[^|>]*(apps|packages)/[^/[:space:]]+/package\.json'; then
|
|
||||||
deny "BLOCKED: Do not use cat/head/tail/jq to read workspace package.json files (apps/*/package.json, packages/*/package.json). These files auto-inject AGENTS.md context that exhausts the model's 32K context window. Use 'npm run' scripts for dependency info, or read root package.json."
|
|
||||||
fi
|
|
||||||
if echo "$COMMAND" | grep -qE '(apps|packages)/[^/[:space:]]+/package\.json.*\|\s*(jq|cat|head|tail)'; then
|
|
||||||
deny "BLOCKED: Do not pipe workspace package.json files (apps/*/package.json, packages/*/package.json) through jq or other readers. These files auto-inject AGENTS.md context that exhausts the model's 32K context window."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── File path checks (replace_string_in_file / create_file / read_file tools) ─
|
|
||||||
# (Policy 8 is a file-path check — see below)
|
|
||||||
if [[ -n "$FILE_PATH" ]]; then
|
|
||||||
|
|
||||||
# ── Policy 8: No editing *.generated.ts files ──────────────────────────────
|
|
||||||
if echo "$FILE_PATH" | grep -qE '\.generated\.ts$'; then
|
|
||||||
deny "BLOCKED: Do not edit *.generated.ts files directly. These are auto-generated and will be overwritten on the next build. Edit the source files (controller.ts, routes.ts, business-logic.ts) instead and run 'npm run build:core' to regenerate."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 12: No editing eslint.config.js files ───────────────────────────
|
|
||||||
if echo "$FILE_PATH" | grep -qE '(^|/)eslint\.config\.[cm]?[jt]s$'; then
|
|
||||||
deny "BLOCKED: Do not edit eslint.config.js files directly. ESLint configuration changes require human review — describe the change needed and let the user decide or consider a method that leads to higher code quality, if available."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Policy 13: No read_file ranges >50 lines (docs/ exempt, limit 500) ─────
|
|
||||||
# Prevents context exhaustion on 32K models from large sequential reads.
|
|
||||||
# docs/ files (documentation) are exempt: they are meant to be read whole
|
|
||||||
# and may use ranges up to 500 lines per call.
|
|
||||||
if [[ "$TOOL_NAME" == "read_file" || "$TOOL_NAME" == "read" || "$TOOL_NAME" == "edit" ]]; then
|
|
||||||
START_LINE=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(String(i.startLine ?? 1));
|
|
||||||
" 2>/dev/null || echo "1")
|
|
||||||
END_LINE=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
const i = d.tool_input || {};
|
|
||||||
process.stdout.write(String(i.endLine ?? 0));
|
|
||||||
" 2>/dev/null || echo "0")
|
|
||||||
if [[ "$END_LINE" -gt 0 ]]; then
|
|
||||||
RANGE=$(( END_LINE - START_LINE + 1 ))
|
|
||||||
if echo "$FILE_PATH" | grep -qE '(^|/)docs/|\.md$|\.txt$'; then
|
|
||||||
# docs/ files and all .md/.txt files — allow up to 500 lines
|
|
||||||
if [[ "$RANGE" -gt 500 ]]; then
|
|
||||||
deny "BLOCKED: Read more than 500 lines at once is prohibited for docs/ and .md/.txt files. Use startLine/endLine to paginate in ≤500-line chunks."
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
# All other files — 50-line limit
|
|
||||||
if [[ "$RANGE" -gt 50 ]]; then
|
|
||||||
deny "BLOCKED: Read more than 50 lines at once is prohibited. Use startLine/endLine to paginate in ≤50-line chunks. For docs/ and .md/.txt files the limit is 500 lines. Use grep_search first to find the right offset."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
fi
|
|
||||||
@ -1,74 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# SessionStart hook: inject project state at conversation start.
|
|
||||||
# Provides current branch, active investigations, and session continuation notes.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo ".")"
|
|
||||||
# Reset tool call counter for periodic self-checks (see post-tool-use.sh)
|
|
||||||
REPO_ID=$(printf '%s' "$REPO_ROOT" | md5sum | cut -c1-8 2>/dev/null || echo "default")
|
|
||||||
echo "0" > "/tmp/.opencode-tool-count-${REPO_ID}"
|
|
||||||
BRANCH=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "unknown")
|
|
||||||
|
|
||||||
# Check for active investigation files
|
|
||||||
active_investigations=""
|
|
||||||
EXPLORATIONS_DIR="$REPO_ROOT/docs/explorations"
|
|
||||||
if [[ -d "$EXPLORATIONS_DIR" ]]; then
|
|
||||||
inv_files=$(find "$EXPLORATIONS_DIR" -name "*.md" -not -empty 2>/dev/null || true)
|
|
||||||
if [[ -n "$inv_files" ]]; then
|
|
||||||
inv_count=$(echo "$inv_files" | wc -l)
|
|
||||||
inv_names=$(echo "$inv_files" | xargs -I{} basename {} .md | sed 's/^/ - /' || true)
|
|
||||||
active_investigations="Active investigation/exploration files (${inv_count}):\n${inv_names}\nReview relevant files before starting related work."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check for session continuation notes
|
|
||||||
session_notes=""
|
|
||||||
if [[ -d "/memories/session" ]]; then
|
|
||||||
session_files=$(find /memories/session -name "*.md" 2>/dev/null || true)
|
|
||||||
if [[ -n "$session_files" ]]; then
|
|
||||||
session_names=$(echo "$session_files" | xargs -I{} basename {} .md | sed 's/^/ - /' || true)
|
|
||||||
session_notes="Session memory files exist:\n${session_names}\nCheck these for context from previous conversations."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check for dead-ends file from previous debugging sessions
|
|
||||||
dead_ends=""
|
|
||||||
DEAD_ENDS_FILE="$REPO_ROOT/.session/dead-ends.md"
|
|
||||||
if [[ -f "$DEAD_ENDS_FILE" ]]; then
|
|
||||||
de_count=$(grep -c '^\s*- \*\*' "$DEAD_ENDS_FILE" 2>/dev/null || echo "0")
|
|
||||||
if [[ "$de_count" -gt 0 ]]; then
|
|
||||||
dead_ends="Active dead-ends file with ~${de_count} entries — read before debugging to avoid re-testing eliminated hypotheses."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Build context message
|
|
||||||
context="PROJECT STATE | Branch: ${BRANCH}"
|
|
||||||
|
|
||||||
if [[ -n "$active_investigations" ]]; then
|
|
||||||
context="${context}\n${active_investigations}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -n "$session_notes" ]]; then
|
|
||||||
context="${context}\n${session_notes}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -n "$dead_ends" ]]; then
|
|
||||||
context="${context}\n${dead_ends}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
context="${context}\nREMINDERS: (a) Check AGENTS.md and package-level AGENTS.md files for implementation guidance. (b) Ordered markdown lists are auto-renumbered by the editor on save — do not manually renumber after inserting or removing items."
|
|
||||||
|
|
||||||
# Prefix with a self-identifying marker so the model cannot confuse the
|
|
||||||
# injection with project content.
|
|
||||||
context="[HOOK INJECTION: session-start] System context — injected at session start, not part of any user message or tool output:\n\n${context}"
|
|
||||||
|
|
||||||
# Output JSON
|
|
||||||
json_context=$(printf '%b' "$context" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))')
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "SessionStart",
|
|
||||||
"additionalContext": ${json_context}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
@ -1,176 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# Stop hook:
|
|
||||||
# 1. Validates TODO.md and COMPLETED.md — blocking if violations found.
|
|
||||||
# 2. Prompts agent to record lessons learned before session ends (non-blocking).
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo ".")"
|
|
||||||
SESSION_DIR="$REPO_ROOT/.session"
|
|
||||||
TODO_FILE="$REPO_ROOT/docs/TODO.md"
|
|
||||||
COMPLETED_FILE="$REPO_ROOT/docs/projects/COMPLETED.md"
|
|
||||||
|
|
||||||
# ── Validation (blocking) ────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
problems=""
|
|
||||||
|
|
||||||
# Check TODO.md for completed [x] or ✅ numbered items — they should be moved to COMPLETED.md
|
|
||||||
if [[ -f "$TODO_FILE" ]]; then
|
|
||||||
todo_completed=$(grep -n '^\s*[0-9]\+\. \(\[x\]\|✅\)' "$TODO_FILE" 2>/dev/null || true)
|
|
||||||
if [[ -n "$todo_completed" ]]; then
|
|
||||||
count=$(echo "$todo_completed" | wc -l | tr -d ' ')
|
|
||||||
problems="docs/TODO.md contains ${count} completed [x]/✅ task(s). Move them to docs/projects/COMPLETED.md and remove from TODO.md.\nLines:\n${todo_completed}"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check COMPLETED.md for inline [x] items — it should contain stubs/summaries only
|
|
||||||
if [[ -f "$COMPLETED_FILE" ]]; then
|
|
||||||
completed_inline=$(grep -n '^\s*\(-\|[0-9]\+\.\) \[x\]' "$COMPLETED_FILE" 2>/dev/null || true)
|
|
||||||
if [[ -n "$completed_inline" ]]; then
|
|
||||||
count=$(echo "$completed_inline" | wc -l | tr -d ' ')
|
|
||||||
if [[ -n "$problems" ]]; then
|
|
||||||
problems="${problems}\n\n"
|
|
||||||
fi
|
|
||||||
problems="${problems}docs/projects/COMPLETED.md contains ${count} inline [x] item(s). Archive to a completed/*.md file and replace with a stub entry."
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -n "$problems" ]]; then
|
|
||||||
msg=$(printf '%b' "$problems" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))')
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "Stop",
|
|
||||||
"decision": "block",
|
|
||||||
"reason": ${msg}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Lessons learned prompt (non-blocking) ────────────────────────────────────
|
|
||||||
|
|
||||||
# Check whether all questions in the user's last prompt were answered
|
|
||||||
unanswered_reminder=""
|
|
||||||
LAST_PROMPT_FILE="/tmp/.last-user-prompt.txt"
|
|
||||||
if [[ -f "$LAST_PROMPT_FILE" ]]; then
|
|
||||||
last_prompt=$(cat "$LAST_PROMPT_FILE" 2>/dev/null || true)
|
|
||||||
if [[ -n "$last_prompt" ]]; then
|
|
||||||
unanswered_reminder="$last_prompt"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check whether the dev server is running (ports 3000/3001)
|
|
||||||
dev_server_running=""
|
|
||||||
if ss -tlnp 2>/dev/null | grep -qE ':300[01]\s'; then
|
|
||||||
dev_server_running="yes"
|
|
||||||
fi
|
|
||||||
dead_ends_active=""
|
|
||||||
DEAD_ENDS_FILE="$SESSION_DIR/dead-ends.md"
|
|
||||||
if [[ -f "$DEAD_ENDS_FILE" ]]; then
|
|
||||||
entry_count=$(grep -c '^\s*- \*\*' "$DEAD_ENDS_FILE" 2>/dev/null || echo "0")
|
|
||||||
if [[ "$entry_count" -gt 0 ]]; then
|
|
||||||
dead_ends_active="yes"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check for investigation files that may need status updates
|
|
||||||
open_investigations=""
|
|
||||||
EXPLORATIONS_DIR="$REPO_ROOT/docs/explorations"
|
|
||||||
if [[ -d "$EXPLORATIONS_DIR" ]]; then
|
|
||||||
inv_files=$(find "$EXPLORATIONS_DIR" -name "*.md" -not -empty 2>/dev/null || true)
|
|
||||||
if [[ -n "$inv_files" ]]; then
|
|
||||||
# Check for explorations NOT yet marked complete (status line doesn't contain 'complete')
|
|
||||||
active=$(echo "$inv_files" | while read -r f; do
|
|
||||||
if ! grep -qi '^\*\*Status\*\*.*complete\|^Status:.*complete' "$f" 2>/dev/null; then
|
|
||||||
echo "$f"
|
|
||||||
fi
|
|
||||||
done || true)
|
|
||||||
if [[ -n "$active" ]]; then
|
|
||||||
active_names=$(echo "$active" | xargs -I{} basename {} .md | sed 's/^/ - /' || true)
|
|
||||||
open_investigations="yes"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check if pre-compact-state.md still lists active investigations
|
|
||||||
stale_compact=""
|
|
||||||
COMPACT_FILE="$SESSION_DIR/pre-compact-state.md"
|
|
||||||
if [[ -f "$COMPACT_FILE" ]]; then
|
|
||||||
active_in_compact=$(grep -A20 '## Active Investigations' "$COMPACT_FILE" 2>/dev/null | grep -v '##' | grep -v '^\s*$' | head -5 || true)
|
|
||||||
if [[ -n "$active_in_compact" ]]; then
|
|
||||||
stale_compact="yes"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Build the lessons-learned prompt
|
|
||||||
prompt="SESSION END — QUALITY GATE"
|
|
||||||
|
|
||||||
# Remind agent to verify it answered every question in the user's last message
|
|
||||||
if [[ -n "$unanswered_reminder" ]]; then
|
|
||||||
prompt="${prompt}\n\nANSWER CHECK: Before finishing, re-read the user's last message below and confirm"
|
|
||||||
prompt="${prompt} you addressed EVERY question and request in it — not just the primary task:"
|
|
||||||
prompt="${prompt}\n---"
|
|
||||||
prompt="${prompt}\n${unanswered_reminder}"
|
|
||||||
prompt="${prompt}\n---"
|
|
||||||
fi
|
|
||||||
if [[ -n "$dev_server_running" ]]; then
|
|
||||||
prompt="${prompt}\nThe dev server IS running (port 3000/3001 detected). Run: npm test && npm run lint"
|
|
||||||
prompt="${prompt}\nThen ask the user to confirm the build is clean in their terminal."
|
|
||||||
else
|
|
||||||
prompt="${prompt}\nThe dev server is NOT running. Run: npm run build:strict"
|
|
||||||
prompt="${prompt}\nThis runs build + lint + format:check + tests. Do NOT skip this if you changed any source files."
|
|
||||||
fi
|
|
||||||
prompt="${prompt}\n"
|
|
||||||
prompt="${prompt}\n---"
|
|
||||||
prompt="${prompt}\nLESSONS LEARNED CAPTURE — Before finishing, consider recording reusable insights:"
|
|
||||||
prompt="${prompt}\n"
|
|
||||||
prompt="${prompt}\n1. **Process insights** (what worked, what didn't, workflow improvements) → write to /memories/repo/ or /memories/"
|
|
||||||
|
|
||||||
if [[ -n "$dead_ends_active" ]]; then
|
|
||||||
prompt="${prompt}\n2. **Dead-ends file** has entries — verify all have results recorded (ELIMINATED/CONFIRMED with reasons)"
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -n "$open_investigations" ]]; then
|
|
||||||
prompt="${prompt}\n3. **Open explorations** not yet marked complete — update their status or mark complete:"
|
|
||||||
prompt="${prompt}\n${active_names}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -n "$stale_compact" ]]; then
|
|
||||||
prompt="${prompt}\n3. **pre-compact-state.md** still lists active investigations — clear or update it now that the session is ending"
|
|
||||||
fi
|
|
||||||
|
|
||||||
prompt="${prompt}\n4. **TODO.md review** — Check docs/TODO.md: did any work this session complete or partially complete an item?"
|
|
||||||
prompt="${prompt}\n - If an item is done: mark it [x], then move the whole entry to docs/projects/COMPLETED.md and remove it from TODO.md"
|
|
||||||
prompt="${prompt}\n - If partially done: update the description to reflect current state"
|
|
||||||
prompt="${prompt}\n - Don't wait for the blocking check — proactively keep TODO.md accurate"
|
|
||||||
prompt="${prompt}\n5. **New tasks discovered** → note them for the user or add to docs/TODO.md"
|
|
||||||
prompt="${prompt}\n"
|
|
||||||
prompt="${prompt}\n---"
|
|
||||||
prompt="${prompt}\nEFFORT REFLECTION: If this session required significant effort (many tool calls, multiple dead-ends, complex investigation):"
|
|
||||||
prompt="${prompt}\n Ask yourself: What information, if it had existed at the start, would have prevented most of that work?"
|
|
||||||
prompt="${prompt}\n First, determine scope — is this globally applicable, or specific to certain files/patterns?"
|
|
||||||
prompt="${prompt}\n Then lean toward hooks as the solution:"
|
|
||||||
prompt="${prompt}\n • Hard stops via PreToolUse blocks (best when the bad action is a terminal command)"
|
|
||||||
prompt="${prompt}\n • PostToolUse reminders (fire right after editing a relevant file — effective because they appear mid-task)"
|
|
||||||
prompt="${prompt}\n • applyTo: instructions files scoped to a file glob (fire when the agent opens matching files)"
|
|
||||||
prompt="${prompt}\n • PreCompact saves investigation state before context compression (PostCompact does not exist)"
|
|
||||||
prompt="${prompt}\n • Stop / SessionStart: on-demand summary injection — less precise than PostToolUse but good for broad reminders"
|
|
||||||
prompt="${prompt}\n These are all more reliable than AGENTS.md sections (lost-in-the-middle problem)."
|
|
||||||
prompt="${prompt}\n Record the insight in the right hook/instructions file, NOT just in AGENTS.md."
|
|
||||||
prompt="${prompt}\n"
|
|
||||||
prompt="${prompt}\nSkip categories that don't apply. Only record genuinely new insights."
|
|
||||||
|
|
||||||
# Prefix with a self-identifying marker so the model cannot confuse the
|
|
||||||
# injection with the user's own message or prior tool output.
|
|
||||||
prompt="[HOOK INJECTION: stop] System reminder — injected at session end, not part of any user message or tool output:\n\n${prompt}"
|
|
||||||
|
|
||||||
json_prompt=$(printf '%b' "$prompt" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))')
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "Stop",
|
|
||||||
"additionalContext": ${json_prompt}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
@ -1,64 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# UserPromptSubmit hook:
|
|
||||||
# 1. Writes the typed prompt text to a temp file as a fallback.
|
|
||||||
# 2. Injects an additionalContext instruction telling the agent to overwrite
|
|
||||||
# that file with a complete capture (including attachments / selections)
|
|
||||||
# before doing anything else. The Stop hook reads the file to remind the
|
|
||||||
# agent to verify every question was answered.
|
|
||||||
# 3. Injects agent nudges when trigger words are detected in the prompt.
|
|
||||||
# Nudges fire here (context tail) rather than in AGENTS.md (context start)
|
|
||||||
# so they survive long sessions — see "lost in the middle" in .agents/AGENTS.md.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
INPUT=$(cat)
|
|
||||||
|
|
||||||
PROMPT=$(echo "$INPUT" | node -e "
|
|
||||||
const d = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
|
|
||||||
process.stdout.write(d.prompt || '');
|
|
||||||
" 2>/dev/null || true)
|
|
||||||
|
|
||||||
if [[ -n "$PROMPT" ]]; then
|
|
||||||
echo "$PROMPT" > /tmp/.last-user-prompt.txt
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Agent nudges ─────────────────────────────────────────────────────────────
|
|
||||||
# Check prompt for trigger words and append a one-line nudge to additionalContext.
|
|
||||||
# Only one nudge fires per turn (brainstorm takes priority over research).
|
|
||||||
NUDGE=""
|
|
||||||
|
|
||||||
if echo "$PROMPT" | grep -qiE \
|
|
||||||
'\bwait\b|actually|hmm+|\breconsider\b|overthink|overcomplicat|too complicated|simpler way|going in circles|second thoughts|what if we|maybe i should|step back|rethink|hold on|i.m not sure|analysis paralysis|stuck'; then
|
|
||||||
NUDGE="NUDGE: The user may be showing signs of analysis paralysis or complexity spiral. Consider suggesting they switch to @brainstorm — one sentence, non-intrusive. Don't force it; if they decline, don't repeat it for the same topic."
|
|
||||||
elif echo "$PROMPT" | grep -qiE \
|
|
||||||
'why is this broken|something.s wrong|not working|what changed|it used to work|\bregression\b|\btrace\b|dig into|root cause|figure out why|\bdebug\b|\bdiagnose\b|investigate|repeated.*fail|why.*code|how does this work|i don.t understand'; then
|
|
||||||
NUDGE="NUDGE: The user may need systematic investigation. Consider suggesting they switch to @research — one sentence, non-intrusive. Don't force it; if they decline, don't repeat it for the same topic."
|
|
||||||
fi
|
|
||||||
|
|
||||||
TASK_CAPTURE="TASK CAPTURE (do this first, before any other action): Write a concise bullet list of every distinct question and task in this prompt — including content from any attached files or selected text — to /tmp/.last-user-prompt.txt. Read the file and choose to overwrite or merge, depending on if there are still unanswered questions remaining. This file is read at session end to verify every question was answered. If there is nothing to capture (e.g. a one-word test message), write a single dash to the file."
|
|
||||||
|
|
||||||
# CURRENT QUESTION marker — placed last (context tail = highest recency bias).
|
|
||||||
# If the current prompt resembles an earlier one, this prevents the model from
|
|
||||||
# answering the earlier version. Mechanism validated by S2A (Weston & Sukhbaatar,
|
|
||||||
# arXiv:2311.11829): explicitly isolating the current query from prior context
|
|
||||||
# reduces sycophancy and improves factuality without a second LLM call.
|
|
||||||
CURRENT_Q="CURRENT QUESTION: Answer the user's most recent message (above). If a similar question appeared earlier in this session, answer THIS version — do not conflate it with the prior one."
|
|
||||||
|
|
||||||
if [[ -n "$NUDGE" ]]; then
|
|
||||||
ADDITIONAL_CONTEXT="${NUDGE}\n\n${TASK_CAPTURE}\n\n${CURRENT_Q}"
|
|
||||||
else
|
|
||||||
ADDITIONAL_CONTEXT="${TASK_CAPTURE}\n\n${CURRENT_Q}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Prefix with a self-identifying marker so the model cannot confuse the
|
|
||||||
# injection with the user's own message.
|
|
||||||
ADDITIONAL_CONTEXT="[HOOK INJECTION: user-prompt-submit] System reminder — NOT part of the user's message:\n\n${ADDITIONAL_CONTEXT}"
|
|
||||||
|
|
||||||
json_context=$(printf '%b' "$ADDITIONAL_CONTEXT" | node -e 'process.stdout.write(JSON.stringify(require("fs").readFileSync("/dev/stdin","utf8")))')
|
|
||||||
cat <<EOF
|
|
||||||
{
|
|
||||||
"hookSpecificOutput": {
|
|
||||||
"hookEventName": "UserPromptSubmit",
|
|
||||||
"additionalContext": ${json_context}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
@ -1,8 +0,0 @@
|
|||||||
[Unit]
|
|
||||||
Description=Restart llama-server when presets.ini changes
|
|
||||||
|
|
||||||
[Path]
|
|
||||||
PathModified=/home/dev/models/presets.ini
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=default.target
|
|
||||||
@ -1,6 +0,0 @@
|
|||||||
[Unit]
|
|
||||||
Description=Restart llama-server (triggered by presets.ini change)
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
Type=oneshot
|
|
||||||
ExecStart=/bin/systemctl restart llama-server
|
|
||||||
@ -1,15 +0,0 @@
|
|||||||
[Unit]
|
|
||||||
Description=llama-server
|
|
||||||
After=network-online.target
|
|
||||||
Wants=network-online.target
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
ExecStart=/opt/llama-server/start.sh
|
|
||||||
User=ollama
|
|
||||||
Group=ollama
|
|
||||||
Restart=always
|
|
||||||
RestartSec=3
|
|
||||||
Environment="PATH=/home/dev/.nvm/versions/node/v24.15.0/bin:/home/dev/.opencode/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Program Files (x86)/NVIDIA Corporation/PhysX/Common:/mnt/c/Python312/Scripts/:/mnt/c/Python312/:/mnt/c/Program Files/Microsoft/jdk-17.0.8.7-hotspot/bin:/mnt/c/Program Files/Oculus/Support/oculus-runtime:/mnt/c/Windows/system32:/mnt/c/Windows:/mnt/c/Windows/System32/Wbem:/mnt/c/Windows/System32/WindowsPowerShell/v1.0/:/mnt/c/Windows/System32/OpenSSH/:/mnt/c/Program Files/dotnet/:/mnt/c/Program Files/Microsoft VS Code/bin:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/nodejs/:/mnt/c/ProgramData/chocolatey/bin:/mnt/c/Users/Dev/AppData/Local/Programs/cursor/resources/app/bin:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/Docker/Docker/resources/bin:/mnt/c/Users/Dev/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/Dev/AppData/Roaming/npm:/mnt/c/Users/Dev/.dotnet/tools:/mnt/c/Users/Dev/AppData/Local/Microsoft/WinGet/Packages/albertony.npiperelay_Microsoft.Winget.Source_8wekyb3d8bbwe:/mnt/c/Users/Dev/.lmstudio/bin:/snap/bin"
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=default.target
|
|
||||||
@ -1,147 +0,0 @@
|
|||||||
version = 1
|
|
||||||
|
|
||||||
; ─── Global ──────────────────────────────────────────────────────────────────
|
|
||||||
; Settings in [*] are inherited by every model loaded by the router.
|
|
||||||
; Per-model sections below override individual keys.
|
|
||||||
[*]
|
|
||||||
|
|
||||||
; Number of model layers to offload to GPU.
|
|
||||||
; 99 means "offload everything" — llama.cpp loads as many as fit and falls back
|
|
||||||
; to CPU automatically for any overflow. Using an explicit value avoids the
|
|
||||||
; occasional conservative auto-estimate.
|
|
||||||
; Default: auto
|
|
||||||
; n-gpu-layers = 99
|
|
||||||
|
|
||||||
; Flash Attention: reduces KV-cache VRAM usage and speeds up long-context
|
|
||||||
; inference by computing attention without materializing the full NxN matrix.
|
|
||||||
; "on" forces it; "auto" (default) enables it when CUDA is detected — same
|
|
||||||
; effect in practice, but explicit is clearer here.
|
|
||||||
; Default: auto
|
|
||||||
flash-attn = on
|
|
||||||
|
|
||||||
; Number of CPU threads used for non-GPU work: tokenization, sampling, and any
|
|
||||||
; layers that overflow VRAM during hybrid inference. ~2/3 of physical cores is
|
|
||||||
; the rule of thumb; going higher causes contention on the same cores the GPU
|
|
||||||
; DMA uses. (Machine has 12 logical cores → 8 threads.)
|
|
||||||
; Default: -1 (use all cores)
|
|
||||||
threads = 8
|
|
||||||
|
|
||||||
; Number of inference slots (parallel sequences). 1 = single-user server with
|
|
||||||
; no batching overhead. Increase only if you need concurrent requests; each
|
|
||||||
; extra slot consumes a proportional share of KV-cache VRAM.
|
|
||||||
; Default: -1 (auto, usually 1)
|
|
||||||
parallel = 1
|
|
||||||
|
|
||||||
; Jinja2 chat templating — required for models with complex chat templates
|
|
||||||
; (e.g. Qwen3, which uses raise_exception() guards). Without this, llama.cpp
|
|
||||||
; falls back to a static PEG auto-parser that can't handle those templates.
|
|
||||||
jinja = on
|
|
||||||
|
|
||||||
; Token budget for chain-of-thought reasoning.
|
|
||||||
; -1 = unrestricted (model decides when to stop thinking)
|
|
||||||
; 0 = disable thinking entirely
|
|
||||||
; N = hard cap at N tokens, then force the model to answer
|
|
||||||
; Commented out: matches the default (-1 = unrestricted).
|
|
||||||
; reasoning-budget = -1
|
|
||||||
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
; ─── Qwen3-14B ───────────────────────────────────────────────────────────────
|
|
||||||
; ~8.5 GB GGUF — fits fully in 12 GB VRAM. Fast (~12–18 tok/s). Good daily
|
|
||||||
; driver for interactive coding and Q&A.
|
|
||||||
[Qwen_Qwen3-14B-Q4_K_M]
|
|
||||||
|
|
||||||
; Full 32 K context is safe: 14B fits in VRAM with plenty of headroom for the
|
|
||||||
; KV cache. At 32 K × 2 bytes × 2 (K+V) × 40 layers ≈ ~5 GB worst-case KV.
|
|
||||||
; Default: 0 (read from model metadata, typically the training context limit)
|
|
||||||
ctx-size = 32768
|
|
||||||
|
|
||||||
; Cap generation at 4096 tokens. Prevents runaway responses; raise if you need
|
|
||||||
; longer output (documentation, large refactors). Default: -1 (unlimited)
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
|
|
||||||
; ─── OmniCoder-2-9B ──────────────────────────────────────────────────────────
|
|
||||||
; ~9.4 GB GGUF — fits fully in 12 GB VRAM. Fast generation. Vision-capable
|
|
||||||
; (multimodal projector at OmniCoder-2-9B.Q8_0/mmproj-Q8_0.gguf — auto-detected
|
|
||||||
; from subdirectory layout by the router).
|
|
||||||
[OmniCoder-2-9B.Q8_0]
|
|
||||||
|
|
||||||
; Full 32 K context fits comfortably alongside 9B weights.
|
|
||||||
; Default: 0 (read from model metadata)
|
|
||||||
ctx-size = 32768
|
|
||||||
|
|
||||||
; Cap generation at 4096 tokens. Default: -1 (unlimited)
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
|
|
||||||
; ─── Qwen3.6-35B-A3B (MoE + MTP) ────────────────────────────────────────────
|
|
||||||
; 13.6 GB GGUF — ~12 GB on GPU, ~1.6 GB CPU offload on a 12 GB card.
|
|
||||||
; MoE model: only ~3B parameters active per forward pass despite 35B total.
|
|
||||||
; MTP (multi-token prediction) heads baked in — uses draft-mtp speculative
|
|
||||||
; decoding to roughly double throughput vs non-speculative. Requires b9279+.
|
|
||||||
[Qwen3.6-35B-A3B-IQ3_S-3.06bpw]
|
|
||||||
|
|
||||||
; KV cache is small (~31 MiB/1K tokens) due to GQA — 32K context only needs
|
|
||||||
; ~1 GB KV cache, which pages to CPU gracefully without major throughput loss.
|
|
||||||
ctx-size = 32768
|
|
||||||
|
|
||||||
; Cap generation at 4096 tokens. Default: -1 (unlimited)
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
; Multi-token prediction speculative decoding.
|
|
||||||
; spec-type = draft-mtp uses MTP heads built into the model weights.
|
|
||||||
spec-type = draft-mtp
|
|
||||||
|
|
||||||
; Minimum acceptance probability for a speculated draft token (0–1).
|
|
||||||
; 0.75 = accept tokens the model is 75%+ confident in. Lower = more aggressive
|
|
||||||
; speculation (faster but slightly more divergence risk).
|
|
||||||
spec-draft-p-min = 0.75
|
|
||||||
|
|
||||||
; Max tokens to speculate per step. 3 is the sweet spot for Qwen3.6 MTP.
|
|
||||||
spec-draft-n-max = 3
|
|
||||||
|
|
||||||
|
|
||||||
; ─── Qwen3.6-27B ─────────────────────────────────────────────────────────────
|
|
||||||
; 17 GB GGUF — ~12 GB on GPU, ~5 GB CPU offload on a 12 GB card.
|
|
||||||
; Slower (~4–8 tok/s) due to CPU↔GPU transfers; best for deep analysis tasks.
|
|
||||||
[Qwen_Qwen3.6-27B-Q4_K_M]
|
|
||||||
|
|
||||||
; Smaller context than 14B to keep the KV cache on-GPU. At 16 K the KV cache
|
|
||||||
; is roughly half the size, which reduces how much spills to CPU on each
|
|
||||||
; forward pass — meaningful when every byte of VRAM is already spoken for.
|
|
||||||
; Default: 0 (read from model metadata)
|
|
||||||
ctx-size = 16384
|
|
||||||
|
|
||||||
; Cap generation at 4096 tokens. Default: -1 (unlimited)
|
|
||||||
n-predict = 4096
|
|
||||||
|
|
||||||
[Qwopus3.6-27B-v2-MTP-Q4_K_M]
|
|
||||||
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
spec-type = draft-mtp
|
|
||||||
spec-draft-p-min = 0.75
|
|
||||||
spec-draft-n-max = 3
|
|
||||||
|
|
||||||
[Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M]
|
|
||||||
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
spec-type = draft-mtp
|
|
||||||
spec-draft-p-min = 0.75
|
|
||||||
spec-draft-n-max = 3
|
|
||||||
|
|
||||||
[Qwopus3.5-9B-Coder-MTP-Q8_0]
|
|
||||||
|
|
||||||
ctx-size = 65536
|
|
||||||
n-predict = 4096
|
|
||||||
spec-type = draft-mtp
|
|
||||||
spec-draft-p-min = 0.75
|
|
||||||
spec-draft-n-max = 3
|
|
||||||
|
|
||||||
[agentica-org_DeepCoder-14B-Preview-Q5_K_M]
|
|
||||||
|
|
||||||
ctx-size = 32768
|
|
||||||
n-predict = 4096
|
|
||||||
@ -1,9 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
export LD_LIBRARY_PATH=/opt/llama-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
|
|
||||||
cd /opt/llama-server
|
|
||||||
exec /opt/llama-server/llama-server \
|
|
||||||
--models-dir /home/dev/models \
|
|
||||||
--models-max 1 \
|
|
||||||
--models-preset /home/dev/models/presets.ini \
|
|
||||||
--host 127.0.0.1 \
|
|
||||||
--port 8080
|
|
||||||
1
.agents/mcp/.gitignore
vendored
1
.agents/mcp/.gitignore
vendored
@ -1 +0,0 @@
|
|||||||
node_modules
|
|
||||||
@ -1,200 +0,0 @@
|
|||||||
#!/usr/bin/env node
|
|
||||||
/**
|
|
||||||
* all-agents MCP server — shared agent infrastructure over the Model Context Protocol.
|
|
||||||
*
|
|
||||||
* Prompts and tools are auto-discovered from sibling directories:
|
|
||||||
* ../agents/*.md → slash-command prompts (requires description: frontmatter)
|
|
||||||
* ../skills/*.md → model-controlled tools (requires description: frontmatter)
|
|
||||||
*
|
|
||||||
* Agent/skill bodies are read from disk at invocation time — editing any .md
|
|
||||||
* file takes effect immediately without restarting the server.
|
|
||||||
*
|
|
||||||
* Frontmatter fields:
|
|
||||||
* description (required) — routing description for the prompt/tool
|
|
||||||
* toolName (skills only, optional) — override the derived tool name
|
|
||||||
* default: load_<basename> (e.g. research-methodology.md → load_research-methodology)
|
|
||||||
*
|
|
||||||
* Not handled here (stays bespoke):
|
|
||||||
* hooks/ — MCP has no lifecycle intercept primitive
|
|
||||||
* AGENTS.md — always-on bootstrap; model needs it before tools/list
|
|
||||||
*
|
|
||||||
* Run: node --experimental-strip-types .agents/mcp/index.ts
|
|
||||||
* Config: ~/.vscode-server/data/User/mcp.json (Copilot),
|
|
||||||
* ~/.config/opencode/opencode.json (OpenCode global)
|
|
||||||
*/
|
|
||||||
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
|
|
||||||
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
|
|
||||||
import { readFileSync, readdirSync } from "node:fs";
|
|
||||||
import { basename, resolve } from "node:path";
|
|
||||||
import { z } from "zod";
|
|
||||||
|
|
||||||
const agentsDir = resolve(import.meta.dirname, "../agents");
|
|
||||||
const skillsDir = resolve(import.meta.dirname, "../skills");
|
|
||||||
|
|
||||||
interface ParsedFile {
|
|
||||||
description: string;
|
|
||||||
toolName?: string | undefined;
|
|
||||||
body: string;
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Parse YAML frontmatter and return description, optional toolName, and body. */
|
|
||||||
function parseFrontmatter(content: string): ParsedFile {
|
|
||||||
const lines = content.split("\n");
|
|
||||||
if (lines[0] !== "---") return { description: "", body: content.trim() };
|
|
||||||
const end = lines.indexOf("---", 1);
|
|
||||||
if (end === -1) return { description: "", body: content.trim() };
|
|
||||||
|
|
||||||
const frontmatter = lines.slice(1, end).join("\n");
|
|
||||||
const body = lines
|
|
||||||
.slice(end + 1)
|
|
||||||
.join("\n")
|
|
||||||
.trim();
|
|
||||||
|
|
||||||
// Simple single-line or quoted-string extraction for description and toolName
|
|
||||||
const descMatch = frontmatter.match(
|
|
||||||
/^description:\s*['"]?([\s\S]*?)['"]?\s*$/m,
|
|
||||||
);
|
|
||||||
const toolMatch = frontmatter.match(/^toolName:\s*['"]?([^'"]+)['"]?\s*$/m);
|
|
||||||
|
|
||||||
// Handle multi-line description values (block scalar or wrapped string)
|
|
||||||
let description = "";
|
|
||||||
if (descMatch) {
|
|
||||||
// If the match includes a leading quote, strip matching quotes
|
|
||||||
const raw = frontmatter.match(/^description:\s*(['"])([\s\S]*?)\1\s*$/m);
|
|
||||||
description = raw ? raw[2]?.trim() ?? '' : descMatch[1]?.trim() ?? '';
|
|
||||||
}
|
|
||||||
|
|
||||||
return {
|
|
||||||
description,
|
|
||||||
toolName: toolMatch?.[1]?.trim(),
|
|
||||||
body,
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
function stripLocalBlocks(body: string): string {
|
|
||||||
return body.replace(/<!-- @local -->[\s\S]*?<!-- @endlocal -->\n?/g, "");
|
|
||||||
}
|
|
||||||
|
|
||||||
function stripCloudBlocks(body: string): string {
|
|
||||||
return body.replace(/<!-- @cloud -->[\s\S]*?<!-- @endcloud -->\n?/g, "");
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Returns 'local' when the MCP client identifies as `opencode` (local-model
|
|
||||||
* harness), 'cloud' for any other client (Copilot / VS Code etc.).
|
|
||||||
*/
|
|
||||||
function getClientProfile(): "local" | "cloud" {
|
|
||||||
const info = server.server.getClientVersion();
|
|
||||||
return info?.name === "opencode" ? "local" : "cloud";
|
|
||||||
}
|
|
||||||
|
|
||||||
function applyClientProfile(body: string): string {
|
|
||||||
return getClientProfile() === "local"
|
|
||||||
? stripCloudBlocks(body)
|
|
||||||
: stripLocalBlocks(body);
|
|
||||||
}
|
|
||||||
|
|
||||||
const server = new McpServer({ name: "all-agents", version: "1.0.0" });
|
|
||||||
|
|
||||||
// ── Prompts (auto-discovered from ../agents/*.md) ─────────────────────────────
|
|
||||||
|
|
||||||
const agentFiles = readdirSync(agentsDir).filter(
|
|
||||||
(f) => f.endsWith(".md") && f !== "AGENTS.md",
|
|
||||||
);
|
|
||||||
|
|
||||||
for (const file of agentFiles) {
|
|
||||||
const name = basename(file, ".md");
|
|
||||||
const { description, body } = parseFrontmatter(
|
|
||||||
readFileSync(resolve(agentsDir, file), "utf8"),
|
|
||||||
);
|
|
||||||
if (!description) {
|
|
||||||
process.stderr.write(
|
|
||||||
`[all-agents] WARNING: ${file} has no description — skipping\n`,
|
|
||||||
);
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
const argKey =
|
|
||||||
name === "orchestrator" ? "goal" : name === "brainstorm" ? "topic" : "task";
|
|
||||||
const argDesc =
|
|
||||||
name === "orchestrator"
|
|
||||||
? "The high-level goal to decompose"
|
|
||||||
: name === "brainstorm"
|
|
||||||
? "The problem or decision to brainstorm"
|
|
||||||
: "The specific task or question";
|
|
||||||
|
|
||||||
server.registerPrompt(
|
|
||||||
name,
|
|
||||||
{
|
|
||||||
description,
|
|
||||||
argsSchema: { [argKey]: z.string().optional().describe(argDesc) },
|
|
||||||
},
|
|
||||||
(args: Record<string, string | undefined>) => {
|
|
||||||
const input = args[argKey];
|
|
||||||
const agentBody = applyClientProfile(
|
|
||||||
parseFrontmatter(readFileSync(resolve(agentsDir, file), "utf8")).body,
|
|
||||||
);
|
|
||||||
return {
|
|
||||||
messages: [
|
|
||||||
{
|
|
||||||
role: "user" as const,
|
|
||||||
content: {
|
|
||||||
type: "text" as const,
|
|
||||||
text: input ? `${agentBody}\n\n${input}` : agentBody,
|
|
||||||
},
|
|
||||||
},
|
|
||||||
],
|
|
||||||
};
|
|
||||||
},
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
// ── Tools (auto-discovered from ../skills/*.md) ───────────────────────────────
|
|
||||||
|
|
||||||
const skillFiles = readdirSync(skillsDir).filter((f) => f.endsWith(".md"));
|
|
||||||
|
|
||||||
for (const file of skillFiles) {
|
|
||||||
const name = basename(file, ".md");
|
|
||||||
const { description, toolName } = parseFrontmatter(
|
|
||||||
readFileSync(resolve(skillsDir, file), "utf8"),
|
|
||||||
);
|
|
||||||
if (!description) {
|
|
||||||
process.stderr.write(
|
|
||||||
`[all-agents] WARNING: ${file} has no description — skipping\n`,
|
|
||||||
);
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
server.registerTool(toolName ?? `load_${name}`, { description }, () => ({
|
|
||||||
content: [
|
|
||||||
{
|
|
||||||
type: "text" as const,
|
|
||||||
text: parseFrontmatter(readFileSync(resolve(skillsDir, file), "utf8"))
|
|
||||||
.body,
|
|
||||||
},
|
|
||||||
],
|
|
||||||
}));
|
|
||||||
}
|
|
||||||
|
|
||||||
// ── Resources (no-op to satisfy resources/list) ──────────────────────────────
|
|
||||||
|
|
||||||
server.registerResource(
|
|
||||||
"noop",
|
|
||||||
"noop://noop",
|
|
||||||
{ description: "No-op resource (satisfies resources/list)" },
|
|
||||||
() => ({
|
|
||||||
contents: [{ uri: "noop://noop", mimeType: "text/plain", text: "" }],
|
|
||||||
}),
|
|
||||||
);
|
|
||||||
|
|
||||||
// ── Connect ───────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
const transport = new StdioServerTransport();
|
|
||||||
try {
|
|
||||||
await server.connect(transport);
|
|
||||||
} catch (err) {
|
|
||||||
process.stderr.write(
|
|
||||||
`MCP connect failed: ${err instanceof Error ? err.message : String(err)}\n`,
|
|
||||||
);
|
|
||||||
process.exit(1);
|
|
||||||
}
|
|
||||||
1176
.agents/mcp/package-lock.json
generated
1176
.agents/mcp/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
@ -1,13 +0,0 @@
|
|||||||
{
|
|
||||||
"name": "@dotfiles/all-agents-mcp",
|
|
||||||
"version": "1.0.0",
|
|
||||||
"private": true,
|
|
||||||
"type": "module",
|
|
||||||
"dependencies": {
|
|
||||||
"@modelcontextprotocol/sdk": "^1.29.0",
|
|
||||||
"zod": "^4.1.12"
|
|
||||||
},
|
|
||||||
"devDependencies": {
|
|
||||||
"@types/node": "^25.9.1"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,45 +0,0 @@
|
|||||||
{
|
|
||||||
// Visit https://aka.ms/tsconfig to read more about this file
|
|
||||||
"compilerOptions": {
|
|
||||||
"preserveSymlinks": true,
|
|
||||||
// File Layout
|
|
||||||
// "rootDir": "./src",
|
|
||||||
// "outDir": "./dist",
|
|
||||||
// Environment Settings
|
|
||||||
// See also https://aka.ms/tsconfig/module
|
|
||||||
"module": "nodenext",
|
|
||||||
"target": "esnext",
|
|
||||||
"lib": [
|
|
||||||
"esnext"
|
|
||||||
],
|
|
||||||
"types": [
|
|
||||||
"node"
|
|
||||||
],
|
|
||||||
// For nodejs:
|
|
||||||
// "lib": ["esnext"],
|
|
||||||
// "types": ["node"],
|
|
||||||
// and npm install -D @types/node
|
|
||||||
// Other Outputs
|
|
||||||
"sourceMap": true,
|
|
||||||
"declaration": true,
|
|
||||||
"declarationMap": true,
|
|
||||||
// Stricter Typechecking Options
|
|
||||||
"noUncheckedIndexedAccess": true,
|
|
||||||
"exactOptionalPropertyTypes": true,
|
|
||||||
// Style Options
|
|
||||||
// "noImplicitReturns": true,
|
|
||||||
// "noImplicitOverride": true,
|
|
||||||
// "noUnusedLocals": true,
|
|
||||||
// "noUnusedParameters": true,
|
|
||||||
// "noFallthroughCasesInSwitch": true,
|
|
||||||
// "noPropertyAccessFromIndexSignature": true,
|
|
||||||
// Recommended Options
|
|
||||||
"strict": true,
|
|
||||||
"jsx": "react-jsx",
|
|
||||||
"verbatimModuleSyntax": true,
|
|
||||||
"isolatedModules": true,
|
|
||||||
"noUncheckedSideEffectImports": true,
|
|
||||||
"moduleDetection": "force",
|
|
||||||
"skipLibCheck": true,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@ -1,34 +0,0 @@
|
|||||||
---
|
|
||||||
description: Execution rules for debugging: hypothesis testing, instrumentation, and trace cleanup
|
|
||||||
---
|
|
||||||
|
|
||||||
# Research Execution
|
|
||||||
|
|
||||||
Keep context clean and evidence tracked during active investigation.
|
|
||||||
|
|
||||||
## Context Management
|
|
||||||
|
|
||||||
Methodology degrades after ~15 tool calls. Re-read investigation file and
|
|
||||||
dead-ends every ~10 tool calls. When drifting toward guess-and-check, pause and
|
|
||||||
re-read notes. Hold references; load on demand.
|
|
||||||
|
|
||||||
## Findings Format
|
|
||||||
|
|
||||||
Record each hypothesis test to `.session/findings.md`:
|
|
||||||
|
|
||||||
```
|
|
||||||
- [timestamp] Hypothesis: [one sentence]
|
|
||||||
Falsification: [what you'd expect if wrong]
|
|
||||||
Result: [ELIMINATED/CONFIRMED] — [why, in one sentence]
|
|
||||||
```
|
|
||||||
|
|
||||||
## Timing Awareness
|
|
||||||
|
|
||||||
Prefix unknown commands with `time`. Fast (<5s): low barrier. Slow (>30s):
|
|
||||||
reason first. Unknown: measure. Capture: `time cmd 2>&1 | tee /tmp/output.txt`
|
|
||||||
|
|
||||||
## Techniques
|
|
||||||
|
|
||||||
- **Five Whys**: trace causal chains; starting point, not sole method
|
|
||||||
- **Delta Debugging**: binary search between passing/failing cases
|
|
||||||
- **Rubber Duck**: explain the system step by step to expose gaps
|
|
||||||
@ -1,33 +0,0 @@
|
|||||||
---
|
|
||||||
description: Checklist for investigation setup: orientations, hypothesis, and circuit breaker baselines
|
|
||||||
---
|
|
||||||
|
|
||||||
# Research Setup
|
|
||||||
|
|
||||||
**Goal**: Build a grounded mental model before acting.
|
|
||||||
|
|
||||||
## Investigation Checklist
|
|
||||||
|
|
||||||
Before every hypothesis cycle:
|
|
||||||
|
|
||||||
- [ ] Hypothesis written (one sentence: "I believe X because Y")
|
|
||||||
- [ ] Falsification criterion written ("if wrong, I'd expect to see ___")
|
|
||||||
- [ ] Falsification test run BEFORE confirmation test
|
|
||||||
- [ ] Result recorded (ELIMINATED with reason, or CONFIRMED with evidence)
|
|
||||||
- [ ] Hypothesis re-evaluated at this tool-call boundary
|
|
||||||
- [ ] All traces/instrumentation removed before next hypothesis
|
|
||||||
|
|
||||||
## Orientations
|
|
||||||
|
|
||||||
**Understand (Grounded Theory)** — Read code, name what you see. Compare new
|
|
||||||
observations against earlier ones. Connect categories (what calls what, data
|
|
||||||
flows). Write findings to session memory. Stop at saturation.
|
|
||||||
|
|
||||||
**Diagnose (Strong Inference + Satisficing)** — Simple check first: can a
|
|
||||||
single log answer the question. When no single log answers the question,
|
|
||||||
triage (see `research-triage.md`).
|
|
||||||
|
|
||||||
## Mode Switching
|
|
||||||
|
|
||||||
These compose recursively:
|
|
||||||
Understand -> anomaly -> Diagnose -> need context -> Understand -> ...
|
|
||||||
@ -1,20 +0,0 @@
|
|||||||
---
|
|
||||||
description: Risk assessment table for debugging: symptom-to-cause mapping and verification steps
|
|
||||||
---
|
|
||||||
|
|
||||||
# Research Triage
|
|
||||||
|
|
||||||
Assess risk before choosing your approach.
|
|
||||||
|
|
||||||
| Factor | Low Risk | High Risk |
|
|
||||||
| ----------------- | ------------------------ | ------------------------------ |
|
|
||||||
| **Reversibility** | Easy to undo | Hard to reverse (data, deploy) |
|
|
||||||
| **Blast radius** | One file/function | Many systems, shared state |
|
|
||||||
| **Confidence** | Familiar, clear evidence | Novel, ambiguous symptoms |
|
|
||||||
| **Novelty** | Seen this before | Never encountered |
|
|
||||||
| **Time cost** | Known fast (<5s) | Unknown = measure first |
|
|
||||||
|
|
||||||
**Low risk** → Satisfice: test the single most likely hypothesis. Stop when confirmed.
|
|
||||||
|
|
||||||
**Any high risk** → Strong Inference: generate 2-3 competing hypotheses, design
|
|
||||||
a discriminating test, eliminate based on evidence.
|
|
||||||
@ -1,62 +0,0 @@
|
|||||||
# Verification Exercise: `build` agent smoke test
|
|
||||||
|
|
||||||
**Setup**: Open OpenCode → the default agent is now `orchestrator`. To test the
|
|
||||||
`build` agent directly, either Tab-cycle to it or use
|
|
||||||
`opencode run --agent build "your prompt"`.
|
|
||||||
|
|
||||||
## Level 1 — Read-only (verifies tool-call JSON is valid)
|
|
||||||
|
|
||||||
> **Prompt**: "Read .agents/hooks/post-tool-use.sh. Report: (1) what file path
|
|
||||||
> the counter uses, (2) what line the SELF-CHECK fires on, and (3) the exact
|
|
||||||
> modulo condition."
|
|
||||||
|
|
||||||
### Pass criteria:
|
|
||||||
|
|
||||||
- No tool call parse error in the OpenCode UI
|
|
||||||
- It reads the file in ≤50-line chunks (pagination rule working)
|
|
||||||
- Reports `/tmp/.opencode-tool-count-<hash>`, line ~23, `COUNT % 15 == 0`
|
|
||||||
- Session counter file exists: `ls /tmp/.opencode-tool-count-* 2>/dev/null`
|
|
||||||
|
|
||||||
## Level 2 — Small bounded write (verifies end-to-end tool call + edit)
|
|
||||||
|
|
||||||
> **Prompt**: "In .agents/hooks/post-tool-use.sh, the REPO_ID derivation line
|
|
||||||
> uses md5sum. Add a single-line comment directly above it (# repo-scoped to
|
|
||||||
> avoid cross-repo counter contamination) and nothing else."
|
|
||||||
|
|
||||||
### Pass criteria:
|
|
||||||
|
|
||||||
- Makes exactly 2–3 tool calls (read → edit → optionally verify)
|
|
||||||
- Doesn't read more than 50 lines at once
|
|
||||||
- The comment appears on the correct line in the file
|
|
||||||
- No hallucinated paths
|
|
||||||
|
|
||||||
## Level 3 — Scope escalation (verifies rule 5 in build.md)
|
|
||||||
|
|
||||||
> **Prompt**: "Refactor all five hook files to share a common REPO_ROOT
|
|
||||||
> derivation function."
|
|
||||||
|
|
||||||
### Pass criteria:
|
|
||||||
|
|
||||||
- It refuses and tells you this exceeds 2–3 files / needs the orchestrator or
|
|
||||||
default agent
|
|
||||||
- It does NOT start reading all five files and attempting the refactor
|
|
||||||
|
|
||||||
If Level 1 and 2 pass cleanly and Level 3 correctly escalates, the build agent
|
|
||||||
is working. If Level 1 shows parse errors, restart OpenCode to reload the
|
|
||||||
renamed agent config.
|
|
||||||
|
|
||||||
## Level 4 — Orchestrator planning gate (cloud only)
|
|
||||||
|
|
||||||
**Setup**: Switch to the `orchestrator` agent (or use `/orchestrator` in
|
|
||||||
Copilot). Run a vague multi-step request.
|
|
||||||
|
|
||||||
> **Prompt**: "Clean up the hook files — reduce repetition and make sure the
|
|
||||||
> conventions match what's in .agents/AGENTS.md."
|
|
||||||
|
|
||||||
### Pass criteria:
|
|
||||||
|
|
||||||
- Produces a numbered plan with clear subtasks and acceptance criteria
|
|
||||||
- Asks "Proceed?" before starting any implementation
|
|
||||||
- Does NOT immediately start reading or editing files
|
|
||||||
- After confirming, executes subtasks sequentially with inline tool calls
|
|
||||||
(cloud) or dispatches to `build` via `task` (OpenCode/local)
|
|
||||||
39
README.md
39
README.md
@ -1,39 +0,0 @@
|
|||||||
# dotfiles
|
|
||||||
|
|
||||||
Personal dotfiles and AI-agent infrastructure for VS Code Copilot and OpenCode.
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
For host machines, install dotfiles plus llama-server config and systemd services via the `--host` flag:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://github.com/username/dotfiles ~/dotfiles
|
|
||||||
~/dotfiles/install.sh --host
|
|
||||||
```
|
|
||||||
|
|
||||||
If using devcontainers, drop the `--host` flag in the Dockerfile or just rely on vscode settings or, possibly better, a devcontainer "features" config such as:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"features": {
|
|
||||||
"ghcr.io/willfantom/features/dotfiles:1": {
|
|
||||||
"repository": "git@git.bcdewitt.ddns.net:bcdewitt/dotfiles.git",
|
|
||||||
"targetPath": "~/dotfiles",
|
|
||||||
"installCommand": "install.sh"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## What Gets Installed
|
|
||||||
|
|
||||||
**Basic install** (`install.sh`):
|
|
||||||
- Agent hooks wired into VS Code Copilot and OpenCode (the `.agents/` infrastructure)
|
|
||||||
- OpenCode config symlinked to `~/.config/opencode/opencode.json`
|
|
||||||
|
|
||||||
**Host install** (`install.sh --host`):
|
|
||||||
- Everything in basic install, plus:
|
|
||||||
- llama-server presets, startup script, and systemd units from `.agents/llama-server/`
|
|
||||||
|
|
||||||
## Idempotent
|
|
||||||
|
|
||||||
The install script is idempotent — safe to re-run at any time. It skips steps that
|
|
||||||
are already in place and only changes what needs updating.
|
|
||||||
349
install.sh
349
install.sh
@ -1,349 +0,0 @@
|
|||||||
#!/usr/bin/env bash
|
|
||||||
# install.sh — Wire .agents/ into global tool configs.
|
|
||||||
# Run with --host to also install llama-server, VS Code, Docker, and extensions.
|
|
||||||
# Idempotent: safe to re-run. Creates dirs, symlinks, and config entries.
|
|
||||||
# Run once per machine after cloning dotfiles.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
INSTALL_HOST=false
|
|
||||||
for arg in "$@"; do case "$arg" in --host) INSTALL_HOST=true ;; esac; done
|
|
||||||
|
|
||||||
DOTFILES_AGENTS="$(cd "$(dirname "$0")" && pwd)/.agents"
|
|
||||||
|
|
||||||
log() { printf '\033[0;32m✓\033[0m %s\n' "$1"; }
|
|
||||||
warn() { printf '\033[0;33m⚠\033[0m %s\n' "$1"; }
|
|
||||||
skip() { printf '\033[0;34m–\033[0m %s\n' "$1"; }
|
|
||||||
|
|
||||||
# ── 1. Copilot global hooks ──────────────────────────────────────────────────
|
|
||||||
# Generate ~/.copilot/hooks/hooks.json with absolute paths so the hooks
|
|
||||||
# work from any workspace — no per-project symlinks or stubs needed.
|
|
||||||
COPILOT_HOOKS_DIR="$HOME/.copilot/hooks"
|
|
||||||
COPILOT_HOOK_FILE="$COPILOT_HOOKS_DIR/hooks.json"
|
|
||||||
|
|
||||||
mkdir -p "$COPILOT_HOOKS_DIR"
|
|
||||||
|
|
||||||
# Migrate: remove old symlink if present
|
|
||||||
if [[ -L "$COPILOT_HOOK_FILE" ]]; then
|
|
||||||
rm "$COPILOT_HOOK_FILE"
|
|
||||||
log "Removed old Copilot hook symlink (migrating to generated file)"
|
|
||||||
fi
|
|
||||||
|
|
||||||
EXPECTED_PRE="$DOTFILES_AGENTS/hooks/pre-tool-use.sh"
|
|
||||||
if [[ -f "$COPILOT_HOOK_FILE" ]] && \
|
|
||||||
node -e "const c=JSON.parse(require('fs').readFileSync('$COPILOT_HOOK_FILE','utf8')); process.exit(c.hooks&&c.hooks.PreToolUse&&c.hooks.PreToolUse[0].command==='$EXPECTED_PRE'?0:1);" 2>/dev/null; then
|
|
||||||
skip "Copilot global hooks already up-to-date: $COPILOT_HOOK_FILE"
|
|
||||||
else
|
|
||||||
node -e "
|
|
||||||
const fs = require('fs');
|
|
||||||
const d = '$DOTFILES_AGENTS/hooks';
|
|
||||||
const hooks = {
|
|
||||||
UserPromptSubmit: [{type:'command',command:d+'/user-prompt-submit.sh',timeout:5}],
|
|
||||||
SessionStart: [{type:'command',command:d+'/session-start.sh',timeout:10}],
|
|
||||||
PreToolUse: [{type:'command',command:d+'/pre-tool-use.sh',timeout:5}],
|
|
||||||
PostToolUse: [{type:'command',command:d+'/post-tool-use.sh',timeout:5}],
|
|
||||||
PreCompact: [{type:'command',command:d+'/pre-compact.sh',timeout:10}],
|
|
||||||
Stop: [{type:'command',command:d+'/stop.sh',timeout:5}]
|
|
||||||
};
|
|
||||||
fs.writeFileSync('$COPILOT_HOOK_FILE', JSON.stringify({hooks}, null, 2) + '\n');
|
|
||||||
"
|
|
||||||
log "Copilot global hooks generated with absolute paths: $COPILOT_HOOK_FILE"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 2. OpenCode global plugin ────────────────────────────────────────────────
|
|
||||||
OC_PLUGINS_DIR="$HOME/.config/opencode/plugins"
|
|
||||||
OC_PLUGIN_TARGET="$DOTFILES_AGENTS/frameworks/opencode/plugin.ts"
|
|
||||||
OC_PLUGIN_LINK="$OC_PLUGINS_DIR/plugin.ts"
|
|
||||||
|
|
||||||
mkdir -p "$OC_PLUGINS_DIR"
|
|
||||||
if [[ -L "$OC_PLUGIN_LINK" && "$(readlink "$OC_PLUGIN_LINK")" == "$OC_PLUGIN_TARGET" ]]; then
|
|
||||||
skip "OpenCode plugin symlink already set: $OC_PLUGIN_LINK"
|
|
||||||
else
|
|
||||||
ln -sf "$OC_PLUGIN_TARGET" "$OC_PLUGIN_LINK"
|
|
||||||
log "OpenCode plugin symlink: $OC_PLUGIN_LINK → $OC_PLUGIN_TARGET"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 3. OpenCode global agents dir ───────────────────────────────────────────
|
|
||||||
OC_AGENTS_DIR="$HOME/.config/opencode/agents"
|
|
||||||
OC_AGENTS_SOURCE="$DOTFILES_AGENTS/agents"
|
|
||||||
|
|
||||||
mkdir -p "$OC_AGENTS_DIR"
|
|
||||||
for src in "$OC_AGENTS_SOURCE"/*.md; do
|
|
||||||
name="$(basename "$src")"
|
|
||||||
link="$OC_AGENTS_DIR/$name"
|
|
||||||
if [[ "$name" == "AGENTS.md" ]]; then continue; fi # not a slash-command agent
|
|
||||||
if [[ -L "$link" && "$(readlink "$link")" == "$src" ]]; then
|
|
||||||
skip "OpenCode agent symlink already set: $link"
|
|
||||||
else
|
|
||||||
ln -sf "$src" "$link"
|
|
||||||
log "OpenCode agent symlink: $link → $src"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# ── 3a. OpenCode global AGENTS.md ───────────────────────────────────────────
|
|
||||||
OC_AGENTS_TARGET="$DOTFILES_AGENTS/AGENTS.md"
|
|
||||||
OC_AGENTS_LINK="$HOME/.config/opencode/AGENTS.md"
|
|
||||||
|
|
||||||
if [[ -L "$OC_AGENTS_LINK" && "$(readlink "$OC_AGENTS_LINK")" == "$OC_AGENTS_TARGET" ]]; then
|
|
||||||
skip "OpenCode AGENTS.md symlink already set: $OC_AGENTS_LINK"
|
|
||||||
else
|
|
||||||
ln -sf "$OC_AGENTS_TARGET" "$OC_AGENTS_LINK"
|
|
||||||
log "OpenCode AGENTS.md symlink: $OC_AGENTS_LINK → $OC_AGENTS_TARGET"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 4. OpenCode global config (opencode.json) ────────────────────────────────
|
|
||||||
OC_CONFIG_SOURCE="$DOTFILES_AGENTS/frameworks/opencode/opencode.json"
|
|
||||||
OC_CONFIG_LINK="$HOME/.config/opencode/opencode.json"
|
|
||||||
|
|
||||||
mkdir -p "$(dirname "$OC_CONFIG_LINK")"
|
|
||||||
if [[ -L "$OC_CONFIG_LINK" && "$(readlink "$OC_CONFIG_LINK")" == "$OC_CONFIG_SOURCE" ]]; then
|
|
||||||
skip "OpenCode config symlink already set: $OC_CONFIG_LINK"
|
|
||||||
else
|
|
||||||
ln -sf "$OC_CONFIG_SOURCE" "$OC_CONFIG_LINK"
|
|
||||||
log "OpenCode config symlink: $OC_CONFIG_LINK → $OC_CONFIG_SOURCE"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 5. Build llama-server (requires --host) ──────────────────────────────────
|
|
||||||
if [[ "$INSTALL_HOST" != "true" ]]; then
|
|
||||||
skip "llama-server build skipped (use --host to install)"
|
|
||||||
else
|
|
||||||
if [[ -x /opt/llama-server/llama-server ]]; then
|
|
||||||
skip "llama-server already installed at /opt/llama-server/llama-server"
|
|
||||||
else
|
|
||||||
sudo apt-get install -y cmake build-essential nvidia-cuda-toolkit libgomp1 git
|
|
||||||
|
|
||||||
(
|
|
||||||
git clone --depth 1 --branch b9279 https://github.com/ggml-org/llama.cpp.git /tmp/llama-build
|
|
||||||
cd /tmp/llama-build
|
|
||||||
|
|
||||||
cmake -B build \
|
|
||||||
-DGGML_CUDA=ON \
|
|
||||||
-DCMAKE_BUILD_TYPE=Release \
|
|
||||||
-DLLAMA_BUILD_SERVER=ON \
|
|
||||||
-DLLAMA_BUILD_TESTS=OFF \
|
|
||||||
-DLLAMA_BUILD_EXAMPLES=OFF
|
|
||||||
|
|
||||||
cmake --build build --config Release -j$(nproc)
|
|
||||||
|
|
||||||
sudo mkdir -p /opt/llama-server
|
|
||||||
sudo cp build/bin/llama-server /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/libggml*.so* /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/libllama*.so* /opt/llama-server/
|
|
||||||
sudo cp -P build/bin/libmtmd*.so* /opt/llama-server/ 2>/dev/null || true
|
|
||||||
|
|
||||||
echo "/opt/llama-server" | sudo tee /etc/ld.so.conf.d/llama-server.conf
|
|
||||||
sudo ldconfig
|
|
||||||
|
|
||||||
rm -rf /tmp/llama-build
|
|
||||||
)
|
|
||||||
|
|
||||||
log "llama-server built and installed to /opt/llama-server/"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 6. Llama-server host config (requires --host) ───────────────────────────
|
|
||||||
if [[ "$INSTALL_HOST" != "true" ]]; then
|
|
||||||
skip "Llama-server host config skipped (use --host to install)"
|
|
||||||
else
|
|
||||||
|
|
||||||
# ── 6a. Model downloads (requires --host) ──────────────────────────────────
|
|
||||||
if ! command -v huggingface-cli >/dev/null 2>&1; then
|
|
||||||
warn "huggingface-cli not found — skipping model downloads (install via 'pip install huggingface_hub')"
|
|
||||||
else
|
|
||||||
_hf_download() {
|
|
||||||
local repo="$1" file="$2" dir="$3"
|
|
||||||
local dest="$dir/$file"
|
|
||||||
if [[ -f "$dest" ]]; then
|
|
||||||
skip "Model already present: $dest"
|
|
||||||
else
|
|
||||||
mkdir -p "$dir"
|
|
||||||
huggingface-cli download "$repo" "$file" --local-dir "$dir" >/dev/null
|
|
||||||
log "Downloaded model: $repo/$file → $dest"
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
_hf_download "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF" "Qwopus3.6-27B-v2-MTP-Q4_K_M.gguf" "$HOME/models"
|
|
||||||
_hf_download "Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF" "Qwopus3.5-9B-Coder-MTP-Q8_0.gguf" "$HOME/models"
|
|
||||||
_hf_download "bartowski/agentica-org_DeepCoder-14B-Preview-GGUF" "agentica-org_DeepCoder-14B-Preview-Q5_K_M.gguf" "$HOME/models"
|
|
||||||
_hf_download "byteshape/Qwen3.6-35B-A3B-MTP-GGUF" "Qwen3.6-35B-A3B-IQ3_S-3.06bpw.gguf" "$HOME/models"
|
|
||||||
_hf_download "Jackrong/Qwopus3.6-35B-A3B-v1-MTP-GGUF" "Qwopus3.6-35B-A3B-v1-MTP-Q4_K_M.gguf" "$HOME/models"
|
|
||||||
_hf_download "mradermacher/OmniCoder-2-9B-GGUF" "OmniCoder-2-9B.Q8_0.gguf" "$HOME/models/OmniCoder-2-9B.Q8_0"
|
|
||||||
_hf_download "mradermacher/OmniCoder-2-9B-GGUF" "mmproj-Q8_0.gguf" "$HOME/models/OmniCoder-2-9B.Q8_0"
|
|
||||||
_hf_download "bartowski/Qwen_Qwen3-14B-GGUF" "Qwen_Qwen3-14B-Q4_K_M.gguf" "$HOME/models"
|
|
||||||
_hf_download "bartowski/Qwen_Qwen3.6-27B-GGUF" "Qwen_Qwen3.6-27B-Q4_K_M.gguf" "$HOME/models"
|
|
||||||
fi
|
|
||||||
|
|
||||||
PRESETS_SRC="$DOTFILES_AGENTS/llama-server/presets.ini"
|
|
||||||
PRESETS_DST="$HOME/models/presets.ini"
|
|
||||||
mkdir -p "$HOME/models"
|
|
||||||
if diff -q "$PRESETS_SRC" "$PRESETS_DST" >/dev/null 2>&1; then
|
|
||||||
skip "presets.ini already up-to-date: $PRESETS_DST"
|
|
||||||
else
|
|
||||||
cp "$PRESETS_SRC" "$PRESETS_DST"
|
|
||||||
log "Installed presets.ini → $PRESETS_DST"
|
|
||||||
fi
|
|
||||||
|
|
||||||
SVC_SRC="$DOTFILES_AGENTS/llama-server/llama-server.service"
|
|
||||||
SVC_DST="/etc/systemd/system/llama-server.service"
|
|
||||||
if diff -q "$SVC_SRC" "$SVC_DST" >/dev/null 2>&1; then
|
|
||||||
skip "llama-server.service already up-to-date: $SVC_DST"
|
|
||||||
else
|
|
||||||
cp "$SVC_SRC" "$SVC_DST"
|
|
||||||
log "Installed llama-server.service → $SVC_DST"
|
|
||||||
fi
|
|
||||||
|
|
||||||
PATH_SRC="$DOTFILES_AGENTS/llama-server/llama-server-presets.path"
|
|
||||||
PATH_DST="/etc/systemd/system/llama-server-presets.path"
|
|
||||||
if diff -q "$PATH_SRC" "$PATH_DST" >/dev/null 2>&1; then
|
|
||||||
skip "llama-server-presets.path already up-to-date: $PATH_DST"
|
|
||||||
else
|
|
||||||
cp "$PATH_SRC" "$PATH_DST"
|
|
||||||
log "Installed llama-server-presets.path → $PATH_DST"
|
|
||||||
fi
|
|
||||||
|
|
||||||
PSVC_SRC="$DOTFILES_AGENTS/llama-server/llama-server-presets.service"
|
|
||||||
PSVC_DST="/etc/systemd/system/llama-server-presets.service"
|
|
||||||
if diff -q "$PSVC_SRC" "$PSVC_DST" >/dev/null 2>&1; then
|
|
||||||
skip "llama-server-presets.service already up-to-date: $PSVC_DST"
|
|
||||||
else
|
|
||||||
cp "$PSVC_SRC" "$PSVC_DST"
|
|
||||||
log "Installed llama-server-presets.service → $PSVC_DST"
|
|
||||||
fi
|
|
||||||
|
|
||||||
START_SRC="$DOTFILES_AGENTS/llama-server/start.sh"
|
|
||||||
START_DST="/opt/llama-server/start.sh"
|
|
||||||
mkdir -p "$(dirname "$START_DST")"
|
|
||||||
if diff -q "$START_SRC" "$START_DST" >/dev/null 2>&1; then
|
|
||||||
skip "start.sh already up-to-date: $START_DST"
|
|
||||||
else
|
|
||||||
cp "$START_SRC" "$START_DST"
|
|
||||||
log "Installed start.sh → $START_DST"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 7. VS Code global MCP ────────────────────────────────────────────────────
|
|
||||||
# Primary remote/server path; falls back to local if running VS Code locally.
|
|
||||||
VSCODE_MCP_PATHS=(
|
|
||||||
"$HOME/.vscode-server/data/User/mcp.json"
|
|
||||||
"$HOME/.vscode/data/User/mcp.json"
|
|
||||||
"$HOME/Library/Application Support/Code/User/mcp.json"
|
|
||||||
)
|
|
||||||
|
|
||||||
for VSCODE_MCP in "${VSCODE_MCP_PATHS[@]}"; do
|
|
||||||
if [[ -d "$(dirname "$VSCODE_MCP")" ]]; then
|
|
||||||
MCP_KEY="all-agents"
|
|
||||||
MCP_SERVER_CMD="node"
|
|
||||||
MCP_SERVER_ARGS="[\"--experimental-strip-types\", \"$DOTFILES_AGENTS/mcp/index.ts\"]"
|
|
||||||
|
|
||||||
node -e "
|
|
||||||
const fs = require('fs');
|
|
||||||
const path = '$VSCODE_MCP';
|
|
||||||
const config = fs.existsSync(path) ? JSON.parse(fs.readFileSync(path, 'utf8')) : {};
|
|
||||||
config.servers = config.servers || {};
|
|
||||||
let changed = false;
|
|
||||||
if (!config.servers['$MCP_KEY']) {
|
|
||||||
config.servers['$MCP_KEY'] = { type: 'stdio', command: '$MCP_SERVER_CMD', args: $MCP_SERVER_ARGS };
|
|
||||||
changed = true;
|
|
||||||
}
|
|
||||||
if (!config.servers['exa']) {
|
|
||||||
config.servers['exa'] = { type: 'http', url: 'https://mcp.exa.ai/mcp' };
|
|
||||||
changed = true;
|
|
||||||
}
|
|
||||||
if (changed) {
|
|
||||||
fs.writeFileSync(path, JSON.stringify(config, null, 2) + '\n');
|
|
||||||
console.log('VS Code MCP config updated: ' + path);
|
|
||||||
} else {
|
|
||||||
process.stdout.write('');
|
|
||||||
}
|
|
||||||
"
|
|
||||||
log "VS Code MCP entries ensured: $VSCODE_MCP"
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# ── 8. VS Code global prompts dir ───────────────────────────────────────────
|
|
||||||
for VSCODE_PROMPTS_DIR in \
|
|
||||||
"$HOME/.vscode-server/data/User/prompts" \
|
|
||||||
"$HOME/.vscode/data/User/prompts"; do
|
|
||||||
if [[ -d "$(dirname "$(dirname "$VSCODE_PROMPTS_DIR")")" ]]; then
|
|
||||||
mkdir -p "$VSCODE_PROMPTS_DIR"
|
|
||||||
log "VS Code prompts dir ensured: $VSCODE_PROMPTS_DIR"
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# ── 9. MCP server dependencies ───────────────────────────────────────────────
|
|
||||||
MCP_DIR="$DOTFILES_AGENTS/mcp"
|
|
||||||
if [[ ! -d "$MCP_DIR/node_modules/@modelcontextprotocol" ]]; then
|
|
||||||
log "Installing MCP server dependencies (npm install in $MCP_DIR)..."
|
|
||||||
npm install --prefix "$MCP_DIR" --silent
|
|
||||||
log "MCP server dependencies installed"
|
|
||||||
else
|
|
||||||
skip "MCP server node_modules already present"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 10. VS Code, Docker, extensions (requires --host) ────────────────────────
|
|
||||||
if [[ "$INSTALL_HOST" != "true" ]]; then
|
|
||||||
skip "VS Code, Docker, extensions skipped (use --host to install)"
|
|
||||||
else
|
|
||||||
|
|
||||||
# ── 10a. VS Code ──────────────────────────────────────────────────────
|
|
||||||
if command -v code >/dev/null 2>&1; then
|
|
||||||
skip "VS Code already installed"
|
|
||||||
else
|
|
||||||
log "Installing VS Code..."
|
|
||||||
sudo apt-get update
|
|
||||||
sudo apt-get install -y wget gpg
|
|
||||||
wget -qO- https://packages.microsoft.com/keys/microsoft.asc | \
|
|
||||||
gpg --dearmor -o /usr/share/keyrings/packages.microsoft.gpg
|
|
||||||
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" | \
|
|
||||||
sudo tee /etc/apt/sources.list.d/vscode.list
|
|
||||||
sudo apt-get update
|
|
||||||
sudo apt-get install -y code
|
|
||||||
log "VS Code installed"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 10b. Docker ──────────────────────────────────────────────────────
|
|
||||||
if command -v docker >/dev/null 2>&1; then
|
|
||||||
skip "Docker already installed"
|
|
||||||
else
|
|
||||||
log "Installing Docker..."
|
|
||||||
sudo apt-get update
|
|
||||||
sudo apt-get install -y ca-certificates curl
|
|
||||||
sudo install -m 0755 -d /etc/apt/keyrings
|
|
||||||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
|
|
||||||
sudo chmod a+r /etc/apt/keyrings/docker.asc
|
|
||||||
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
|
|
||||||
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
|
|
||||||
sudo apt-get update
|
|
||||||
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
||||||
log "Docker installed"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── 10c. VS Code extensions ──────────────────────────────────────────
|
|
||||||
_install_ext() {
|
|
||||||
local ext_id="$1"
|
|
||||||
if code --list-extensions 2>/dev/null | grep -qi "^${ext_id}$"; then
|
|
||||||
skip "VS Code extension already installed: $ext_id"
|
|
||||||
else
|
|
||||||
code --install-extension "$ext_id" >/dev/null 2>&1
|
|
||||||
log "VS Code extension installed: $ext_id"
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
|
|
||||||
_install_ext "ms-vscode-remote.vscode-remote-extensionpack"
|
|
||||||
_install_ext "ms-azuretools.vscode-docker"
|
|
||||||
_install_ext "streetsidesoftware.code-spell-checker"
|
|
||||||
_install_ext "EditorConfig.EditorConfig"
|
|
||||||
_install_ext "dbaeumer.vscode-eslint"
|
|
||||||
_install_ext "mhutchie.git-graph"
|
|
||||||
_install_ext "bierner.github-markdown-preview"
|
|
||||||
_install_ext "esbenp.prettier-vscode"
|
|
||||||
_install_ext "yoavbls.pretty-ts-errors"
|
|
||||||
|
|
||||||
fi
|
|
||||||
|
|
||||||
# ── Done ─────────────────────────────────────────────────────────────────────
|
|
||||||
printf '\n\033[0;32minstall.sh complete.\033[0m\n'
|
|
||||||
printf 'Next steps:\n'
|
|
||||||
printf ' 1. Restart OpenCode to pick up the new global plugin.\n'
|
|
||||||
printf ' 2. Reload VS Code / reconnect to reload MCP servers.\n'
|
|
||||||
printf ' 3. Smoke test: /research slash prompt fires; a denied terminal command is blocked.\n'
|
|
||||||
Loading…
x
Reference in New Issue
Block a user