zcbot/DESIGN.md

# 个人任务 Agent 设计方案 v2

> 一个面向"写汇报 PPT、写科研申报书、写代码"三类任务的轻量级 agent 框架。
> 完全自实现,综合 nanobot / CoreCoder / better-claw / smolagents 的优点,模型自由,**为模型持续升级做了演化性设计**。

> **v2 更新要点(2026-05)**:
> 1. 模型策略基于 DeepSeek V4 实际能力(2026-04-24 发布)重新设计
> 2. Skill 系统对齐 Anthropic 开放标准(2025-12 发布,行业已成共识)
> 3. 工具范式改为 Hybrid:JSON tool call + run_python 混合
> 4. 新增第 6 章「演化性设计」—— 让 agent 跟着模型升级而升级
> 5. 新增 Eval Suite 框架,作为模型升级的决策依据

---

## 1. 设计目标与边界

### 1.1 要做什么

构建一个本地运行的任务型 agent,能稳定完成三类工作:

| 任务模式 | 输入示例 | 输出 | 主要能力 |
|---------|---------|------|---------|
| **PPT 模式** | "把这份会议纪要做成 5 页汇报 PPT" | `.pptx` 文件 | 大纲提炼、版式设计、图表生成 |
| **科研申报模式** | "写一份国自然青年基金的立项依据" | `.docx` 文件(分章节) | 长文写作、文献检索、格式套模板 |
| **编码模式** | "修这个文件的 bug" / "实现这个函数" | 修改后的代码 | 文件编辑、shell 执行、迭代验证 |

### 1.2 明确不做什么

- ❌ 子 agent / subagent
- ❌ IM 渠道(Telegram / WeChat / Discord 等)
- ❌ 多用户系统
- ❌ Web UI(初期 CLI 即可)
- ❌ 自定义 RAG / 向量检索
- ❌ Anthropic 锁定(必须模型自由)

### 1.3 关键约束

- **模型自由**:支持 DeepSeek V4 / Kimi / Qwen / GPT / Claude 等任意 OpenAI-compatible API
- **代码可控**:总代码量 1100-1500 行,自己能完全读懂
- **任务持久化**:任意时刻关机,下次能恢复
- **长任务稳定**:单任务可跑数小时不崩
- **演化性**:模型升级时 agent 能力随之升级,不需要大改架构

---

## 2. 各家方案借鉴清单

| 借鉴自 | 借鉴的设计 | 为什么抄 |
|--------|-----------|---------|
| **CoreCoder** | 主 agent loop 简洁实现 | ~150 行写完核心,可读性极高 |
| **CoreCoder** | Edit 工具的"唯一匹配"约束 | 防止 LLM 改错地方,业界最佳实践 |
| **CoreCoder** | 三层 context 压缩(简化版,V4 时代不太用) | 兜底方案 |
| **Anthropic Agent Skills** | **SKILL.md + 渐进披露 标准** | 行业标准,2025-12 开放,跨平台兼容 |
| **nanobot** | Workspace + 任务隔离的目录结构 | 多任务并行不互相污染 |
| **nanobot** | 双层记忆(core + extended) | core 注入 prompt,extended 按需读 |
| **better-claw** | Mid-query 轮转 + carryover | 长任务工程兜底 |
| **better-claw** | 任务状态持久化(state.json) | 支持中断恢复 |
| **smolagents** | LiteLLM 做模型层 | 一行切换 25+ provider |
| **smolagents** | `@tool` 装饰器风格 | 工具定义最简洁 |
| **smolagents** | **CodeAgent 思路(部分采用)** | 通过 `run_python` 工具实现 hybrid 范式 |
| **CodeAct 论文** | 代码作为 action 的范式 | 在数据/计算/批处理任务上比 JSON 强 20% |

---

## 3. 整体架构

### 3.1 分层结构

```
┌──────────────────────────────────────────────┐
│  入口层                                       │
│  - CLI(初期):interactive REPL              │
│  - 后期可加:Streamlit / Gradio Web UI       │
└──────────────────────────────────────────────┘
                    ↓
┌──────────────────────────────────────────────┐
│  任务路由层                                   │
│  - /ppt    → PPT 模式 + ppt skill            │
│  - /doc    → 科研写作模式 + proposal skill   │
│  - /code   → 编码模式 + coding skill         │
│  - 默认    → 通用模式                         │
└──────────────────────────────────────────────┘
                    ↓
┌──────────────────────────────────────────────┐
│  Agent 核心                                   │
│  - Loop:ReAct 循环(LLM ↔ Tool)             │
│  - Capability Manager:模型能力探测与适配     │
│  - Context Manager:三层压缩(必要时)         │
│  - Session:对话历史持久化                    │
│  - Memory:双层记忆系统                       │
└──────────────────────────────────────────────┘
                    ↓
┌──────────────────────────────────────────────┐
│  工具层(Hybrid 范式)                        │
│  通用工具(JSON tool call):                  │
│  - read / write / edit / glob / grep / shell │
│  - web_search / web_fetch                    │
│  - load_skill                                │
│  - run_python(沙盒执行,Hybrid 关键)         │
│                                              │
│  Skill 提供的工具(标准格式):                │
│  - skills/<name>/scripts/*.py                │
└──────────────────────────────────────────────┘
                    ↓
┌──────────────────────────────────────────────┐
│  LLM 层(LiteLLM + Model Profile)            │
│  - 默认:DeepSeek V4-Flash                   │
│  - 升级:DeepSeek V4-Pro / Claude Opus 4.7   │
│  - Profile 化配置,新模型 5 分钟接入          │
└──────────────────────────────────────────────┘
```

### 3.2 目录结构

```
your_agent/
├── core/
│   ├── __init__.py
│   ├── loop.py              # 主 agent loop          ~150 行
│   ├── llm.py               # LiteLLM 封装            ~120 行
│   ├── capabilities.py      # 模型能力探测与适配     ~100 行  ⭐新增
│   ├── context.py           # 三层 context 压缩       ~150 行
│   ├── session.py           # 会话持久化              ~100 行
│   └── memory.py            # 双层记忆                ~80 行
├── tools/
│   ├── __init__.py
│   ├── base.py              # Tool 基类               ~50 行
│   ├── fs.py                # read/write/edit/glob/grep ~250 行
│   ├── shell.py             # bash 执行 + 安全检查    ~80 行
│   ├── web.py               # web_search + fetch      ~80 行
│   ├── run_python.py        # 沙盒 Python 执行器      ~100 行  ⭐新增
│   └── skill_tool.py        # load_skill 工具         ~60 行
├── skills/                   # ⭐ 标准 Anthropic Agent Skills 格式
│   ├── ppt/
│   │   ├── SKILL.md         # 主指引(短)
│   │   ├── references/      # 详细资料(按需读)
│   │   ├── scripts/         # 可执行脚本
│   │   └── assets/          # 模板等资源
│   ├── proposal/
│   │   ├── SKILL.md
│   │   ├── references/
│   │   ├── scripts/
│   │   └── assets/
│   └── coding/
│       ├── SKILL.md
│       └── references/
├── prompts/                  # ⭐ 版本化系统提示词
│   ├── system/
│   │   ├── general_v1.md
│   │   └── general_active.md → general_v1.md
│   └── modes/
│       ├── ppt.md
│       ├── proposal.md
│       └── coding.md
├── config/
│   ├── agent.yaml           # 主配置
│   └── models/              # ⭐ 模型档案库
│       ├── _template.yaml
│       ├── deepseek_v4.yaml
│       ├── claude_4_7.yaml
│       └── gpt_5.yaml
├── evals/                   # ⭐ 评估任务集
│   ├── coding/
│   ├── ppt/
│   ├── proposal/
│   └── runner.py            # eval 执行器
├── workspace/                # 用户数据(gitignore)
│   ├── tasks/
│   ├── memory/
│   │   ├── core.md
│   │   └── extended/
│   └── logs/
├── cli.py                    # CLI 入口                ~150 行
├── main.py                   # 装配 + 启动             ~50 行
├── requirements.txt
└── README.md
```

**总代码量预估:1300-1600 行 Python**(比 v1 多 ~200 行,因为加了 capabilities 和 run_python)

---

## 4. 核心组件设计

### 4.1 Agent Loop(`core/loop.py`)

```python
class AgentLoop:
    def __init__(self, llm, tools, capabilities, context_manager, session,
                 max_iterations=None):
        self.llm = llm
        self.tools = tools
        self.caps = capabilities  # ⭐ 模型能力档案
        self.ctx = context_manager
        self.session = session
        # 迭代次数从 capabilities 读取,不同模型不同
        self.max_iterations = max_iterations or self.caps.max_iterations

    def run(self, user_message: str) -> str:
        self.session.append({"role": "user", "content": user_message})

        for iteration in range(self.max_iterations):
            messages = self.ctx.check_and_compress(self.session.messages)

            response = self.llm.chat(
                messages=messages,
                tools=[t.schema for t in self.tools.values()],
                # ⭐ 高级特性按 capabilities 启用
                parallel_tool_calls=self.caps.parallel_tools,
                reasoning_effort=self.caps.default_reasoning_effort
            )
            msg = response.choices[0].message
            self.session.append(msg)

            if not msg.tool_calls:
                return msg.content

            # ⭐ 并行 vs 串行执行根据能力决定
            if self.caps.parallel_tools and len(msg.tool_calls) > 1:
                results = self._execute_tools_parallel(msg.tool_calls)
            else:
                results = self._execute_tools_serial(msg.tool_calls)

            for tool_call, result in results:
                self.session.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": result
                })

        return "[已达到最大迭代次数]"
```

### 4.2 Model Profile + Capabilities(`core/capabilities.py`)⭐ 新增

**核心思想**:每个模型有自己的能力档案,agent 行为根据档案动态调整。新模型出来,加一个 yaml 即可。

```python
@dataclass
class ModelCapabilities:
    """模型能力档案"""
    model_id: str

    # 基础能力
    max_context: int                  # 最大上下文(tokens)
    reliable_context: int             # 实测可靠上下文
    max_output: int                   # 最大输出

    # Tool calling
    parallel_tools: bool              # 是否支持并行 tool call
    tool_calling_quality: str         # "excellent" / "good" / "fair"

    # 思考模式
    thinking_mode: bool               # 是否支持思考模式
    reasoning_effort_levels: list     # ["low","medium","high","max"]
    default_reasoning_effort: str

    # 推理与代码
    code_quality: str                 # CodeAct 范式适配度
    enable_run_python: bool           # 是否启用 run_python 工具

    # 工程参数
    max_iterations: int               # 最大迭代次数
    optimal_temperature: float

    # 特殊功能
    prompt_caching: bool              # Anthropic 特有
    extended_thinking: bool           # Claude 4.x 特有

    @classmethod
    def from_yaml(cls, path: Path) -> "ModelCapabilities":
        """从 config/models/*.yaml 加载"""
        ...

    @classmethod
    def detect(cls, model_id: str) -> "ModelCapabilities":
        """根据 model_id 自动找到对应档案
        deepseek-v4-flash → config/models/deepseek_v4.yaml (variant=flash)
        claude-opus-4-7 → config/models/claude_4_7.yaml
        """
        ...
```

**模型档案示例** (`config/models/deepseek_v4.yaml`):

```yaml
family: deepseek_v4
variants:
  flash:
    model_id: deepseek/deepseek-v4-flash
    max_context: 1048576
    reliable_context: 262144
    max_output: 384000
    parallel_tools: true
    tool_calling_quality: good
    thinking_mode: true
    reasoning_effort_levels: [non_thinking, thinking]
    default_reasoning_effort: non_thinking
    code_quality: good
    enable_run_python: true
    max_iterations: 50
    optimal_temperature: 0.3
    prompt_caching: false
    extended_thinking: false

  pro:
    model_id: deepseek/deepseek-v4-pro
    max_context: 1048576
    reliable_context: 524288
    max_output: 384000
    parallel_tools: true
    tool_calling_quality: excellent
    thinking_mode: true
    reasoning_effort_levels: [low, medium, high, max]
    default_reasoning_effort: medium
    code_quality: excellent
    enable_run_python: true
    max_iterations: 100
    optimal_temperature: 0.2
    prompt_caching: false
    extended_thinking: false
```

**Claude 4.7 档案** (`config/models/claude_4_7.yaml`):

```yaml
family: claude_4_7
variants:
  opus:
    model_id: anthropic/claude-opus-4-7
    max_context: 200000
    reliable_context: 200000
    max_output: 8192
    parallel_tools: true
    tool_calling_quality: excellent
    thinking_mode: true
    reasoning_effort_levels: [low, medium, high]
    default_reasoning_effort: medium
    code_quality: excellent
    enable_run_python: true
    max_iterations: 100
    optimal_temperature: 0.2
    prompt_caching: true        # Claude 特有
    extended_thinking: true
```

**新模型怎么加(以未来 V5 为例)**:

```bash
# 1. 复制模板
cp config/models/_template.yaml config/models/deepseek_v5.yaml

# 2. 填能力(从模型发布博客 + 跑一次 capability probe)
vim config/models/deepseek_v5.yaml

# 3. 跑 eval suite 验证
python evals/runner.py --model deepseek-v5-pro

# 4. 切换默认模型
vim config/agent.yaml  # default_model: deepseek-v5-pro
```

整个流程不需要改任何 agent 核心代码。

### 4.3 LLM 封装(`core/llm.py`)

```python
class LLM:
    def __init__(self, capabilities: ModelCapabilities, api_key: str, base_url: str = None):
        self.caps = capabilities
        self.api_key = api_key
        self.base_url = base_url
        self.token_counter = TokenCounter()

    def chat(self, messages, tools=None, parallel_tool_calls=None,
             reasoning_effort=None, max_retries=3):
        # 用 capabilities 自动填充默认值
        kwargs = {
            "model": self.caps.model_id,
            "messages": messages,
            "tools": tools,
            "temperature": self.caps.optimal_temperature,
            "api_key": self.api_key,
            "base_url": self.base_url,
        }

        # 按能力启用
        if self.caps.parallel_tools and parallel_tool_calls is not False:
            kwargs["parallel_tool_calls"] = True
        if self.caps.thinking_mode and reasoning_effort:
            kwargs["reasoning_effort"] = reasoning_effort
        if self.caps.prompt_caching:
            kwargs["extra_headers"] = {"anthropic-beta": "prompt-caching-2024-07-31"}

        for attempt in range(max_retries):
            try:
                response = litellm.completion(**kwargs)
                self.token_counter.add(response.usage, self.caps.model_id)
                return response
            except (RateLimitError, APIConnectionError):
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
```

### 4.4 工具系统(Hybrid 范式)

#### 4.4.1 通用工具(JSON tool call)

文件操作工具的核心仍然是 **Edit 工具的"唯一匹配"约束**(借鉴 CoreCoder):

```python
class EditTool(Tool):
    name = "edit"
    description = "Replace a unique string in a file with another string."

    def execute(self, path: str, old_str: str, new_str: str) -> str:
        content = Path(path).read_text()
        count = content.count(old_str)
        if count == 0:
            return f"[Error] old_str not found in {path}"
        if count > 1:
            return f"[Error] old_str appears {count} times, must be unique"
        Path(path).write_text(content.replace(old_str, new_str))
        return self._make_diff(content, ...)
```

#### 4.4.2 `run_python` 工具 ⭐ 新增

**Hybrid 范式的关键**:agent 主要用 JSON tool call,但需要时可以写代码作为 action:

```python
class RunPythonTool(Tool):
    name = "run_python"
    description = """Execute Python code in a sandboxed environment.

    Use for:
    - Data analysis, statistics, calculations
    - Batch file operations (process many files)
    - Document generation (PPT, Word, charts)
    - Tasks where code is more natural than tool composition

    Available libraries: pandas, numpy, matplotlib, python-pptx, python-docx,
    arxiv, requests, pypdf, pdfplumber.

    Working directory is the current task's tasks/<task_id>/.
    Files created here are automatically available to the user.
    """

    def execute(self, code: str, timeout: int = 60) -> str:
        # 阶段 1(本地用):subprocess + venv + 工作目录限制
        # 阶段 2(更安全):Docker container
        # 阶段 3(公开服务):E2B / Modal

        with tempfile.NamedTemporaryFile(suffix=".py", mode="w", delete=False) as f:
            f.write(code)
            script_path = f.name

        try:
            result = subprocess.run(
                [sys.executable, script_path],
                cwd=self.task_dir,  # 限制工作目录
                capture_output=True,
                timeout=timeout,
                env=self._safe_env()  # 过滤敏感环境变量
            )
            return self._format_result(result)
        finally:
            os.unlink(script_path)
```

**为什么这是关键设计**:
- **JSON tool call 处理离散操作**(读文件、跑命令、查文献)
- **Code execution 处理连续逻辑**(算数据、批处理、生成文档)
- 模型自己决定什么时候用什么,不是你硬编码

#### 4.4.3 工具粒度原则 ⭐ 新增

工具切分按"原子操作"原则,不做高级封装:

```python
# ❌ 反模式:工具做太多,模型用不灵活
class GenerateProposalTool:
    def execute(self, topic):
        # 内部硬编码 8 章节流程

# ✅ 正模式:原子操作,组合策略给模型
class WriteSectionTool:    # 写一节
class CompileDocxTool:     # 合并成 docx
class SearchPapersTool:    # 查文献
class FormatBibtexTool:    # BibTeX 格式化
```

**理由**:模型变强后会有更好的组合策略。封装太死接收不到模型升级的红利。

### 4.5 Skill 系统(对齐 Anthropic 开放标准)⭐ 重大调整

#### 4.5.1 标准目录结构

每个 skill 是一个目录,包含:

```
skills/proposal/
├── SKILL.md              # 主指引(短,~3000 tokens 内)
├── references/           # 详细资料(按需加载)
│   ├── nsfc_format.md
│   ├── citation_style.md
│   └── section_examples.md
├── scripts/              # 可执行脚本(可作为工具)
│   ├── search_papers.py
│   ├── format_bibtex.py
│   └── compile_docx.py
└── assets/               # 模板、字体等
    └── templates/
        ├── nsfc_youth.docx
        ├── nsfc_general.docx
        └── nsfc_key.docx
```

#### 4.5.2 SKILL.md 标准格式

```markdown
---
name: proposal
description: 撰写科研申报书(国自然/省基金/横向项目)。当用户需要写课题申请、立项依据、项目书时使用。
---

# 科研申报书

## 资源
- `references/nsfc_format.md`:国自然格式细节
- `references/citation_style.md`:引文规范
- `references/section_examples.md`:各章节范例
- `scripts/search_papers.py`:可执行,文献检索
- `scripts/compile_docx.py`:可执行,合并章节为 docx
- `assets/templates/`:不同基金类型的 docx 模板

## 原则
- 文献必须真实(用 search_papers,绝不编造)
- 分章节写,不一次性生成全文
- 先与用户对齐课题信息卡片

## 工作目录
所有产出在 `tasks/<task_id>/`:
- `project.md` - 课题信息卡片
- `sections/<section_name>.md` - 各章节草稿
- `proposal.docx` - 最终输出

## 字数参考
立项依据 5000-8000 字,研究内容 3000-5000 字。
具体格式参见 references/nsfc_format.md。
```

注意:**不再写"Step 1/2/3"流程**,只写资源、原则、目标。让模型自己规划。

#### 4.5.3 Progressive Disclosure(渐进披露)的三层加载

按 Anthropic 标准:

| 层 | 时机 | 内容 | Token 占用 |
|---|------|------|----------|
| **Discovery** | Agent 启动 | 仅 `name + description`,所有 skill 都加载 | 几百 tokens |
| **Activation** | 任务匹配某个 skill | 完整 SKILL.md 主体 | 1000-5000 tokens |
| **Execution** | SKILL.md 引用某个 reference 时 | 单个 reference 文件 | 视情况 |

具体实现:

```python
# 启动时:Discovery
def build_initial_system_prompt(skills) -> str:
    skill_descriptions = []
    for name, skill in skills.items():
        meta = parse_frontmatter(skill["SKILL.md"])
        skill_descriptions.append(f"- {name}: {meta['description']}")

    return f"""
You are a task agent. Available skills:
{chr(10).join(skill_descriptions)}

Use `load_skill(name)` to load full instructions when relevant.
"""

# Agent 调用 load_skill 后:Activation
class LoadSkillTool(Tool):
    def execute(self, name: str) -> str:
        return (skills_dir / name / "SKILL.md").read_text()

# Agent 在 SKILL.md 里看到 references/xxx.md,主动调 read_file:Execution
# 这一层不需要专门工具,就是普通 read 工具
```

#### 4.5.4 Skill 设计原则(基于 Anthropic 官方 + 行业经验)

1. **Description 是关键** —— 决定模型能否触发,要明确具体
2. **SKILL.md 主体不超过 5000 tokens / 500 行** —— 超过就拆 references/
3. **写 WHY+WHAT,不写 HOW** —— 描述目标和资源,不写步骤
4. **代码即工具又是文档** —— scripts/ 里的脚本可以执行,也可以读到 context 当文档
5. **保持 Skill 数 ≤ 20**,工具数 ≤ 10 同时可见(超过后准确率下降)

### 4.6 Context 管理(简化版)

V4 时代 long context 性能好了很多,**大部分任务不再需要复杂压缩**。但保留三层兜底:

```python
class ContextManager:
    def __init__(self, capabilities, llm):
        # ⭐ 阈值从 capabilities 读取,不同模型不同
        self.max_tokens = capabilities.reliable_context
        self.soft = self.max_tokens * 0.6      # V4 长 context 强,提高阈值
        self.force = self.max_tokens * 0.85
        self.collapse = self.max_tokens * 0.95
        self.llm = llm

    def check_and_compress(self, messages):
        tokens = count_tokens(messages)
        if tokens < self.soft:
            return messages
        if tokens < self.force:
            return self._snip_old_tool_results(messages)
        if tokens < self.collapse:
            return self._microcompact(messages)
        return self._collapse(messages, llm=self.llm)
```

**实测预期**:DeepSeek V4-Pro 在 256K 内基本不触发任何压缩,写一份完整申报书(7-8 万 token)用 V4-Flash 也只触发到 soft 层。

### 4.7 Session 与 Memory

(沿用 v1 设计,无重大变化)

---

## 5. 模型路由策略(基于 V4 实际能力)⭐ 重大调整

### 5.1 默认配置:V4-Flash 当主力

```yaml
# config/agent.yaml
default_model: deepseek_v4.flash

# 模式覆盖
by_mode:
  general: deepseek_v4.flash

  coding: deepseek_v4.flash         # SWE-Bench 80.6,Flash 已够用
  coding_hard: deepseek_v4.pro      # 复杂 bug、架构设计

  ppt: deepseek_v4.flash            # PPT 生成不需要顶级模型

  proposal_draft: deepseek_v4.flash
  proposal_final:
    profile: deepseek_v4.pro
    reasoning_effort: max           # 终稿用最强模式

# 工具用模型(便宜)
utility:
  summarize: deepseek_v4.flash
  title: deepseek_v4.flash

# 紧急升级路径(V4 不行时手动切)
fallback:
  - claude_4_7.opus  # 国基终稿如果质量不够,临时切 Claude
```

### 5.2 成本预估

| 任务 | V4-Flash | V4-Pro-Max | Claude Opus 4.7 |
|-----|---------|-----------|------------------|
| 修一个 bug(~10 轮) | $0.01 | $0.05 | $0.30 |
| 5 页汇报 PPT | $0.05 | $0.20 | $1.50 |
| 一份完整申报书(2-3 小时) | $0.30 | $1.50 | $10-15 |

**结论**:99% 任务 V4-Flash 已够用,关键终稿可升级 Pro,Claude 仅作 fallback。

---

## 6. 演化性设计 ⭐ 新增章节

> **核心问题**:模型每 3-6 个月迭代一次,agent 怎么不被甩在后面?

### 6.1 设计哲学

**Less Scaffolding, More Trust**(少脚手架,多信任)

老 agent 框架(LangChain 早期、AutoGPT)失败的核心原因:**给 LLM 太多脚手架,模型升级后这些脚手架成了枷锁**。

参考反例:
- 强制 ReAct 三段式输出 —— GPT-4 出来后这种格式反而降智
- PydanticOutputParser 死磕格式 —— Structured Output 内置后成了多此一举
- Prompt 里详细教"应该怎么思考" —— 强模型不需要被教

**正确做法**:把 LLM 当一个**会持续变强的同事**对待,告诉它目标,不告诉它步骤。

### 6.2 七条具体原则

#### 原则 1:Prompt 用 WHY+WHAT,不用 HOW

```
❌ HOW 型:
"修 bug 时:
1. 先用 read 工具读文件
2. 再用 grep 找相关位置
3. 然后用 edit 工具替换
4. 最后跑测试..."

✅ WHY+WHAT 型:
"目标:修复用户报告的 bug,做最小可逆修改。
工具:read, edit, grep, run, ...
原则:验证后再改、最小变更、有测试就跑。"
```

#### 原则 2:Skill 用渐进披露,不写完整流程

直接对齐 Anthropic 开放标准。Discovery 层只放 description,模型理解能力越强触发越准 —— 你不用回头给老 description 加 trigger 词。

#### 原则 3:工具按原子操作切分,不做高级封装

详见 4.4.3。粒度太粗,模型升级后没有施展空间。

#### 原则 4:Model Profile 化,不硬编码

详见 4.2。所有模型相关参数都在 yaml 里,新模型 5 分钟接入。

#### 原则 5:Capability Probing(启动时探测)

```python
def probe_capabilities(llm) -> dict:
    """启动时跑几个小测试,验证 yaml 里声称的能力"""
    results = {}

    # 测试 1:并行 tool call 是否真的工作
    response = llm.chat([...], tools=[...test_tools])
    results["parallel_tools_actual"] = len(response.tool_calls) > 1

    # 测试 2:thinking mode 输出是否符合预期
    response = llm.chat([...], reasoning_effort="medium")
    results["thinking_works"] = hasattr(response.choices[0].message, "reasoning_content")

    # 测试 3:long context recall(简化版 needle in haystack)
    needle = f"The secret code is {random_code()}."
    haystack = make_long_context(needle, target_tokens=100_000)
    response = llm.chat([{"role": "user", "content": haystack + "\nWhat is the secret code?"}])
    results["long_context_100k"] = random_code() in response.choices[0].message.content

    return results
```

发现实际能力跟 yaml 不符 → 警告并自动调整。

#### 原则 6:版本化 Prompt,支持 A/B 切换

```
prompts/system/
├── coding_v1.md       # 老模型用的详细版
├── coding_v2.md       # 新模型用的精简版
└── coding_active.md → coding_v2.md
```

模型升级时:
1. 写一个新版本 prompt(更精简、更信任模型)
2. 在 eval suite 上对比 v1 vs v2
3. 数据说话,切换 active 软链接

#### 原则 7:Eval Suite —— 模型升级的决策基础

**最关键的一条**。没有 eval,你升级模型只能"凭感觉"。

详见下一节。

### 6.3 Eval Suite 框架

#### 6.3.1 目录结构

```
evals/
├── coding/
│   ├── fix_import_bug/
│   │   ├── input/           # 输入文件
│   │   │   └── main.py
│   │   ├── prompt.txt       # 给 agent 的指令
│   │   ├── expected/        # 期望输出
│   │   │   └── main.py
│   │   └── rubric.yaml      # 评分标准
│   ├── implement_function/
│   └── refactor/
├── ppt/
│   ├── meeting_to_slides/
│   │   ├── input/
│   │   │   └── notes.md
│   │   ├── prompt.txt
│   │   └── rubric.yaml      # 主观评分(LLM-as-judge)
│   └── ...
├── proposal/
│   ├── write_intro_section/
│   ├── search_and_cite/
│   └── ...
└── runner.py                 # 执行器
```

#### 6.3.2 Rubric 示例

**客观评分**(coding 任务):
```yaml
# evals/coding/fix_import_bug/rubric.yaml
type: deterministic
checks:
  - type: file_diff
    path: main.py
    expected_path: expected/main.py
  - type: run_command
    command: python -c "import main"
    expect_exit_code: 0
  - type: run_tests
    command: pytest tests/
```

**主观评分**(ppt/proposal 任务):
```yaml
# evals/ppt/meeting_to_slides/rubric.yaml
type: llm_judge
judge_model: claude-opus-4-7  # 用强模型当裁判
criteria:
  - "幻灯片数是否符合要求(5 页)"
  - "每页 bullet 是否 ≤ 5 条"
  - "信息密度是否合理"
  - "是否有图表(如果数据 ≥ 3 个点)"
score_threshold: 7  # 满分 10
```

#### 6.3.3 Runner

```python
# evals/runner.py
def run_eval_suite(model_id: str, suite: str = "all"):
    results = []
    for case_dir in find_cases(suite):
        # 起一个干净的 agent 实例
        agent = build_agent(model=model_id, workspace=tmp_workspace())

        # 跑测试
        prompt = (case_dir / "prompt.txt").read_text()
        result = agent.run(prompt)

        # 评分
        rubric = load_rubric(case_dir / "rubric.yaml")
        score = grade(rubric, agent.workspace, result)

        results.append({
            "case": case_dir.name,
            "score": score,
            "tokens": agent.token_counter.total,
            "cost": agent.token_counter.cost_usd,
            "duration": agent.duration_seconds
        })

    return summarize(results)

if __name__ == "__main__":
    # 模型升级时,跑这个
    print(run_eval_suite("deepseek-v4-flash"))
    print(run_eval_suite("deepseek-v4-pro"))
    # 对比看哪个性价比最高
```

#### 6.3.4 Eval 的真实用途

每次模型升级,你能用数据回答这些问题:

> Q1:V5-Flash 出来了,值得升级吗?
> A:跑 eval suite,对比 V4-Flash vs V5-Flash 的 score 和 cost。

> Q2:Claude Opus 5.0 出来了,要不要换主力?
> A:跑 eval。如果 score 提升 < 10% 但 cost 涨 10x,继续用 DeepSeek。

> Q3:某个 prompt 改了之后,效果是好是坏?
> A:跑 eval。

**没有 eval suite,你的"升级"全靠想象。**

### 6.4 实操:模型升级 checklist

未来 V5、Opus 5、GPT-6 出来时,按这个流程:

```markdown
## 模型升级 Checklist

- [ ] 1. 写新模型档案 yaml (5 分钟,从 _template 起)
- [ ] 2. 跑 capability probe 验证 yaml(10 分钟)
- [ ] 3. 跑完整 eval suite 测试新模型(30 分钟,看任务量)
- [ ] 4. 对比 score / cost / latency,判断是否升级
- [ ] 5. 如果升级:
  - [ ] 在 config 里调整 default_model
  - [ ] 检查现有 prompt 是否可以精简(强模型不需要那么多脚手架)
  - [ ] 跑 eval 回归一遍
- [ ] 6. 部分模式按需升级(比如只把 proposal_final 升级到新 Pro)
```

整个流程**不需要改 agent 核心代码**。

---

## 7. 关键工程细节

### 7.1 任务状态(`tasks/<task_id>/state.json`)

```json
{
  "task_id": "proposal_20260102_1430",
  "mode": "proposal",
  "description": "国自然青年基金 - LLM agent 在医疗问诊",
  "status": "active",
  "model_used": "deepseek-v4-pro",
  "reasoning_effort": "max",
  "created_at": 1735800000,
  "tokens_used": {"prompt": 145000, "completion": 38000},
  "cost_usd": 0.42
}
```

### 7.2 中断恢复 / 成本控制 / 安全约束

(沿用 v1 设计,无重大变化)

---

## 8. 实施路线图

### Phase 1:最小可用骨架(2 天)

- [x] `core/llm.py` + Model Profile 雏形
- [x] `core/loop.py` - 主循环
- [x] `core/session.py`
- [x] `tools/base.py` + `tools/fs.py` + `tools/shell.py`
- [x] `cli.py` - 基础 REPL
- [x] `config/agent.yaml` + `config/models/deepseek_v4.yaml`

**验收**:`python cli.py chat` 能让 V4-Flash 修一个简单 Python bug。

### Phase 2:Skill 系统(标准格式)+ 三个 skill(2 天)

- [x] `tools/skill_tool.py`(LoadSkill)
- [x] 三个 skill 目录,对齐 Anthropic 格式
- [x] 任务模式路由

**验收**:三种模式都能进入,渐进披露正常工作。

### Phase 3:Hybrid 范式(1-2 天)

- [x] `tools/run_python.py` - subprocess 沙盒版
- [x] PPT/Word 通过 run_python 生成(不再做高级 API 封装)
- [x] PDF / 文献检索脚本到 skills/proposal/scripts/

**验收**:能产出 .pptx 和 .docx,文献检索真实。

### Phase 4:演化性能力(1-2 天)⭐ 新增

- [x] `core/capabilities.py` - Model Profile 加载
- [x] Capability Probing 启动检测
- [x] 版本化 prompts/ 目录结构
- [x] 配置热重载

**验收**:能切换 V4-Flash 和 V4-Pro 不用改代码,只改 config。

### Phase 5:Eval Suite(2 天)⭐ 新增

- [x] `evals/runner.py`
- [x] 每种任务 3-5 个测试 case
- [x] LLM-as-judge 评分
- [x] 报告输出(score / cost / latency)

**验收**:`python evals/runner.py --model deepseek-v4-flash` 能跑完所有任务并出报告。

### Phase 6:长任务工程化(2-3 天)

- [x] `core/context.py` - 三层压缩(兜底用)
- [x] `core/memory.py` - 双层记忆
- [x] 任务恢复机制

**验收**:写完整一份申报书不崩,中断后能恢复。

### Phase 7:打磨(持续)

- 双层记忆系统完善
- 更多 skill
- Web UI(可选)
- Docker 沙盒(替代 subprocess)
- 更多模型档案(Claude / GPT / Kimi 等)

---

## 9. 技术栈清单

```
# requirements.txt(核心)

# LLM
litellm>=1.50.0
tiktoken>=0.7.0

# 文档(给 run_python 用)
python-pptx>=0.6.21
python-docx>=1.1.0
pypdf>=3.17.0
pdfplumber>=0.10.0
matplotlib>=3.8.0
pandas>=2.0.0

# 文献
arxiv>=2.1.0
requests>=2.31.0

# CLI / 配置
click>=8.1.0
rich>=13.7.0
pydantic>=2.5.0
pyyaml>=6.0
python-frontmatter>=1.0.0

# Eval
deepdiff>=6.0      # 文件 diff 比对

# 开发
pytest>=7.4.0
ruff>=0.1.0
```

---

## 10. 风险与权衡

### 10.1 已知风险

| 风险 | 缓解 |
|-----|------|
| run_python 沙盒安全(subprocess 不够强) | 限制工作目录 + 环境变量过滤;后期升级 Docker |
| V4 在某些复杂任务上仍不如 Claude | Eval suite 帮判断;fallback 机制 |
| Skill description 不够好 → 触发不准 | 用 V4-Pro 优化 description;eval 测触发率 |
| Long context 退化 | Capability probe 探测 reliable_context;不要依赖宣称值 |
| Prompt 改了一次就不敢动 | 版本化 + eval 让改动有数据支撑 |

### 10.2 取舍说明

**为什么改用 Anthropic Skill 标准而不是自创**:
- 行业标准已成,跨平台兼容
- 直接拿到 Anthropic skills repo 的现成资源
- 未来想换底层 SDK 不用改 skill

**为什么用 Hybrid 范式而不是纯 CodeAgent**:
- DeepSeek V4 在 JSON tool calling 已足够稳定
- 沙盒成本更低(只在需要时执行代码)
- 可以兼容 thinking 模式(纯 CodeAgent 跟 thinking 不太合)

**为什么花精力做 Eval Suite**:
- 没有 eval,模型升级决策只能凭感觉
- 一次性投入,长期复用
- 跑 eval 的成本(~$10)远低于因为没 eval 选错模型的成本

**为什么不做 subagent**:
- 用户明确不需要
- 加了之后状态管理复杂度爆炸
- 单 agent + skill 已能覆盖 95% 场景

---

## 11. 与 v1 方案的差异

| 维度 | v1 | v2 |
|-----|----|----|
| 默认模型 | deepseek-chat (V3.2) | deepseek-v4-flash |
| Context 阈值 | 65K, 0.5/0.7/0.85 | 256K, 0.6/0.85/0.95 |
| 工具范式 | 纯 JSON tool call | **Hybrid:JSON + run_python** |
| Skill 格式 | 自创 | **Anthropic 开放标准** |
| Skill 描述风格 | "Step 1/2/3" 流程 | **WHY+WHAT 风格** |
| 模型配置 | 散在 config 里 | **Model Profile 化** |
| 升级机制 | 无 | **Capability Probing + Eval Suite** |
| Prompt 管理 | 散在代码里 | **版本化 + active 软链接** |
| 工具粒度 | 部分高级封装(如 make_pptx) | **原子化(用 run_python 调 python-pptx)** |
| 代码量 | 1100-1300 行 | 1300-1600 行 |

---

## 12. 下一步

确认方案后,从 Phase 1 开始落地:

1. 起项目骨架
2. 写 `core/llm.py` + `core/capabilities.py`
3. 写 `config/models/deepseek_v4.yaml`
4. 跑通最小 REPL
5. 用 V4-Flash 跑一个简单任务

Phase 1 预计 2 天完成,跑通后能立即用。

---

## 附录 A:Anthropic Skill 标准参考

官方资源:
- [Agent Skills 文档](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview)
- [Anthropic skills 仓库](https://github.com/anthropics/skills)(开源,可直接抄)
- 现成可用的 skill:pdf-processing、xlsx、docx、pptx、claude-api 等

行业落地:
- Claude Code: `~/.claude/skills/`
- OpenAI Codex CLI: `.agents/skills/`
- Google Gemini CLI: `.gemini/skills/`
- GitHub Copilot: 同日跟进
- **格式完全统一**,只是路径不同

## 附录 B:DeepSeek V4 关键事实(2026-04-24)

模型:
- V4-Pro:1.6T 总 / 49B 激活,1M context
- V4-Flash:284B 总 / 13B 激活,1M context
- 三种推理模式:non-thinking / thinking / thinking-max

Agent 能力(V4-Pro-Max):
- SWE-Bench Verified: 80.6%(对标 Claude Opus 4.6 的 80.8%)
- Terminal-Bench 2.0: 67.9%(超过 Claude 4.6 的 64.3%)
- MCPAtlas: 73.6%(对标 Claude 4.6 的 73.8%)

价格:
- 输入约 $0.145 / M tokens(Claude Opus 的 1/7)
- 输出约 $1.74 / M tokens(Claude Opus 的 1/6)

迁移:
- `deepseek-chat` / `deepseek-reasoner` 在 2026-07-24 后下线
- 必须迁到 `deepseek-v4-flash` / `deepseek-v4-pro`

## 附录 C:演化性设计的灵感来源

- **Anthropic "Equipping agents for the real world"** —— Skill 渐进披露的设计哲学
- **CodeAct 论文(Wang et al. 2024)** —— Code as action 范式
- **LangChain 早期教训** —— 过度脚手架的反例
- **Karpathy 的 nanoGPT 哲学** —— "可读性优先于功能完备"

---

*Last updated: 2026-05-02*
*v2 changes: DeepSeek V4 / Anthropic Skill 标准 / Hybrid 范式 / 演化性设计*