Compare commits

..

2 Commits

Author SHA1 Message Date
caoqianming 1a950dedb5 Stage C Step 5: main.py sandbox check + lifespan fs quota WARN
- main.py sandbox check 子命令:5 项独立探测 + 汇总 exit code
  ① docker daemon 可达 ② zcbot-sandbox:latest 镜像存在
  ③ zcbot-sandbox-net network 存在(warn 不 err) ④ 镜像 zcbot uid 与 host
  uid 对齐 ⑤ workspace/users 所在 fs 类型可 quota
- core/sandbox/check.py:detect_fs_quota(path) -> (level, msg) 抽出来给
  lifespan 与 CLI 共用;识别 xfs+prjquota/ext4+project/zfs/btrfs/tmpfs/其他
- web/app.py lifespan docker backend 启用时调 detect_fs_quota 打 WARN
  到 stdout(不阻塞启动,应用层周期扫描仍生效)
- err vs warn 分界:err = docker backend fail-fast 根因(daemon/镜像/uid),
  warn = 不阻塞启动但外部开放前要清(network 缺/fs 不可 quota)
- run_sandbox_check 用 module-level getattr 而非固化 CHECKS 元组,让
  unittest patch core.sandbox.check.check_xxx 生效
- tests/test_sandbox_check.py 19 测试覆盖各分支 + exit code 汇总;
  unittest discover 31/31 PASS
- RUN.md 加"部署前置对账"小节 + "配额硬化"重写(fs 状态→处理映射表 +
  xfs 升级 4 步) + 故障兜底 3 行;DESIGN 不动

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:41:16 +08:00
caoqianming dfac0acfa6 Stage C Step 3: DockerExecutor 集成 AgentLoop + web lifespan reaper
- core/executor_docker.py 新增 DockerExecutor:组合 HostExecutor+SandboxPool,
  shell/run_python 走 docker exec(setsid + --user 1000:1000 + --workdir),
  其他工具直通 host(§7.5 #6 信任域二分)
- run_python tmp .py 落 <user_root>/.zcbot_tmp/<task_id>/(dotfile,/v1/files
  天然过滤),容器内对应 /workspace/.zcbot_tmp/...,跑完 unlink
- ZCBOT_SANDBOX_BACKEND=host|docker env 切 backend,默 host(Windows dogfood
  零变化);docker 路径 pool 未 init → fail-fast 不静默退化
- web/app.py lifespan:docker backend 启动时 init_pool + shutdown_all 清孤儿 +
  60s 后台 reaper(run_in_executor 调 sync reap_idle);关闭时 cancel + 兜底清
- pool.py 顺手清 Step 2 债:asyncio.Lock → threading.Lock,ensure 改同步
  (主使用方是 BG 线程 tool call,ephemeral loop 会让 asyncio.Lock 跨锁失效)
- Cancel limitation 接受:Popen.kill() 仅杀 docker CLI 客户端,容器内进程靠
  idle 5min reaper 兜底;升级到 PGID 协议(§7.5 #3)等用户反馈触发
- tests/test_executor_docker.py 11 测试覆盖关键路径(host 直通/argv 形态/
  tmp 清理/timeout/cancel/未知工具/enable_run_python=False)
- DESIGN.md 不动(纯按 §7.5 #5 #6 既有协议实施)
- RUN.md 加 ZCBOT_SANDBOX_BACKEND env 段 + 切 docker 的前置条件 + 集成验证路径
- unittest discover 12/12 PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 16:13:16 +08:00
11 changed files with 1229 additions and 35 deletions

View File

@ -2,7 +2,7 @@
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9` > 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`
最后更新:2026-05-26(Stage C Step 2:Docker per-user 容器池 + Dockerfile / init.sh / network ensure,代码就绪未集成 AgentLoop) 最后更新:2026-05-26(Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN + RUN.md 配额硬化段完善)
--- ---
@ -15,7 +15,7 @@
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 | | 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 | | 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill | | 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C(Executor+docker sandbox)待 —— 外部用户开放 hard prereq,完成前仅 dogfood + 信任同事白名单;DoD 详 DESIGN §7.5 落地清单 6 条**。 | | §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
--- ---
@ -23,6 +23,8 @@
### 2026-05-26 ### 2026-05-26
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python``pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure``rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。 - **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py``HostExecutor``DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py``if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。 - **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py``HostExecutor``DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py``if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
- **REVISIONS.md 修订日志机制(覆盖 proposal/patent/ppt 三个产物型 skill)**:`<task_dir>/REVISIONS.md` 作为产物迭代过程的紧凑 changelog —— task 对话历史是粗流水(50 条消息找上周改动靠翻),REVISIONS 是用户与 LLM 共同沉淀的实质决策列表(5 行就能复盘"上周这章为啥这么写"),与 spec 定位互补:**spec = 宪法(定调一次),REVISIONS = 实施日志(每次卡点累加)**。三个 SKILL.md 各加 (a) 起草步骤里加一步"用户确认实质改动后追加一行" + (b) "## 修订日志" 独立小节(何时记/何时不记表 + 格式约定 + 实例 + 操作)。三类 skill 的"实质改动"判据按各自领域定制:proposal = 技术路线/考核指标/创新点/课题分解/关键引文/预算结构;patent = 区别技术特征/关键参数/公式/实施例/章节;ppt = 版式/主色/页/图标/文案要点。统一原则:首次起草不记 / 错别字微调不记 / 模型自己改改撤撤不记 — 拿不准倾向不记,避免变流水账。格式选**单行 bullet 倒序追加**(时间在前、文件:章节定位、改了什么 — 为什么),用 edit 在头注释后插入新一行(不 append 到末尾,倒序读秒看最新)。否决:(a) 走 system prompt 软约束 — 对 coding/research/documents/imagegen/videogen 等非产物型 skill 强加无关约束;(b) 新建 `record_revision` tool — 开发期内 LLM 直接 edit 追加足够,加 tool 增加每次小改的调用开销,后期发现 LLM 漏记多再升 tool 化;(c) 按产物拆多文件(`<topic>.revisions.md`)— 单文件好读、跨产物时间线统一。`DESIGN.md` 不动(无架构变化);`RUN.md` 不动(无 CLI/env 变化)。 - **REVISIONS.md 修订日志机制(覆盖 proposal/patent/ppt 三个产物型 skill)**:`<task_dir>/REVISIONS.md` 作为产物迭代过程的紧凑 changelog —— task 对话历史是粗流水(50 条消息找上周改动靠翻),REVISIONS 是用户与 LLM 共同沉淀的实质决策列表(5 行就能复盘"上周这章为啥这么写"),与 spec 定位互补:**spec = 宪法(定调一次),REVISIONS = 实施日志(每次卡点累加)**。三个 SKILL.md 各加 (a) 起草步骤里加一步"用户确认实质改动后追加一行" + (b) "## 修订日志" 独立小节(何时记/何时不记表 + 格式约定 + 实例 + 操作)。三类 skill 的"实质改动"判据按各自领域定制:proposal = 技术路线/考核指标/创新点/课题分解/关键引文/预算结构;patent = 区别技术特征/关键参数/公式/实施例/章节;ppt = 版式/主色/页/图标/文案要点。统一原则:首次起草不记 / 错别字微调不记 / 模型自己改改撤撤不记 — 拿不准倾向不记,避免变流水账。格式选**单行 bullet 倒序追加**(时间在前、文件:章节定位、改了什么 — 为什么),用 edit 在头注释后插入新一行(不 append 到末尾,倒序读秒看最新)。否决:(a) 走 system prompt 软约束 — 对 coding/research/documents/imagegen/videogen 等非产物型 skill 强加无关约束;(b) 新建 `record_revision` tool — 开发期内 LLM 直接 edit 追加足够,加 tool 增加每次小改的调用开销,后期发现 LLM 漏记多再升 tool 化;(c) 按产物拆多文件(`<topic>.revisions.md`)— 单文件好读、跨产物时间线统一。`DESIGN.md` 不动(无架构变化);`RUN.md` 不动(无 CLI/env 变化)。

95
RUN.md
View File

@ -256,8 +256,14 @@ sudo journalctl -u zcbot -n 50 # 看新进程起没起干
## Sandbox(Stage C,Ubuntu) ## Sandbox(Stage C,Ubuntu)
> 为外部用户开放前必须完成。当前 dogfood + 信任同事白名单阶段可跳过 ── 默 backend = host, > 为外部用户开放前必须完成。当前 dogfood + 信任同事白名单阶段可跳过 ── 默 backend = host,
> `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 接入 DockerExecutor 后切 > `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 已接入 DockerExecutor:
> `ZCBOT_SANDBOX_BACKEND=docker` 启用。 > `ZCBOT_SANDBOX_BACKEND=docker` 切容器执行;`host`(默)保留本地 Windows / 同事 dogfood。
>
> 启用 docker backend 的前置条件:
> 1. 部署机有 docker daemon,zcbot 用户在 `docker` group
> 2. `zcbot-sandbox:latest` 镜像已 build(`HOST_UID/GID` 对齐)
> 3. `.env` 至少有 `ZCBOT_PG_IPS=<PG实际IP>`(§7.5 #1 PG 单独 block 一遍)
> 4. lifespan 启动失败会 fail-fast(`RuntimeError: sandbox init failed`),不静默退到 host
### 镜像构建 ### 镜像构建
@ -284,6 +290,14 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
### Sandbox 相关 env(.env 加) ### Sandbox 相关 env(.env 加)
``` ```
# Backend 选择(默 host):
# host = shell/run_python 走 host subprocess(本地 Windows / dogfood)
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
# ZCBOT_SANDBOX_BACKEND=docker
# 容器内 exec 用户(默 1000:1000;Dockerfile 的 HOST_UID/HOST_GID build-arg 同步对齐)
# ZCBOT_SANDBOX_EXEC_USER=1000:1000
# 容器镜像 tag(默 zcbot-sandbox:latest) # 容器镜像 tag(默 zcbot-sandbox:latest)
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest # ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
# 容器 runtime(切 gVisor 用 runsc,Firecracker 用 kata;默 runc) # 容器 runtime(切 gVisor 用 runsc,Firecracker 用 kata;默 runc)
@ -295,10 +309,23 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4 ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
``` ```
### 验证(Step 2 部分能验) ### 验证
Step 3 之后,推荐用集成验证(web 起 docker backend + dev SPA 发 `shell` / `run_python` 消息):
```bash
# 启动 web 时切 docker backend(.env 已设 PG_IPS / SANDBOX_BACKEND=docker)
ZCBOT_SANDBOX_BACKEND=docker .venv/bin/python main.py web
# 触发任一 shell / run_python 消息后,容器应已起
sudo -u zcbot docker ps --filter label=zcbot.product=sandbox
# 应看到 zcbot-sandbox-<your-uid>,STATUS = Up ...
# 5 分钟无新消息后 reaper 自动 rm
```
也可直接起一个测试容器单验 hardening(不依赖 web 进程):
```bash ```bash
# 起一个测试容器(直接 docker run,不走 pool ── pool 在 Step 3 接入后才用)
USER_ID=00000000-0000-0000-0000-000000000001 USER_ID=00000000-0000-0000-0000-000000000001
sudo -u zcbot docker run -d \ sudo -u zcbot docker run -d \
--name zcbot-sandbox-$USER_ID \ --name zcbot-sandbox-$USER_ID \
@ -331,19 +358,60 @@ sudo -u zcbot docker rm -f zcbot-sandbox-$USER_ID
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。 残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
### 配额硬化(§7.5 #4,外部开放前必做) ### 部署前置对账
应用层磁盘配额(Step 5 引入)能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条 `ZCBOT_SANDBOX_BACKEND=docker` 之前跑一次:
硬要 **xfs / ext4 project quota 或 zfs dataset quota**。部署到独立服务器 + 多租户开放前:
```bash ```bash
# 示例(xfs project quota): sudo -u zcbot .venv/bin/python main.py sandbox check
sudo mount -o remount,prjquota /opt
sudo xfs_quota -x -c "project -s -p /opt/zcbot/workspace/users/<uid> <pid>" /opt
sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
``` ```
具体方案视部署 fs 选择(xfs 推荐)── 不做这步等于"软配额 + 信任用户不写满"。 输出形如 `[ok] / [warn] / [err]` × 5 项 + 汇总 `N/5 passed`,exit code 0=可启动 / 1=有 err
要修。5 项对应:① docker daemon 可达 ② `zcbot-sandbox:latest` 镜像存在 ③
`zcbot-sandbox-net` network 存在(缺也能跑,lifespan 自动 ensure)④ 镜像内 zcbot
uid 与 host uid 对齐(错配 → exec 写 `/workspace` 全 EACCES)⑤ `workspace/users/`
所在 fs 类型可 quota。
lifespan 启动时同样会打第 ⑤ 项的 WARN 到 stdout(`[startup] [warn] fs quota ...`),
应用层周期扫描仍生效;**仅外部用户开放前必须把 ⑤ 升级到 OS 层 quota**。
### 配额硬化(§7.5 #4,外部开放前必做)
应用层磁盘配额能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条硬要 OS 层
quota。`sandbox check` 第 ⑤ 项会探测当前 fs 状态:
| 探测结果 | 含义 | 处理 |
|---|---|---|
| `fs quota: xfs with prjquota on ...` | ok,可直接 `xfs_quota -x` 给 user 加配额 | (无需处理) |
| `fs quota: ext4 with project quota on ...` | ok,可 `quota -P` | (无需处理) |
| `fs quota: zfs on ...` | ok,在 dataset 层 `zfs set quota=` | (无需处理) |
| `fs quota: xfs ... NO prjquota mount option` | fs 支持但 mount 时没启 | 见下方 xfs 步骤 |
| `fs quota: ext4 ... NO project quota option` | 同上 | `sudo tune2fs -O project,quota <dev>` + remount |
| `fs quota: btrfs ...` | qgroup 配置复杂 | 生产推荐换 xfs 单独分区,或自行验 `btrfs qgroup` |
| `fs quota: tmpfs/overlay/... ` | 通常 Docker-in-Docker 或本地 dev | 生产必须挂独立分区 |
**xfs 升级步骤(推荐方案)**:
```bash
# 1) 确认 workspace 在哪个 mount(假设 /opt 是独立 xfs 分区)
findmnt --target /opt/zcbot/workspace
# 2) 启用 prjquota(写入 /etc/fstab 让 reboot 后保留)
sudo mount -o remount,prjquota /opt
# 3) 给某 user 加 project quota(<pid> 自定义整数 id,与 user_id 映射建表跟踪)
echo "1001 /opt/zcbot/workspace/users/<user_uuid>" | sudo tee -a /etc/projects
echo "zcbot_<user_uuid>:1001" | sudo tee -a /etc/projid
sudo xfs_quota -x -c "project -s zcbot_<user_uuid>" /opt
sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
```
`<pid>``user_uuid` 映射手工维护(`/etc/projects` 是数字 id,zcbot 侧需建表追踪;
**首期外部开放前补一个 `main.py sandbox quota-set --user-id <uuid> --gb 10` 子命令**
读写 /etc/projects + 调 xfs_quota,这是 Step 4 / 5 之后真上线前一步,当前不做)。
不做这步等于"软配额 + 信任用户不写满" -- dogfood + 信任同事白名单阶段够用,
**外部用户开放是 hard prereq**。
--- ---
@ -359,6 +427,9 @@ sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir | | `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
| Sandbox 容器内 `touch /workspace/x``Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 | | Sandbox 容器内 `touch /workspace/x``Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` | | Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
| 启动报 `ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: ...` | docker daemon 没起 / 用户不在 docker group / network create 失败。先跑 `main.py sandbox check` 看哪一项 err |
| `[startup] [warn] fs quota: <fstype> on ...` | workspace 所在 fs 没启 OS 层 quota。dogfood 阶段忽略;外部用户开放前必须升级 xfs prjquota / ext4 project / zfs(详 RUN.md「配额硬化」段) |
| `docker run zcbot-sandbox:latest``Unable to find image` | 镜像没 build。`sudo -u zcbot docker build -f deploy/sandbox/Dockerfile --build-arg HOST_UID=$(id -u zcbot) --build-arg HOST_GID=$(id -g zcbot) -t zcbot-sandbox:latest .` |
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export | | Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) | | `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY``curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` | | `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY``curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |

View File

@ -26,6 +26,7 @@ import yaml
from rich.console import Console from rich.console import Console
from core.capabilities import ModelCapabilities from core.capabilities import ModelCapabilities
from core.executor_docker import DockerExecutor
from core.executor_host import HostExecutor from core.executor_host import HostExecutor
from core.llm import LLM from core.llm import LLM
from core.loop import AgentLoop from core.loop import AgentLoop
@ -53,6 +54,39 @@ def load_config() -> dict:
return yaml.safe_load((ROOT / "config" / "agent.yaml").read_text(encoding="utf-8")) or {} return yaml.safe_load((ROOT / "config" / "agent.yaml").read_text(encoding="utf-8")) or {}
def _resolve_executor(
host: HostExecutor,
user_id: UUID,
user_root_path: Path,
working_dir_path: Path,
):
"""选 Executor backend(§7.5 #5)。
env `ZCBOT_SANDBOX_BACKEND=docker` 时构造 DockerExecutor;其他值 / 缺失 host
docker 路径要 lifespan `core.sandbox.init_pool` (否则 pool None 退 host
+ 启动日志由 web 入口在 init 时打印,这里不重复 warn)
"""
import os
if os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower() != "docker":
return host
from core.sandbox import get_pool
pool = get_pool()
if pool is None:
# lifespan 没 init 成功 —— 让上层早死比静默退化更安全(避免外部用户开放时
# 误以为在沙盒里跑实则 host)。Web 入口启动会 fail-fast,这里再补一条提醒。
raise RuntimeError(
"ZCBOT_SANDBOX_BACKEND=docker but sandbox pool not initialized; "
"check web lifespan init_pool() / docker daemon availability"
)
return DockerExecutor(
host=host,
pool=pool,
user_id=user_id,
user_root=user_root_path,
working_dir=working_dir_path,
)
def resolve_workspace(workspace: Optional[str], cfg: Optional[dict] = None) -> Path: def resolve_workspace(workspace: Optional[str], cfg: Optional[dict] = None) -> Path:
cfg = cfg or load_config() cfg = cfg or load_config()
p = Path(workspace) if workspace else ROOT / cfg.get("workspace_dir", "workspace") p = Path(workspace) if workspace else ROOT / cfg.get("workspace_dir", "workspace")
@ -439,9 +473,11 @@ def build_agent(
tools[ws.name] = ws tools[ws.name] = ws
sink = ConsoleEventSink(console) if console else None sink = ConsoleEventSink(console) if console else None
# §7.5 #5 Executor 抽象:本步全 host backend(in-process),Step 3 docker backend # §7.5 #5/#6 Executor 抽象:env `ZCBOT_SANDBOX_BACKEND=host|docker` 切 backend。
# 引入后切 `ZCBOT_SANDBOX_BACKEND=docker` 把 shell/run_python dispatch 到容器。 # host(默)= 全 in-process,本地 dogfood / Windows 走这条;docker = shell/run_python
executor = HostExecutor(tools) # dispatch 到 per-user 容器(其他工具仍 host)。docker 路径要求 lifespan 已 `init_pool`。
host_executor = HostExecutor(tools)
executor = _resolve_executor(host_executor, uid, ur_path, working_dir_path)
agent = AgentLoop( agent = AgentLoop(
llm, executor, session, caps, llm, executor, session, caps,
user_id=uid, working_dir=working_dir_path, sink=sink, user_id=uid, working_dir=working_dir_path, sink=sink,

239
core/executor_docker.py Normal file
View File

@ -0,0 +1,239 @@
"""DockerExecutor:`shell` / `run_python` 走 docker exec,其余 in-process(§7.5 #6)。
Backend 二分(§7.5 #6 信任域):
- host in-process:`read/write/edit/glob/grep/load_skill/web_*/seedream/seedance`
原本就在 host 持凭据(Bocha key / ARK key)或走 `paths.py::resolve_user_path` 校验
(user-rooted 安全边界已存),塞容器无收益付 ~200ms exec overhead × N
- container exec:`shell` / `run_python` 执行模型生成的任意代码,必须容器隔离
容器准入(per call):
1. `pool.ensure(user_id)` 拿到 / `zcbot-sandbox-<uid>` 容器(per-user lock 已串行化)
2. `docker exec --user 1000:1000 --workdir /workspace/<wd_name> <c> setsid bash -c '<cmd>'`
3. timeout docker CLI 客户端(Popen.kill())
4. 完成 `pool.mark_active(user_id)` idle 计时
run_python tmp .py host `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`(bind mount
自动可见于容器 `/workspace/.zcbot_tmp/<task_id>/`),执行完 unlinkdotfile 起头让
`/v1/files` API 天然过滤(`web/app.py:169` startswith(".")),用户视野不污染
Cancel limitation(第一版接受):
- docker exec 客户端断开后,容器内 server 端进程**不会**因此终止 这是 docker 设计
- 第一版只杀 docker CLI(Popen.kill());容器内残留进程靠 idle 5min reaper / 下次
ensure rm -f 兜底
- 升级触发(§7.5 #3 PGID 协议):用户反馈"取消了但还在烧 CPU" / 多次 cancel 后
容器内进程堆积 启用ZCBOT_EXEC_ID env + PGID 写文件 + 二次 exec kill协议
"""
from __future__ import annotations
import os
import secrets
import subprocess
import time
from pathlib import Path
from typing import Any, Dict, List, Optional
from uuid import UUID
from .executor import ExecCtx, Executor, ToolResult
from .executor_host import HostExecutor
from .sandbox import SandboxPool
CONTAINER_TOOLS = frozenset({"shell", "run_python"})
# 容器内非 root 用户:与 Dockerfile HOST_UID/HOST_GID build-arg 默认值对齐。
# 部署机 host 上 zcbot 账号 uid 若非 1000,镜像 build 时透传 HOST_UID + 这里
# env `ZCBOT_SANDBOX_EXEC_USER` 同步改(详 RUN.md "Sandbox 部署"段)。
DEFAULT_EXEC_USER = "1000:1000"
# host 侧 tmp 脚本目录(user_root 内 dotfile,被 /v1/files API 隐藏)
TMP_SUBDIR = ".zcbot_tmp"
class DockerExecutor(Executor):
"""组合 HostExecutor + docker exec dispatch shell/run_python。
host backend 仍承担 schema 列表 + 大部分 tool 执行;本类只在 shell/run_python
命中时夺路接管,docker exec per-user 容器里跑
"""
def __init__(
self,
host: HostExecutor,
pool: SandboxPool,
user_id: UUID,
user_root: Path,
working_dir: Path,
) -> None:
self.host = host
self.pool = pool
self.user_id = user_id
self.user_root = user_root.resolve()
self.working_dir = working_dir.resolve()
# 容器内对应路径 /workspace/<wd_name>
try:
wd_rel = self.working_dir.relative_to(self.user_root)
self.container_workdir = "/workspace/" + wd_rel.as_posix()
except ValueError:
# working_dir 不在 user_root 下 —— 防御性兜底,正常路径不会到这里
self.container_workdir = "/workspace"
self.exec_user = os.getenv("ZCBOT_SANDBOX_EXEC_USER", DEFAULT_EXEC_USER)
# ── Executor 接口 ────────────────────────────────────────
def has_tool(self, name: str) -> bool:
return self.host.has_tool(name)
def schemas(self) -> List[Dict[str, Any]]:
return self.host.schemas()
def call_tool(self, name: str, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
if name not in CONTAINER_TOOLS:
return self.host.call_tool(name, args, ctx)
if not self.host.has_tool(name):
# caps.enable_run_python=False 等场景下,host 没装 run_python → schema 也没暴露
return ToolResult(content=f"[Error] unknown tool: {name}", exit_code=2)
try:
if name == "shell":
return self._exec_shell(args, ctx)
if name == "run_python":
return self._exec_python(args, ctx)
except Exception as e:
return ToolResult(
content=f"[Error executing {name} via docker] {type(e).__name__}: {e}",
exit_code=1,
)
return ToolResult(content=f"[Error] unhandled container tool: {name}", exit_code=2)
# ── shell ────────────────────────────────────────────────
def _exec_shell(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
cmd = args.get("command")
if not isinstance(cmd, str) or not cmd.strip():
return ToolResult(
content="[Error] bad arguments to shell: command must be non-empty string",
exit_code=2,
)
timeout = int(args.get("timeout") or 60)
container = self.pool.ensure(self.user_id)
argv = self._docker_exec_argv(container) + ["setsid", "bash", "-c", cmd]
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
self.pool.mark_active(self.user_id)
return result
# ── run_python ───────────────────────────────────────────
def _exec_python(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
code = args.get("code")
if not isinstance(code, str):
return ToolResult(
content="[Error] bad arguments to run_python: code must be string",
exit_code=2,
)
timeout = int(args.get("timeout") or 120)
# tmp .py 落 host 侧 `.zcbot_tmp/<task_id>/<rand>.py`;
# 容器内对应 /workspace/.zcbot_tmp/<task_id>/<rand>.py
tmp_root = self.user_root / TMP_SUBDIR / str(ctx.task_id)
tmp_root.mkdir(parents=True, exist_ok=True)
rand_name = f"{int(time.time() * 1000)}-{secrets.token_hex(4)}.py"
host_script = tmp_root / rand_name
container_script = f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/{rand_name}"
host_script.write_text(code, encoding="utf-8")
try:
container = self.pool.ensure(self.user_id)
argv = self._docker_exec_argv(
container,
extra_env={
"PYTHONIOENCODING": "utf-8",
"PYTHONPATH": "/workspace",
},
) + ["setsid", "python", container_script]
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
self.pool.mark_active(self.user_id)
return result
finally:
try:
host_script.unlink()
except OSError:
pass
# ── helpers ──────────────────────────────────────────────
def _docker_exec_argv(
self, container: str, extra_env: Optional[Dict[str, str]] = None
) -> List[str]:
argv = [
"docker", "exec",
"--user", self.exec_user,
"--workdir", self.container_workdir,
]
env: Dict[str, str] = {}
if extra_env:
env.update(extra_env)
for k, v in env.items():
argv.extend(["-e", f"{k}={v}"])
argv.append(container)
return argv
def _run_subprocess(
self, argv: List[str], timeout: int, ctx: ExecCtx
) -> ToolResult:
"""跑 docker exec 子进程,带 cancel 协作 poll。
cancel 命中 / timeout Popen.kill() docker CLI 客户端;
容器内 server 端进程接受 limitation(见模块头注释)
"""
cancel_check = ctx.cancel_check
try:
proc = subprocess.Popen(
argv,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding="utf-8",
errors="replace",
)
except FileNotFoundError as e:
return ToolResult(content=f"[Error] docker CLI not found: {e}", exit_code=2)
start = time.monotonic()
cancel_hit = False
timeout_hit = False
stdout: str = ""
stderr: str = ""
while True:
try:
stdout, stderr = proc.communicate(timeout=0.5)
break
except subprocess.TimeoutExpired:
if cancel_check is not None and cancel_check():
cancel_hit = True
proc.kill()
stdout, stderr = proc.communicate()
break
if time.monotonic() - start > timeout:
timeout_hit = True
proc.kill()
stdout, stderr = proc.communicate()
break
if timeout_hit:
return ToolResult(
content=f"[Error] command timed out after {timeout}s",
exit_code=124,
)
if cancel_hit:
return ToolResult(
content="[Error] command cancelled by user",
exit_code=130,
)
parts: List[str] = []
if stdout:
parts.append(f"[stdout]\n{stdout.rstrip()}")
if stderr:
parts.append(f"[stderr]\n{stderr.rstrip()}")
parts.append(f"[exit {proc.returncode}]")
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)

View File

@ -3,17 +3,48 @@
模块边界: 模块边界:
- `network.py`:Docker network ensure(`zcbot-sandbox-net`,`--internal` 隔离 outbound + cross-container) - `network.py`:Docker network ensure(`zcbot-sandbox-net`,`--internal` 隔离 outbound + cross-container)
- `pool.py`:per-user 容器生命周期(ensure / mark_active / reap_idle / shutdown_all) - `pool.py`:per-user 容器生命周期(ensure / mark_active / reap_idle / shutdown_all)
- `__init__.py`:module-level singleton(`init_pool` / `get_pool`), web lifespan
`agent_builder` 共享同一个池实例
不在本目录:`shell` / `run_python` 工具的 docker exec 调用 那是 Step 3 不在本目录:`shell` / `run_python` 工具的 docker exec 调用 那是 `core/executor_docker.py`,
`core/executor_docker.py`,调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec 调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec
""" """
from __future__ import annotations
from pathlib import Path
from typing import Optional
from .pool import SandboxPool, container_name, setup_pool from .pool import SandboxPool, container_name, setup_pool
from .network import NETWORK_NAME, ensure_network from .network import NETWORK_NAME, ensure_network
__all__ = [ __all__ = [
"SandboxPool", "SandboxPool",
"container_name", "container_name",
"setup_pool", "setup_pool",
"NETWORK_NAME", "NETWORK_NAME",
"ensure_network", "ensure_network",
"init_pool",
"get_pool",
] ]
# Module-level singleton。web lifespan 启动钩子调 `init_pool(user_root_base)`,
# `agent_builder` 在构造 DockerExecutor 时 `get_pool()` 拿同一实例。
# 未初始化 → `get_pool()` 返 None,agent_builder 此时必须不走 docker 分支。
_pool: Optional[SandboxPool] = None
def init_pool(user_root_base: Path) -> SandboxPool:
"""幂等初始化 module-level pool。返回 pool 实例。
lifespan 调一次;ensure_network 内部也幂等重复调用返回同一实例(不重新建)
"""
global _pool
if _pool is None:
_pool = setup_pool(user_root_base)
return _pool
def get_pool() -> Optional[SandboxPool]:
return _pool

258
core/sandbox/check.py Normal file
View File

@ -0,0 +1,258 @@
"""Sandbox 部署前置对账(`main.py sandbox check`)。
5 项独立探测,各自打 `[ok]` / `[warn]` / `[err]`,汇总后返 exit code
外部用户开放前所有项必须 `[ok]`
探测项与 §7.5 协议对应:
1. Docker daemon 可达 -- ZCBOT_SANDBOX_BACKEND=docker 启用必备
2. `zcbot-sandbox:latest` 镜像存在 -- 缺则 pool.ensure docker run "Unable to find image"
3. `zcbot-sandbox-net` network 存在 -- 缺也无所谓(init_pool 内自动 ensure),但提前预热
4. 镜像 HOST_UID host zcbot uid 对齐 -- 错配会让 exec 进来后 write /workspace EACCES
5. user_root_base fs 类型可 quota -- §7.5 #4,xfs prjquota / ext4 project / zfs;否则
"扫描间隙打满共享 fs"会拖死同节点其他 user(攻击者写满速度 >> 应用层周期扫描)
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
from pathlib import Path
from typing import Tuple
from .pool import DEFAULT_IMAGE
from .network import NETWORK_NAME
# 颜色用 ANSI(终端不支持的环境自动退化为 plain;click.echo 不强求 click context)
def _ok(msg: str) -> None:
print(f"[ok] {msg}")
def _warn(msg: str) -> None:
print(f"[warn] {msg}")
def _err(msg: str) -> None:
print(f"[err] {msg}")
def _run(argv, timeout: int = 10) -> Tuple[int, str, str]:
"""统一 subprocess.run wrapper。docker CLI 不存在 → returncode=127,stderr 给原因。"""
if shutil.which(argv[0]) is None:
return 127, "", f"{argv[0]} not found in PATH"
try:
r = subprocess.run(argv, capture_output=True, text=True, timeout=timeout)
return r.returncode, r.stdout.strip(), r.stderr.strip()
except subprocess.TimeoutExpired:
return 124, "", f"timed out after {timeout}s"
except Exception as e:
return 1, "", f"{type(e).__name__}: {e}"
# -- 探测项 ------------------------------------------------
def check_docker_daemon() -> bool:
rc, out, err = _run(["docker", "version", "--format", "{{.Server.Version}}"])
if rc == 0 and out:
_ok(f"docker daemon reachable (server={out})")
return True
if rc == 127:
_err("docker CLI not found -- apt install docker.io / docker-ce")
elif "permission denied" in err.lower():
_err(f"docker daemon not reachable: {err} -- usermod -aG docker $USER + relogin")
else:
_err(f"docker daemon not reachable: {err or 'unknown'}")
return False
def check_image_present() -> bool:
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
rc, _, err = _run(["docker", "image", "inspect", image])
if rc == 0:
_ok(f"image present: {image}")
return True
_err(
f"image not found: {image} -- "
f"`docker build -f deploy/sandbox/Dockerfile "
f"--build-arg HOST_UID=$(id -u) --build-arg HOST_GID=$(id -g) "
f"-t {image} .`"
)
return False
def check_network_present() -> bool:
rc, _, _ = _run(["docker", "network", "inspect", NETWORK_NAME])
if rc == 0:
_ok(f"network present: {NETWORK_NAME}")
return True
_warn(
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
f"或手动 `docker network create --internal {NETWORK_NAME}`"
)
return True # warn 不算失败
def check_host_uid_alignment() -> bool:
"""镜像内 zcbot 用户 uid 与 host 当前 uid 对齐。
bind mount host fs owner 直接落进容器;镜像 build 时若漏传 `HOST_UID`,
容器内默 uid=1000,host 实际跑 zcbot 服务的账号若 uid1000 exec /workspace
EACCES这里用 `docker run --rm --entrypoint id -u zcbot` 拿镜像 uid,
host `os.getuid()` 比对(假设 zcbot 用户跑 check 子命令)
"""
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
rc, out, err = _run(
["docker", "run", "--rm", "--entrypoint", "id", image, "-u", "zcbot"]
)
if rc != 0:
_warn(
f"image uid check skipped: {err or 'unknown'} -- "
f"if image not built yet 先跑 build 再来"
)
return True
try:
image_uid = int(out)
except ValueError:
_warn(f"image uid unexpected output: {out!r}")
return True
if sys.platform == "win32":
_warn(
f"image zcbot uid={image_uid}; host uid check skipped on Windows "
f"(Linux 部署机上跑 check 才有意义)"
)
return True
host_uid = os.getuid() # type: ignore[attr-defined]
if image_uid == host_uid:
_ok(f"HOST_UID aligned: image zcbot uid={image_uid} == host uid={host_uid}")
return True
_err(
f"HOST_UID mismatch: image zcbot uid={image_uid}, host uid={host_uid} -- "
f"重 build 镜像 `docker build --build-arg HOST_UID={host_uid} ...`"
)
return False
def detect_fs_quota(target: Path) -> Tuple[str, str]:
"""探测 target 所在 fs 是否可 quota,返 (level, msg)。
level {"ok", "warn"} fs quota 永不视为 err(不阻塞 web 启动)
CLI lifespan 共用 CLI _ok/_warn 打印,lifespan print
识别:
- xfs:mount options `prjquota` `pquota` ok;否则 warn(fs 支持但未 enable)
- ext4:mount options `prjquota` `project,quota` ok
- zfs:任何 ok(dataset quota zfs set ,这里不深入)
- btrfs:警告 quota 群组复杂
- tmpfs / overlay / 其他:warn(典型 Docker-in-Docker 或本地 dev,生产部署不应该)
"""
if sys.platform == "win32":
return "warn", "fs quota check skipped on Windows (Linux 部署机才有意义)"
# findmnt 在多数 Linux 发行版自带(util-linux)
rc, out, err = _run([
"findmnt", "--target", str(target), "-no", "FSTYPE,OPTIONS",
])
if rc != 0 or not out:
return "warn", (
f"fs quota check skipped: cannot detect fs for {target} "
f"({err or 'findmnt missing'})"
)
parts = out.split()
fstype = parts[0].lower() if parts else ""
options = parts[1] if len(parts) > 1 else ""
opts = set(options.split(","))
if fstype == "xfs":
if "prjquota" in opts or "pquota" in opts:
return "ok", f"fs quota: xfs with prjquota on {target}"
return "warn", (
f"fs quota: xfs on {target} but NO prjquota mount option -- "
f"`sudo mount -o remount,prjquota <mountpoint>` + `xfs_quota -x ...`"
)
if fstype == "ext4":
if "prjquota" in opts or ("project" in opts and "quota" in opts):
return "ok", f"fs quota: ext4 with project quota on {target}"
return "warn", (
f"fs quota: ext4 on {target} but NO project quota option -- "
f"`tune2fs -O project,quota <dev>` + remount + `quota -P`"
)
if fstype == "zfs":
return "ok", f"fs quota: zfs on {target} (dataset quota via `zfs set quota=...`)"
if fstype == "btrfs":
return "warn", (
f"fs quota: btrfs on {target} -- qgroup 配置复杂,生产部署"
f"推荐 xfs prjquota;如必须用 btrfs 自行验 `btrfs qgroup`"
)
return "warn", (
f"fs quota: {fstype or '<unknown>'} on {target} -- "
f"非主流 quota-able 类型,外部用户开放前换 xfs/ext4/zfs 单独分区"
)
def check_fs_quota_capable() -> bool:
"""CLI 入口:探测 workspace/users/ 所在 fs。返 True(永不 err)。"""
from core.agent_builder import load_config, resolve_workspace
try:
cfg = load_config()
workspace = resolve_workspace(None, cfg)
target = (workspace / "users").resolve()
except Exception as e:
_warn(f"fs quota check: cannot resolve workspace path: {e}")
return True
level, msg = detect_fs_quota(target)
if level == "ok":
_ok(msg)
else:
_warn(msg)
return True
# -- 汇总入口 ---------------------------------------------
CHECK_NAMES = [
("docker daemon", "check_docker_daemon"),
("image present", "check_image_present"),
("network present", "check_network_present"),
("HOST_UID alignment", "check_host_uid_alignment"),
("fs quota capable", "check_fs_quota_capable"),
]
def run_sandbox_check() -> int:
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
err vs warn 分界:
- err = docker backend 启动会 fail-fast 的根因(daemon / 镜像 / HOST_UID)
- warn = 不阻塞启动但外部用户开放前要清(network / fs 不可 quota)
通过模块全局 lookup 拿函数引用(不固化进 CHECKS 元组), unittest patch
`core.sandbox.check.check_xxx` 对本函数生效
"""
print("--- sandbox deployment check ---\n")
ok_count = 0
module = sys.modules[__name__]
for label, fn_name in CHECK_NAMES:
fn = getattr(module, fn_name)
try:
if fn():
ok_count += 1
except Exception as e:
_err(f"{label}: unexpected {type(e).__name__}: {e}")
total = len(CHECK_NAMES)
print()
if ok_count == total:
print(f"[summary] {ok_count}/{total} checks passed -- docker backend ready")
return 0
failed = total - ok_count
print(
f"[summary] {ok_count}/{total} passed, {failed} failed -- "
f"修完上面的 [err] 项再启 docker backend"
)
return 1

View File

@ -5,27 +5,30 @@
workspace 目录) workspace 目录)
生命周期: 生命周期:
- `ensure(user_id)`:per-user `asyncio.Lock` 串行化 `docker inspect` 探测 running - `ensure(user_id)`:per-user `threading.Lock` 串行化 `docker inspect` 探测
直接返;exists-but-stopped `rm -f` 重起(保证 iptables 重新 apply);不存在 `docker run` running 直接返;exists-but-stopped `rm -f` 重起(保证 iptables 重新 apply);
不存在 `docker run`
- `mark_active(user_id)`:exec 完更新 in-memory `_last_active[uid]=now`(docker labels - `mark_active(user_id)`:exec 完更新 in-memory `_last_active[uid]=now`(docker labels
不可运行时修改 Docker 23+ 移除 `docker update --label-add` 支持) 不可运行时修改 Docker 23+ 移除 `docker update --label-add` 支持)
- `reap_idle()`:周期任务, `_last_active` dict,>`idle_ttl` `docker rm -f` - `reap_idle()`:周期任务, `_last_active` dict,>`idle_ttl` `docker rm -f`
- `shutdown_all()`:app 启动时清前驱孤儿(`docker ps --filter label=zcbot.product=sandbox`) - `shutdown_all()`:app 启动时清前驱孤儿(`docker ps --filter label=zcbot.product=sandbox`)
API 全同步 ensure 主要使用方是 AgentLoop / DockerExecutor,跑在 web BG 线程内
天然同步;reaper 跑在 uvicorn loop ,通过 `run_in_executor` 包一层调本类 sync 方法
threading.Lock 跨线程有效,asyncio.Lock 会被 ephemeral loop 创建 / 销毁绕过保护
幂等性: 幂等性:
- ensure 在重复调用时跨 daemon round-trip < 100ms( `docker inspect`);per-user lock - ensure 在重复调用时跨 daemon round-trip < 100ms( `docker inspect`);per-user lock
防同 user 两并发 `docker run --name` "Conflict"(虽然 docker 本身会 reject,提前 防同 user 两并发 `docker run --name` "Conflict"(虽然 docker 本身会 reject,提前
锁更干净) 锁更干净)
- reaper 只杀 dict 里有记录的容器 重启后 dict 不杀历史孤儿(这条由 startup - reaper 只杀 dict 里有记录的容器 重启后 dict 不杀历史孤儿(这条由 startup
`shutdown_all` 兜底) `shutdown_all` 兜底)
Step 2 范围: pool / lifecycleTools(shell / run_python) Step 3 接入
""" """
from __future__ import annotations from __future__ import annotations
import asyncio
import os import os
import subprocess import subprocess
import threading
import time import time
from pathlib import Path from pathlib import Path
from typing import Dict, List, Optional from typing import Dict, List, Optional
@ -97,17 +100,19 @@ class SandboxPool:
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS)) os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
) )
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "") self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
self._locks: Dict[UUID, asyncio.Lock] = {} self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
self._locks: Dict[UUID, threading.Lock] = {}
self._last_active: Dict[UUID, int] = {} self._last_active: Dict[UUID, int] = {}
def _lock_for(self, user_id: UUID) -> asyncio.Lock: def _lock_for(self, user_id: UUID) -> threading.Lock:
if user_id not in self._locks: with self._dict_lock:
self._locks[user_id] = asyncio.Lock() if user_id not in self._locks:
return self._locks[user_id] self._locks[user_id] = threading.Lock()
return self._locks[user_id]
async def ensure(self, user_id: UUID) -> str: def ensure(self, user_id: UUID) -> str:
"""返回容器名;create-or-reuse 原子。""" """返回容器名;create-or-reuse 原子。同步阻塞,主调方 AgentLoop 已在 BG 线程。"""
async with self._lock_for(user_id): with self._lock_for(user_id):
name = container_name(user_id) name = container_name(user_id)
if _container_running(name): if _container_running(name):
self._last_active[user_id] = _now() self._last_active[user_id] = _now()
@ -118,7 +123,7 @@ class SandboxPool:
["docker", "rm", "-f", name], ["docker", "rm", "-f", name],
capture_output=True, check=False, capture_output=True, check=False,
) )
await asyncio.to_thread(self._docker_run, user_id, name) self._docker_run(user_id, name)
self._last_active[user_id] = _now() self._last_active[user_id] = _now()
return name return name

20
main.py
View File

@ -198,5 +198,25 @@ def web(host: str, port: int, reload: bool) -> None:
uvicorn.run(create_app(), host=host, port=port, log_level="info") uvicorn.run(create_app(), host=host, port=port, log_level="info")
# ─────────────── Sandbox(Stage C 部署前置对账) ───────────────
@cli.group()
def sandbox() -> None:
"""Sandbox 容器部署对账(`ZCBOT_SANDBOX_BACKEND=docker` 启用前跑一遍)。"""
@sandbox.command("check")
def sandbox_check() -> None:
"""对账 docker backend 启动前置(daemon / 镜像 / network / HOST_UID / fs quota)。
非阻塞 每项独立打印 `[ok]` / `[warn]` / `[err]`,最后汇总`err` 一项 退出 1,
ok / warn 退出 0warn 项不阻塞 web 启动,**外部用户开放前必须清零**
( DESIGN §7.5 落地清单)
"""
from core.sandbox.check import run_sandbox_check
rc = run_sandbox_check()
sys.exit(rc)
if __name__ == "__main__": if __name__ == "__main__":
cli() cli()

View File

@ -0,0 +1,285 @@
"""DockerExecutor 单元测试。
mock subprocess(`docker exec` 命令的实际跑由部署机 smoke ,RUN.md 5 条命令)
覆盖关键路径:
- 信任域 dispatch:host 工具直通 / container 工具走 docker exec
- argv 形态:--user / --workdir / setsid / bash -c / python <script>
- tmp .py:写到 host `.zcbot_tmp/<task_id>/`,执行完 unlink,无残留
- timeout / cancel:Popen.kill() 兜底
- schemas() / has_tool() 透传 host
"""
from __future__ import annotations
import sys
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
from uuid import uuid4
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from core.executor import ExecCtx, ToolResult
from core.executor_docker import DockerExecutor, TMP_SUBDIR
from core.executor_host import HostExecutor
class FakePool:
"""SandboxPool 替身:ensure 返固定容器名,mark_active 记录调用。"""
def __init__(self):
self.ensure_calls = []
self.mark_active_calls = []
def ensure(self, user_id):
name = f"zcbot-sandbox-{user_id}"
self.ensure_calls.append(user_id)
return name
def mark_active(self, user_id):
self.mark_active_calls.append(user_id)
class FakeTool:
"""tools.base.Tool 替身:execute 返串,schema 暴露 name + 空 parameters。"""
def __init__(self, name, output="ok"):
self.name = name
self._output = output
self.execute_calls = []
@property
def schema(self):
return {"type": "function", "function": {"name": self.name}}
def execute(self, **kwargs):
self.execute_calls.append(kwargs)
return self._output
def make_executor(tools_dict=None):
"""构造 DockerExecutor + FakePool + tmp user_root。返回 (executor, pool, tmp_dir)。"""
tmp = tempfile.mkdtemp()
user_root = Path(tmp) / "users" / "u1"
user_root.mkdir(parents=True)
working_dir = user_root / "demo"
working_dir.mkdir()
if tools_dict is None:
tools_dict = {
"read": FakeTool("read", "READ_OUT"),
"shell": FakeTool("shell"), # host shell 不应被调用
"run_python": FakeTool("run_python"),
}
host = HostExecutor(tools_dict)
pool = FakePool()
executor = DockerExecutor(
host=host,
pool=pool,
user_id=uuid4(),
user_root=user_root,
working_dir=working_dir,
)
return executor, pool, Path(tmp)
def make_ctx(executor):
return ExecCtx(
user_id=executor.user_id,
task_id=uuid4(),
working_dir=executor.working_dir,
cancel_check=None,
)
class TestHostPassthrough(unittest.TestCase):
"""非 container tool 直通 host backend,不调 pool / subprocess。"""
def test_read_passthrough_to_host(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("read", {"file": "x"}, ctx)
self.assertEqual(result.content, "READ_OUT")
self.assertEqual(result.exit_code, 0)
self.assertEqual(pool.ensure_calls, [])
self.assertEqual(pool.mark_active_calls, [])
def test_schemas_and_has_tool_from_host(self):
executor, _, _ = make_executor()
names = [s["function"]["name"] for s in executor.schemas()]
self.assertIn("read", names)
self.assertIn("shell", names)
self.assertTrue(executor.has_tool("shell"))
self.assertFalse(executor.has_tool("nope"))
class TestShellExec(unittest.TestCase):
"""shell 调用走 docker exec subprocess,argv 形态正确。"""
def test_shell_invokes_docker_exec(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
proc = MagicMock()
proc.communicate.return_value = ("hello\n", "")
proc.returncode = 0
with patch("core.executor_docker.subprocess.Popen", return_value=proc) as popen:
result = executor.call_tool("shell", {"command": "echo hello"}, ctx)
self.assertIn("[stdout]\nhello", result.content)
self.assertIn("[exit 0]", result.content)
self.assertEqual(result.exit_code, 0)
argv = popen.call_args[0][0]
self.assertEqual(argv[:2], ["docker", "exec"])
self.assertIn("--user", argv)
self.assertIn("--workdir", argv)
# workdir 应是 /workspace/demo(working_dir 相对 user_root)
self.assertEqual(argv[argv.index("--workdir") + 1], "/workspace/demo")
# container name = zcbot-sandbox-<uid>
container_idx = argv.index(f"zcbot-sandbox-{executor.user_id}")
# setsid bash -c 必须出现且紧跟 container 之后
self.assertEqual(argv[container_idx + 1:], ["setsid", "bash", "-c", "echo hello"])
self.assertEqual(pool.ensure_calls, [executor.user_id])
self.assertEqual(pool.mark_active_calls, [executor.user_id])
def test_shell_bad_args(self):
executor, _, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("shell", {"command": ""}, ctx)
self.assertIn("[Error]", result.content)
self.assertEqual(result.exit_code, 2)
def test_shell_timeout(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
import subprocess as real_subprocess
proc = MagicMock()
# 第一次 communicate 抛 TimeoutExpired,第二次(kill 后)返空
proc.communicate.side_effect = [
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
("", "killed\n"),
]
proc.returncode = -9
with patch("core.executor_docker.subprocess.Popen", return_value=proc), \
patch("core.executor_docker.time.monotonic", side_effect=[0, 100]):
result = executor.call_tool("shell", {"command": "sleep 9999", "timeout": 1}, ctx)
self.assertIn("timed out after 1s", result.content)
self.assertEqual(result.exit_code, 124)
proc.kill.assert_called_once()
def test_shell_cancel(self):
executor, _, _ = make_executor()
ctx = ExecCtx(
user_id=executor.user_id,
task_id=uuid4(),
working_dir=executor.working_dir,
cancel_check=lambda: True, # 立即 cancel
)
import subprocess as real_subprocess
proc = MagicMock()
proc.communicate.side_effect = [
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
("", ""),
]
proc.returncode = -15
with patch("core.executor_docker.subprocess.Popen", return_value=proc):
result = executor.call_tool("shell", {"command": "sleep 9999"}, ctx)
self.assertIn("cancelled by user", result.content)
self.assertEqual(result.exit_code, 130)
proc.kill.assert_called_once()
class TestRunPython(unittest.TestCase):
"""run_python:tmp .py 落 user_root/.zcbot_tmp/<task_id>/,跑完 unlink。"""
def test_run_python_tmp_script(self):
executor, pool, tmp_root = make_executor()
ctx = make_ctx(executor)
proc = MagicMock()
proc.communicate.return_value = ("42\n", "")
proc.returncode = 0
captured_argv = []
def _popen(argv, **kwargs):
captured_argv.append(argv)
return proc
with patch("core.executor_docker.subprocess.Popen", side_effect=_popen):
result = executor.call_tool(
"run_python", {"code": "print(42)"}, ctx
)
self.assertIn("[stdout]\n42", result.content)
self.assertEqual(result.exit_code, 0)
argv = captured_argv[0]
# 末尾形态:setsid python /workspace/.zcbot_tmp/<task_id>/<rand>.py
self.assertEqual(argv[-3], "setsid")
self.assertEqual(argv[-2], "python")
self.assertTrue(argv[-1].startswith(f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/"))
self.assertTrue(argv[-1].endswith(".py"))
# PYTHONIOENCODING / PYTHONPATH 注入
env_kvs = [argv[i + 1] for i, a in enumerate(argv) if a == "-e"]
self.assertIn("PYTHONIOENCODING=utf-8", env_kvs)
self.assertIn("PYTHONPATH=/workspace", env_kvs)
# host 侧 tmp 已 unlink(目录可能仍在,无所谓 —— ensure 容器时会重新 mkdir)
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
self.assertEqual(leftover, [], f"tmp .py not cleaned up: {leftover}")
def test_run_python_bad_code_type(self):
executor, _, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("run_python", {"code": 123}, ctx)
self.assertIn("[Error]", result.content)
self.assertEqual(result.exit_code, 2)
def test_run_python_cleans_tmp_on_exception(self):
"""Popen 抛异常时 tmp .py 仍要被清理(finally 兜底)。"""
executor, _, _ = make_executor()
ctx = make_ctx(executor)
with patch(
"core.executor_docker.subprocess.Popen",
side_effect=RuntimeError("boom"),
):
result = executor.call_tool("run_python", {"code": "x"}, ctx)
self.assertIn("[Error executing run_python via docker]", result.content)
self.assertEqual(result.exit_code, 1)
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
self.assertEqual(leftover, [])
class TestUnknownTool(unittest.TestCase):
def test_unknown_tool_goes_to_host(self):
executor, _, _ = make_executor(tools_dict={}) # 空 host → 啥都没
ctx = make_ctx(executor)
result = executor.call_tool("nope", {}, ctx)
self.assertIn("unknown tool", result.content)
self.assertEqual(result.exit_code, 2)
def test_container_tool_not_registered_on_host(self):
"""caps.enable_run_python=False:host 没装 run_python,docker 也应拒。"""
executor, _, _ = make_executor(tools_dict={"read": FakeTool("read")})
ctx = make_ctx(executor)
result = executor.call_tool("run_python", {"code": "x"}, ctx)
self.assertIn("unknown tool", result.content)
self.assertEqual(result.exit_code, 2)
if __name__ == "__main__":
unittest.main()

186
tests/test_sandbox_check.py Normal file
View File

@ -0,0 +1,186 @@
"""`main.py sandbox check` 探测函数单元测试。
mock subprocess,:
- daemon 不可达 / image / network / uid 错配的各种分支
- detect_fs_quota xfs/ext4/zfs/btrfs/其他 + prjquota mount option 的判断
- 汇总 exit code: ok / warn / err
"""
from __future__ import annotations
import sys
import unittest
from pathlib import Path
from unittest.mock import patch
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from core.sandbox.check import (
check_docker_daemon,
check_image_present,
check_host_uid_alignment,
detect_fs_quota,
run_sandbox_check,
)
def _mk_run(returns):
"""构造 `_run` 替身:按调用次序返 (rc, stdout, stderr) 列表里的元素。"""
iter_ret = iter(returns)
def fake_run(argv, timeout=10):
return next(iter_ret)
return fake_run
class TestDaemonCheck(unittest.TestCase):
def test_daemon_ok(self):
with patch("core.sandbox.check._run", _mk_run([(0, "24.0.7", "")])):
self.assertTrue(check_docker_daemon())
def test_daemon_cli_missing(self):
with patch("core.sandbox.check._run", _mk_run([(127, "", "docker not found in PATH")])):
self.assertFalse(check_docker_daemon())
def test_daemon_permission_denied(self):
with patch(
"core.sandbox.check._run",
_mk_run([(1, "", "Got permission denied while trying to connect")]),
):
self.assertFalse(check_docker_daemon())
class TestImageCheck(unittest.TestCase):
def test_image_present(self):
with patch("core.sandbox.check._run", _mk_run([(0, "[...]", "")])):
self.assertTrue(check_image_present())
def test_image_missing(self):
with patch("core.sandbox.check._run", _mk_run([(1, "", "No such image")])):
self.assertFalse(check_image_present())
class TestHostUidAlignment(unittest.TestCase):
def test_uid_aligned(self):
if sys.platform == "win32":
self.skipTest("getuid not on Windows")
import os
host_uid = os.getuid() # type: ignore[attr-defined]
with patch(
"core.sandbox.check._run",
_mk_run([(0, str(host_uid), "")]),
):
self.assertTrue(check_host_uid_alignment())
def test_uid_mismatch(self):
if sys.platform == "win32":
self.skipTest("getuid not on Windows")
import os
bad = os.getuid() + 1 # type: ignore[attr-defined]
with patch("core.sandbox.check._run", _mk_run([(0, str(bad), "")])):
self.assertFalse(check_host_uid_alignment())
def test_image_not_built_yet(self):
# docker run 失败 → warn 不 err
with patch(
"core.sandbox.check._run",
_mk_run([(125, "", "Unable to find image")]),
):
self.assertTrue(check_host_uid_alignment())
def test_skipped_on_windows(self):
with patch("core.sandbox.check.sys") as mock_sys, \
patch("core.sandbox.check._run", _mk_run([(0, "1000", "")])):
mock_sys.platform = "win32"
self.assertTrue(check_host_uid_alignment())
class TestDetectFsQuota(unittest.TestCase):
"""detect_fs_quota:不依赖 print,纯返 (level, msg) 便于断言。"""
def test_xfs_with_prjquota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,prjquota,attr2", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt/zcbot/workspace/users"))
self.assertEqual(level, "ok")
self.assertIn("xfs with prjquota", msg)
def test_xfs_without_prjquota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,attr2", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("NO prjquota", msg)
def test_ext4_with_project_quota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "ext4 rw,prjquota", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "ok")
self.assertIn("ext4 with project quota", msg)
def test_zfs(self):
with patch("core.sandbox.check._run", _mk_run([(0, "zfs rw,xattr,noacl", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/tank/zcbot"))
self.assertEqual(level, "ok")
self.assertIn("zfs", msg)
def test_btrfs_warns(self):
with patch("core.sandbox.check._run", _mk_run([(0, "btrfs rw,relatime", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("btrfs", msg)
def test_tmpfs_warns(self):
with patch("core.sandbox.check._run", _mk_run([(0, "tmpfs rw", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/tmp"))
self.assertEqual(level, "warn")
def test_findmnt_missing(self):
with patch("core.sandbox.check._run", _mk_run([(127, "", "findmnt not found in PATH")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("findmnt", msg)
def test_windows_skipped(self):
with patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "win32"
level, msg = detect_fs_quota(Path("C:/"))
self.assertEqual(level, "warn")
self.assertIn("Windows", msg)
class TestSummaryExitCode(unittest.TestCase):
"""run_sandbox_check 汇总:err → exit 1,全 ok / 仅 warn → exit 0。"""
def test_all_ok_exits_zero(self):
with patch("core.sandbox.check.check_docker_daemon", return_value=True), \
patch("core.sandbox.check.check_image_present", return_value=True), \
patch("core.sandbox.check.check_network_present", return_value=True), \
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
rc = run_sandbox_check()
self.assertEqual(rc, 0)
def test_any_err_exits_one(self):
with patch("core.sandbox.check.check_docker_daemon", return_value=False), \
patch("core.sandbox.check.check_image_present", return_value=True), \
patch("core.sandbox.check.check_network_present", return_value=True), \
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
rc = run_sandbox_check()
self.assertEqual(rc, 1)
if __name__ == "__main__":
unittest.main()

View File

@ -481,7 +481,7 @@ def create_app() -> FastAPI:
async def lifespan(app: FastAPI): async def lifespan(app: FastAPI):
broker.bind_loop(asyncio.get_running_loop()) broker.bind_loop(asyncio.get_running_loop())
# Skill 注册表启动时扫一次 — 文件系统静态,运行中不变;/v1/skills 直接读 # Skill 注册表启动时扫一次 — 文件系统静态,运行中不变;/v1/skills 直接读
from core.agent_builder import load_config from core.agent_builder import load_config, resolve_workspace
from core.paths import ROOT from core.paths import ROOT
from core.skills import SkillRegistry from core.skills import SkillRegistry
_cfg = load_config() _cfg = load_config()
@ -500,7 +500,68 @@ def create_app() -> FastAPI:
) )
if result.rowcount: if result.rowcount:
print(f"[startup] reaped {result.rowcount} stale active run(s)") print(f"[startup] reaped {result.rowcount} stale active run(s)")
yield
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
# 全清重启)③ 后台 reaper task,每 60s 跑 reap_idle。
sandbox_backend = os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower()
sandbox_reaper_task = None
if sandbox_backend == "docker":
from core.sandbox import init_pool
from core.sandbox.check import detect_fs_quota
workspace = resolve_workspace(None, _cfg)
user_root_base = workspace / "users"
# §7.5 #4 fs quota 探测:不阻塞启动(应用层周期扫描已有),仅打 WARN
# 提醒外部用户开放前必须升级到 xfs prjquota / ext4 project / zfs。
try:
level, msg = detect_fs_quota(user_root_base.resolve())
print(f"[startup] {'[ok]' if level == 'ok' else '[warn]'} {msg}")
except Exception as e:
print(f"[startup] [warn] fs quota detect failed: {type(e).__name__}: {e}")
try:
pool = init_pool(user_root_base)
removed = pool.shutdown_all()
if removed:
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
async def _reaper() -> None:
loop = asyncio.get_running_loop()
while True:
try:
await asyncio.sleep(60)
removed = await loop.run_in_executor(None, pool.reap_idle)
if removed:
print(f"[reaper] reaped {len(removed)} idle sandbox container(s)")
except asyncio.CancelledError:
raise
except Exception as e:
print(f"[reaper] error: {type(e).__name__}: {e}")
sandbox_reaper_task = asyncio.create_task(_reaper(), name="sandbox-reaper")
app.state.sandbox_pool = pool
except Exception as e:
# ensure_network / docker CLI 不可用 → fail-fast。Stage C 协议:任一
# hardening 缺失视为部署未完成,不退化到 host(否则误以为有沙盒实则在裸跑)。
raise RuntimeError(
f"ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: {e}"
)
try:
yield
finally:
if sandbox_reaper_task is not None:
sandbox_reaper_task.cancel()
try:
await sandbox_reaper_task
except (asyncio.CancelledError, Exception):
pass
if sandbox_backend == "docker":
pool = getattr(app.state, "sandbox_pool", None)
if pool is not None:
try:
pool.shutdown_all()
except Exception as e:
print(f"[shutdown] sandbox shutdown_all error: {type(e).__name__}: {e}")
app = FastAPI( app = FastAPI(
title="zcbot api", title="zcbot api",