Stage C Step 5: main.py sandbox check + lifespan fs quota WARN

- main.py sandbox check 子命令:5 项独立探测 + 汇总 exit code
  ① docker daemon 可达 ② zcbot-sandbox:latest 镜像存在
  ③ zcbot-sandbox-net network 存在(warn 不 err) ④ 镜像 zcbot uid 与 host
  uid 对齐 ⑤ workspace/users 所在 fs 类型可 quota
- core/sandbox/check.py:detect_fs_quota(path) -> (level, msg) 抽出来给
  lifespan 与 CLI 共用;识别 xfs+prjquota/ext4+project/zfs/btrfs/tmpfs/其他
- web/app.py lifespan docker backend 启用时调 detect_fs_quota 打 WARN
  到 stdout(不阻塞启动,应用层周期扫描仍生效)
- err vs warn 分界:err = docker backend fail-fast 根因(daemon/镜像/uid),
  warn = 不阻塞启动但外部开放前要清(network 缺/fs 不可 quota)
- run_sandbox_check 用 module-level getattr 而非固化 CHECKS 元组,让
  unittest patch core.sandbox.check.check_xxx 生效
- tests/test_sandbox_check.py 19 测试覆盖各分支 + exit code 汇总;
  unittest discover 31/31 PASS
- RUN.md 加"部署前置对账"小节 + "配额硬化"重写(fs 状态→处理映射表 +
  xfs 升级 4 步) + 故障兜底 3 行;DESIGN 不动

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
caoqianming 2026-05-26 16:41:16 +08:00
parent dfac0acfa6
commit 1a950dedb5
6 changed files with 529 additions and 11 deletions

View File

@ -2,7 +2,7 @@
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`
最后更新:2026-05-26(Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan reaper,ZCBOT_SANDBOX_BACKEND env 切换 host/docker)
最后更新:2026-05-26(Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN + RUN.md 配额硬化段完善)
---
@ -15,7 +15,7 @@
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota(§7.5 落地清单 #2 #4)**。 |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
---
@ -23,6 +23,7 @@
### 2026-05-26
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python``pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure``rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py``HostExecutor``DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py``if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。

60
RUN.md
View File

@ -358,19 +358,60 @@ sudo -u zcbot docker rm -f zcbot-sandbox-$USER_ID
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
### 配额硬化(§7.5 #4,外部开放前必做)
### 部署前置对账
应用层磁盘配额(Step 5 引入)能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条
硬要 **xfs / ext4 project quota 或 zfs dataset quota**。部署到独立服务器 + 多租户开放前:
`ZCBOT_SANDBOX_BACKEND=docker` 之前跑一次:
```bash
# 示例(xfs project quota):
sudo mount -o remount,prjquota /opt
sudo xfs_quota -x -c "project -s -p /opt/zcbot/workspace/users/<uid> <pid>" /opt
sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
sudo -u zcbot .venv/bin/python main.py sandbox check
```
具体方案视部署 fs 选择(xfs 推荐)── 不做这步等于"软配额 + 信任用户不写满"。
输出形如 `[ok] / [warn] / [err]` × 5 项 + 汇总 `N/5 passed`,exit code 0=可启动 / 1=有 err
要修。5 项对应:① docker daemon 可达 ② `zcbot-sandbox:latest` 镜像存在 ③
`zcbot-sandbox-net` network 存在(缺也能跑,lifespan 自动 ensure)④ 镜像内 zcbot
uid 与 host uid 对齐(错配 → exec 写 `/workspace` 全 EACCES)⑤ `workspace/users/`
所在 fs 类型可 quota。
lifespan 启动时同样会打第 ⑤ 项的 WARN 到 stdout(`[startup] [warn] fs quota ...`),
应用层周期扫描仍生效;**仅外部用户开放前必须把 ⑤ 升级到 OS 层 quota**。
### 配额硬化(§7.5 #4,外部开放前必做)
应用层磁盘配额能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条硬要 OS 层
quota。`sandbox check` 第 ⑤ 项会探测当前 fs 状态:
| 探测结果 | 含义 | 处理 |
|---|---|---|
| `fs quota: xfs with prjquota on ...` | ok,可直接 `xfs_quota -x` 给 user 加配额 | (无需处理) |
| `fs quota: ext4 with project quota on ...` | ok,可 `quota -P` | (无需处理) |
| `fs quota: zfs on ...` | ok,在 dataset 层 `zfs set quota=` | (无需处理) |
| `fs quota: xfs ... NO prjquota mount option` | fs 支持但 mount 时没启 | 见下方 xfs 步骤 |
| `fs quota: ext4 ... NO project quota option` | 同上 | `sudo tune2fs -O project,quota <dev>` + remount |
| `fs quota: btrfs ...` | qgroup 配置复杂 | 生产推荐换 xfs 单独分区,或自行验 `btrfs qgroup` |
| `fs quota: tmpfs/overlay/... ` | 通常 Docker-in-Docker 或本地 dev | 生产必须挂独立分区 |
**xfs 升级步骤(推荐方案)**:
```bash
# 1) 确认 workspace 在哪个 mount(假设 /opt 是独立 xfs 分区)
findmnt --target /opt/zcbot/workspace
# 2) 启用 prjquota(写入 /etc/fstab 让 reboot 后保留)
sudo mount -o remount,prjquota /opt
# 3) 给某 user 加 project quota(<pid> 自定义整数 id,与 user_id 映射建表跟踪)
echo "1001 /opt/zcbot/workspace/users/<user_uuid>" | sudo tee -a /etc/projects
echo "zcbot_<user_uuid>:1001" | sudo tee -a /etc/projid
sudo xfs_quota -x -c "project -s zcbot_<user_uuid>" /opt
sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
```
`<pid>``user_uuid` 映射手工维护(`/etc/projects` 是数字 id,zcbot 侧需建表追踪;
**首期外部开放前补一个 `main.py sandbox quota-set --user-id <uuid> --gb 10` 子命令**
读写 /etc/projects + 调 xfs_quota,这是 Step 4 / 5 之后真上线前一步,当前不做)。
不做这步等于"软配额 + 信任用户不写满" -- dogfood + 信任同事白名单阶段够用,
**外部用户开放是 hard prereq**。
---
@ -386,6 +427,9 @@ sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
| Sandbox 容器内 `touch /workspace/x``Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
| 启动报 `ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: ...` | docker daemon 没起 / 用户不在 docker group / network create 失败。先跑 `main.py sandbox check` 看哪一项 err |
| `[startup] [warn] fs quota: <fstype> on ...` | workspace 所在 fs 没启 OS 层 quota。dogfood 阶段忽略;外部用户开放前必须升级 xfs prjquota / ext4 project / zfs(详 RUN.md「配额硬化」段) |
| `docker run zcbot-sandbox:latest``Unable to find image` | 镜像没 build。`sudo -u zcbot docker build -f deploy/sandbox/Dockerfile --build-arg HOST_UID=$(id -u zcbot) --build-arg HOST_GID=$(id -g zcbot) -t zcbot-sandbox:latest .` |
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY``curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |

258
core/sandbox/check.py Normal file
View File

@ -0,0 +1,258 @@
"""Sandbox 部署前置对账(`main.py sandbox check`)。
5 项独立探测,各自打 `[ok]` / `[warn]` / `[err]`,汇总后返 exit code
外部用户开放前所有项必须 `[ok]`
探测项与 §7.5 协议对应:
1. Docker daemon 可达 -- ZCBOT_SANDBOX_BACKEND=docker 启用必备
2. `zcbot-sandbox:latest` 镜像存在 -- 缺则 pool.ensure docker run "Unable to find image"
3. `zcbot-sandbox-net` network 存在 -- 缺也无所谓(init_pool 内自动 ensure),但提前预热
4. 镜像 HOST_UID host zcbot uid 对齐 -- 错配会让 exec 进来后 write /workspace EACCES
5. user_root_base fs 类型可 quota -- §7.5 #4,xfs prjquota / ext4 project / zfs;否则
"扫描间隙打满共享 fs"会拖死同节点其他 user(攻击者写满速度 >> 应用层周期扫描)
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
from pathlib import Path
from typing import Tuple
from .pool import DEFAULT_IMAGE
from .network import NETWORK_NAME
# 颜色用 ANSI(终端不支持的环境自动退化为 plain;click.echo 不强求 click context)
def _ok(msg: str) -> None:
print(f"[ok] {msg}")
def _warn(msg: str) -> None:
print(f"[warn] {msg}")
def _err(msg: str) -> None:
print(f"[err] {msg}")
def _run(argv, timeout: int = 10) -> Tuple[int, str, str]:
"""统一 subprocess.run wrapper。docker CLI 不存在 → returncode=127,stderr 给原因。"""
if shutil.which(argv[0]) is None:
return 127, "", f"{argv[0]} not found in PATH"
try:
r = subprocess.run(argv, capture_output=True, text=True, timeout=timeout)
return r.returncode, r.stdout.strip(), r.stderr.strip()
except subprocess.TimeoutExpired:
return 124, "", f"timed out after {timeout}s"
except Exception as e:
return 1, "", f"{type(e).__name__}: {e}"
# -- 探测项 ------------------------------------------------
def check_docker_daemon() -> bool:
rc, out, err = _run(["docker", "version", "--format", "{{.Server.Version}}"])
if rc == 0 and out:
_ok(f"docker daemon reachable (server={out})")
return True
if rc == 127:
_err("docker CLI not found -- apt install docker.io / docker-ce")
elif "permission denied" in err.lower():
_err(f"docker daemon not reachable: {err} -- usermod -aG docker $USER + relogin")
else:
_err(f"docker daemon not reachable: {err or 'unknown'}")
return False
def check_image_present() -> bool:
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
rc, _, err = _run(["docker", "image", "inspect", image])
if rc == 0:
_ok(f"image present: {image}")
return True
_err(
f"image not found: {image} -- "
f"`docker build -f deploy/sandbox/Dockerfile "
f"--build-arg HOST_UID=$(id -u) --build-arg HOST_GID=$(id -g) "
f"-t {image} .`"
)
return False
def check_network_present() -> bool:
rc, _, _ = _run(["docker", "network", "inspect", NETWORK_NAME])
if rc == 0:
_ok(f"network present: {NETWORK_NAME}")
return True
_warn(
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
f"或手动 `docker network create --internal {NETWORK_NAME}`"
)
return True # warn 不算失败
def check_host_uid_alignment() -> bool:
"""镜像内 zcbot 用户 uid 与 host 当前 uid 对齐。
bind mount host fs owner 直接落进容器;镜像 build 时若漏传 `HOST_UID`,
容器内默 uid=1000,host 实际跑 zcbot 服务的账号若 uid1000 exec /workspace
EACCES这里用 `docker run --rm --entrypoint id -u zcbot` 拿镜像 uid,
host `os.getuid()` 比对(假设 zcbot 用户跑 check 子命令)
"""
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
rc, out, err = _run(
["docker", "run", "--rm", "--entrypoint", "id", image, "-u", "zcbot"]
)
if rc != 0:
_warn(
f"image uid check skipped: {err or 'unknown'} -- "
f"if image not built yet 先跑 build 再来"
)
return True
try:
image_uid = int(out)
except ValueError:
_warn(f"image uid unexpected output: {out!r}")
return True
if sys.platform == "win32":
_warn(
f"image zcbot uid={image_uid}; host uid check skipped on Windows "
f"(Linux 部署机上跑 check 才有意义)"
)
return True
host_uid = os.getuid() # type: ignore[attr-defined]
if image_uid == host_uid:
_ok(f"HOST_UID aligned: image zcbot uid={image_uid} == host uid={host_uid}")
return True
_err(
f"HOST_UID mismatch: image zcbot uid={image_uid}, host uid={host_uid} -- "
f"重 build 镜像 `docker build --build-arg HOST_UID={host_uid} ...`"
)
return False
def detect_fs_quota(target: Path) -> Tuple[str, str]:
"""探测 target 所在 fs 是否可 quota,返 (level, msg)。
level {"ok", "warn"} fs quota 永不视为 err(不阻塞 web 启动)
CLI lifespan 共用 CLI _ok/_warn 打印,lifespan print
识别:
- xfs:mount options `prjquota` `pquota` ok;否则 warn(fs 支持但未 enable)
- ext4:mount options `prjquota` `project,quota` ok
- zfs:任何 ok(dataset quota zfs set ,这里不深入)
- btrfs:警告 quota 群组复杂
- tmpfs / overlay / 其他:warn(典型 Docker-in-Docker 或本地 dev,生产部署不应该)
"""
if sys.platform == "win32":
return "warn", "fs quota check skipped on Windows (Linux 部署机才有意义)"
# findmnt 在多数 Linux 发行版自带(util-linux)
rc, out, err = _run([
"findmnt", "--target", str(target), "-no", "FSTYPE,OPTIONS",
])
if rc != 0 or not out:
return "warn", (
f"fs quota check skipped: cannot detect fs for {target} "
f"({err or 'findmnt missing'})"
)
parts = out.split()
fstype = parts[0].lower() if parts else ""
options = parts[1] if len(parts) > 1 else ""
opts = set(options.split(","))
if fstype == "xfs":
if "prjquota" in opts or "pquota" in opts:
return "ok", f"fs quota: xfs with prjquota on {target}"
return "warn", (
f"fs quota: xfs on {target} but NO prjquota mount option -- "
f"`sudo mount -o remount,prjquota <mountpoint>` + `xfs_quota -x ...`"
)
if fstype == "ext4":
if "prjquota" in opts or ("project" in opts and "quota" in opts):
return "ok", f"fs quota: ext4 with project quota on {target}"
return "warn", (
f"fs quota: ext4 on {target} but NO project quota option -- "
f"`tune2fs -O project,quota <dev>` + remount + `quota -P`"
)
if fstype == "zfs":
return "ok", f"fs quota: zfs on {target} (dataset quota via `zfs set quota=...`)"
if fstype == "btrfs":
return "warn", (
f"fs quota: btrfs on {target} -- qgroup 配置复杂,生产部署"
f"推荐 xfs prjquota;如必须用 btrfs 自行验 `btrfs qgroup`"
)
return "warn", (
f"fs quota: {fstype or '<unknown>'} on {target} -- "
f"非主流 quota-able 类型,外部用户开放前换 xfs/ext4/zfs 单独分区"
)
def check_fs_quota_capable() -> bool:
"""CLI 入口:探测 workspace/users/ 所在 fs。返 True(永不 err)。"""
from core.agent_builder import load_config, resolve_workspace
try:
cfg = load_config()
workspace = resolve_workspace(None, cfg)
target = (workspace / "users").resolve()
except Exception as e:
_warn(f"fs quota check: cannot resolve workspace path: {e}")
return True
level, msg = detect_fs_quota(target)
if level == "ok":
_ok(msg)
else:
_warn(msg)
return True
# -- 汇总入口 ---------------------------------------------
CHECK_NAMES = [
("docker daemon", "check_docker_daemon"),
("image present", "check_image_present"),
("network present", "check_network_present"),
("HOST_UID alignment", "check_host_uid_alignment"),
("fs quota capable", "check_fs_quota_capable"),
]
def run_sandbox_check() -> int:
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
err vs warn 分界:
- err = docker backend 启动会 fail-fast 的根因(daemon / 镜像 / HOST_UID)
- warn = 不阻塞启动但外部用户开放前要清(network / fs 不可 quota)
通过模块全局 lookup 拿函数引用(不固化进 CHECKS 元组), unittest patch
`core.sandbox.check.check_xxx` 对本函数生效
"""
print("--- sandbox deployment check ---\n")
ok_count = 0
module = sys.modules[__name__]
for label, fn_name in CHECK_NAMES:
fn = getattr(module, fn_name)
try:
if fn():
ok_count += 1
except Exception as e:
_err(f"{label}: unexpected {type(e).__name__}: {e}")
total = len(CHECK_NAMES)
print()
if ok_count == total:
print(f"[summary] {ok_count}/{total} checks passed -- docker backend ready")
return 0
failed = total - ok_count
print(
f"[summary] {ok_count}/{total} passed, {failed} failed -- "
f"修完上面的 [err] 项再启 docker backend"
)
return 1

20
main.py
View File

@ -198,5 +198,25 @@ def web(host: str, port: int, reload: bool) -> None:
uvicorn.run(create_app(), host=host, port=port, log_level="info")
# ─────────────── Sandbox(Stage C 部署前置对账) ───────────────
@cli.group()
def sandbox() -> None:
"""Sandbox 容器部署对账(`ZCBOT_SANDBOX_BACKEND=docker` 启用前跑一遍)。"""
@sandbox.command("check")
def sandbox_check() -> None:
"""对账 docker backend 启动前置(daemon / 镜像 / network / HOST_UID / fs quota)。
非阻塞 每项独立打印 `[ok]` / `[warn]` / `[err]`,最后汇总`err` 一项 退出 1,
ok / warn 退出 0warn 项不阻塞 web 启动,**外部用户开放前必须清零**
( DESIGN §7.5 落地清单)
"""
from core.sandbox.check import run_sandbox_check
rc = run_sandbox_check()
sys.exit(rc)
if __name__ == "__main__":
cli()

186
tests/test_sandbox_check.py Normal file
View File

@ -0,0 +1,186 @@
"""`main.py sandbox check` 探测函数单元测试。
mock subprocess,:
- daemon 不可达 / image / network / uid 错配的各种分支
- detect_fs_quota xfs/ext4/zfs/btrfs/其他 + prjquota mount option 的判断
- 汇总 exit code: ok / warn / err
"""
from __future__ import annotations
import sys
import unittest
from pathlib import Path
from unittest.mock import patch
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from core.sandbox.check import (
check_docker_daemon,
check_image_present,
check_host_uid_alignment,
detect_fs_quota,
run_sandbox_check,
)
def _mk_run(returns):
"""构造 `_run` 替身:按调用次序返 (rc, stdout, stderr) 列表里的元素。"""
iter_ret = iter(returns)
def fake_run(argv, timeout=10):
return next(iter_ret)
return fake_run
class TestDaemonCheck(unittest.TestCase):
def test_daemon_ok(self):
with patch("core.sandbox.check._run", _mk_run([(0, "24.0.7", "")])):
self.assertTrue(check_docker_daemon())
def test_daemon_cli_missing(self):
with patch("core.sandbox.check._run", _mk_run([(127, "", "docker not found in PATH")])):
self.assertFalse(check_docker_daemon())
def test_daemon_permission_denied(self):
with patch(
"core.sandbox.check._run",
_mk_run([(1, "", "Got permission denied while trying to connect")]),
):
self.assertFalse(check_docker_daemon())
class TestImageCheck(unittest.TestCase):
def test_image_present(self):
with patch("core.sandbox.check._run", _mk_run([(0, "[...]", "")])):
self.assertTrue(check_image_present())
def test_image_missing(self):
with patch("core.sandbox.check._run", _mk_run([(1, "", "No such image")])):
self.assertFalse(check_image_present())
class TestHostUidAlignment(unittest.TestCase):
def test_uid_aligned(self):
if sys.platform == "win32":
self.skipTest("getuid not on Windows")
import os
host_uid = os.getuid() # type: ignore[attr-defined]
with patch(
"core.sandbox.check._run",
_mk_run([(0, str(host_uid), "")]),
):
self.assertTrue(check_host_uid_alignment())
def test_uid_mismatch(self):
if sys.platform == "win32":
self.skipTest("getuid not on Windows")
import os
bad = os.getuid() + 1 # type: ignore[attr-defined]
with patch("core.sandbox.check._run", _mk_run([(0, str(bad), "")])):
self.assertFalse(check_host_uid_alignment())
def test_image_not_built_yet(self):
# docker run 失败 → warn 不 err
with patch(
"core.sandbox.check._run",
_mk_run([(125, "", "Unable to find image")]),
):
self.assertTrue(check_host_uid_alignment())
def test_skipped_on_windows(self):
with patch("core.sandbox.check.sys") as mock_sys, \
patch("core.sandbox.check._run", _mk_run([(0, "1000", "")])):
mock_sys.platform = "win32"
self.assertTrue(check_host_uid_alignment())
class TestDetectFsQuota(unittest.TestCase):
"""detect_fs_quota:不依赖 print,纯返 (level, msg) 便于断言。"""
def test_xfs_with_prjquota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,prjquota,attr2", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt/zcbot/workspace/users"))
self.assertEqual(level, "ok")
self.assertIn("xfs with prjquota", msg)
def test_xfs_without_prjquota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,attr2", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("NO prjquota", msg)
def test_ext4_with_project_quota(self):
with patch("core.sandbox.check._run", _mk_run([(0, "ext4 rw,prjquota", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "ok")
self.assertIn("ext4 with project quota", msg)
def test_zfs(self):
with patch("core.sandbox.check._run", _mk_run([(0, "zfs rw,xattr,noacl", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/tank/zcbot"))
self.assertEqual(level, "ok")
self.assertIn("zfs", msg)
def test_btrfs_warns(self):
with patch("core.sandbox.check._run", _mk_run([(0, "btrfs rw,relatime", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("btrfs", msg)
def test_tmpfs_warns(self):
with patch("core.sandbox.check._run", _mk_run([(0, "tmpfs rw", "")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/tmp"))
self.assertEqual(level, "warn")
def test_findmnt_missing(self):
with patch("core.sandbox.check._run", _mk_run([(127, "", "findmnt not found in PATH")])), \
patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "linux"
level, msg = detect_fs_quota(Path("/opt"))
self.assertEqual(level, "warn")
self.assertIn("findmnt", msg)
def test_windows_skipped(self):
with patch("core.sandbox.check.sys") as mock_sys:
mock_sys.platform = "win32"
level, msg = detect_fs_quota(Path("C:/"))
self.assertEqual(level, "warn")
self.assertIn("Windows", msg)
class TestSummaryExitCode(unittest.TestCase):
"""run_sandbox_check 汇总:err → exit 1,全 ok / 仅 warn → exit 0。"""
def test_all_ok_exits_zero(self):
with patch("core.sandbox.check.check_docker_daemon", return_value=True), \
patch("core.sandbox.check.check_image_present", return_value=True), \
patch("core.sandbox.check.check_network_present", return_value=True), \
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
rc = run_sandbox_check()
self.assertEqual(rc, 0)
def test_any_err_exits_one(self):
with patch("core.sandbox.check.check_docker_daemon", return_value=False), \
patch("core.sandbox.check.check_image_present", return_value=True), \
patch("core.sandbox.check.check_network_present", return_value=True), \
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
rc = run_sandbox_check()
self.assertEqual(rc, 1)
if __name__ == "__main__":
unittest.main()

View File

@ -509,9 +509,18 @@ def create_app() -> FastAPI:
sandbox_reaper_task = None
if sandbox_backend == "docker":
from core.sandbox import init_pool
from core.sandbox.check import detect_fs_quota
workspace = resolve_workspace(None, _cfg)
user_root_base = workspace / "users"
# §7.5 #4 fs quota 探测:不阻塞启动(应用层周期扫描已有),仅打 WARN
# 提醒外部用户开放前必须升级到 xfs prjquota / ext4 project / zfs。
try:
pool = init_pool(workspace / "users")
level, msg = detect_fs_quota(user_root_base.resolve())
print(f"[startup] {'[ok]' if level == 'ok' else '[warn]'} {msg}")
except Exception as e:
print(f"[startup] [warn] fs quota detect failed: {type(e).__name__}: {e}")
try:
pool = init_pool(user_root_base)
removed = pool.shutdown_all()
if removed:
print(f"[startup] swept {len(removed)} stale sandbox container(s)")