Stage C Step 5: main.py sandbox check + lifespan fs quota WARN
- main.py sandbox check 子命令:5 项独立探测 + 汇总 exit code ① docker daemon 可达 ② zcbot-sandbox:latest 镜像存在 ③ zcbot-sandbox-net network 存在(warn 不 err) ④ 镜像 zcbot uid 与 host uid 对齐 ⑤ workspace/users 所在 fs 类型可 quota - core/sandbox/check.py:detect_fs_quota(path) -> (level, msg) 抽出来给 lifespan 与 CLI 共用;识别 xfs+prjquota/ext4+project/zfs/btrfs/tmpfs/其他 - web/app.py lifespan docker backend 启用时调 detect_fs_quota 打 WARN 到 stdout(不阻塞启动,应用层周期扫描仍生效) - err vs warn 分界:err = docker backend fail-fast 根因(daemon/镜像/uid), warn = 不阻塞启动但外部开放前要清(network 缺/fs 不可 quota) - run_sandbox_check 用 module-level getattr 而非固化 CHECKS 元组,让 unittest patch core.sandbox.check.check_xxx 生效 - tests/test_sandbox_check.py 19 测试覆盖各分支 + exit code 汇总; unittest discover 31/31 PASS - RUN.md 加"部署前置对账"小节 + "配额硬化"重写(fs 状态→处理映射表 + xfs 升级 4 步) + 故障兜底 3 行;DESIGN 不动 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
dfac0acfa6
commit
1a950dedb5
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
||||||
|
|
||||||
最后更新:2026-05-26(Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan reaper,ZCBOT_SANDBOX_BACKEND env 切换 host/docker)
|
最后更新:2026-05-26(Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN + RUN.md 配额硬化段完善)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -15,7 +15,7 @@
|
||||||
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
||||||
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
||||||
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
||||||
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota(§7.5 落地清单 #2 #4)**。 |
|
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -23,6 +23,7 @@
|
||||||
|
|
||||||
### 2026-05-26
|
### 2026-05-26
|
||||||
|
|
||||||
|
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
|
||||||
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python` → `pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure` 时 `rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
|
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python` → `pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure` 时 `rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
|
||||||
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
|
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
|
||||||
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py` 里 `HostExecutor` → `DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py` 里 `if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
|
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py` 里 `HostExecutor` → `DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py` 里 `if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
|
||||||
|
|
|
||||||
60
RUN.md
60
RUN.md
|
|
@ -358,19 +358,60 @@ sudo -u zcbot docker rm -f zcbot-sandbox-$USER_ID
|
||||||
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
|
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
|
||||||
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
|
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
|
||||||
|
|
||||||
### 配额硬化(§7.5 #4,外部开放前必做)
|
### 部署前置对账
|
||||||
|
|
||||||
应用层磁盘配额(Step 5 引入)能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条
|
切 `ZCBOT_SANDBOX_BACKEND=docker` 之前跑一次:
|
||||||
硬要 **xfs / ext4 project quota 或 zfs dataset quota**。部署到独立服务器 + 多租户开放前:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 示例(xfs project quota):
|
sudo -u zcbot .venv/bin/python main.py sandbox check
|
||||||
sudo mount -o remount,prjquota /opt
|
|
||||||
sudo xfs_quota -x -c "project -s -p /opt/zcbot/workspace/users/<uid> <pid>" /opt
|
|
||||||
sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
|
|
||||||
```
|
```
|
||||||
|
|
||||||
具体方案视部署 fs 选择(xfs 推荐)── 不做这步等于"软配额 + 信任用户不写满"。
|
输出形如 `[ok] / [warn] / [err]` × 5 项 + 汇总 `N/5 passed`,exit code 0=可启动 / 1=有 err
|
||||||
|
要修。5 项对应:① docker daemon 可达 ② `zcbot-sandbox:latest` 镜像存在 ③
|
||||||
|
`zcbot-sandbox-net` network 存在(缺也能跑,lifespan 自动 ensure)④ 镜像内 zcbot
|
||||||
|
uid 与 host uid 对齐(错配 → exec 写 `/workspace` 全 EACCES)⑤ `workspace/users/`
|
||||||
|
所在 fs 类型可 quota。
|
||||||
|
|
||||||
|
lifespan 启动时同样会打第 ⑤ 项的 WARN 到 stdout(`[startup] [warn] fs quota ...`),
|
||||||
|
应用层周期扫描仍生效;**仅外部用户开放前必须把 ⑤ 升级到 OS 层 quota**。
|
||||||
|
|
||||||
|
### 配额硬化(§7.5 #4,外部开放前必做)
|
||||||
|
|
||||||
|
应用层磁盘配额能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条硬要 OS 层
|
||||||
|
quota。`sandbox check` 第 ⑤ 项会探测当前 fs 状态:
|
||||||
|
|
||||||
|
| 探测结果 | 含义 | 处理 |
|
||||||
|
|---|---|---|
|
||||||
|
| `fs quota: xfs with prjquota on ...` | ok,可直接 `xfs_quota -x` 给 user 加配额 | (无需处理) |
|
||||||
|
| `fs quota: ext4 with project quota on ...` | ok,可 `quota -P` | (无需处理) |
|
||||||
|
| `fs quota: zfs on ...` | ok,在 dataset 层 `zfs set quota=` | (无需处理) |
|
||||||
|
| `fs quota: xfs ... NO prjquota mount option` | fs 支持但 mount 时没启 | 见下方 xfs 步骤 |
|
||||||
|
| `fs quota: ext4 ... NO project quota option` | 同上 | `sudo tune2fs -O project,quota <dev>` + remount |
|
||||||
|
| `fs quota: btrfs ...` | qgroup 配置复杂 | 生产推荐换 xfs 单独分区,或自行验 `btrfs qgroup` |
|
||||||
|
| `fs quota: tmpfs/overlay/... ` | 通常 Docker-in-Docker 或本地 dev | 生产必须挂独立分区 |
|
||||||
|
|
||||||
|
**xfs 升级步骤(推荐方案)**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1) 确认 workspace 在哪个 mount(假设 /opt 是独立 xfs 分区)
|
||||||
|
findmnt --target /opt/zcbot/workspace
|
||||||
|
|
||||||
|
# 2) 启用 prjquota(写入 /etc/fstab 让 reboot 后保留)
|
||||||
|
sudo mount -o remount,prjquota /opt
|
||||||
|
|
||||||
|
# 3) 给某 user 加 project quota(<pid> 自定义整数 id,与 user_id 映射建表跟踪)
|
||||||
|
echo "1001 /opt/zcbot/workspace/users/<user_uuid>" | sudo tee -a /etc/projects
|
||||||
|
echo "zcbot_<user_uuid>:1001" | sudo tee -a /etc/projid
|
||||||
|
sudo xfs_quota -x -c "project -s zcbot_<user_uuid>" /opt
|
||||||
|
sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
|
||||||
|
```
|
||||||
|
|
||||||
|
`<pid>` ↔ `user_uuid` 映射手工维护(`/etc/projects` 是数字 id,zcbot 侧需建表追踪;
|
||||||
|
**首期外部开放前补一个 `main.py sandbox quota-set --user-id <uuid> --gb 10` 子命令**
|
||||||
|
读写 /etc/projects + 调 xfs_quota,这是 Step 4 / 5 之后真上线前一步,当前不做)。
|
||||||
|
|
||||||
|
不做这步等于"软配额 + 信任用户不写满" -- dogfood + 信任同事白名单阶段够用,
|
||||||
|
**外部用户开放是 hard prereq**。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -386,6 +427,9 @@ sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
|
||||||
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
|
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
|
||||||
| Sandbox 容器内 `touch /workspace/x` 报 `Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
|
| Sandbox 容器内 `touch /workspace/x` 报 `Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
|
||||||
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
|
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
|
||||||
|
| 启动报 `ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: ...` | docker daemon 没起 / 用户不在 docker group / network create 失败。先跑 `main.py sandbox check` 看哪一项 err |
|
||||||
|
| `[startup] [warn] fs quota: <fstype> on ...` | workspace 所在 fs 没启 OS 层 quota。dogfood 阶段忽略;外部用户开放前必须升级 xfs prjquota / ext4 project / zfs(详 RUN.md「配额硬化」段) |
|
||||||
|
| `docker run zcbot-sandbox:latest` 报 `Unable to find image` | 镜像没 build。`sudo -u zcbot docker build -f deploy/sandbox/Dockerfile --build-arg HOST_UID=$(id -u zcbot) --build-arg HOST_GID=$(id -g zcbot) -t zcbot-sandbox:latest .` |
|
||||||
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
|
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
|
||||||
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
|
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
|
||||||
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY` 或 `curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |
|
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY` 或 `curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,258 @@
|
||||||
|
"""Sandbox 部署前置对账(`main.py sandbox check`)。
|
||||||
|
|
||||||
|
跑 5 项独立探测,各自打 `[ok]` / `[warn]` / `[err]`,汇总后返 exit code。
|
||||||
|
外部用户开放前所有项必须 `[ok]`。
|
||||||
|
|
||||||
|
探测项与 §7.5 协议对应:
|
||||||
|
1. Docker daemon 可达 -- ZCBOT_SANDBOX_BACKEND=docker 启用必备
|
||||||
|
2. `zcbot-sandbox:latest` 镜像存在 -- 缺则 pool.ensure 时 docker run 报 "Unable to find image"
|
||||||
|
3. `zcbot-sandbox-net` network 存在 -- 缺也无所谓(init_pool 内自动 ensure),但提前预热
|
||||||
|
4. 镜像 HOST_UID 与 host zcbot uid 对齐 -- 错配会让 exec 进来后 write /workspace 时 EACCES
|
||||||
|
5. user_root_base fs 类型可 quota -- §7.5 #4,xfs prjquota / ext4 project / zfs;否则
|
||||||
|
"扫描间隙打满共享 fs"会拖死同节点其他 user(攻击者写满速度 >> 应用层周期扫描)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Tuple
|
||||||
|
|
||||||
|
from .pool import DEFAULT_IMAGE
|
||||||
|
from .network import NETWORK_NAME
|
||||||
|
|
||||||
|
|
||||||
|
# 颜色用 ANSI(终端不支持的环境自动退化为 plain;click.echo 不强求 click context)
|
||||||
|
def _ok(msg: str) -> None:
|
||||||
|
print(f"[ok] {msg}")
|
||||||
|
|
||||||
|
|
||||||
|
def _warn(msg: str) -> None:
|
||||||
|
print(f"[warn] {msg}")
|
||||||
|
|
||||||
|
|
||||||
|
def _err(msg: str) -> None:
|
||||||
|
print(f"[err] {msg}")
|
||||||
|
|
||||||
|
|
||||||
|
def _run(argv, timeout: int = 10) -> Tuple[int, str, str]:
|
||||||
|
"""统一 subprocess.run wrapper。docker CLI 不存在 → returncode=127,stderr 给原因。"""
|
||||||
|
if shutil.which(argv[0]) is None:
|
||||||
|
return 127, "", f"{argv[0]} not found in PATH"
|
||||||
|
try:
|
||||||
|
r = subprocess.run(argv, capture_output=True, text=True, timeout=timeout)
|
||||||
|
return r.returncode, r.stdout.strip(), r.stderr.strip()
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
return 124, "", f"timed out after {timeout}s"
|
||||||
|
except Exception as e:
|
||||||
|
return 1, "", f"{type(e).__name__}: {e}"
|
||||||
|
|
||||||
|
|
||||||
|
# -- 探测项 ------------------------------------------------
|
||||||
|
|
||||||
|
def check_docker_daemon() -> bool:
|
||||||
|
rc, out, err = _run(["docker", "version", "--format", "{{.Server.Version}}"])
|
||||||
|
if rc == 0 and out:
|
||||||
|
_ok(f"docker daemon reachable (server={out})")
|
||||||
|
return True
|
||||||
|
if rc == 127:
|
||||||
|
_err("docker CLI not found -- apt install docker.io / docker-ce")
|
||||||
|
elif "permission denied" in err.lower():
|
||||||
|
_err(f"docker daemon not reachable: {err} -- usermod -aG docker $USER + relogin")
|
||||||
|
else:
|
||||||
|
_err(f"docker daemon not reachable: {err or 'unknown'}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def check_image_present() -> bool:
|
||||||
|
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
|
||||||
|
rc, _, err = _run(["docker", "image", "inspect", image])
|
||||||
|
if rc == 0:
|
||||||
|
_ok(f"image present: {image}")
|
||||||
|
return True
|
||||||
|
_err(
|
||||||
|
f"image not found: {image} -- "
|
||||||
|
f"`docker build -f deploy/sandbox/Dockerfile "
|
||||||
|
f"--build-arg HOST_UID=$(id -u) --build-arg HOST_GID=$(id -g) "
|
||||||
|
f"-t {image} .`"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def check_network_present() -> bool:
|
||||||
|
rc, _, _ = _run(["docker", "network", "inspect", NETWORK_NAME])
|
||||||
|
if rc == 0:
|
||||||
|
_ok(f"network present: {NETWORK_NAME}")
|
||||||
|
return True
|
||||||
|
_warn(
|
||||||
|
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
|
||||||
|
f"或手动 `docker network create --internal {NETWORK_NAME}`"
|
||||||
|
)
|
||||||
|
return True # warn 不算失败
|
||||||
|
|
||||||
|
|
||||||
|
def check_host_uid_alignment() -> bool:
|
||||||
|
"""镜像内 zcbot 用户 uid 与 host 当前 uid 对齐。
|
||||||
|
|
||||||
|
bind mount 让 host fs owner 直接落进容器;镜像 build 时若漏传 `HOST_UID`,
|
||||||
|
容器内默 uid=1000,host 实际跑 zcbot 服务的账号若 uid≠1000 → exec 写 /workspace
|
||||||
|
全 EACCES。这里用 `docker run --rm --entrypoint id -u zcbot` 拿镜像 uid,
|
||||||
|
与 host `os.getuid()` 比对(假设 zcbot 用户跑 check 子命令)。
|
||||||
|
"""
|
||||||
|
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
|
||||||
|
rc, out, err = _run(
|
||||||
|
["docker", "run", "--rm", "--entrypoint", "id", image, "-u", "zcbot"]
|
||||||
|
)
|
||||||
|
if rc != 0:
|
||||||
|
_warn(
|
||||||
|
f"image uid check skipped: {err or 'unknown'} -- "
|
||||||
|
f"if image not built yet 先跑 build 再来"
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
try:
|
||||||
|
image_uid = int(out)
|
||||||
|
except ValueError:
|
||||||
|
_warn(f"image uid unexpected output: {out!r}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
if sys.platform == "win32":
|
||||||
|
_warn(
|
||||||
|
f"image zcbot uid={image_uid}; host uid check skipped on Windows "
|
||||||
|
f"(Linux 部署机上跑 check 才有意义)"
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
host_uid = os.getuid() # type: ignore[attr-defined]
|
||||||
|
if image_uid == host_uid:
|
||||||
|
_ok(f"HOST_UID aligned: image zcbot uid={image_uid} == host uid={host_uid}")
|
||||||
|
return True
|
||||||
|
_err(
|
||||||
|
f"HOST_UID mismatch: image zcbot uid={image_uid}, host uid={host_uid} -- "
|
||||||
|
f"重 build 镜像 `docker build --build-arg HOST_UID={host_uid} ...`"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def detect_fs_quota(target: Path) -> Tuple[str, str]:
|
||||||
|
"""探测 target 所在 fs 是否可 quota,返 (level, msg)。
|
||||||
|
|
||||||
|
level ∈ {"ok", "warn"} —— fs quota 永不视为 err(不阻塞 web 启动)。
|
||||||
|
给 CLI 与 lifespan 共用 —— CLI 走 _ok/_warn 打印,lifespan 走 print。
|
||||||
|
|
||||||
|
识别:
|
||||||
|
- xfs:mount options 含 `prjquota` 或 `pquota` → ok;否则 warn(fs 支持但未 enable)
|
||||||
|
- ext4:mount options 含 `prjquota` 或 `project,quota` → ok
|
||||||
|
- zfs:任何 → ok(dataset quota 在 zfs set 层,这里不深入)
|
||||||
|
- btrfs:警告 quota 群组复杂
|
||||||
|
- tmpfs / overlay / 其他:warn(典型 Docker-in-Docker 或本地 dev,生产部署不应该)
|
||||||
|
"""
|
||||||
|
if sys.platform == "win32":
|
||||||
|
return "warn", "fs quota check skipped on Windows (Linux 部署机才有意义)"
|
||||||
|
|
||||||
|
# findmnt 在多数 Linux 发行版自带(util-linux)
|
||||||
|
rc, out, err = _run([
|
||||||
|
"findmnt", "--target", str(target), "-no", "FSTYPE,OPTIONS",
|
||||||
|
])
|
||||||
|
if rc != 0 or not out:
|
||||||
|
return "warn", (
|
||||||
|
f"fs quota check skipped: cannot detect fs for {target} "
|
||||||
|
f"({err or 'findmnt missing'})"
|
||||||
|
)
|
||||||
|
|
||||||
|
parts = out.split()
|
||||||
|
fstype = parts[0].lower() if parts else ""
|
||||||
|
options = parts[1] if len(parts) > 1 else ""
|
||||||
|
opts = set(options.split(","))
|
||||||
|
|
||||||
|
if fstype == "xfs":
|
||||||
|
if "prjquota" in opts or "pquota" in opts:
|
||||||
|
return "ok", f"fs quota: xfs with prjquota on {target}"
|
||||||
|
return "warn", (
|
||||||
|
f"fs quota: xfs on {target} but NO prjquota mount option -- "
|
||||||
|
f"`sudo mount -o remount,prjquota <mountpoint>` + `xfs_quota -x ...`"
|
||||||
|
)
|
||||||
|
if fstype == "ext4":
|
||||||
|
if "prjquota" in opts or ("project" in opts and "quota" in opts):
|
||||||
|
return "ok", f"fs quota: ext4 with project quota on {target}"
|
||||||
|
return "warn", (
|
||||||
|
f"fs quota: ext4 on {target} but NO project quota option -- "
|
||||||
|
f"`tune2fs -O project,quota <dev>` + remount + `quota -P`"
|
||||||
|
)
|
||||||
|
if fstype == "zfs":
|
||||||
|
return "ok", f"fs quota: zfs on {target} (dataset quota via `zfs set quota=...`)"
|
||||||
|
if fstype == "btrfs":
|
||||||
|
return "warn", (
|
||||||
|
f"fs quota: btrfs on {target} -- qgroup 配置复杂,生产部署"
|
||||||
|
f"推荐 xfs prjquota;如必须用 btrfs 自行验 `btrfs qgroup`"
|
||||||
|
)
|
||||||
|
return "warn", (
|
||||||
|
f"fs quota: {fstype or '<unknown>'} on {target} -- "
|
||||||
|
f"非主流 quota-able 类型,外部用户开放前换 xfs/ext4/zfs 单独分区"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def check_fs_quota_capable() -> bool:
|
||||||
|
"""CLI 入口:探测 workspace/users/ 所在 fs。返 True(永不 err)。"""
|
||||||
|
from core.agent_builder import load_config, resolve_workspace
|
||||||
|
|
||||||
|
try:
|
||||||
|
cfg = load_config()
|
||||||
|
workspace = resolve_workspace(None, cfg)
|
||||||
|
target = (workspace / "users").resolve()
|
||||||
|
except Exception as e:
|
||||||
|
_warn(f"fs quota check: cannot resolve workspace path: {e}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
level, msg = detect_fs_quota(target)
|
||||||
|
if level == "ok":
|
||||||
|
_ok(msg)
|
||||||
|
else:
|
||||||
|
_warn(msg)
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
# -- 汇总入口 ---------------------------------------------
|
||||||
|
|
||||||
|
CHECK_NAMES = [
|
||||||
|
("docker daemon", "check_docker_daemon"),
|
||||||
|
("image present", "check_image_present"),
|
||||||
|
("network present", "check_network_present"),
|
||||||
|
("HOST_UID alignment", "check_host_uid_alignment"),
|
||||||
|
("fs quota capable", "check_fs_quota_capable"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def run_sandbox_check() -> int:
|
||||||
|
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
|
||||||
|
|
||||||
|
err vs warn 分界:
|
||||||
|
- err = docker backend 启动会 fail-fast 的根因(daemon / 镜像 / HOST_UID)
|
||||||
|
- warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota)
|
||||||
|
|
||||||
|
通过模块全局 lookup 拿函数引用(不固化进 CHECKS 元组),让 unittest patch
|
||||||
|
`core.sandbox.check.check_xxx` 对本函数生效。
|
||||||
|
"""
|
||||||
|
print("--- sandbox deployment check ---\n")
|
||||||
|
ok_count = 0
|
||||||
|
module = sys.modules[__name__]
|
||||||
|
for label, fn_name in CHECK_NAMES:
|
||||||
|
fn = getattr(module, fn_name)
|
||||||
|
try:
|
||||||
|
if fn():
|
||||||
|
ok_count += 1
|
||||||
|
except Exception as e:
|
||||||
|
_err(f"{label}: unexpected {type(e).__name__}: {e}")
|
||||||
|
total = len(CHECK_NAMES)
|
||||||
|
print()
|
||||||
|
if ok_count == total:
|
||||||
|
print(f"[summary] {ok_count}/{total} checks passed -- docker backend ready")
|
||||||
|
return 0
|
||||||
|
failed = total - ok_count
|
||||||
|
print(
|
||||||
|
f"[summary] {ok_count}/{total} passed, {failed} failed -- "
|
||||||
|
f"修完上面的 [err] 项再启 docker backend"
|
||||||
|
)
|
||||||
|
return 1
|
||||||
20
main.py
20
main.py
|
|
@ -198,5 +198,25 @@ def web(host: str, port: int, reload: bool) -> None:
|
||||||
uvicorn.run(create_app(), host=host, port=port, log_level="info")
|
uvicorn.run(create_app(), host=host, port=port, log_level="info")
|
||||||
|
|
||||||
|
|
||||||
|
# ─────────────── Sandbox(Stage C 部署前置对账) ───────────────
|
||||||
|
|
||||||
|
@cli.group()
|
||||||
|
def sandbox() -> None:
|
||||||
|
"""Sandbox 容器部署对账(`ZCBOT_SANDBOX_BACKEND=docker` 启用前跑一遍)。"""
|
||||||
|
|
||||||
|
|
||||||
|
@sandbox.command("check")
|
||||||
|
def sandbox_check() -> None:
|
||||||
|
"""对账 docker backend 启动前置(daemon / 镜像 / network / HOST_UID / fs quota)。
|
||||||
|
|
||||||
|
非阻塞 ─ 每项独立打印 `[ok]` / `[warn]` / `[err]`,最后汇总。`err` 一项 → 退出 1,
|
||||||
|
全 ok / 仅 warn → 退出 0。warn 项不阻塞 web 启动,但**外部用户开放前必须清零**
|
||||||
|
(详 DESIGN §7.5 落地清单)。
|
||||||
|
"""
|
||||||
|
from core.sandbox.check import run_sandbox_check
|
||||||
|
rc = run_sandbox_check()
|
||||||
|
sys.exit(rc)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
cli()
|
cli()
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,186 @@
|
||||||
|
"""`main.py sandbox check` 探测函数单元测试。
|
||||||
|
|
||||||
|
mock subprocess,验:
|
||||||
|
- daemon 不可达 / image 缺 / network 缺 / uid 错配的各种分支
|
||||||
|
- detect_fs_quota 对 xfs/ext4/zfs/btrfs/其他 + prjquota mount option 的判断
|
||||||
|
- 汇总 exit code:全 ok / 仅 warn / 有 err
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
||||||
|
|
||||||
|
from core.sandbox.check import (
|
||||||
|
check_docker_daemon,
|
||||||
|
check_image_present,
|
||||||
|
check_host_uid_alignment,
|
||||||
|
detect_fs_quota,
|
||||||
|
run_sandbox_check,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _mk_run(returns):
|
||||||
|
"""构造 `_run` 替身:按调用次序返 (rc, stdout, stderr) 列表里的元素。"""
|
||||||
|
iter_ret = iter(returns)
|
||||||
|
|
||||||
|
def fake_run(argv, timeout=10):
|
||||||
|
return next(iter_ret)
|
||||||
|
return fake_run
|
||||||
|
|
||||||
|
|
||||||
|
class TestDaemonCheck(unittest.TestCase):
|
||||||
|
def test_daemon_ok(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "24.0.7", "")])):
|
||||||
|
self.assertTrue(check_docker_daemon())
|
||||||
|
|
||||||
|
def test_daemon_cli_missing(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(127, "", "docker not found in PATH")])):
|
||||||
|
self.assertFalse(check_docker_daemon())
|
||||||
|
|
||||||
|
def test_daemon_permission_denied(self):
|
||||||
|
with patch(
|
||||||
|
"core.sandbox.check._run",
|
||||||
|
_mk_run([(1, "", "Got permission denied while trying to connect")]),
|
||||||
|
):
|
||||||
|
self.assertFalse(check_docker_daemon())
|
||||||
|
|
||||||
|
|
||||||
|
class TestImageCheck(unittest.TestCase):
|
||||||
|
def test_image_present(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "[...]", "")])):
|
||||||
|
self.assertTrue(check_image_present())
|
||||||
|
|
||||||
|
def test_image_missing(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(1, "", "No such image")])):
|
||||||
|
self.assertFalse(check_image_present())
|
||||||
|
|
||||||
|
|
||||||
|
class TestHostUidAlignment(unittest.TestCase):
|
||||||
|
def test_uid_aligned(self):
|
||||||
|
if sys.platform == "win32":
|
||||||
|
self.skipTest("getuid not on Windows")
|
||||||
|
import os
|
||||||
|
host_uid = os.getuid() # type: ignore[attr-defined]
|
||||||
|
with patch(
|
||||||
|
"core.sandbox.check._run",
|
||||||
|
_mk_run([(0, str(host_uid), "")]),
|
||||||
|
):
|
||||||
|
self.assertTrue(check_host_uid_alignment())
|
||||||
|
|
||||||
|
def test_uid_mismatch(self):
|
||||||
|
if sys.platform == "win32":
|
||||||
|
self.skipTest("getuid not on Windows")
|
||||||
|
import os
|
||||||
|
bad = os.getuid() + 1 # type: ignore[attr-defined]
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, str(bad), "")])):
|
||||||
|
self.assertFalse(check_host_uid_alignment())
|
||||||
|
|
||||||
|
def test_image_not_built_yet(self):
|
||||||
|
# docker run 失败 → warn 不 err
|
||||||
|
with patch(
|
||||||
|
"core.sandbox.check._run",
|
||||||
|
_mk_run([(125, "", "Unable to find image")]),
|
||||||
|
):
|
||||||
|
self.assertTrue(check_host_uid_alignment())
|
||||||
|
|
||||||
|
def test_skipped_on_windows(self):
|
||||||
|
with patch("core.sandbox.check.sys") as mock_sys, \
|
||||||
|
patch("core.sandbox.check._run", _mk_run([(0, "1000", "")])):
|
||||||
|
mock_sys.platform = "win32"
|
||||||
|
self.assertTrue(check_host_uid_alignment())
|
||||||
|
|
||||||
|
|
||||||
|
class TestDetectFsQuota(unittest.TestCase):
|
||||||
|
"""detect_fs_quota:不依赖 print,纯返 (level, msg) 便于断言。"""
|
||||||
|
|
||||||
|
def test_xfs_with_prjquota(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,prjquota,attr2", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/opt/zcbot/workspace/users"))
|
||||||
|
self.assertEqual(level, "ok")
|
||||||
|
self.assertIn("xfs with prjquota", msg)
|
||||||
|
|
||||||
|
def test_xfs_without_prjquota(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,attr2", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/opt"))
|
||||||
|
self.assertEqual(level, "warn")
|
||||||
|
self.assertIn("NO prjquota", msg)
|
||||||
|
|
||||||
|
def test_ext4_with_project_quota(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "ext4 rw,prjquota", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/opt"))
|
||||||
|
self.assertEqual(level, "ok")
|
||||||
|
self.assertIn("ext4 with project quota", msg)
|
||||||
|
|
||||||
|
def test_zfs(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "zfs rw,xattr,noacl", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/tank/zcbot"))
|
||||||
|
self.assertEqual(level, "ok")
|
||||||
|
self.assertIn("zfs", msg)
|
||||||
|
|
||||||
|
def test_btrfs_warns(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "btrfs rw,relatime", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/opt"))
|
||||||
|
self.assertEqual(level, "warn")
|
||||||
|
self.assertIn("btrfs", msg)
|
||||||
|
|
||||||
|
def test_tmpfs_warns(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(0, "tmpfs rw", "")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/tmp"))
|
||||||
|
self.assertEqual(level, "warn")
|
||||||
|
|
||||||
|
def test_findmnt_missing(self):
|
||||||
|
with patch("core.sandbox.check._run", _mk_run([(127, "", "findmnt not found in PATH")])), \
|
||||||
|
patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "linux"
|
||||||
|
level, msg = detect_fs_quota(Path("/opt"))
|
||||||
|
self.assertEqual(level, "warn")
|
||||||
|
self.assertIn("findmnt", msg)
|
||||||
|
|
||||||
|
def test_windows_skipped(self):
|
||||||
|
with patch("core.sandbox.check.sys") as mock_sys:
|
||||||
|
mock_sys.platform = "win32"
|
||||||
|
level, msg = detect_fs_quota(Path("C:/"))
|
||||||
|
self.assertEqual(level, "warn")
|
||||||
|
self.assertIn("Windows", msg)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSummaryExitCode(unittest.TestCase):
|
||||||
|
"""run_sandbox_check 汇总:err → exit 1,全 ok / 仅 warn → exit 0。"""
|
||||||
|
|
||||||
|
def test_all_ok_exits_zero(self):
|
||||||
|
with patch("core.sandbox.check.check_docker_daemon", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_image_present", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_network_present", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
|
||||||
|
rc = run_sandbox_check()
|
||||||
|
self.assertEqual(rc, 0)
|
||||||
|
|
||||||
|
def test_any_err_exits_one(self):
|
||||||
|
with patch("core.sandbox.check.check_docker_daemon", return_value=False), \
|
||||||
|
patch("core.sandbox.check.check_image_present", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_network_present", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
|
||||||
|
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
|
||||||
|
rc = run_sandbox_check()
|
||||||
|
self.assertEqual(rc, 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
11
web/app.py
11
web/app.py
|
|
@ -509,9 +509,18 @@ def create_app() -> FastAPI:
|
||||||
sandbox_reaper_task = None
|
sandbox_reaper_task = None
|
||||||
if sandbox_backend == "docker":
|
if sandbox_backend == "docker":
|
||||||
from core.sandbox import init_pool
|
from core.sandbox import init_pool
|
||||||
|
from core.sandbox.check import detect_fs_quota
|
||||||
workspace = resolve_workspace(None, _cfg)
|
workspace = resolve_workspace(None, _cfg)
|
||||||
|
user_root_base = workspace / "users"
|
||||||
|
# §7.5 #4 fs quota 探测:不阻塞启动(应用层周期扫描已有),仅打 WARN
|
||||||
|
# 提醒外部用户开放前必须升级到 xfs prjquota / ext4 project / zfs。
|
||||||
try:
|
try:
|
||||||
pool = init_pool(workspace / "users")
|
level, msg = detect_fs_quota(user_root_base.resolve())
|
||||||
|
print(f"[startup] {'[ok]' if level == 'ok' else '[warn]'} {msg}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[startup] [warn] fs quota detect failed: {type(e).__name__}: {e}")
|
||||||
|
try:
|
||||||
|
pool = init_pool(user_root_base)
|
||||||
removed = pool.shutdown_all()
|
removed = pool.shutdown_all()
|
||||||
if removed:
|
if removed:
|
||||||
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
|
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue