Compare commits
No commits in common. "1a950dedb5ca29376b4aa1798eb83430cd16d61d" and "f66511ccf8c70a8a9e50016a4e74031ae310182d" have entirely different histories.
1a950dedb5
...
f66511ccf8
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
||||||
|
|
||||||
最后更新:2026-05-26(Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN + RUN.md 配额硬化段完善)
|
最后更新:2026-05-26(Stage C Step 2:Docker per-user 容器池 + Dockerfile / init.sh / network ensure,代码就绪未集成 AgentLoop)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -15,7 +15,7 @@
|
||||||
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
||||||
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
||||||
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
||||||
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
|
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C(Executor+docker sandbox)待 —— 外部用户开放 hard prereq,完成前仅 dogfood + 信任同事白名单;DoD 详 DESIGN §7.5 落地清单 6 条**。 |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -23,8 +23,6 @@
|
||||||
|
|
||||||
### 2026-05-26
|
### 2026-05-26
|
||||||
|
|
||||||
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
|
|
||||||
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python` → `pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure` 时 `rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
|
|
||||||
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
|
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
|
||||||
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py` 里 `HostExecutor` → `DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py` 里 `if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
|
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py` 里 `HostExecutor` → `DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py` 里 `if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
|
||||||
- **REVISIONS.md 修订日志机制(覆盖 proposal/patent/ppt 三个产物型 skill)**:`<task_dir>/REVISIONS.md` 作为产物迭代过程的紧凑 changelog —— task 对话历史是粗流水(50 条消息找上周改动靠翻),REVISIONS 是用户与 LLM 共同沉淀的实质决策列表(5 行就能复盘"上周这章为啥这么写"),与 spec 定位互补:**spec = 宪法(定调一次),REVISIONS = 实施日志(每次卡点累加)**。三个 SKILL.md 各加 (a) 起草步骤里加一步"用户确认实质改动后追加一行" + (b) "## 修订日志" 独立小节(何时记/何时不记表 + 格式约定 + 实例 + 操作)。三类 skill 的"实质改动"判据按各自领域定制:proposal = 技术路线/考核指标/创新点/课题分解/关键引文/预算结构;patent = 区别技术特征/关键参数/公式/实施例/章节;ppt = 版式/主色/页/图标/文案要点。统一原则:首次起草不记 / 错别字微调不记 / 模型自己改改撤撤不记 — 拿不准倾向不记,避免变流水账。格式选**单行 bullet 倒序追加**(时间在前、文件:章节定位、改了什么 — 为什么),用 edit 在头注释后插入新一行(不 append 到末尾,倒序读秒看最新)。否决:(a) 走 system prompt 软约束 — 对 coding/research/documents/imagegen/videogen 等非产物型 skill 强加无关约束;(b) 新建 `record_revision` tool — 开发期内 LLM 直接 edit 追加足够,加 tool 增加每次小改的调用开销,后期发现 LLM 漏记多再升 tool 化;(c) 按产物拆多文件(`<topic>.revisions.md`)— 单文件好读、跨产物时间线统一。`DESIGN.md` 不动(无架构变化);`RUN.md` 不动(无 CLI/env 变化)。
|
- **REVISIONS.md 修订日志机制(覆盖 proposal/patent/ppt 三个产物型 skill)**:`<task_dir>/REVISIONS.md` 作为产物迭代过程的紧凑 changelog —— task 对话历史是粗流水(50 条消息找上周改动靠翻),REVISIONS 是用户与 LLM 共同沉淀的实质决策列表(5 行就能复盘"上周这章为啥这么写"),与 spec 定位互补:**spec = 宪法(定调一次),REVISIONS = 实施日志(每次卡点累加)**。三个 SKILL.md 各加 (a) 起草步骤里加一步"用户确认实质改动后追加一行" + (b) "## 修订日志" 独立小节(何时记/何时不记表 + 格式约定 + 实例 + 操作)。三类 skill 的"实质改动"判据按各自领域定制:proposal = 技术路线/考核指标/创新点/课题分解/关键引文/预算结构;patent = 区别技术特征/关键参数/公式/实施例/章节;ppt = 版式/主色/页/图标/文案要点。统一原则:首次起草不记 / 错别字微调不记 / 模型自己改改撤撤不记 — 拿不准倾向不记,避免变流水账。格式选**单行 bullet 倒序追加**(时间在前、文件:章节定位、改了什么 — 为什么),用 edit 在头注释后插入新一行(不 append 到末尾,倒序读秒看最新)。否决:(a) 走 system prompt 软约束 — 对 coding/research/documents/imagegen/videogen 等非产物型 skill 强加无关约束;(b) 新建 `record_revision` tool — 开发期内 LLM 直接 edit 追加足够,加 tool 增加每次小改的调用开销,后期发现 LLM 漏记多再升 tool 化;(c) 按产物拆多文件(`<topic>.revisions.md`)— 单文件好读、跨产物时间线统一。`DESIGN.md` 不动(无架构变化);`RUN.md` 不动(无 CLI/env 变化)。
|
||||||
|
|
|
||||||
91
RUN.md
91
RUN.md
|
|
@ -256,14 +256,8 @@ sudo journalctl -u zcbot -n 50 # 看新进程起没起干
|
||||||
## Sandbox(Stage C,Ubuntu)
|
## Sandbox(Stage C,Ubuntu)
|
||||||
|
|
||||||
> 为外部用户开放前必须完成。当前 dogfood + 信任同事白名单阶段可跳过 ── 默 backend = host,
|
> 为外部用户开放前必须完成。当前 dogfood + 信任同事白名单阶段可跳过 ── 默 backend = host,
|
||||||
> `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 已接入 DockerExecutor:
|
> `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 接入 DockerExecutor 后切
|
||||||
> `ZCBOT_SANDBOX_BACKEND=docker` 切容器执行;`host`(默)保留本地 Windows / 同事 dogfood。
|
> `ZCBOT_SANDBOX_BACKEND=docker` 启用。
|
||||||
>
|
|
||||||
> 启用 docker backend 的前置条件:
|
|
||||||
> 1. 部署机有 docker daemon,zcbot 用户在 `docker` group
|
|
||||||
> 2. `zcbot-sandbox:latest` 镜像已 build(`HOST_UID/GID` 对齐)
|
|
||||||
> 3. `.env` 至少有 `ZCBOT_PG_IPS=<PG实际IP>`(§7.5 #1 PG 单独 block 一遍)
|
|
||||||
> 4. lifespan 启动失败会 fail-fast(`RuntimeError: sandbox init failed`),不静默退到 host
|
|
||||||
|
|
||||||
### 镜像构建
|
### 镜像构建
|
||||||
|
|
||||||
|
|
@ -290,14 +284,6 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
|
||||||
### Sandbox 相关 env(.env 加)
|
### Sandbox 相关 env(.env 加)
|
||||||
|
|
||||||
```
|
```
|
||||||
# Backend 选择(默 host):
|
|
||||||
# host = shell/run_python 走 host subprocess(本地 Windows / dogfood)
|
|
||||||
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
|
|
||||||
# ZCBOT_SANDBOX_BACKEND=docker
|
|
||||||
|
|
||||||
# 容器内 exec 用户(默 1000:1000;Dockerfile 的 HOST_UID/HOST_GID build-arg 同步对齐)
|
|
||||||
# ZCBOT_SANDBOX_EXEC_USER=1000:1000
|
|
||||||
|
|
||||||
# 容器镜像 tag(默 zcbot-sandbox:latest)
|
# 容器镜像 tag(默 zcbot-sandbox:latest)
|
||||||
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
|
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
|
||||||
# 容器 runtime(切 gVisor 用 runsc,Firecracker 用 kata;默 runc)
|
# 容器 runtime(切 gVisor 用 runsc,Firecracker 用 kata;默 runc)
|
||||||
|
|
@ -309,23 +295,10 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
|
||||||
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
|
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
|
||||||
```
|
```
|
||||||
|
|
||||||
### 验证
|
### 验证(Step 2 部分能验)
|
||||||
|
|
||||||
Step 3 之后,推荐用集成验证(web 起 docker backend + dev SPA 发 `shell` / `run_python` 消息):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 启动 web 时切 docker backend(.env 已设 PG_IPS / SANDBOX_BACKEND=docker)
|
|
||||||
ZCBOT_SANDBOX_BACKEND=docker .venv/bin/python main.py web
|
|
||||||
|
|
||||||
# 触发任一 shell / run_python 消息后,容器应已起
|
|
||||||
sudo -u zcbot docker ps --filter label=zcbot.product=sandbox
|
|
||||||
# 应看到 zcbot-sandbox-<your-uid>,STATUS = Up ...
|
|
||||||
# 5 分钟无新消息后 reaper 自动 rm
|
|
||||||
```
|
|
||||||
|
|
||||||
也可直接起一个测试容器单验 hardening(不依赖 web 进程):
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# 起一个测试容器(直接 docker run,不走 pool ── pool 在 Step 3 接入后才用)
|
||||||
USER_ID=00000000-0000-0000-0000-000000000001
|
USER_ID=00000000-0000-0000-0000-000000000001
|
||||||
sudo -u zcbot docker run -d \
|
sudo -u zcbot docker run -d \
|
||||||
--name zcbot-sandbox-$USER_ID \
|
--name zcbot-sandbox-$USER_ID \
|
||||||
|
|
@ -358,60 +331,19 @@ sudo -u zcbot docker rm -f zcbot-sandbox-$USER_ID
|
||||||
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
|
Step 4 引入 egress proxy 后,完整 5 条红队用例(metadata / loopback / 跨 user / nohup
|
||||||
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
|
残留 / allowlist 外 403)进 `tests/test_sandbox_redteam.py` 自动化跑。
|
||||||
|
|
||||||
### 部署前置对账
|
|
||||||
|
|
||||||
切 `ZCBOT_SANDBOX_BACKEND=docker` 之前跑一次:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo -u zcbot .venv/bin/python main.py sandbox check
|
|
||||||
```
|
|
||||||
|
|
||||||
输出形如 `[ok] / [warn] / [err]` × 5 项 + 汇总 `N/5 passed`,exit code 0=可启动 / 1=有 err
|
|
||||||
要修。5 项对应:① docker daemon 可达 ② `zcbot-sandbox:latest` 镜像存在 ③
|
|
||||||
`zcbot-sandbox-net` network 存在(缺也能跑,lifespan 自动 ensure)④ 镜像内 zcbot
|
|
||||||
uid 与 host uid 对齐(错配 → exec 写 `/workspace` 全 EACCES)⑤ `workspace/users/`
|
|
||||||
所在 fs 类型可 quota。
|
|
||||||
|
|
||||||
lifespan 启动时同样会打第 ⑤ 项的 WARN 到 stdout(`[startup] [warn] fs quota ...`),
|
|
||||||
应用层周期扫描仍生效;**仅外部用户开放前必须把 ⑤ 升级到 OS 层 quota**。
|
|
||||||
|
|
||||||
### 配额硬化(§7.5 #4,外部开放前必做)
|
### 配额硬化(§7.5 #4,外部开放前必做)
|
||||||
|
|
||||||
应用层磁盘配额能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条硬要 OS 层
|
应用层磁盘配额(Step 5 引入)能挡常规超额,**但扫描间隙打满共享 fs 拖死同节点**这条
|
||||||
quota。`sandbox check` 第 ⑤ 项会探测当前 fs 状态:
|
硬要 **xfs / ext4 project quota 或 zfs dataset quota**。部署到独立服务器 + 多租户开放前:
|
||||||
|
|
||||||
| 探测结果 | 含义 | 处理 |
|
|
||||||
|---|---|---|
|
|
||||||
| `fs quota: xfs with prjquota on ...` | ok,可直接 `xfs_quota -x` 给 user 加配额 | (无需处理) |
|
|
||||||
| `fs quota: ext4 with project quota on ...` | ok,可 `quota -P` | (无需处理) |
|
|
||||||
| `fs quota: zfs on ...` | ok,在 dataset 层 `zfs set quota=` | (无需处理) |
|
|
||||||
| `fs quota: xfs ... NO prjquota mount option` | fs 支持但 mount 时没启 | 见下方 xfs 步骤 |
|
|
||||||
| `fs quota: ext4 ... NO project quota option` | 同上 | `sudo tune2fs -O project,quota <dev>` + remount |
|
|
||||||
| `fs quota: btrfs ...` | qgroup 配置复杂 | 生产推荐换 xfs 单独分区,或自行验 `btrfs qgroup` |
|
|
||||||
| `fs quota: tmpfs/overlay/... ` | 通常 Docker-in-Docker 或本地 dev | 生产必须挂独立分区 |
|
|
||||||
|
|
||||||
**xfs 升级步骤(推荐方案)**:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1) 确认 workspace 在哪个 mount(假设 /opt 是独立 xfs 分区)
|
# 示例(xfs project quota):
|
||||||
findmnt --target /opt/zcbot/workspace
|
|
||||||
|
|
||||||
# 2) 启用 prjquota(写入 /etc/fstab 让 reboot 后保留)
|
|
||||||
sudo mount -o remount,prjquota /opt
|
sudo mount -o remount,prjquota /opt
|
||||||
|
sudo xfs_quota -x -c "project -s -p /opt/zcbot/workspace/users/<uid> <pid>" /opt
|
||||||
# 3) 给某 user 加 project quota(<pid> 自定义整数 id,与 user_id 映射建表跟踪)
|
sudo xfs_quota -x -c "limit -p bhard=10g <pid>" /opt
|
||||||
echo "1001 /opt/zcbot/workspace/users/<user_uuid>" | sudo tee -a /etc/projects
|
|
||||||
echo "zcbot_<user_uuid>:1001" | sudo tee -a /etc/projid
|
|
||||||
sudo xfs_quota -x -c "project -s zcbot_<user_uuid>" /opt
|
|
||||||
sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
|
|
||||||
```
|
```
|
||||||
|
|
||||||
`<pid>` ↔ `user_uuid` 映射手工维护(`/etc/projects` 是数字 id,zcbot 侧需建表追踪;
|
具体方案视部署 fs 选择(xfs 推荐)── 不做这步等于"软配额 + 信任用户不写满"。
|
||||||
**首期外部开放前补一个 `main.py sandbox quota-set --user-id <uuid> --gb 10` 子命令**
|
|
||||||
读写 /etc/projects + 调 xfs_quota,这是 Step 4 / 5 之后真上线前一步,当前不做)。
|
|
||||||
|
|
||||||
不做这步等于"软配额 + 信任用户不写满" -- dogfood + 信任同事白名单阶段够用,
|
|
||||||
**外部用户开放是 hard prereq**。
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -427,9 +359,6 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
|
||||||
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
|
| `--working-dir` 指定后 task 删了目录还在 | 两种情况:① 目录非空(有用户文件) — 设计如此,绝不 rmtree,手动 `rm -rf <dir>` 清;② 外部 `--working-dir`(DB 存绝对路径)— 不自动清,避免误删用户外部项目。ROOT 内 + 同 working_dir 无其他 task 引用 + FS 空 → DELETE task 时已自动 rmdir |
|
||||||
| Sandbox 容器内 `touch /workspace/x` 报 `Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
|
| Sandbox 容器内 `touch /workspace/x` 报 `Permission denied` | 容器 uid 1000 与 host `zcbot` 用户 uid 不一致(bind mount 保 host owner)。`docker build --build-arg HOST_UID=$(id -u zcbot)` 重建镜像 |
|
||||||
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
|
| Sandbox 容器 build 完起不来,`docker logs` 显示 iptables 报错 | 缺 NET_ADMIN cap(`--cap-add=NET_ADMIN` 漏了)或 kernel 不支持(WSL2 / OpenVZ 环境不能跑)。Ubuntu 物理 / KVM 正常。验:`docker exec ... iptables -V` |
|
||||||
| 启动报 `ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: ...` | docker daemon 没起 / 用户不在 docker group / network create 失败。先跑 `main.py sandbox check` 看哪一项 err |
|
|
||||||
| `[startup] [warn] fs quota: <fstype> on ...` | workspace 所在 fs 没启 OS 层 quota。dogfood 阶段忽略;外部用户开放前必须升级 xfs prjquota / ext4 project / zfs(详 RUN.md「配额硬化」段) |
|
|
||||||
| `docker run zcbot-sandbox:latest` 报 `Unable to find image` | 镜像没 build。`sudo -u zcbot docker build -f deploy/sandbox/Dockerfile --build-arg HOST_UID=$(id -u zcbot) --build-arg HOST_GID=$(id -g zcbot) -t zcbot-sandbox:latest .` |
|
|
||||||
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
|
| Export 报 "无可导出内容" | task 没 messages(只 system 不算);先发条消息再 export |
|
||||||
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
|
| `NoSubtaskError: working_dir ... 前缀嵌套` | §7.4 no-subtask:同 user 不允许 working_dir 嵌套(child / parent)。**同项目多对话**用**完全相同**的 working_dir;否则改成 sibling(平级) |
|
||||||
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY` 或 `curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |
|
| `main.py web` 启动后 curl 连不上 | 检查 proxy(`HTTP_PROXY` / `HTTPS_PROXY`):本地服务 127.0.0.1,系统 proxy 拦截会 502。临时 `unset HTTP_PROXY HTTPS_PROXY` 或 `curl --noproxy '*'`。验通:`curl --noproxy '*' http://127.0.0.1:8765/healthz` |
|
||||||
|
|
|
||||||
|
|
@ -26,7 +26,6 @@ import yaml
|
||||||
from rich.console import Console
|
from rich.console import Console
|
||||||
|
|
||||||
from core.capabilities import ModelCapabilities
|
from core.capabilities import ModelCapabilities
|
||||||
from core.executor_docker import DockerExecutor
|
|
||||||
from core.executor_host import HostExecutor
|
from core.executor_host import HostExecutor
|
||||||
from core.llm import LLM
|
from core.llm import LLM
|
||||||
from core.loop import AgentLoop
|
from core.loop import AgentLoop
|
||||||
|
|
@ -54,39 +53,6 @@ def load_config() -> dict:
|
||||||
return yaml.safe_load((ROOT / "config" / "agent.yaml").read_text(encoding="utf-8")) or {}
|
return yaml.safe_load((ROOT / "config" / "agent.yaml").read_text(encoding="utf-8")) or {}
|
||||||
|
|
||||||
|
|
||||||
def _resolve_executor(
|
|
||||||
host: HostExecutor,
|
|
||||||
user_id: UUID,
|
|
||||||
user_root_path: Path,
|
|
||||||
working_dir_path: Path,
|
|
||||||
):
|
|
||||||
"""选 Executor backend(§7.5 #5)。
|
|
||||||
|
|
||||||
env `ZCBOT_SANDBOX_BACKEND=docker` 时构造 DockerExecutor;其他值 / 缺失 → host。
|
|
||||||
docker 路径要 lifespan 已 `core.sandbox.init_pool` 过(否则 pool 为 None → 退 host
|
|
||||||
+ 启动日志由 web 入口在 init 时打印,这里不重复 warn)。
|
|
||||||
"""
|
|
||||||
import os
|
|
||||||
if os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower() != "docker":
|
|
||||||
return host
|
|
||||||
from core.sandbox import get_pool
|
|
||||||
pool = get_pool()
|
|
||||||
if pool is None:
|
|
||||||
# lifespan 没 init 成功 —— 让上层早死比静默退化更安全(避免外部用户开放时
|
|
||||||
# 误以为在沙盒里跑实则 host)。Web 入口启动会 fail-fast,这里再补一条提醒。
|
|
||||||
raise RuntimeError(
|
|
||||||
"ZCBOT_SANDBOX_BACKEND=docker but sandbox pool not initialized; "
|
|
||||||
"check web lifespan init_pool() / docker daemon availability"
|
|
||||||
)
|
|
||||||
return DockerExecutor(
|
|
||||||
host=host,
|
|
||||||
pool=pool,
|
|
||||||
user_id=user_id,
|
|
||||||
user_root=user_root_path,
|
|
||||||
working_dir=working_dir_path,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def resolve_workspace(workspace: Optional[str], cfg: Optional[dict] = None) -> Path:
|
def resolve_workspace(workspace: Optional[str], cfg: Optional[dict] = None) -> Path:
|
||||||
cfg = cfg or load_config()
|
cfg = cfg or load_config()
|
||||||
p = Path(workspace) if workspace else ROOT / cfg.get("workspace_dir", "workspace")
|
p = Path(workspace) if workspace else ROOT / cfg.get("workspace_dir", "workspace")
|
||||||
|
|
@ -473,11 +439,9 @@ def build_agent(
|
||||||
tools[ws.name] = ws
|
tools[ws.name] = ws
|
||||||
|
|
||||||
sink = ConsoleEventSink(console) if console else None
|
sink = ConsoleEventSink(console) if console else None
|
||||||
# §7.5 #5/#6 Executor 抽象:env `ZCBOT_SANDBOX_BACKEND=host|docker` 切 backend。
|
# §7.5 #5 Executor 抽象:本步全 host backend(in-process),Step 3 docker backend
|
||||||
# host(默)= 全 in-process,本地 dogfood / Windows 走这条;docker = shell/run_python
|
# 引入后切 `ZCBOT_SANDBOX_BACKEND=docker` 把 shell/run_python dispatch 到容器。
|
||||||
# dispatch 到 per-user 容器(其他工具仍 host)。docker 路径要求 lifespan 已 `init_pool`。
|
executor = HostExecutor(tools)
|
||||||
host_executor = HostExecutor(tools)
|
|
||||||
executor = _resolve_executor(host_executor, uid, ur_path, working_dir_path)
|
|
||||||
agent = AgentLoop(
|
agent = AgentLoop(
|
||||||
llm, executor, session, caps,
|
llm, executor, session, caps,
|
||||||
user_id=uid, working_dir=working_dir_path, sink=sink,
|
user_id=uid, working_dir=working_dir_path, sink=sink,
|
||||||
|
|
|
||||||
|
|
@ -1,239 +0,0 @@
|
||||||
"""DockerExecutor:`shell` / `run_python` 走 docker exec,其余 in-process(§7.5 #6)。
|
|
||||||
|
|
||||||
Backend 二分(§7.5 #6 信任域):
|
|
||||||
- host in-process:`read/write/edit/glob/grep/load_skill/web_*/seedream/seedance`
|
|
||||||
原本就在 host 持凭据(Bocha key / ARK key)或走 `paths.py::resolve_user_path` 校验
|
|
||||||
(user-rooted 安全边界已存),塞容器无收益付 ~200ms exec overhead × N 次
|
|
||||||
- container exec:`shell` / `run_python` —— 执行模型生成的任意代码,必须容器隔离
|
|
||||||
|
|
||||||
容器准入(per call):
|
|
||||||
1. `pool.ensure(user_id)` —— 拿到 / 起 `zcbot-sandbox-<uid>` 容器(per-user lock 已串行化)
|
|
||||||
2. `docker exec --user 1000:1000 --workdir /workspace/<wd_name> <c> setsid bash -c '<cmd>'`
|
|
||||||
3. timeout 到 → 杀 docker CLI 客户端(Popen.kill())
|
|
||||||
4. 完成 → `pool.mark_active(user_id)` 刷 idle 计时
|
|
||||||
|
|
||||||
run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`(bind mount
|
|
||||||
自动可见于容器 `/workspace/.zcbot_tmp/<task_id>/`),执行完 unlink。dotfile 起头让
|
|
||||||
`/v1/files` API 天然过滤(`web/app.py:169` startswith(".")),用户视野不污染。
|
|
||||||
|
|
||||||
Cancel limitation(第一版接受):
|
|
||||||
- docker exec 客户端断开后,容器内 server 端进程**不会**因此终止 —— 这是 docker 设计
|
|
||||||
- 第一版只杀 docker CLI(Popen.kill());容器内残留进程靠 idle 5min reaper / 下次
|
|
||||||
ensure 时 rm -f 兜底
|
|
||||||
- 升级触发(§7.5 #3 PGID 协议):用户反馈"取消了但还在烧 CPU" / 多次 cancel 后
|
|
||||||
容器内进程堆积 → 启用「ZCBOT_EXEC_ID env + PGID 写文件 + 二次 exec kill」协议
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import os
|
|
||||||
import secrets
|
|
||||||
import subprocess
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any, Dict, List, Optional
|
|
||||||
from uuid import UUID
|
|
||||||
|
|
||||||
from .executor import ExecCtx, Executor, ToolResult
|
|
||||||
from .executor_host import HostExecutor
|
|
||||||
from .sandbox import SandboxPool
|
|
||||||
|
|
||||||
|
|
||||||
CONTAINER_TOOLS = frozenset({"shell", "run_python"})
|
|
||||||
|
|
||||||
# 容器内非 root 用户:与 Dockerfile HOST_UID/HOST_GID build-arg 默认值对齐。
|
|
||||||
# 部署机 host 上 zcbot 账号 uid 若非 1000,镜像 build 时透传 HOST_UID + 这里
|
|
||||||
# env `ZCBOT_SANDBOX_EXEC_USER` 同步改(详 RUN.md "Sandbox 部署"段)。
|
|
||||||
DEFAULT_EXEC_USER = "1000:1000"
|
|
||||||
|
|
||||||
# host 侧 tmp 脚本目录(user_root 内 dotfile,被 /v1/files API 隐藏)
|
|
||||||
TMP_SUBDIR = ".zcbot_tmp"
|
|
||||||
|
|
||||||
|
|
||||||
class DockerExecutor(Executor):
|
|
||||||
"""组合 HostExecutor + docker exec dispatch shell/run_python。
|
|
||||||
|
|
||||||
host backend 仍承担 schema 列表 + 大部分 tool 执行;本类只在 shell/run_python
|
|
||||||
命中时夺路接管,docker exec 在 per-user 容器里跑。
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
host: HostExecutor,
|
|
||||||
pool: SandboxPool,
|
|
||||||
user_id: UUID,
|
|
||||||
user_root: Path,
|
|
||||||
working_dir: Path,
|
|
||||||
) -> None:
|
|
||||||
self.host = host
|
|
||||||
self.pool = pool
|
|
||||||
self.user_id = user_id
|
|
||||||
self.user_root = user_root.resolve()
|
|
||||||
self.working_dir = working_dir.resolve()
|
|
||||||
# 容器内对应路径 /workspace/<wd_name>
|
|
||||||
try:
|
|
||||||
wd_rel = self.working_dir.relative_to(self.user_root)
|
|
||||||
self.container_workdir = "/workspace/" + wd_rel.as_posix()
|
|
||||||
except ValueError:
|
|
||||||
# working_dir 不在 user_root 下 —— 防御性兜底,正常路径不会到这里
|
|
||||||
self.container_workdir = "/workspace"
|
|
||||||
self.exec_user = os.getenv("ZCBOT_SANDBOX_EXEC_USER", DEFAULT_EXEC_USER)
|
|
||||||
|
|
||||||
# ── Executor 接口 ────────────────────────────────────────
|
|
||||||
|
|
||||||
def has_tool(self, name: str) -> bool:
|
|
||||||
return self.host.has_tool(name)
|
|
||||||
|
|
||||||
def schemas(self) -> List[Dict[str, Any]]:
|
|
||||||
return self.host.schemas()
|
|
||||||
|
|
||||||
def call_tool(self, name: str, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
|
|
||||||
if name not in CONTAINER_TOOLS:
|
|
||||||
return self.host.call_tool(name, args, ctx)
|
|
||||||
if not self.host.has_tool(name):
|
|
||||||
# caps.enable_run_python=False 等场景下,host 没装 run_python → schema 也没暴露
|
|
||||||
return ToolResult(content=f"[Error] unknown tool: {name}", exit_code=2)
|
|
||||||
try:
|
|
||||||
if name == "shell":
|
|
||||||
return self._exec_shell(args, ctx)
|
|
||||||
if name == "run_python":
|
|
||||||
return self._exec_python(args, ctx)
|
|
||||||
except Exception as e:
|
|
||||||
return ToolResult(
|
|
||||||
content=f"[Error executing {name} via docker] {type(e).__name__}: {e}",
|
|
||||||
exit_code=1,
|
|
||||||
)
|
|
||||||
return ToolResult(content=f"[Error] unhandled container tool: {name}", exit_code=2)
|
|
||||||
|
|
||||||
# ── shell ────────────────────────────────────────────────
|
|
||||||
|
|
||||||
def _exec_shell(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
|
|
||||||
cmd = args.get("command")
|
|
||||||
if not isinstance(cmd, str) or not cmd.strip():
|
|
||||||
return ToolResult(
|
|
||||||
content="[Error] bad arguments to shell: command must be non-empty string",
|
|
||||||
exit_code=2,
|
|
||||||
)
|
|
||||||
timeout = int(args.get("timeout") or 60)
|
|
||||||
|
|
||||||
container = self.pool.ensure(self.user_id)
|
|
||||||
argv = self._docker_exec_argv(container) + ["setsid", "bash", "-c", cmd]
|
|
||||||
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
|
|
||||||
self.pool.mark_active(self.user_id)
|
|
||||||
return result
|
|
||||||
|
|
||||||
# ── run_python ───────────────────────────────────────────
|
|
||||||
|
|
||||||
def _exec_python(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
|
|
||||||
code = args.get("code")
|
|
||||||
if not isinstance(code, str):
|
|
||||||
return ToolResult(
|
|
||||||
content="[Error] bad arguments to run_python: code must be string",
|
|
||||||
exit_code=2,
|
|
||||||
)
|
|
||||||
timeout = int(args.get("timeout") or 120)
|
|
||||||
|
|
||||||
# tmp .py 落 host 侧 `.zcbot_tmp/<task_id>/<rand>.py`;
|
|
||||||
# 容器内对应 /workspace/.zcbot_tmp/<task_id>/<rand>.py
|
|
||||||
tmp_root = self.user_root / TMP_SUBDIR / str(ctx.task_id)
|
|
||||||
tmp_root.mkdir(parents=True, exist_ok=True)
|
|
||||||
rand_name = f"{int(time.time() * 1000)}-{secrets.token_hex(4)}.py"
|
|
||||||
host_script = tmp_root / rand_name
|
|
||||||
container_script = f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/{rand_name}"
|
|
||||||
host_script.write_text(code, encoding="utf-8")
|
|
||||||
|
|
||||||
try:
|
|
||||||
container = self.pool.ensure(self.user_id)
|
|
||||||
argv = self._docker_exec_argv(
|
|
||||||
container,
|
|
||||||
extra_env={
|
|
||||||
"PYTHONIOENCODING": "utf-8",
|
|
||||||
"PYTHONPATH": "/workspace",
|
|
||||||
},
|
|
||||||
) + ["setsid", "python", container_script]
|
|
||||||
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
|
|
||||||
self.pool.mark_active(self.user_id)
|
|
||||||
return result
|
|
||||||
finally:
|
|
||||||
try:
|
|
||||||
host_script.unlink()
|
|
||||||
except OSError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# ── helpers ──────────────────────────────────────────────
|
|
||||||
|
|
||||||
def _docker_exec_argv(
|
|
||||||
self, container: str, extra_env: Optional[Dict[str, str]] = None
|
|
||||||
) -> List[str]:
|
|
||||||
argv = [
|
|
||||||
"docker", "exec",
|
|
||||||
"--user", self.exec_user,
|
|
||||||
"--workdir", self.container_workdir,
|
|
||||||
]
|
|
||||||
env: Dict[str, str] = {}
|
|
||||||
if extra_env:
|
|
||||||
env.update(extra_env)
|
|
||||||
for k, v in env.items():
|
|
||||||
argv.extend(["-e", f"{k}={v}"])
|
|
||||||
argv.append(container)
|
|
||||||
return argv
|
|
||||||
|
|
||||||
def _run_subprocess(
|
|
||||||
self, argv: List[str], timeout: int, ctx: ExecCtx
|
|
||||||
) -> ToolResult:
|
|
||||||
"""跑 docker exec 子进程,带 cancel 协作 poll。
|
|
||||||
|
|
||||||
cancel 命中 / timeout 到 → Popen.kill() 杀 docker CLI 客户端;
|
|
||||||
容器内 server 端进程接受 limitation(见模块头注释)。
|
|
||||||
"""
|
|
||||||
cancel_check = ctx.cancel_check
|
|
||||||
try:
|
|
||||||
proc = subprocess.Popen(
|
|
||||||
argv,
|
|
||||||
stdout=subprocess.PIPE,
|
|
||||||
stderr=subprocess.PIPE,
|
|
||||||
text=True,
|
|
||||||
encoding="utf-8",
|
|
||||||
errors="replace",
|
|
||||||
)
|
|
||||||
except FileNotFoundError as e:
|
|
||||||
return ToolResult(content=f"[Error] docker CLI not found: {e}", exit_code=2)
|
|
||||||
|
|
||||||
start = time.monotonic()
|
|
||||||
cancel_hit = False
|
|
||||||
timeout_hit = False
|
|
||||||
stdout: str = ""
|
|
||||||
stderr: str = ""
|
|
||||||
while True:
|
|
||||||
try:
|
|
||||||
stdout, stderr = proc.communicate(timeout=0.5)
|
|
||||||
break
|
|
||||||
except subprocess.TimeoutExpired:
|
|
||||||
if cancel_check is not None and cancel_check():
|
|
||||||
cancel_hit = True
|
|
||||||
proc.kill()
|
|
||||||
stdout, stderr = proc.communicate()
|
|
||||||
break
|
|
||||||
if time.monotonic() - start > timeout:
|
|
||||||
timeout_hit = True
|
|
||||||
proc.kill()
|
|
||||||
stdout, stderr = proc.communicate()
|
|
||||||
break
|
|
||||||
|
|
||||||
if timeout_hit:
|
|
||||||
return ToolResult(
|
|
||||||
content=f"[Error] command timed out after {timeout}s",
|
|
||||||
exit_code=124,
|
|
||||||
)
|
|
||||||
if cancel_hit:
|
|
||||||
return ToolResult(
|
|
||||||
content="[Error] command cancelled by user",
|
|
||||||
exit_code=130,
|
|
||||||
)
|
|
||||||
|
|
||||||
parts: List[str] = []
|
|
||||||
if stdout:
|
|
||||||
parts.append(f"[stdout]\n{stdout.rstrip()}")
|
|
||||||
if stderr:
|
|
||||||
parts.append(f"[stderr]\n{stderr.rstrip()}")
|
|
||||||
parts.append(f"[exit {proc.returncode}]")
|
|
||||||
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)
|
|
||||||
|
|
@ -3,48 +3,17 @@
|
||||||
模块边界:
|
模块边界:
|
||||||
- `network.py`:Docker network ensure(`zcbot-sandbox-net`,`--internal` 隔离 outbound + cross-container)
|
- `network.py`:Docker network ensure(`zcbot-sandbox-net`,`--internal` 隔离 outbound + cross-container)
|
||||||
- `pool.py`:per-user 容器生命周期(ensure / mark_active / reap_idle / shutdown_all)
|
- `pool.py`:per-user 容器生命周期(ensure / mark_active / reap_idle / shutdown_all)
|
||||||
- `__init__.py`:module-level singleton(`init_pool` / `get_pool`),给 web lifespan 与
|
|
||||||
`agent_builder` 共享同一个池实例。
|
|
||||||
|
|
||||||
不在本目录:`shell` / `run_python` 工具的 docker exec 调用 ── 那是 `core/executor_docker.py`,
|
不在本目录:`shell` / `run_python` 工具的 docker exec 调用 ── 那是 Step 3 的
|
||||||
调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec。
|
`core/executor_docker.py`,调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec。
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Optional
|
|
||||||
|
|
||||||
from .pool import SandboxPool, container_name, setup_pool
|
from .pool import SandboxPool, container_name, setup_pool
|
||||||
from .network import NETWORK_NAME, ensure_network
|
from .network import NETWORK_NAME, ensure_network
|
||||||
|
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"SandboxPool",
|
"SandboxPool",
|
||||||
"container_name",
|
"container_name",
|
||||||
"setup_pool",
|
"setup_pool",
|
||||||
"NETWORK_NAME",
|
"NETWORK_NAME",
|
||||||
"ensure_network",
|
"ensure_network",
|
||||||
"init_pool",
|
|
||||||
"get_pool",
|
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
# Module-level singleton。web lifespan 启动钩子调 `init_pool(user_root_base)`,
|
|
||||||
# `agent_builder` 在构造 DockerExecutor 时 `get_pool()` 拿同一实例。
|
|
||||||
# 未初始化 → `get_pool()` 返 None,agent_builder 此时必须不走 docker 分支。
|
|
||||||
_pool: Optional[SandboxPool] = None
|
|
||||||
|
|
||||||
|
|
||||||
def init_pool(user_root_base: Path) -> SandboxPool:
|
|
||||||
"""幂等初始化 module-level pool。返回 pool 实例。
|
|
||||||
|
|
||||||
lifespan 调一次;ensure_network 内部也幂等。重复调用返回同一实例(不重新建)。
|
|
||||||
"""
|
|
||||||
global _pool
|
|
||||||
if _pool is None:
|
|
||||||
_pool = setup_pool(user_root_base)
|
|
||||||
return _pool
|
|
||||||
|
|
||||||
|
|
||||||
def get_pool() -> Optional[SandboxPool]:
|
|
||||||
return _pool
|
|
||||||
|
|
|
||||||
|
|
@ -1,258 +0,0 @@
|
||||||
"""Sandbox 部署前置对账(`main.py sandbox check`)。
|
|
||||||
|
|
||||||
跑 5 项独立探测,各自打 `[ok]` / `[warn]` / `[err]`,汇总后返 exit code。
|
|
||||||
外部用户开放前所有项必须 `[ok]`。
|
|
||||||
|
|
||||||
探测项与 §7.5 协议对应:
|
|
||||||
1. Docker daemon 可达 -- ZCBOT_SANDBOX_BACKEND=docker 启用必备
|
|
||||||
2. `zcbot-sandbox:latest` 镜像存在 -- 缺则 pool.ensure 时 docker run 报 "Unable to find image"
|
|
||||||
3. `zcbot-sandbox-net` network 存在 -- 缺也无所谓(init_pool 内自动 ensure),但提前预热
|
|
||||||
4. 镜像 HOST_UID 与 host zcbot uid 对齐 -- 错配会让 exec 进来后 write /workspace 时 EACCES
|
|
||||||
5. user_root_base fs 类型可 quota -- §7.5 #4,xfs prjquota / ext4 project / zfs;否则
|
|
||||||
"扫描间隙打满共享 fs"会拖死同节点其他 user(攻击者写满速度 >> 应用层周期扫描)
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import os
|
|
||||||
import shutil
|
|
||||||
import subprocess
|
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Tuple
|
|
||||||
|
|
||||||
from .pool import DEFAULT_IMAGE
|
|
||||||
from .network import NETWORK_NAME
|
|
||||||
|
|
||||||
|
|
||||||
# 颜色用 ANSI(终端不支持的环境自动退化为 plain;click.echo 不强求 click context)
|
|
||||||
def _ok(msg: str) -> None:
|
|
||||||
print(f"[ok] {msg}")
|
|
||||||
|
|
||||||
|
|
||||||
def _warn(msg: str) -> None:
|
|
||||||
print(f"[warn] {msg}")
|
|
||||||
|
|
||||||
|
|
||||||
def _err(msg: str) -> None:
|
|
||||||
print(f"[err] {msg}")
|
|
||||||
|
|
||||||
|
|
||||||
def _run(argv, timeout: int = 10) -> Tuple[int, str, str]:
|
|
||||||
"""统一 subprocess.run wrapper。docker CLI 不存在 → returncode=127,stderr 给原因。"""
|
|
||||||
if shutil.which(argv[0]) is None:
|
|
||||||
return 127, "", f"{argv[0]} not found in PATH"
|
|
||||||
try:
|
|
||||||
r = subprocess.run(argv, capture_output=True, text=True, timeout=timeout)
|
|
||||||
return r.returncode, r.stdout.strip(), r.stderr.strip()
|
|
||||||
except subprocess.TimeoutExpired:
|
|
||||||
return 124, "", f"timed out after {timeout}s"
|
|
||||||
except Exception as e:
|
|
||||||
return 1, "", f"{type(e).__name__}: {e}"
|
|
||||||
|
|
||||||
|
|
||||||
# -- 探测项 ------------------------------------------------
|
|
||||||
|
|
||||||
def check_docker_daemon() -> bool:
|
|
||||||
rc, out, err = _run(["docker", "version", "--format", "{{.Server.Version}}"])
|
|
||||||
if rc == 0 and out:
|
|
||||||
_ok(f"docker daemon reachable (server={out})")
|
|
||||||
return True
|
|
||||||
if rc == 127:
|
|
||||||
_err("docker CLI not found -- apt install docker.io / docker-ce")
|
|
||||||
elif "permission denied" in err.lower():
|
|
||||||
_err(f"docker daemon not reachable: {err} -- usermod -aG docker $USER + relogin")
|
|
||||||
else:
|
|
||||||
_err(f"docker daemon not reachable: {err or 'unknown'}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def check_image_present() -> bool:
|
|
||||||
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
|
|
||||||
rc, _, err = _run(["docker", "image", "inspect", image])
|
|
||||||
if rc == 0:
|
|
||||||
_ok(f"image present: {image}")
|
|
||||||
return True
|
|
||||||
_err(
|
|
||||||
f"image not found: {image} -- "
|
|
||||||
f"`docker build -f deploy/sandbox/Dockerfile "
|
|
||||||
f"--build-arg HOST_UID=$(id -u) --build-arg HOST_GID=$(id -g) "
|
|
||||||
f"-t {image} .`"
|
|
||||||
)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def check_network_present() -> bool:
|
|
||||||
rc, _, _ = _run(["docker", "network", "inspect", NETWORK_NAME])
|
|
||||||
if rc == 0:
|
|
||||||
_ok(f"network present: {NETWORK_NAME}")
|
|
||||||
return True
|
|
||||||
_warn(
|
|
||||||
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
|
|
||||||
f"或手动 `docker network create --internal {NETWORK_NAME}`"
|
|
||||||
)
|
|
||||||
return True # warn 不算失败
|
|
||||||
|
|
||||||
|
|
||||||
def check_host_uid_alignment() -> bool:
|
|
||||||
"""镜像内 zcbot 用户 uid 与 host 当前 uid 对齐。
|
|
||||||
|
|
||||||
bind mount 让 host fs owner 直接落进容器;镜像 build 时若漏传 `HOST_UID`,
|
|
||||||
容器内默 uid=1000,host 实际跑 zcbot 服务的账号若 uid≠1000 → exec 写 /workspace
|
|
||||||
全 EACCES。这里用 `docker run --rm --entrypoint id -u zcbot` 拿镜像 uid,
|
|
||||||
与 host `os.getuid()` 比对(假设 zcbot 用户跑 check 子命令)。
|
|
||||||
"""
|
|
||||||
image = os.getenv("ZCBOT_SANDBOX_IMAGE", DEFAULT_IMAGE)
|
|
||||||
rc, out, err = _run(
|
|
||||||
["docker", "run", "--rm", "--entrypoint", "id", image, "-u", "zcbot"]
|
|
||||||
)
|
|
||||||
if rc != 0:
|
|
||||||
_warn(
|
|
||||||
f"image uid check skipped: {err or 'unknown'} -- "
|
|
||||||
f"if image not built yet 先跑 build 再来"
|
|
||||||
)
|
|
||||||
return True
|
|
||||||
|
|
||||||
try:
|
|
||||||
image_uid = int(out)
|
|
||||||
except ValueError:
|
|
||||||
_warn(f"image uid unexpected output: {out!r}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
if sys.platform == "win32":
|
|
||||||
_warn(
|
|
||||||
f"image zcbot uid={image_uid}; host uid check skipped on Windows "
|
|
||||||
f"(Linux 部署机上跑 check 才有意义)"
|
|
||||||
)
|
|
||||||
return True
|
|
||||||
|
|
||||||
host_uid = os.getuid() # type: ignore[attr-defined]
|
|
||||||
if image_uid == host_uid:
|
|
||||||
_ok(f"HOST_UID aligned: image zcbot uid={image_uid} == host uid={host_uid}")
|
|
||||||
return True
|
|
||||||
_err(
|
|
||||||
f"HOST_UID mismatch: image zcbot uid={image_uid}, host uid={host_uid} -- "
|
|
||||||
f"重 build 镜像 `docker build --build-arg HOST_UID={host_uid} ...`"
|
|
||||||
)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def detect_fs_quota(target: Path) -> Tuple[str, str]:
|
|
||||||
"""探测 target 所在 fs 是否可 quota,返 (level, msg)。
|
|
||||||
|
|
||||||
level ∈ {"ok", "warn"} —— fs quota 永不视为 err(不阻塞 web 启动)。
|
|
||||||
给 CLI 与 lifespan 共用 —— CLI 走 _ok/_warn 打印,lifespan 走 print。
|
|
||||||
|
|
||||||
识别:
|
|
||||||
- xfs:mount options 含 `prjquota` 或 `pquota` → ok;否则 warn(fs 支持但未 enable)
|
|
||||||
- ext4:mount options 含 `prjquota` 或 `project,quota` → ok
|
|
||||||
- zfs:任何 → ok(dataset quota 在 zfs set 层,这里不深入)
|
|
||||||
- btrfs:警告 quota 群组复杂
|
|
||||||
- tmpfs / overlay / 其他:warn(典型 Docker-in-Docker 或本地 dev,生产部署不应该)
|
|
||||||
"""
|
|
||||||
if sys.platform == "win32":
|
|
||||||
return "warn", "fs quota check skipped on Windows (Linux 部署机才有意义)"
|
|
||||||
|
|
||||||
# findmnt 在多数 Linux 发行版自带(util-linux)
|
|
||||||
rc, out, err = _run([
|
|
||||||
"findmnt", "--target", str(target), "-no", "FSTYPE,OPTIONS",
|
|
||||||
])
|
|
||||||
if rc != 0 or not out:
|
|
||||||
return "warn", (
|
|
||||||
f"fs quota check skipped: cannot detect fs for {target} "
|
|
||||||
f"({err or 'findmnt missing'})"
|
|
||||||
)
|
|
||||||
|
|
||||||
parts = out.split()
|
|
||||||
fstype = parts[0].lower() if parts else ""
|
|
||||||
options = parts[1] if len(parts) > 1 else ""
|
|
||||||
opts = set(options.split(","))
|
|
||||||
|
|
||||||
if fstype == "xfs":
|
|
||||||
if "prjquota" in opts or "pquota" in opts:
|
|
||||||
return "ok", f"fs quota: xfs with prjquota on {target}"
|
|
||||||
return "warn", (
|
|
||||||
f"fs quota: xfs on {target} but NO prjquota mount option -- "
|
|
||||||
f"`sudo mount -o remount,prjquota <mountpoint>` + `xfs_quota -x ...`"
|
|
||||||
)
|
|
||||||
if fstype == "ext4":
|
|
||||||
if "prjquota" in opts or ("project" in opts and "quota" in opts):
|
|
||||||
return "ok", f"fs quota: ext4 with project quota on {target}"
|
|
||||||
return "warn", (
|
|
||||||
f"fs quota: ext4 on {target} but NO project quota option -- "
|
|
||||||
f"`tune2fs -O project,quota <dev>` + remount + `quota -P`"
|
|
||||||
)
|
|
||||||
if fstype == "zfs":
|
|
||||||
return "ok", f"fs quota: zfs on {target} (dataset quota via `zfs set quota=...`)"
|
|
||||||
if fstype == "btrfs":
|
|
||||||
return "warn", (
|
|
||||||
f"fs quota: btrfs on {target} -- qgroup 配置复杂,生产部署"
|
|
||||||
f"推荐 xfs prjquota;如必须用 btrfs 自行验 `btrfs qgroup`"
|
|
||||||
)
|
|
||||||
return "warn", (
|
|
||||||
f"fs quota: {fstype or '<unknown>'} on {target} -- "
|
|
||||||
f"非主流 quota-able 类型,外部用户开放前换 xfs/ext4/zfs 单独分区"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def check_fs_quota_capable() -> bool:
|
|
||||||
"""CLI 入口:探测 workspace/users/ 所在 fs。返 True(永不 err)。"""
|
|
||||||
from core.agent_builder import load_config, resolve_workspace
|
|
||||||
|
|
||||||
try:
|
|
||||||
cfg = load_config()
|
|
||||||
workspace = resolve_workspace(None, cfg)
|
|
||||||
target = (workspace / "users").resolve()
|
|
||||||
except Exception as e:
|
|
||||||
_warn(f"fs quota check: cannot resolve workspace path: {e}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
level, msg = detect_fs_quota(target)
|
|
||||||
if level == "ok":
|
|
||||||
_ok(msg)
|
|
||||||
else:
|
|
||||||
_warn(msg)
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
# -- 汇总入口 ---------------------------------------------
|
|
||||||
|
|
||||||
CHECK_NAMES = [
|
|
||||||
("docker daemon", "check_docker_daemon"),
|
|
||||||
("image present", "check_image_present"),
|
|
||||||
("network present", "check_network_present"),
|
|
||||||
("HOST_UID alignment", "check_host_uid_alignment"),
|
|
||||||
("fs quota capable", "check_fs_quota_capable"),
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def run_sandbox_check() -> int:
|
|
||||||
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
|
|
||||||
|
|
||||||
err vs warn 分界:
|
|
||||||
- err = docker backend 启动会 fail-fast 的根因(daemon / 镜像 / HOST_UID)
|
|
||||||
- warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota)
|
|
||||||
|
|
||||||
通过模块全局 lookup 拿函数引用(不固化进 CHECKS 元组),让 unittest patch
|
|
||||||
`core.sandbox.check.check_xxx` 对本函数生效。
|
|
||||||
"""
|
|
||||||
print("--- sandbox deployment check ---\n")
|
|
||||||
ok_count = 0
|
|
||||||
module = sys.modules[__name__]
|
|
||||||
for label, fn_name in CHECK_NAMES:
|
|
||||||
fn = getattr(module, fn_name)
|
|
||||||
try:
|
|
||||||
if fn():
|
|
||||||
ok_count += 1
|
|
||||||
except Exception as e:
|
|
||||||
_err(f"{label}: unexpected {type(e).__name__}: {e}")
|
|
||||||
total = len(CHECK_NAMES)
|
|
||||||
print()
|
|
||||||
if ok_count == total:
|
|
||||||
print(f"[summary] {ok_count}/{total} checks passed -- docker backend ready")
|
|
||||||
return 0
|
|
||||||
failed = total - ok_count
|
|
||||||
print(
|
|
||||||
f"[summary] {ok_count}/{total} passed, {failed} failed -- "
|
|
||||||
f"修完上面的 [err] 项再启 docker backend"
|
|
||||||
)
|
|
||||||
return 1
|
|
||||||
|
|
@ -5,30 +5,27 @@
|
||||||
出 workspace 目录)。
|
出 workspace 目录)。
|
||||||
|
|
||||||
生命周期:
|
生命周期:
|
||||||
- `ensure(user_id)`:per-user `threading.Lock` 串行化 → `docker inspect` 探测 →
|
- `ensure(user_id)`:per-user `asyncio.Lock` 串行化 → `docker inspect` 探测 → 已 running
|
||||||
已 running 直接返;exists-but-stopped 先 `rm -f` 重起(保证 iptables 重新 apply);
|
直接返;exists-but-stopped 先 `rm -f` 重起(保证 iptables 重新 apply);不存在 `docker run`
|
||||||
不存在 `docker run`
|
|
||||||
- `mark_active(user_id)`:exec 完更新 in-memory `_last_active[uid]=now`(docker labels
|
- `mark_active(user_id)`:exec 完更新 in-memory `_last_active[uid]=now`(docker labels
|
||||||
不可运行时修改 ── Docker 23+ 移除 `docker update --label-add` 支持)
|
不可运行时修改 ── Docker 23+ 移除 `docker update --label-add` 支持)
|
||||||
- `reap_idle()`:周期任务,扫 `_last_active` dict,>`idle_ttl` 的 `docker rm -f`
|
- `reap_idle()`:周期任务,扫 `_last_active` dict,>`idle_ttl` 的 `docker rm -f`
|
||||||
- `shutdown_all()`:app 启动时清前驱孤儿(`docker ps --filter label=zcbot.product=sandbox`)
|
- `shutdown_all()`:app 启动时清前驱孤儿(`docker ps --filter label=zcbot.product=sandbox`)
|
||||||
|
|
||||||
API 全同步 —— ensure 主要使用方是 AgentLoop / DockerExecutor,跑在 web BG 线程内
|
|
||||||
天然同步;reaper 跑在 uvicorn 主 loop 里,通过 `run_in_executor` 包一层调本类 sync 方法。
|
|
||||||
threading.Lock 跨线程有效,asyncio.Lock 会被 ephemeral loop 创建 / 销毁绕过保护。
|
|
||||||
|
|
||||||
幂等性:
|
幂等性:
|
||||||
- ensure 在重复调用时跨 daemon round-trip < 100ms(纯 `docker inspect`);per-user lock
|
- ensure 在重复调用时跨 daemon round-trip < 100ms(纯 `docker inspect`);per-user lock
|
||||||
防同 user 两并发 `docker run --name` 撞 "Conflict"(虽然 docker 本身会 reject,提前
|
防同 user 两并发 `docker run --name` 撞 "Conflict"(虽然 docker 本身会 reject,提前
|
||||||
锁更干净)
|
锁更干净)
|
||||||
- reaper 只杀 dict 里有记录的容器 ── 重启后 dict 空 → 不杀历史孤儿(这条由 startup
|
- reaper 只杀 dict 里有记录的容器 ── 重启后 dict 空 → 不杀历史孤儿(这条由 startup
|
||||||
`shutdown_all` 兜底)
|
`shutdown_all` 兜底)
|
||||||
|
|
||||||
|
Step 2 范围:仅 pool / lifecycle。Tools(shell / run_python)在 Step 3 接入。
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
import os
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import threading
|
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Dict, List, Optional
|
from typing import Dict, List, Optional
|
||||||
|
|
@ -100,19 +97,17 @@ class SandboxPool:
|
||||||
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
|
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
|
||||||
)
|
)
|
||||||
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
|
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
|
||||||
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
|
self._locks: Dict[UUID, asyncio.Lock] = {}
|
||||||
self._locks: Dict[UUID, threading.Lock] = {}
|
|
||||||
self._last_active: Dict[UUID, int] = {}
|
self._last_active: Dict[UUID, int] = {}
|
||||||
|
|
||||||
def _lock_for(self, user_id: UUID) -> threading.Lock:
|
def _lock_for(self, user_id: UUID) -> asyncio.Lock:
|
||||||
with self._dict_lock:
|
|
||||||
if user_id not in self._locks:
|
if user_id not in self._locks:
|
||||||
self._locks[user_id] = threading.Lock()
|
self._locks[user_id] = asyncio.Lock()
|
||||||
return self._locks[user_id]
|
return self._locks[user_id]
|
||||||
|
|
||||||
def ensure(self, user_id: UUID) -> str:
|
async def ensure(self, user_id: UUID) -> str:
|
||||||
"""返回容器名;create-or-reuse 原子。同步阻塞,主调方 AgentLoop 已在 BG 线程。"""
|
"""返回容器名;create-or-reuse 原子。"""
|
||||||
with self._lock_for(user_id):
|
async with self._lock_for(user_id):
|
||||||
name = container_name(user_id)
|
name = container_name(user_id)
|
||||||
if _container_running(name):
|
if _container_running(name):
|
||||||
self._last_active[user_id] = _now()
|
self._last_active[user_id] = _now()
|
||||||
|
|
@ -123,7 +118,7 @@ class SandboxPool:
|
||||||
["docker", "rm", "-f", name],
|
["docker", "rm", "-f", name],
|
||||||
capture_output=True, check=False,
|
capture_output=True, check=False,
|
||||||
)
|
)
|
||||||
self._docker_run(user_id, name)
|
await asyncio.to_thread(self._docker_run, user_id, name)
|
||||||
self._last_active[user_id] = _now()
|
self._last_active[user_id] = _now()
|
||||||
return name
|
return name
|
||||||
|
|
||||||
|
|
|
||||||
20
main.py
20
main.py
|
|
@ -198,25 +198,5 @@ def web(host: str, port: int, reload: bool) -> None:
|
||||||
uvicorn.run(create_app(), host=host, port=port, log_level="info")
|
uvicorn.run(create_app(), host=host, port=port, log_level="info")
|
||||||
|
|
||||||
|
|
||||||
# ─────────────── Sandbox(Stage C 部署前置对账) ───────────────
|
|
||||||
|
|
||||||
@cli.group()
|
|
||||||
def sandbox() -> None:
|
|
||||||
"""Sandbox 容器部署对账(`ZCBOT_SANDBOX_BACKEND=docker` 启用前跑一遍)。"""
|
|
||||||
|
|
||||||
|
|
||||||
@sandbox.command("check")
|
|
||||||
def sandbox_check() -> None:
|
|
||||||
"""对账 docker backend 启动前置(daemon / 镜像 / network / HOST_UID / fs quota)。
|
|
||||||
|
|
||||||
非阻塞 ─ 每项独立打印 `[ok]` / `[warn]` / `[err]`,最后汇总。`err` 一项 → 退出 1,
|
|
||||||
全 ok / 仅 warn → 退出 0。warn 项不阻塞 web 启动,但**外部用户开放前必须清零**
|
|
||||||
(详 DESIGN §7.5 落地清单)。
|
|
||||||
"""
|
|
||||||
from core.sandbox.check import run_sandbox_check
|
|
||||||
rc = run_sandbox_check()
|
|
||||||
sys.exit(rc)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
cli()
|
cli()
|
||||||
|
|
|
||||||
|
|
@ -1,285 +0,0 @@
|
||||||
"""DockerExecutor 单元测试。
|
|
||||||
|
|
||||||
mock subprocess(`docker exec` 命令的实际跑由部署机 smoke 验,RUN.md 有 5 条命令)。
|
|
||||||
覆盖关键路径:
|
|
||||||
- 信任域 dispatch:host 工具直通 / container 工具走 docker exec
|
|
||||||
- argv 形态:--user / --workdir / setsid / bash -c / python <script>
|
|
||||||
- tmp .py:写到 host 侧 `.zcbot_tmp/<task_id>/`,执行完 unlink,无残留
|
|
||||||
- timeout / cancel:Popen.kill() 兜底
|
|
||||||
- schemas() / has_tool() 透传 host
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import tempfile
|
|
||||||
import unittest
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import MagicMock, patch
|
|
||||||
from uuid import uuid4
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
|
||||||
|
|
||||||
from core.executor import ExecCtx, ToolResult
|
|
||||||
from core.executor_docker import DockerExecutor, TMP_SUBDIR
|
|
||||||
from core.executor_host import HostExecutor
|
|
||||||
|
|
||||||
|
|
||||||
class FakePool:
|
|
||||||
"""SandboxPool 替身:ensure 返固定容器名,mark_active 记录调用。"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.ensure_calls = []
|
|
||||||
self.mark_active_calls = []
|
|
||||||
|
|
||||||
def ensure(self, user_id):
|
|
||||||
name = f"zcbot-sandbox-{user_id}"
|
|
||||||
self.ensure_calls.append(user_id)
|
|
||||||
return name
|
|
||||||
|
|
||||||
def mark_active(self, user_id):
|
|
||||||
self.mark_active_calls.append(user_id)
|
|
||||||
|
|
||||||
|
|
||||||
class FakeTool:
|
|
||||||
"""tools.base.Tool 替身:execute 返串,schema 暴露 name + 空 parameters。"""
|
|
||||||
|
|
||||||
def __init__(self, name, output="ok"):
|
|
||||||
self.name = name
|
|
||||||
self._output = output
|
|
||||||
self.execute_calls = []
|
|
||||||
|
|
||||||
@property
|
|
||||||
def schema(self):
|
|
||||||
return {"type": "function", "function": {"name": self.name}}
|
|
||||||
|
|
||||||
def execute(self, **kwargs):
|
|
||||||
self.execute_calls.append(kwargs)
|
|
||||||
return self._output
|
|
||||||
|
|
||||||
|
|
||||||
def make_executor(tools_dict=None):
|
|
||||||
"""构造 DockerExecutor + FakePool + tmp user_root。返回 (executor, pool, tmp_dir)。"""
|
|
||||||
tmp = tempfile.mkdtemp()
|
|
||||||
user_root = Path(tmp) / "users" / "u1"
|
|
||||||
user_root.mkdir(parents=True)
|
|
||||||
working_dir = user_root / "demo"
|
|
||||||
working_dir.mkdir()
|
|
||||||
|
|
||||||
if tools_dict is None:
|
|
||||||
tools_dict = {
|
|
||||||
"read": FakeTool("read", "READ_OUT"),
|
|
||||||
"shell": FakeTool("shell"), # host shell 不应被调用
|
|
||||||
"run_python": FakeTool("run_python"),
|
|
||||||
}
|
|
||||||
host = HostExecutor(tools_dict)
|
|
||||||
pool = FakePool()
|
|
||||||
executor = DockerExecutor(
|
|
||||||
host=host,
|
|
||||||
pool=pool,
|
|
||||||
user_id=uuid4(),
|
|
||||||
user_root=user_root,
|
|
||||||
working_dir=working_dir,
|
|
||||||
)
|
|
||||||
return executor, pool, Path(tmp)
|
|
||||||
|
|
||||||
|
|
||||||
def make_ctx(executor):
|
|
||||||
return ExecCtx(
|
|
||||||
user_id=executor.user_id,
|
|
||||||
task_id=uuid4(),
|
|
||||||
working_dir=executor.working_dir,
|
|
||||||
cancel_check=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class TestHostPassthrough(unittest.TestCase):
|
|
||||||
"""非 container tool 直通 host backend,不调 pool / subprocess。"""
|
|
||||||
|
|
||||||
def test_read_passthrough_to_host(self):
|
|
||||||
executor, pool, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
result = executor.call_tool("read", {"file": "x"}, ctx)
|
|
||||||
self.assertEqual(result.content, "READ_OUT")
|
|
||||||
self.assertEqual(result.exit_code, 0)
|
|
||||||
self.assertEqual(pool.ensure_calls, [])
|
|
||||||
self.assertEqual(pool.mark_active_calls, [])
|
|
||||||
|
|
||||||
def test_schemas_and_has_tool_from_host(self):
|
|
||||||
executor, _, _ = make_executor()
|
|
||||||
names = [s["function"]["name"] for s in executor.schemas()]
|
|
||||||
self.assertIn("read", names)
|
|
||||||
self.assertIn("shell", names)
|
|
||||||
self.assertTrue(executor.has_tool("shell"))
|
|
||||||
self.assertFalse(executor.has_tool("nope"))
|
|
||||||
|
|
||||||
|
|
||||||
class TestShellExec(unittest.TestCase):
|
|
||||||
"""shell 调用走 docker exec subprocess,argv 形态正确。"""
|
|
||||||
|
|
||||||
def test_shell_invokes_docker_exec(self):
|
|
||||||
executor, pool, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
|
|
||||||
proc = MagicMock()
|
|
||||||
proc.communicate.return_value = ("hello\n", "")
|
|
||||||
proc.returncode = 0
|
|
||||||
|
|
||||||
with patch("core.executor_docker.subprocess.Popen", return_value=proc) as popen:
|
|
||||||
result = executor.call_tool("shell", {"command": "echo hello"}, ctx)
|
|
||||||
|
|
||||||
self.assertIn("[stdout]\nhello", result.content)
|
|
||||||
self.assertIn("[exit 0]", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 0)
|
|
||||||
|
|
||||||
argv = popen.call_args[0][0]
|
|
||||||
self.assertEqual(argv[:2], ["docker", "exec"])
|
|
||||||
self.assertIn("--user", argv)
|
|
||||||
self.assertIn("--workdir", argv)
|
|
||||||
# workdir 应是 /workspace/demo(working_dir 相对 user_root)
|
|
||||||
self.assertEqual(argv[argv.index("--workdir") + 1], "/workspace/demo")
|
|
||||||
# container name = zcbot-sandbox-<uid>
|
|
||||||
container_idx = argv.index(f"zcbot-sandbox-{executor.user_id}")
|
|
||||||
# setsid bash -c 必须出现且紧跟 container 之后
|
|
||||||
self.assertEqual(argv[container_idx + 1:], ["setsid", "bash", "-c", "echo hello"])
|
|
||||||
|
|
||||||
self.assertEqual(pool.ensure_calls, [executor.user_id])
|
|
||||||
self.assertEqual(pool.mark_active_calls, [executor.user_id])
|
|
||||||
|
|
||||||
def test_shell_bad_args(self):
|
|
||||||
executor, _, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
result = executor.call_tool("shell", {"command": ""}, ctx)
|
|
||||||
self.assertIn("[Error]", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 2)
|
|
||||||
|
|
||||||
def test_shell_timeout(self):
|
|
||||||
executor, pool, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
|
|
||||||
import subprocess as real_subprocess
|
|
||||||
proc = MagicMock()
|
|
||||||
# 第一次 communicate 抛 TimeoutExpired,第二次(kill 后)返空
|
|
||||||
proc.communicate.side_effect = [
|
|
||||||
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
|
|
||||||
("", "killed\n"),
|
|
||||||
]
|
|
||||||
proc.returncode = -9
|
|
||||||
|
|
||||||
with patch("core.executor_docker.subprocess.Popen", return_value=proc), \
|
|
||||||
patch("core.executor_docker.time.monotonic", side_effect=[0, 100]):
|
|
||||||
result = executor.call_tool("shell", {"command": "sleep 9999", "timeout": 1}, ctx)
|
|
||||||
|
|
||||||
self.assertIn("timed out after 1s", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 124)
|
|
||||||
proc.kill.assert_called_once()
|
|
||||||
|
|
||||||
def test_shell_cancel(self):
|
|
||||||
executor, _, _ = make_executor()
|
|
||||||
ctx = ExecCtx(
|
|
||||||
user_id=executor.user_id,
|
|
||||||
task_id=uuid4(),
|
|
||||||
working_dir=executor.working_dir,
|
|
||||||
cancel_check=lambda: True, # 立即 cancel
|
|
||||||
)
|
|
||||||
|
|
||||||
import subprocess as real_subprocess
|
|
||||||
proc = MagicMock()
|
|
||||||
proc.communicate.side_effect = [
|
|
||||||
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
|
|
||||||
("", ""),
|
|
||||||
]
|
|
||||||
proc.returncode = -15
|
|
||||||
|
|
||||||
with patch("core.executor_docker.subprocess.Popen", return_value=proc):
|
|
||||||
result = executor.call_tool("shell", {"command": "sleep 9999"}, ctx)
|
|
||||||
|
|
||||||
self.assertIn("cancelled by user", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 130)
|
|
||||||
proc.kill.assert_called_once()
|
|
||||||
|
|
||||||
|
|
||||||
class TestRunPython(unittest.TestCase):
|
|
||||||
"""run_python:tmp .py 落 user_root/.zcbot_tmp/<task_id>/,跑完 unlink。"""
|
|
||||||
|
|
||||||
def test_run_python_tmp_script(self):
|
|
||||||
executor, pool, tmp_root = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
|
|
||||||
proc = MagicMock()
|
|
||||||
proc.communicate.return_value = ("42\n", "")
|
|
||||||
proc.returncode = 0
|
|
||||||
|
|
||||||
captured_argv = []
|
|
||||||
|
|
||||||
def _popen(argv, **kwargs):
|
|
||||||
captured_argv.append(argv)
|
|
||||||
return proc
|
|
||||||
|
|
||||||
with patch("core.executor_docker.subprocess.Popen", side_effect=_popen):
|
|
||||||
result = executor.call_tool(
|
|
||||||
"run_python", {"code": "print(42)"}, ctx
|
|
||||||
)
|
|
||||||
|
|
||||||
self.assertIn("[stdout]\n42", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 0)
|
|
||||||
|
|
||||||
argv = captured_argv[0]
|
|
||||||
# 末尾形态:setsid python /workspace/.zcbot_tmp/<task_id>/<rand>.py
|
|
||||||
self.assertEqual(argv[-3], "setsid")
|
|
||||||
self.assertEqual(argv[-2], "python")
|
|
||||||
self.assertTrue(argv[-1].startswith(f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/"))
|
|
||||||
self.assertTrue(argv[-1].endswith(".py"))
|
|
||||||
# PYTHONIOENCODING / PYTHONPATH 注入
|
|
||||||
env_kvs = [argv[i + 1] for i, a in enumerate(argv) if a == "-e"]
|
|
||||||
self.assertIn("PYTHONIOENCODING=utf-8", env_kvs)
|
|
||||||
self.assertIn("PYTHONPATH=/workspace", env_kvs)
|
|
||||||
|
|
||||||
# host 侧 tmp 已 unlink(目录可能仍在,无所谓 —— ensure 容器时会重新 mkdir)
|
|
||||||
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
|
|
||||||
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
|
|
||||||
self.assertEqual(leftover, [], f"tmp .py not cleaned up: {leftover}")
|
|
||||||
|
|
||||||
def test_run_python_bad_code_type(self):
|
|
||||||
executor, _, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
result = executor.call_tool("run_python", {"code": 123}, ctx)
|
|
||||||
self.assertIn("[Error]", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 2)
|
|
||||||
|
|
||||||
def test_run_python_cleans_tmp_on_exception(self):
|
|
||||||
"""Popen 抛异常时 tmp .py 仍要被清理(finally 兜底)。"""
|
|
||||||
executor, _, _ = make_executor()
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
|
|
||||||
with patch(
|
|
||||||
"core.executor_docker.subprocess.Popen",
|
|
||||||
side_effect=RuntimeError("boom"),
|
|
||||||
):
|
|
||||||
result = executor.call_tool("run_python", {"code": "x"}, ctx)
|
|
||||||
|
|
||||||
self.assertIn("[Error executing run_python via docker]", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 1)
|
|
||||||
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
|
|
||||||
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
|
|
||||||
self.assertEqual(leftover, [])
|
|
||||||
|
|
||||||
|
|
||||||
class TestUnknownTool(unittest.TestCase):
|
|
||||||
def test_unknown_tool_goes_to_host(self):
|
|
||||||
executor, _, _ = make_executor(tools_dict={}) # 空 host → 啥都没
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
result = executor.call_tool("nope", {}, ctx)
|
|
||||||
self.assertIn("unknown tool", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 2)
|
|
||||||
|
|
||||||
def test_container_tool_not_registered_on_host(self):
|
|
||||||
"""caps.enable_run_python=False:host 没装 run_python,docker 也应拒。"""
|
|
||||||
executor, _, _ = make_executor(tools_dict={"read": FakeTool("read")})
|
|
||||||
ctx = make_ctx(executor)
|
|
||||||
result = executor.call_tool("run_python", {"code": "x"}, ctx)
|
|
||||||
self.assertIn("unknown tool", result.content)
|
|
||||||
self.assertEqual(result.exit_code, 2)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
unittest.main()
|
|
||||||
|
|
@ -1,186 +0,0 @@
|
||||||
"""`main.py sandbox check` 探测函数单元测试。
|
|
||||||
|
|
||||||
mock subprocess,验:
|
|
||||||
- daemon 不可达 / image 缺 / network 缺 / uid 错配的各种分支
|
|
||||||
- detect_fs_quota 对 xfs/ext4/zfs/btrfs/其他 + prjquota mount option 的判断
|
|
||||||
- 汇总 exit code:全 ok / 仅 warn / 有 err
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import unittest
|
|
||||||
from pathlib import Path
|
|
||||||
from unittest.mock import patch
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
|
||||||
|
|
||||||
from core.sandbox.check import (
|
|
||||||
check_docker_daemon,
|
|
||||||
check_image_present,
|
|
||||||
check_host_uid_alignment,
|
|
||||||
detect_fs_quota,
|
|
||||||
run_sandbox_check,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def _mk_run(returns):
|
|
||||||
"""构造 `_run` 替身:按调用次序返 (rc, stdout, stderr) 列表里的元素。"""
|
|
||||||
iter_ret = iter(returns)
|
|
||||||
|
|
||||||
def fake_run(argv, timeout=10):
|
|
||||||
return next(iter_ret)
|
|
||||||
return fake_run
|
|
||||||
|
|
||||||
|
|
||||||
class TestDaemonCheck(unittest.TestCase):
|
|
||||||
def test_daemon_ok(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "24.0.7", "")])):
|
|
||||||
self.assertTrue(check_docker_daemon())
|
|
||||||
|
|
||||||
def test_daemon_cli_missing(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(127, "", "docker not found in PATH")])):
|
|
||||||
self.assertFalse(check_docker_daemon())
|
|
||||||
|
|
||||||
def test_daemon_permission_denied(self):
|
|
||||||
with patch(
|
|
||||||
"core.sandbox.check._run",
|
|
||||||
_mk_run([(1, "", "Got permission denied while trying to connect")]),
|
|
||||||
):
|
|
||||||
self.assertFalse(check_docker_daemon())
|
|
||||||
|
|
||||||
|
|
||||||
class TestImageCheck(unittest.TestCase):
|
|
||||||
def test_image_present(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "[...]", "")])):
|
|
||||||
self.assertTrue(check_image_present())
|
|
||||||
|
|
||||||
def test_image_missing(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(1, "", "No such image")])):
|
|
||||||
self.assertFalse(check_image_present())
|
|
||||||
|
|
||||||
|
|
||||||
class TestHostUidAlignment(unittest.TestCase):
|
|
||||||
def test_uid_aligned(self):
|
|
||||||
if sys.platform == "win32":
|
|
||||||
self.skipTest("getuid not on Windows")
|
|
||||||
import os
|
|
||||||
host_uid = os.getuid() # type: ignore[attr-defined]
|
|
||||||
with patch(
|
|
||||||
"core.sandbox.check._run",
|
|
||||||
_mk_run([(0, str(host_uid), "")]),
|
|
||||||
):
|
|
||||||
self.assertTrue(check_host_uid_alignment())
|
|
||||||
|
|
||||||
def test_uid_mismatch(self):
|
|
||||||
if sys.platform == "win32":
|
|
||||||
self.skipTest("getuid not on Windows")
|
|
||||||
import os
|
|
||||||
bad = os.getuid() + 1 # type: ignore[attr-defined]
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, str(bad), "")])):
|
|
||||||
self.assertFalse(check_host_uid_alignment())
|
|
||||||
|
|
||||||
def test_image_not_built_yet(self):
|
|
||||||
# docker run 失败 → warn 不 err
|
|
||||||
with patch(
|
|
||||||
"core.sandbox.check._run",
|
|
||||||
_mk_run([(125, "", "Unable to find image")]),
|
|
||||||
):
|
|
||||||
self.assertTrue(check_host_uid_alignment())
|
|
||||||
|
|
||||||
def test_skipped_on_windows(self):
|
|
||||||
with patch("core.sandbox.check.sys") as mock_sys, \
|
|
||||||
patch("core.sandbox.check._run", _mk_run([(0, "1000", "")])):
|
|
||||||
mock_sys.platform = "win32"
|
|
||||||
self.assertTrue(check_host_uid_alignment())
|
|
||||||
|
|
||||||
|
|
||||||
class TestDetectFsQuota(unittest.TestCase):
|
|
||||||
"""detect_fs_quota:不依赖 print,纯返 (level, msg) 便于断言。"""
|
|
||||||
|
|
||||||
def test_xfs_with_prjquota(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,prjquota,attr2", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/opt/zcbot/workspace/users"))
|
|
||||||
self.assertEqual(level, "ok")
|
|
||||||
self.assertIn("xfs with prjquota", msg)
|
|
||||||
|
|
||||||
def test_xfs_without_prjquota(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "xfs rw,relatime,attr2", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/opt"))
|
|
||||||
self.assertEqual(level, "warn")
|
|
||||||
self.assertIn("NO prjquota", msg)
|
|
||||||
|
|
||||||
def test_ext4_with_project_quota(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "ext4 rw,prjquota", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/opt"))
|
|
||||||
self.assertEqual(level, "ok")
|
|
||||||
self.assertIn("ext4 with project quota", msg)
|
|
||||||
|
|
||||||
def test_zfs(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "zfs rw,xattr,noacl", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/tank/zcbot"))
|
|
||||||
self.assertEqual(level, "ok")
|
|
||||||
self.assertIn("zfs", msg)
|
|
||||||
|
|
||||||
def test_btrfs_warns(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "btrfs rw,relatime", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/opt"))
|
|
||||||
self.assertEqual(level, "warn")
|
|
||||||
self.assertIn("btrfs", msg)
|
|
||||||
|
|
||||||
def test_tmpfs_warns(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(0, "tmpfs rw", "")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/tmp"))
|
|
||||||
self.assertEqual(level, "warn")
|
|
||||||
|
|
||||||
def test_findmnt_missing(self):
|
|
||||||
with patch("core.sandbox.check._run", _mk_run([(127, "", "findmnt not found in PATH")])), \
|
|
||||||
patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "linux"
|
|
||||||
level, msg = detect_fs_quota(Path("/opt"))
|
|
||||||
self.assertEqual(level, "warn")
|
|
||||||
self.assertIn("findmnt", msg)
|
|
||||||
|
|
||||||
def test_windows_skipped(self):
|
|
||||||
with patch("core.sandbox.check.sys") as mock_sys:
|
|
||||||
mock_sys.platform = "win32"
|
|
||||||
level, msg = detect_fs_quota(Path("C:/"))
|
|
||||||
self.assertEqual(level, "warn")
|
|
||||||
self.assertIn("Windows", msg)
|
|
||||||
|
|
||||||
|
|
||||||
class TestSummaryExitCode(unittest.TestCase):
|
|
||||||
"""run_sandbox_check 汇总:err → exit 1,全 ok / 仅 warn → exit 0。"""
|
|
||||||
|
|
||||||
def test_all_ok_exits_zero(self):
|
|
||||||
with patch("core.sandbox.check.check_docker_daemon", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_image_present", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_network_present", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
|
|
||||||
rc = run_sandbox_check()
|
|
||||||
self.assertEqual(rc, 0)
|
|
||||||
|
|
||||||
def test_any_err_exits_one(self):
|
|
||||||
with patch("core.sandbox.check.check_docker_daemon", return_value=False), \
|
|
||||||
patch("core.sandbox.check.check_image_present", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_network_present", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_host_uid_alignment", return_value=True), \
|
|
||||||
patch("core.sandbox.check.check_fs_quota_capable", return_value=True):
|
|
||||||
rc = run_sandbox_check()
|
|
||||||
self.assertEqual(rc, 1)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
unittest.main()
|
|
||||||
63
web/app.py
63
web/app.py
|
|
@ -481,7 +481,7 @@ def create_app() -> FastAPI:
|
||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
broker.bind_loop(asyncio.get_running_loop())
|
broker.bind_loop(asyncio.get_running_loop())
|
||||||
# Skill 注册表启动时扫一次 — 文件系统静态,运行中不变;/v1/skills 直接读
|
# Skill 注册表启动时扫一次 — 文件系统静态,运行中不变;/v1/skills 直接读
|
||||||
from core.agent_builder import load_config, resolve_workspace
|
from core.agent_builder import load_config
|
||||||
from core.paths import ROOT
|
from core.paths import ROOT
|
||||||
from core.skills import SkillRegistry
|
from core.skills import SkillRegistry
|
||||||
_cfg = load_config()
|
_cfg = load_config()
|
||||||
|
|
@ -500,68 +500,7 @@ def create_app() -> FastAPI:
|
||||||
)
|
)
|
||||||
if result.rowcount:
|
if result.rowcount:
|
||||||
print(f"[startup] reaped {result.rowcount} stale active run(s)")
|
print(f"[startup] reaped {result.rowcount} stale active run(s)")
|
||||||
|
|
||||||
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
|
|
||||||
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
|
|
||||||
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
|
|
||||||
# 全清重启)③ 后台 reaper task,每 60s 跑 reap_idle。
|
|
||||||
sandbox_backend = os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower()
|
|
||||||
sandbox_reaper_task = None
|
|
||||||
if sandbox_backend == "docker":
|
|
||||||
from core.sandbox import init_pool
|
|
||||||
from core.sandbox.check import detect_fs_quota
|
|
||||||
workspace = resolve_workspace(None, _cfg)
|
|
||||||
user_root_base = workspace / "users"
|
|
||||||
# §7.5 #4 fs quota 探测:不阻塞启动(应用层周期扫描已有),仅打 WARN
|
|
||||||
# 提醒外部用户开放前必须升级到 xfs prjquota / ext4 project / zfs。
|
|
||||||
try:
|
|
||||||
level, msg = detect_fs_quota(user_root_base.resolve())
|
|
||||||
print(f"[startup] {'[ok]' if level == 'ok' else '[warn]'} {msg}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"[startup] [warn] fs quota detect failed: {type(e).__name__}: {e}")
|
|
||||||
try:
|
|
||||||
pool = init_pool(user_root_base)
|
|
||||||
removed = pool.shutdown_all()
|
|
||||||
if removed:
|
|
||||||
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
|
|
||||||
|
|
||||||
async def _reaper() -> None:
|
|
||||||
loop = asyncio.get_running_loop()
|
|
||||||
while True:
|
|
||||||
try:
|
|
||||||
await asyncio.sleep(60)
|
|
||||||
removed = await loop.run_in_executor(None, pool.reap_idle)
|
|
||||||
if removed:
|
|
||||||
print(f"[reaper] reaped {len(removed)} idle sandbox container(s)")
|
|
||||||
except asyncio.CancelledError:
|
|
||||||
raise
|
|
||||||
except Exception as e:
|
|
||||||
print(f"[reaper] error: {type(e).__name__}: {e}")
|
|
||||||
|
|
||||||
sandbox_reaper_task = asyncio.create_task(_reaper(), name="sandbox-reaper")
|
|
||||||
app.state.sandbox_pool = pool
|
|
||||||
except Exception as e:
|
|
||||||
# ensure_network / docker CLI 不可用 → fail-fast。Stage C 协议:任一
|
|
||||||
# hardening 缺失视为部署未完成,不退化到 host(否则误以为有沙盒实则在裸跑)。
|
|
||||||
raise RuntimeError(
|
|
||||||
f"ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: {e}"
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
yield
|
yield
|
||||||
finally:
|
|
||||||
if sandbox_reaper_task is not None:
|
|
||||||
sandbox_reaper_task.cancel()
|
|
||||||
try:
|
|
||||||
await sandbox_reaper_task
|
|
||||||
except (asyncio.CancelledError, Exception):
|
|
||||||
pass
|
|
||||||
if sandbox_backend == "docker":
|
|
||||||
pool = getattr(app.state, "sandbox_pool", None)
|
|
||||||
if pool is not None:
|
|
||||||
try:
|
|
||||||
pool.shutdown_all()
|
|
||||||
except Exception as e:
|
|
||||||
print(f"[shutdown] sandbox shutdown_all error: {type(e).__name__}: {e}")
|
|
||||||
|
|
||||||
app = FastAPI(
|
app = FastAPI(
|
||||||
title="zcbot api",
|
title="zcbot api",
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue