Stage C Step 3: DockerExecutor 集成 AgentLoop + web lifespan reaper

- core/executor_docker.py 新增 DockerExecutor:组合 HostExecutor+SandboxPool,
  shell/run_python 走 docker exec(setsid + --user 1000:1000 + --workdir),
  其他工具直通 host(§7.5 #6 信任域二分)
- run_python tmp .py 落 <user_root>/.zcbot_tmp/<task_id>/(dotfile,/v1/files
  天然过滤),容器内对应 /workspace/.zcbot_tmp/...,跑完 unlink
- ZCBOT_SANDBOX_BACKEND=host|docker env 切 backend,默 host(Windows dogfood
  零变化);docker 路径 pool 未 init → fail-fast 不静默退化
- web/app.py lifespan:docker backend 启动时 init_pool + shutdown_all 清孤儿 +
  60s 后台 reaper(run_in_executor 调 sync reap_idle);关闭时 cancel + 兜底清
- pool.py 顺手清 Step 2 债:asyncio.Lock → threading.Lock,ensure 改同步
  (主使用方是 BG 线程 tool call,ephemeral loop 会让 asyncio.Lock 跨锁失效)
- Cancel limitation 接受:Popen.kill() 仅杀 docker CLI 客户端,容器内进程靠
  idle 5min reaper 兜底;升级到 PGID 协议(§7.5 #3)等用户反馈触发
- tests/test_executor_docker.py 11 测试覆盖关键路径(host 直通/argv 形态/
  tmp 清理/timeout/cancel/未知工具/enable_run_python=False)
- DESIGN.md 不动(纯按 §7.5 #5 #6 既有协议实施)
- RUN.md 加 ZCBOT_SANDBOX_BACKEND env 段 + 切 docker 的前置条件 + 集成验证路径
- unittest discover 12/12 PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
caoqianming 2026-05-26 16:13:16 +08:00
parent f66511ccf8
commit dfac0acfa6
8 changed files with 703 additions and 27 deletions

View File

@ -2,7 +2,7 @@
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`
最后更新:2026-05-26(Stage C Step 2:Docker per-user 容器池 + Dockerfile / init.sh / network ensure,代码就绪未集成 AgentLoop)
最后更新:2026-05-26(Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan reaper,ZCBOT_SANDBOX_BACKEND env 切换 host/docker)
---
@ -15,7 +15,7 @@
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C(Executor+docker sandbox)待 —— 外部用户开放 hard prereq,完成前仅 dogfood + 信任同事白名单;DoD 详 DESIGN §7.5 落地清单 6 条**。 |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota(§7.5 落地清单 #2 #4)**。 |
---
@ -23,6 +23,7 @@
### 2026-05-26
- **Stage C Step 3:DockerExecutor 集成 AgentLoop + web lifespan(`ZCBOT_SANDBOX_BACKEND=host|docker` env 切 backend)**:`core/executor_docker.py` `DockerExecutor` 组合 `HostExecutor` + `SandboxPool`,`call_tool` 按 §7.5 #6 信任域 dispatch:`shell` / `run_python``pool.ensure(user_id)` 拿容器名 + `docker exec --user 1000:1000 --workdir /workspace/<wd_name> -e PYTHONIOENCODING=utf-8 setsid bash -c <cmd>` / `python <script>`(`setsid` 走包一层进程组,§7.5 #3 PGID kill 协议留 Step 3b 启用);其他工具(read/write/edit/glob/grep/load_skill/web_*/seedream/seedance)直通 host。**run_python tmp .py 落 host 侧 `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`**,容器内对应 `/workspace/.zcbot_tmp/<task_id>/<rand>.py`(bind mount 自动可见);dotfile 起头让 `/v1/files` API 天然过滤(`web/app.py:169` `startswith(".")` 已挡)。**Cancel limitation 接受**:Popen.kill() 杀 docker CLI 客户端,容器内 server 端进程不会因此终止(docker exec 设计如此);第一版靠 idle 5min reaper / 下次 `ensure``rm -f` 兜底,升级触发为"用户报取消但还在烧 CPU"。`core/sandbox/__init__.py` 暴露 module-level singleton `init_pool` / `get_pool`,`agent_builder._resolve_executor` 按 env 切 backend、docker 路径 pool 未初始化 → fail-fast(不静默退到 host 防止"以为有沙盒实则在裸跑"误判);`web/app.py` lifespan 启动钩子:`init_pool(workspace/users)` + `shutdown_all` 清前驱孤儿 + `asyncio.create_task(_reaper)`(每 60s `run_in_executor(pool.reap_idle)`),关闭钩子 cancel reaper + `shutdown_all`。**pool.py 顺手清债**:`asyncio.Lock` → `threading.Lock`(主使用方是 web BG 线程同步 tool call,asyncio.Lock 会被每次 `asyncio.run` 起的 ephemeral loop 绕过保护;reaper 改 async wrapper `loop.run_in_executor(pool.reap_idle)`,pool API 全 sync 更直)。**测试**:`tests/test_executor_docker.py` 11 测试覆盖 host 直通 / shell argv 形态 / run_python tmp 文件清理 / timeout / cancel / 未知工具 / caps.enable_run_python=False;`unittest discover -s tests` **12/12 PASS**(原 1 测试不变,新 11 测试加上)。**Windows dogfood 零变化**:默 `ZCBOT_SANDBOX_BACKEND=host`,本地不动 docker;切 docker 路径只在 Ubuntu 部署机有效,真起容器 smoke 仍按 RUN.md "Sandbox(Stage C,Ubuntu)" 段 5 条命令在部署机跑。`DESIGN.md` **不动**(纯按 §7.5 #5 #6 既有协议实施);`RUN.md` 加 `ZCBOT_SANDBOX_BACKEND` env 说明 + 切 docker backend 时的启动前置条件。否决:(a) DockerExecutor 用 `asyncio.run(pool.ensure)` 包 ephemeral loop —— 跨 loop 不共享 asyncio.Lock,失串行化保护,且每次 tool call 多 ~5ms loop 创建销毁噪声;改 pool 同步成本更低;(b) `run_python` tmp .py 放工作目录内 —— 污染用户视野,SKILL 教模型"列工作目录用 glob"时 tmp 文件干扰,crash 残留与产物混(详 §7.9 取舍记录会在下次有同款问题时考虑沉淀);(c) host 侧独立 bind mount `<workspace>/.sandbox_tmp/<uid>/` 挂成容器 `/tmp_scripts` —— 多挂一个 mount 复杂度上升,单 bind mount 协议保持更直;(d) docker backend 失败时退化到 host —— 沙盒缺失=安全模型崩,fail-fast 比"看起来在跑"重要,§7.5 硬协议"任一缺失视为部署未完成"。
- **Stage C Step 2:Docker per-user 容器 + iptables blocklist(§7.5 #1 + #3 落地基底,未接入 AgentLoop)**:`deploy/sandbox/Dockerfile`(python:3.11-slim + tini PID 1 + iptables/iproute2/netbase + non-root user uid `HOST_UID` build-arg + 全套 requirements.txt 装到容器内)+ `deploy/sandbox/init.sh`(`set -euo pipefail`,任一 iptables 规则失败 fail-fast → 容器终止,符合 §7.5 #1"任一缺失视为 Stage C 未完成"硬协议;6 段 IPv4 红线 + ::1 IPv6 loopback 降级容忍 + `ZCBOT_PG_IPS` env 逐 IP DROP;`exec sleep infinity` 等 `docker exec` 进来)。`core/sandbox/network.py` 单函数 `ensure_network()`,`docker network create --internal zcbot-sandbox-net`(默认无 outbound + 跨容器隔离,Step 4 加 proxy 时 proxy 同接此网络);`core/sandbox/pool.py` `SandboxPool` 类持 per-user `asyncio.Lock` + in-memory `_last_active` dict —— ensure 路径 inspect 探测 → running 直接返 / exists-but-stopped `rm -f` 重起(保 iptables 重新 apply)/ 不存在 `docker run` 装齐 hardening flags(`--read-only --tmpfs /tmp:exec --cap-drop=ALL --cap-add=NET_ADMIN --security-opt=no-new-privileges --pids-limit=256 --memory=2g --cpus=1.0` + bind mount user_root → `/workspace` + label `zcbot.product=sandbox` 给批量清扫用 + `--restart=no`);`mark_active` 更新 dict / `reap_idle` 按 ttl 杀 / `shutdown_all` 杀 label 全集(app 启动清前驱孤儿用)。容器命名 `zcbot-sandbox-<user_id>`(UUID 标准串带 dash,与 mount 路径 `<workspace>/users/<user_id>/` 视觉对齐 ── `docker ps | grep zcbot-sandbox-` 直接看活跃 user)。**关键决策**:(a) **docker CLI via subprocess 而非 docker-py SDK** ── §7.5 #5 "接口形状不泄漏 Docker 假设"对应到实现层,subprocess 行为透明、零新依赖、`docker ps` 实地对账;(b) **`docker update --label-add` 不可用 → 用 in-memory dict** ── Docker 23+ 移除 runtime label 修改,所以 last_active 落 Python dict;app 重启 dict 空 → 历史孤儿由 `shutdown_all` 兜底清(lifespan 启动钩子里调);(c) **`--internal` 网络从 Step 2 即生效** ── iptables OUTPUT 规则作为 defense-in-depth(网络层已堵死 outbound,iptables 仍按协议加规则);Step 4 加 proxy 时 proxy 容器同接 `zcbot-sandbox-net`,加 iptables ACCEPT 例外 + 改默认 DROP 实现"默认 deny + 仅经 proxy";(d) **NET_ADMIN cap 留给 PID 1 root 跑 iptables** ── 容器整生命周期持 NET_ADMIN,但 PID 1 `sleep infinity` 不接外部输入,`docker exec` 进来由 `--user 1000:1000` 锁 non-root + 空 cap_effective,等同于无 NET_ADMIN。Step 3 DockerExecutor 必须硬编 --user 1000 不让 root 路径打开(代码 review 守住)。**Step 2 范围明确不包含**:① AgentLoop 集成(`agent_builder.py` 不动 ── pool 是孤立模块,Step 3 才插)② shell/run_python 切到容器 ③ egress proxy(Step 4)④ reaper 后台 task(Step 3 接入 web lifespan 时一起加)。**验证**:`from core.sandbox import ...` 全套导入 + ctor 通过;`SandboxPool(user_root_base=Path(...), pg_ips='10.x,172.x')` 字段正确;`unittest discover` 1/1 PASS。docker 真起容器验证在 Ubuntu 上跑(RUN.md "Sandbox(Stage C,Ubuntu)" 段写了 5 条 smoke 命令:build / iptables 段 / non-root uid / read-only / 销毁)。`DESIGN.md` 不动(纯按 §7.5 #1 #3 既有协议实施);`RUN.md` 加 "Sandbox(Stage C,Ubuntu)" 部署段(镜像构建 / sandbox env / 5 条验证命令 / xfs project quota 升级时点)+ 故障兜底加 2 条(uid 错配 EACCES / NET_ADMIN 缺失)。否决:(a) 容器名用 sha256(uid)[:12] + label 反查 —— 每次 exec 多一次 `docker ps --filter` round-trip,可读性损失,隐私收益 0;(b) per-task 容器 —— DESIGN §7.5 已锁 per-user 共享心智模型(同 user 多 task 共享素材),不重开;(c) 用 docker `init container` 范式做 iptables —— Docker 没原生支持(那是 k8s),compose v2 同步又增复杂度,NET_ADMIN + 非 root exec 范式更直接;(d) Step 2 立即接入 AgentLoop —— 接了不能 dogfood(本地 Windows 无 docker),反而污染 host 路径;pool 孤立 commit 留 Step 3 一起接。
- **Stage C Step 1:Executor 接口骨架 + HostExecutor in-process backend(§7.5 #5 落地)**:`core/executor.py` 加 `Executor` ABC + `ExecCtx`(user_id/task_id/working_dir/cancel_check)+ `ToolResult`(content/exit_code);`core/executor_host.py` 加 `HostExecutor` 包原 tools dict,`call_tool` 内部分流到对应 `Tool.execute` 并把三种错误(unknown / TypeError / 抛异常)统一收成 `[Error] ...` content + exit_code 区分。`AgentLoop.__init__` 改接 `executor` 而非 `tools` dict、加 `working_dir` 形参;`_stream_llm` 用 `executor.schemas()` 拼 LLM tools 字段;`_execute_tool_call` 改单条 `executor.call_tool(name, args, ctx)`,删原三段错误 emit(unknown/TypeError/Exception 已被 executor 收编为 ToolResult,只剩一处 emit)。`agent_builder.py` 装完 tools dict 后 `HostExecutor(tools)` 包一层,传给 `AgentLoop`。**接口形状刻意 backend 无关**——不暴露 `docker exec` / `docker cp` 等 Docker 假设,Step 3 切 docker backend 时 `AgentLoop` 零改动,只换 `agent_builder.py``HostExecutor``DockerExecutor(host_tools=..., docker_tools={shell, run_python})`。**行为零变化** —— sanity import 通过,`unittest discover -s tests` 1/1 PASS。`DESIGN.md` 不动(纯按 §7.5 #5 既有协议实施,无架构漂移);`RUN.md` 不动(无新 env / CLI 变化,`ZCBOT_SANDBOX_BACKEND` env 留到 Step 3 docker backend 引入时一起加)。否决:(a) 不抽 Executor 直接在 `shell.py/run_python.py``if backend=='docker'` —— 违反 §7.5 #5,未来切 gVisor/Firecracker 时改动散到工具层;(b) Executor 用 `exec(cmd, ctx)` primitive 而非 `call_tool(name, args, ctx)` dispatcher —— 不匹配 DESIGN 签名,且 host 工具(read/web_*/seedream)不是 "命令" 语义;(c) 用 `cancel_check` callable 替代 ExecCtx 重建 —— 当前 cancel_check 是 build 后 setter 赋值,ctx 缓存会指向 stale,per-call 构 ExecCtx 是 dataclass 廉价。
- **REVISIONS.md 修订日志机制(覆盖 proposal/patent/ppt 三个产物型 skill)**:`<task_dir>/REVISIONS.md` 作为产物迭代过程的紧凑 changelog —— task 对话历史是粗流水(50 条消息找上周改动靠翻),REVISIONS 是用户与 LLM 共同沉淀的实质决策列表(5 行就能复盘"上周这章为啥这么写"),与 spec 定位互补:**spec = 宪法(定调一次),REVISIONS = 实施日志(每次卡点累加)**。三个 SKILL.md 各加 (a) 起草步骤里加一步"用户确认实质改动后追加一行" + (b) "## 修订日志" 独立小节(何时记/何时不记表 + 格式约定 + 实例 + 操作)。三类 skill 的"实质改动"判据按各自领域定制:proposal = 技术路线/考核指标/创新点/课题分解/关键引文/预算结构;patent = 区别技术特征/关键参数/公式/实施例/章节;ppt = 版式/主色/页/图标/文案要点。统一原则:首次起草不记 / 错别字微调不记 / 模型自己改改撤撤不记 — 拿不准倾向不记,避免变流水账。格式选**单行 bullet 倒序追加**(时间在前、文件:章节定位、改了什么 — 为什么),用 edit 在头注释后插入新一行(不 append 到末尾,倒序读秒看最新)。否决:(a) 走 system prompt 软约束 — 对 coding/research/documents/imagegen/videogen 等非产物型 skill 强加无关约束;(b) 新建 `record_revision` tool — 开发期内 LLM 直接 edit 追加足够,加 tool 增加每次小改的调用开销,后期发现 LLM 漏记多再升 tool 化;(c) 按产物拆多文件(`<topic>.revisions.md`)— 单文件好读、跨产物时间线统一。`DESIGN.md` 不动(无架构变化);`RUN.md` 不动(无 CLI/env 变化)。

35
RUN.md
View File

@ -256,8 +256,14 @@ sudo journalctl -u zcbot -n 50 # 看新进程起没起干
## Sandbox(Stage C,Ubuntu)
> 为外部用户开放前必须完成。当前 dogfood + 信任同事白名单阶段可跳过 ── 默 backend = host,
> `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 接入 DockerExecutor 后切
> `ZCBOT_SANDBOX_BACKEND=docker` 启用。
> `shell` / `run_python` 仍走 subprocess(未隔离)。Step 3 已接入 DockerExecutor:
> `ZCBOT_SANDBOX_BACKEND=docker` 切容器执行;`host`(默)保留本地 Windows / 同事 dogfood。
>
> 启用 docker backend 的前置条件:
> 1. 部署机有 docker daemon,zcbot 用户在 `docker` group
> 2. `zcbot-sandbox:latest` 镜像已 build(`HOST_UID/GID` 对齐)
> 3. `.env` 至少有 `ZCBOT_PG_IPS=<PG实际IP>`(§7.5 #1 PG 单独 block 一遍)
> 4. lifespan 启动失败会 fail-fast(`RuntimeError: sandbox init failed`),不静默退到 host
### 镜像构建
@ -284,6 +290,14 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
### Sandbox 相关 env(.env 加)
```
# Backend 选择(默 host):
# host = shell/run_python 走 host subprocess(本地 Windows / dogfood)
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
# ZCBOT_SANDBOX_BACKEND=docker
# 容器内 exec 用户(默 1000:1000;Dockerfile 的 HOST_UID/HOST_GID build-arg 同步对齐)
# ZCBOT_SANDBOX_EXEC_USER=1000:1000
# 容器镜像 tag(默 zcbot-sandbox:latest)
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
# 容器 runtime(切 gVisor 用 runsc,Firecracker 用 kata;默 runc)
@ -295,10 +309,23 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
```
### 验证(Step 2 部分能验)
### 验证
Step 3 之后,推荐用集成验证(web 起 docker backend + dev SPA 发 `shell` / `run_python` 消息):
```bash
# 启动 web 时切 docker backend(.env 已设 PG_IPS / SANDBOX_BACKEND=docker)
ZCBOT_SANDBOX_BACKEND=docker .venv/bin/python main.py web
# 触发任一 shell / run_python 消息后,容器应已起
sudo -u zcbot docker ps --filter label=zcbot.product=sandbox
# 应看到 zcbot-sandbox-<your-uid>,STATUS = Up ...
# 5 分钟无新消息后 reaper 自动 rm
```
也可直接起一个测试容器单验 hardening(不依赖 web 进程):
```bash
# 起一个测试容器(直接 docker run,不走 pool ── pool 在 Step 3 接入后才用)
USER_ID=00000000-0000-0000-0000-000000000001
sudo -u zcbot docker run -d \
--name zcbot-sandbox-$USER_ID \

View File

@ -26,6 +26,7 @@ import yaml
from rich.console import Console
from core.capabilities import ModelCapabilities
from core.executor_docker import DockerExecutor
from core.executor_host import HostExecutor
from core.llm import LLM
from core.loop import AgentLoop
@ -53,6 +54,39 @@ def load_config() -> dict:
return yaml.safe_load((ROOT / "config" / "agent.yaml").read_text(encoding="utf-8")) or {}
def _resolve_executor(
host: HostExecutor,
user_id: UUID,
user_root_path: Path,
working_dir_path: Path,
):
"""选 Executor backend(§7.5 #5)。
env `ZCBOT_SANDBOX_BACKEND=docker` 时构造 DockerExecutor;其他值 / 缺失 host
docker 路径要 lifespan `core.sandbox.init_pool` (否则 pool None 退 host
+ 启动日志由 web 入口在 init 时打印,这里不重复 warn)
"""
import os
if os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower() != "docker":
return host
from core.sandbox import get_pool
pool = get_pool()
if pool is None:
# lifespan 没 init 成功 —— 让上层早死比静默退化更安全(避免外部用户开放时
# 误以为在沙盒里跑实则 host)。Web 入口启动会 fail-fast,这里再补一条提醒。
raise RuntimeError(
"ZCBOT_SANDBOX_BACKEND=docker but sandbox pool not initialized; "
"check web lifespan init_pool() / docker daemon availability"
)
return DockerExecutor(
host=host,
pool=pool,
user_id=user_id,
user_root=user_root_path,
working_dir=working_dir_path,
)
def resolve_workspace(workspace: Optional[str], cfg: Optional[dict] = None) -> Path:
cfg = cfg or load_config()
p = Path(workspace) if workspace else ROOT / cfg.get("workspace_dir", "workspace")
@ -439,9 +473,11 @@ def build_agent(
tools[ws.name] = ws
sink = ConsoleEventSink(console) if console else None
# §7.5 #5 Executor 抽象:本步全 host backend(in-process),Step 3 docker backend
# 引入后切 `ZCBOT_SANDBOX_BACKEND=docker` 把 shell/run_python dispatch 到容器。
executor = HostExecutor(tools)
# §7.5 #5/#6 Executor 抽象:env `ZCBOT_SANDBOX_BACKEND=host|docker` 切 backend。
# host(默)= 全 in-process,本地 dogfood / Windows 走这条;docker = shell/run_python
# dispatch 到 per-user 容器(其他工具仍 host)。docker 路径要求 lifespan 已 `init_pool`。
host_executor = HostExecutor(tools)
executor = _resolve_executor(host_executor, uid, ur_path, working_dir_path)
agent = AgentLoop(
llm, executor, session, caps,
user_id=uid, working_dir=working_dir_path, sink=sink,

239
core/executor_docker.py Normal file
View File

@ -0,0 +1,239 @@
"""DockerExecutor:`shell` / `run_python` 走 docker exec,其余 in-process(§7.5 #6)。
Backend 二分(§7.5 #6 信任域):
- host in-process:`read/write/edit/glob/grep/load_skill/web_*/seedream/seedance`
原本就在 host 持凭据(Bocha key / ARK key)或走 `paths.py::resolve_user_path` 校验
(user-rooted 安全边界已存),塞容器无收益付 ~200ms exec overhead × N
- container exec:`shell` / `run_python` 执行模型生成的任意代码,必须容器隔离
容器准入(per call):
1. `pool.ensure(user_id)` 拿到 / `zcbot-sandbox-<uid>` 容器(per-user lock 已串行化)
2. `docker exec --user 1000:1000 --workdir /workspace/<wd_name> <c> setsid bash -c '<cmd>'`
3. timeout docker CLI 客户端(Popen.kill())
4. 完成 `pool.mark_active(user_id)` idle 计时
run_python tmp .py host `<user_root>/.zcbot_tmp/<task_id>/<rand>.py`(bind mount
自动可见于容器 `/workspace/.zcbot_tmp/<task_id>/`),执行完 unlinkdotfile 起头让
`/v1/files` API 天然过滤(`web/app.py:169` startswith(".")),用户视野不污染
Cancel limitation(第一版接受):
- docker exec 客户端断开后,容器内 server 端进程**不会**因此终止 这是 docker 设计
- 第一版只杀 docker CLI(Popen.kill());容器内残留进程靠 idle 5min reaper / 下次
ensure rm -f 兜底
- 升级触发(§7.5 #3 PGID 协议):用户反馈"取消了但还在烧 CPU" / 多次 cancel 后
容器内进程堆积 启用ZCBOT_EXEC_ID env + PGID 写文件 + 二次 exec kill协议
"""
from __future__ import annotations
import os
import secrets
import subprocess
import time
from pathlib import Path
from typing import Any, Dict, List, Optional
from uuid import UUID
from .executor import ExecCtx, Executor, ToolResult
from .executor_host import HostExecutor
from .sandbox import SandboxPool
CONTAINER_TOOLS = frozenset({"shell", "run_python"})
# 容器内非 root 用户:与 Dockerfile HOST_UID/HOST_GID build-arg 默认值对齐。
# 部署机 host 上 zcbot 账号 uid 若非 1000,镜像 build 时透传 HOST_UID + 这里
# env `ZCBOT_SANDBOX_EXEC_USER` 同步改(详 RUN.md "Sandbox 部署"段)。
DEFAULT_EXEC_USER = "1000:1000"
# host 侧 tmp 脚本目录(user_root 内 dotfile,被 /v1/files API 隐藏)
TMP_SUBDIR = ".zcbot_tmp"
class DockerExecutor(Executor):
"""组合 HostExecutor + docker exec dispatch shell/run_python。
host backend 仍承担 schema 列表 + 大部分 tool 执行;本类只在 shell/run_python
命中时夺路接管,docker exec per-user 容器里跑
"""
def __init__(
self,
host: HostExecutor,
pool: SandboxPool,
user_id: UUID,
user_root: Path,
working_dir: Path,
) -> None:
self.host = host
self.pool = pool
self.user_id = user_id
self.user_root = user_root.resolve()
self.working_dir = working_dir.resolve()
# 容器内对应路径 /workspace/<wd_name>
try:
wd_rel = self.working_dir.relative_to(self.user_root)
self.container_workdir = "/workspace/" + wd_rel.as_posix()
except ValueError:
# working_dir 不在 user_root 下 —— 防御性兜底,正常路径不会到这里
self.container_workdir = "/workspace"
self.exec_user = os.getenv("ZCBOT_SANDBOX_EXEC_USER", DEFAULT_EXEC_USER)
# ── Executor 接口 ────────────────────────────────────────
def has_tool(self, name: str) -> bool:
return self.host.has_tool(name)
def schemas(self) -> List[Dict[str, Any]]:
return self.host.schemas()
def call_tool(self, name: str, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
if name not in CONTAINER_TOOLS:
return self.host.call_tool(name, args, ctx)
if not self.host.has_tool(name):
# caps.enable_run_python=False 等场景下,host 没装 run_python → schema 也没暴露
return ToolResult(content=f"[Error] unknown tool: {name}", exit_code=2)
try:
if name == "shell":
return self._exec_shell(args, ctx)
if name == "run_python":
return self._exec_python(args, ctx)
except Exception as e:
return ToolResult(
content=f"[Error executing {name} via docker] {type(e).__name__}: {e}",
exit_code=1,
)
return ToolResult(content=f"[Error] unhandled container tool: {name}", exit_code=2)
# ── shell ────────────────────────────────────────────────
def _exec_shell(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
cmd = args.get("command")
if not isinstance(cmd, str) or not cmd.strip():
return ToolResult(
content="[Error] bad arguments to shell: command must be non-empty string",
exit_code=2,
)
timeout = int(args.get("timeout") or 60)
container = self.pool.ensure(self.user_id)
argv = self._docker_exec_argv(container) + ["setsid", "bash", "-c", cmd]
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
self.pool.mark_active(self.user_id)
return result
# ── run_python ───────────────────────────────────────────
def _exec_python(self, args: Dict[str, Any], ctx: ExecCtx) -> ToolResult:
code = args.get("code")
if not isinstance(code, str):
return ToolResult(
content="[Error] bad arguments to run_python: code must be string",
exit_code=2,
)
timeout = int(args.get("timeout") or 120)
# tmp .py 落 host 侧 `.zcbot_tmp/<task_id>/<rand>.py`;
# 容器内对应 /workspace/.zcbot_tmp/<task_id>/<rand>.py
tmp_root = self.user_root / TMP_SUBDIR / str(ctx.task_id)
tmp_root.mkdir(parents=True, exist_ok=True)
rand_name = f"{int(time.time() * 1000)}-{secrets.token_hex(4)}.py"
host_script = tmp_root / rand_name
container_script = f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/{rand_name}"
host_script.write_text(code, encoding="utf-8")
try:
container = self.pool.ensure(self.user_id)
argv = self._docker_exec_argv(
container,
extra_env={
"PYTHONIOENCODING": "utf-8",
"PYTHONPATH": "/workspace",
},
) + ["setsid", "python", container_script]
result = self._run_subprocess(argv, timeout=timeout, ctx=ctx)
self.pool.mark_active(self.user_id)
return result
finally:
try:
host_script.unlink()
except OSError:
pass
# ── helpers ──────────────────────────────────────────────
def _docker_exec_argv(
self, container: str, extra_env: Optional[Dict[str, str]] = None
) -> List[str]:
argv = [
"docker", "exec",
"--user", self.exec_user,
"--workdir", self.container_workdir,
]
env: Dict[str, str] = {}
if extra_env:
env.update(extra_env)
for k, v in env.items():
argv.extend(["-e", f"{k}={v}"])
argv.append(container)
return argv
def _run_subprocess(
self, argv: List[str], timeout: int, ctx: ExecCtx
) -> ToolResult:
"""跑 docker exec 子进程,带 cancel 协作 poll。
cancel 命中 / timeout Popen.kill() docker CLI 客户端;
容器内 server 端进程接受 limitation(见模块头注释)
"""
cancel_check = ctx.cancel_check
try:
proc = subprocess.Popen(
argv,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding="utf-8",
errors="replace",
)
except FileNotFoundError as e:
return ToolResult(content=f"[Error] docker CLI not found: {e}", exit_code=2)
start = time.monotonic()
cancel_hit = False
timeout_hit = False
stdout: str = ""
stderr: str = ""
while True:
try:
stdout, stderr = proc.communicate(timeout=0.5)
break
except subprocess.TimeoutExpired:
if cancel_check is not None and cancel_check():
cancel_hit = True
proc.kill()
stdout, stderr = proc.communicate()
break
if time.monotonic() - start > timeout:
timeout_hit = True
proc.kill()
stdout, stderr = proc.communicate()
break
if timeout_hit:
return ToolResult(
content=f"[Error] command timed out after {timeout}s",
exit_code=124,
)
if cancel_hit:
return ToolResult(
content="[Error] command cancelled by user",
exit_code=130,
)
parts: List[str] = []
if stdout:
parts.append(f"[stdout]\n{stdout.rstrip()}")
if stderr:
parts.append(f"[stderr]\n{stderr.rstrip()}")
parts.append(f"[exit {proc.returncode}]")
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)

View File

@ -3,17 +3,48 @@
模块边界:
- `network.py`:Docker network ensure(`zcbot-sandbox-net`,`--internal` 隔离 outbound + cross-container)
- `pool.py`:per-user 容器生命周期(ensure / mark_active / reap_idle / shutdown_all)
- `__init__.py`:module-level singleton(`init_pool` / `get_pool`), web lifespan
`agent_builder` 共享同一个池实例
不在本目录:`shell` / `run_python` 工具的 docker exec 调用 那是 Step 3
`core/executor_docker.py`,调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec
不在本目录:`shell` / `run_python` 工具的 docker exec 调用 那是 `core/executor_docker.py`,
调用本模块的 `pool.ensure(user_id)` 拿到容器名后再 exec
"""
from __future__ import annotations
from pathlib import Path
from typing import Optional
from .pool import SandboxPool, container_name, setup_pool
from .network import NETWORK_NAME, ensure_network
__all__ = [
"SandboxPool",
"container_name",
"setup_pool",
"NETWORK_NAME",
"ensure_network",
"init_pool",
"get_pool",
]
# Module-level singleton。web lifespan 启动钩子调 `init_pool(user_root_base)`,
# `agent_builder` 在构造 DockerExecutor 时 `get_pool()` 拿同一实例。
# 未初始化 → `get_pool()` 返 None,agent_builder 此时必须不走 docker 分支。
_pool: Optional[SandboxPool] = None
def init_pool(user_root_base: Path) -> SandboxPool:
"""幂等初始化 module-level pool。返回 pool 实例。
lifespan 调一次;ensure_network 内部也幂等重复调用返回同一实例(不重新建)
"""
global _pool
if _pool is None:
_pool = setup_pool(user_root_base)
return _pool
def get_pool() -> Optional[SandboxPool]:
return _pool

View File

@ -5,27 +5,30 @@
workspace 目录)
生命周期:
- `ensure(user_id)`:per-user `asyncio.Lock` 串行化 `docker inspect` 探测 running
直接返;exists-but-stopped `rm -f` 重起(保证 iptables 重新 apply);不存在 `docker run`
- `ensure(user_id)`:per-user `threading.Lock` 串行化 `docker inspect` 探测
running 直接返;exists-but-stopped `rm -f` 重起(保证 iptables 重新 apply);
不存在 `docker run`
- `mark_active(user_id)`:exec 完更新 in-memory `_last_active[uid]=now`(docker labels
不可运行时修改 Docker 23+ 移除 `docker update --label-add` 支持)
- `reap_idle()`:周期任务, `_last_active` dict,>`idle_ttl` `docker rm -f`
- `shutdown_all()`:app 启动时清前驱孤儿(`docker ps --filter label=zcbot.product=sandbox`)
API 全同步 ensure 主要使用方是 AgentLoop / DockerExecutor,跑在 web BG 线程内
天然同步;reaper 跑在 uvicorn loop ,通过 `run_in_executor` 包一层调本类 sync 方法
threading.Lock 跨线程有效,asyncio.Lock 会被 ephemeral loop 创建 / 销毁绕过保护
幂等性:
- ensure 在重复调用时跨 daemon round-trip < 100ms( `docker inspect`);per-user lock
防同 user 两并发 `docker run --name` "Conflict"(虽然 docker 本身会 reject,提前
锁更干净)
- reaper 只杀 dict 里有记录的容器 重启后 dict 不杀历史孤儿(这条由 startup
`shutdown_all` 兜底)
Step 2 范围: pool / lifecycleTools(shell / run_python) Step 3 接入
"""
from __future__ import annotations
import asyncio
import os
import subprocess
import threading
import time
from pathlib import Path
from typing import Dict, List, Optional
@ -97,17 +100,19 @@ class SandboxPool:
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
)
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
self._locks: Dict[UUID, asyncio.Lock] = {}
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
self._locks: Dict[UUID, threading.Lock] = {}
self._last_active: Dict[UUID, int] = {}
def _lock_for(self, user_id: UUID) -> asyncio.Lock:
def _lock_for(self, user_id: UUID) -> threading.Lock:
with self._dict_lock:
if user_id not in self._locks:
self._locks[user_id] = asyncio.Lock()
self._locks[user_id] = threading.Lock()
return self._locks[user_id]
async def ensure(self, user_id: UUID) -> str:
"""返回容器名;create-or-reuse 原子。"""
async with self._lock_for(user_id):
def ensure(self, user_id: UUID) -> str:
"""返回容器名;create-or-reuse 原子。同步阻塞,主调方 AgentLoop 已在 BG 线程。"""
with self._lock_for(user_id):
name = container_name(user_id)
if _container_running(name):
self._last_active[user_id] = _now()
@ -118,7 +123,7 @@ class SandboxPool:
["docker", "rm", "-f", name],
capture_output=True, check=False,
)
await asyncio.to_thread(self._docker_run, user_id, name)
self._docker_run(user_id, name)
self._last_active[user_id] = _now()
return name

View File

@ -0,0 +1,285 @@
"""DockerExecutor 单元测试。
mock subprocess(`docker exec` 命令的实际跑由部署机 smoke ,RUN.md 5 条命令)
覆盖关键路径:
- 信任域 dispatch:host 工具直通 / container 工具走 docker exec
- argv 形态:--user / --workdir / setsid / bash -c / python <script>
- tmp .py:写到 host `.zcbot_tmp/<task_id>/`,执行完 unlink,无残留
- timeout / cancel:Popen.kill() 兜底
- schemas() / has_tool() 透传 host
"""
from __future__ import annotations
import sys
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
from uuid import uuid4
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from core.executor import ExecCtx, ToolResult
from core.executor_docker import DockerExecutor, TMP_SUBDIR
from core.executor_host import HostExecutor
class FakePool:
"""SandboxPool 替身:ensure 返固定容器名,mark_active 记录调用。"""
def __init__(self):
self.ensure_calls = []
self.mark_active_calls = []
def ensure(self, user_id):
name = f"zcbot-sandbox-{user_id}"
self.ensure_calls.append(user_id)
return name
def mark_active(self, user_id):
self.mark_active_calls.append(user_id)
class FakeTool:
"""tools.base.Tool 替身:execute 返串,schema 暴露 name + 空 parameters。"""
def __init__(self, name, output="ok"):
self.name = name
self._output = output
self.execute_calls = []
@property
def schema(self):
return {"type": "function", "function": {"name": self.name}}
def execute(self, **kwargs):
self.execute_calls.append(kwargs)
return self._output
def make_executor(tools_dict=None):
"""构造 DockerExecutor + FakePool + tmp user_root。返回 (executor, pool, tmp_dir)。"""
tmp = tempfile.mkdtemp()
user_root = Path(tmp) / "users" / "u1"
user_root.mkdir(parents=True)
working_dir = user_root / "demo"
working_dir.mkdir()
if tools_dict is None:
tools_dict = {
"read": FakeTool("read", "READ_OUT"),
"shell": FakeTool("shell"), # host shell 不应被调用
"run_python": FakeTool("run_python"),
}
host = HostExecutor(tools_dict)
pool = FakePool()
executor = DockerExecutor(
host=host,
pool=pool,
user_id=uuid4(),
user_root=user_root,
working_dir=working_dir,
)
return executor, pool, Path(tmp)
def make_ctx(executor):
return ExecCtx(
user_id=executor.user_id,
task_id=uuid4(),
working_dir=executor.working_dir,
cancel_check=None,
)
class TestHostPassthrough(unittest.TestCase):
"""非 container tool 直通 host backend,不调 pool / subprocess。"""
def test_read_passthrough_to_host(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("read", {"file": "x"}, ctx)
self.assertEqual(result.content, "READ_OUT")
self.assertEqual(result.exit_code, 0)
self.assertEqual(pool.ensure_calls, [])
self.assertEqual(pool.mark_active_calls, [])
def test_schemas_and_has_tool_from_host(self):
executor, _, _ = make_executor()
names = [s["function"]["name"] for s in executor.schemas()]
self.assertIn("read", names)
self.assertIn("shell", names)
self.assertTrue(executor.has_tool("shell"))
self.assertFalse(executor.has_tool("nope"))
class TestShellExec(unittest.TestCase):
"""shell 调用走 docker exec subprocess,argv 形态正确。"""
def test_shell_invokes_docker_exec(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
proc = MagicMock()
proc.communicate.return_value = ("hello\n", "")
proc.returncode = 0
with patch("core.executor_docker.subprocess.Popen", return_value=proc) as popen:
result = executor.call_tool("shell", {"command": "echo hello"}, ctx)
self.assertIn("[stdout]\nhello", result.content)
self.assertIn("[exit 0]", result.content)
self.assertEqual(result.exit_code, 0)
argv = popen.call_args[0][0]
self.assertEqual(argv[:2], ["docker", "exec"])
self.assertIn("--user", argv)
self.assertIn("--workdir", argv)
# workdir 应是 /workspace/demo(working_dir 相对 user_root)
self.assertEqual(argv[argv.index("--workdir") + 1], "/workspace/demo")
# container name = zcbot-sandbox-<uid>
container_idx = argv.index(f"zcbot-sandbox-{executor.user_id}")
# setsid bash -c 必须出现且紧跟 container 之后
self.assertEqual(argv[container_idx + 1:], ["setsid", "bash", "-c", "echo hello"])
self.assertEqual(pool.ensure_calls, [executor.user_id])
self.assertEqual(pool.mark_active_calls, [executor.user_id])
def test_shell_bad_args(self):
executor, _, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("shell", {"command": ""}, ctx)
self.assertIn("[Error]", result.content)
self.assertEqual(result.exit_code, 2)
def test_shell_timeout(self):
executor, pool, _ = make_executor()
ctx = make_ctx(executor)
import subprocess as real_subprocess
proc = MagicMock()
# 第一次 communicate 抛 TimeoutExpired,第二次(kill 后)返空
proc.communicate.side_effect = [
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
("", "killed\n"),
]
proc.returncode = -9
with patch("core.executor_docker.subprocess.Popen", return_value=proc), \
patch("core.executor_docker.time.monotonic", side_effect=[0, 100]):
result = executor.call_tool("shell", {"command": "sleep 9999", "timeout": 1}, ctx)
self.assertIn("timed out after 1s", result.content)
self.assertEqual(result.exit_code, 124)
proc.kill.assert_called_once()
def test_shell_cancel(self):
executor, _, _ = make_executor()
ctx = ExecCtx(
user_id=executor.user_id,
task_id=uuid4(),
working_dir=executor.working_dir,
cancel_check=lambda: True, # 立即 cancel
)
import subprocess as real_subprocess
proc = MagicMock()
proc.communicate.side_effect = [
real_subprocess.TimeoutExpired(cmd="docker", timeout=0.5),
("", ""),
]
proc.returncode = -15
with patch("core.executor_docker.subprocess.Popen", return_value=proc):
result = executor.call_tool("shell", {"command": "sleep 9999"}, ctx)
self.assertIn("cancelled by user", result.content)
self.assertEqual(result.exit_code, 130)
proc.kill.assert_called_once()
class TestRunPython(unittest.TestCase):
"""run_python:tmp .py 落 user_root/.zcbot_tmp/<task_id>/,跑完 unlink。"""
def test_run_python_tmp_script(self):
executor, pool, tmp_root = make_executor()
ctx = make_ctx(executor)
proc = MagicMock()
proc.communicate.return_value = ("42\n", "")
proc.returncode = 0
captured_argv = []
def _popen(argv, **kwargs):
captured_argv.append(argv)
return proc
with patch("core.executor_docker.subprocess.Popen", side_effect=_popen):
result = executor.call_tool(
"run_python", {"code": "print(42)"}, ctx
)
self.assertIn("[stdout]\n42", result.content)
self.assertEqual(result.exit_code, 0)
argv = captured_argv[0]
# 末尾形态:setsid python /workspace/.zcbot_tmp/<task_id>/<rand>.py
self.assertEqual(argv[-3], "setsid")
self.assertEqual(argv[-2], "python")
self.assertTrue(argv[-1].startswith(f"/workspace/{TMP_SUBDIR}/{ctx.task_id}/"))
self.assertTrue(argv[-1].endswith(".py"))
# PYTHONIOENCODING / PYTHONPATH 注入
env_kvs = [argv[i + 1] for i, a in enumerate(argv) if a == "-e"]
self.assertIn("PYTHONIOENCODING=utf-8", env_kvs)
self.assertIn("PYTHONPATH=/workspace", env_kvs)
# host 侧 tmp 已 unlink(目录可能仍在,无所谓 —— ensure 容器时会重新 mkdir)
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
self.assertEqual(leftover, [], f"tmp .py not cleaned up: {leftover}")
def test_run_python_bad_code_type(self):
executor, _, _ = make_executor()
ctx = make_ctx(executor)
result = executor.call_tool("run_python", {"code": 123}, ctx)
self.assertIn("[Error]", result.content)
self.assertEqual(result.exit_code, 2)
def test_run_python_cleans_tmp_on_exception(self):
"""Popen 抛异常时 tmp .py 仍要被清理(finally 兜底)。"""
executor, _, _ = make_executor()
ctx = make_ctx(executor)
with patch(
"core.executor_docker.subprocess.Popen",
side_effect=RuntimeError("boom"),
):
result = executor.call_tool("run_python", {"code": "x"}, ctx)
self.assertIn("[Error executing run_python via docker]", result.content)
self.assertEqual(result.exit_code, 1)
tmp_subroot = executor.user_root / TMP_SUBDIR / str(ctx.task_id)
leftover = list(tmp_subroot.glob("*.py")) if tmp_subroot.exists() else []
self.assertEqual(leftover, [])
class TestUnknownTool(unittest.TestCase):
def test_unknown_tool_goes_to_host(self):
executor, _, _ = make_executor(tools_dict={}) # 空 host → 啥都没
ctx = make_ctx(executor)
result = executor.call_tool("nope", {}, ctx)
self.assertIn("unknown tool", result.content)
self.assertEqual(result.exit_code, 2)
def test_container_tool_not_registered_on_host(self):
"""caps.enable_run_python=False:host 没装 run_python,docker 也应拒。"""
executor, _, _ = make_executor(tools_dict={"read": FakeTool("read")})
ctx = make_ctx(executor)
result = executor.call_tool("run_python", {"code": "x"}, ctx)
self.assertIn("unknown tool", result.content)
self.assertEqual(result.exit_code, 2)
if __name__ == "__main__":
unittest.main()

View File

@ -481,7 +481,7 @@ def create_app() -> FastAPI:
async def lifespan(app: FastAPI):
broker.bind_loop(asyncio.get_running_loop())
# Skill 注册表启动时扫一次 — 文件系统静态,运行中不变;/v1/skills 直接读
from core.agent_builder import load_config
from core.agent_builder import load_config, resolve_workspace
from core.paths import ROOT
from core.skills import SkillRegistry
_cfg = load_config()
@ -500,7 +500,59 @@ def create_app() -> FastAPI:
)
if result.rowcount:
print(f"[startup] reaped {result.rowcount} stale active run(s)")
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
# 全清重启)③ 后台 reaper task,每 60s 跑 reap_idle。
sandbox_backend = os.getenv("ZCBOT_SANDBOX_BACKEND", "host").lower()
sandbox_reaper_task = None
if sandbox_backend == "docker":
from core.sandbox import init_pool
workspace = resolve_workspace(None, _cfg)
try:
pool = init_pool(workspace / "users")
removed = pool.shutdown_all()
if removed:
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
async def _reaper() -> None:
loop = asyncio.get_running_loop()
while True:
try:
await asyncio.sleep(60)
removed = await loop.run_in_executor(None, pool.reap_idle)
if removed:
print(f"[reaper] reaped {len(removed)} idle sandbox container(s)")
except asyncio.CancelledError:
raise
except Exception as e:
print(f"[reaper] error: {type(e).__name__}: {e}")
sandbox_reaper_task = asyncio.create_task(_reaper(), name="sandbox-reaper")
app.state.sandbox_pool = pool
except Exception as e:
# ensure_network / docker CLI 不可用 → fail-fast。Stage C 协议:任一
# hardening 缺失视为部署未完成,不退化到 host(否则误以为有沙盒实则在裸跑)。
raise RuntimeError(
f"ZCBOT_SANDBOX_BACKEND=docker but sandbox init failed: {e}"
)
try:
yield
finally:
if sandbox_reaper_task is not None:
sandbox_reaper_task.cancel()
try:
await sandbox_reaper_task
except (asyncio.CancelledError, Exception):
pass
if sandbox_backend == "docker":
pool = getattr(app.state, "sandbox_pool", None)
if pool is not None:
try:
pool.shutdown_all()
except Exception as e:
print(f"[shutdown] sandbox shutdown_all error: {type(e).__name__}: {e}")
app = FastAPI(
title="zcbot api",