Stage C 收尾包:资源 yaml + 磁盘配额 + 网络放开 + 容器内源持久化

dogfood + 信任同事白名单阶段 Step 4 完整 egress proxy 暂不做(沉淀为升级触发
信号:任一陌生用户注册 / 模型异常 outbound / 信任白名单出现非密切相识者 → 必上)。
本批 3 件:

(A) 容器资源 yaml 化(可调不重 build):
- agent.yaml 加 sandbox 段(memory/cpus/pids_limit)
- SandboxPool ctor 加三字段,优先级 env > yaml > 默(2g/1.0/256)
- setup_pool/init_pool 透传 sandbox_cfg
- sandbox check 输出加 [info] 4 行给运维一眼对账

(B) 应用层磁盘配额(§7.5 #4 软配额):
- migration 0008 user_disk_usage 单行 per user
- core/storage/disk_quota.py:parse_bytes("5gb"/int)+ scan_user_dir
  (os.scandir 跳顶层 .zcbot_tmp / .memory)+ upsert ON CONFLICT
  + check_disk_quota + scan_all_users 串行
- lifespan _disk_scanner 后台 task(启动跑一次 + 默 15min 周期)
- DockerExecutor write/edit 起手 gate 超额 [Error] 不调容器
- /v1/files/upload 同款 gate 超额 HTTP 413
- yaml `quotas.disk_bytes_per_user: 5gb` + `disk_scan_interval_seconds: 900`
- race 接受:扫描间隙写入轻微突破(image/video 配额同款 race-tolerant);
  外部用户开放前 OS 层 xfs prjquota 兜底
- 11 测试 covered parse_bytes / scan / 跳 dotfile

(C) 网络放开 + 容器内源持久化:
- network.py 去 --internal flag,容器走 docker bridge default 有 NAT outbound
- 已存在 internal network 不自动 rm 仅 warn,RUN.md 给迁移命令(避免破现有容器)
- iptables 红线段不动(169.254/127/10/172.16/192.168/100.64/PG_IP DROP),
  挡 cloud metadata + 内网扫描 + loopback,基线不依赖 proxy
- Dockerfile 加 /etc/pip.conf(global index-url + timeout 60) + /etc/npmrc
  (global registry),让运行时模型 `pip install foo` / `npm install bar`
  也走 mirror(此前 --build-arg 只 build 时生效)

unittest discover 46/46 PASS(原 35 + 新 11)。
DESIGN 不动(延后决策仍在 §7.7 Stage C 阶段语义内,触发信号沉淀进
PROGRESS / RUN);RUN.md 加 env 列表 + 网络迁移 + 配额 + 故障兜底 3 行。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
caoqianming 2026-05-27 08:35:53 +08:00
parent 792366d9fc
commit eaf7f3ea1e
14 changed files with 575 additions and 28 deletions

View File

@ -2,7 +2,7 @@
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`
最后更新:2026-05-26(Stage C Step 3d:fs 工具(read/write/edit/glob/grep)进容器 + DESIGN §7.5 #6 重写,物理边界替代代码护栏)
最后更新:2026-05-27(Stage C 收尾包:容器资源 yaml 化 + 磁盘配额(scan+gate)+ 网络放开 dogfood + 容器内 pip/npm 源持久化;Step 4 完整 egress proxy 延后到外部用户开放前)
---
@ -15,14 +15,15 @@
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 + 3d ✅(Executor + Docker 池 + DockerExecutor + fs 工具进容器)+ Step 5 部署前置对账 ✅ + 容器资源 yaml + 应用层磁盘配额(scan+gate)✅ + dogfood 网络放开 + 容器内 pip/npm 源持久化 ✅**;**Step 4 完整 egress proxy + Step 3b PGID kill 协议延后到外部用户开放前**;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
---
## 已完成关键能力
### 2026-05-26
### 2026-05-27
- **Stage C 收尾包:容器资源 yaml 化 + 应用层磁盘配额 + dogfood 网络放开 + 容器内 pip/npm 源持久化**:Step 4 完整 egress proxy(allowlist + audit + 字节计量)1-2 天工程量,**dogfood + 信任同事白名单阶段不必先做**,符合 DESIGN §7.7 阶段语义;沉淀为升级触发信号(任一陌生用户注册 / dogfood 发现模型异常 outbound / 信任白名单出现非密切相识者 → 必上 Step 4)。本批做 3 件:① **容器资源 yaml 化**:`config/agent.yaml` 加 `sandbox` 段(memory/cpus/pids_limit),`SandboxPool.__init__` 加三个字段,优先级 env > yaml > 默(2g/1.0/256);`setup_pool` / `init_pool` 透传 sandbox_cfg;`main.py sandbox check` 输出加 4 行 `[info]`(memory/cpus/pids_limit/disk_bytes_per_user)给运维一眼对账。② **应用层磁盘配额**:migration `0008_user_disk_usage`(单行 per user,bytes_used/file_count/scanned_at)+ `core/storage/disk_quota.py`(`parse_bytes`("5gb"/"500mb"/int)+ `scan_user_dir`(os.scandir 跳顶层 dotfile `.zcbot_tmp` `.memory`)+ `upsert_user_usage` ON CONFLICT + `check_disk_quota`(超额返中文 msg)+ `scan_all_users` 串行扫所有 user)+ web/app.py lifespan `_disk_scanner` 后台 task(启动跑一次 + 默 15min 周期 `run_in_executor`)+ `DockerExecutor._exec_fs_tool` write/edit 起手 `_check_user_disk_quota` 超额返 `[Error]` 不调容器 + `/v1/files/upload` 同款 gate 超额 HTTP 413。yaml `quotas.disk_bytes_per_user: 5gb` + `disk_scan_interval_seconds: 900`,≤0 视为不限,首次扫描前 check 短路放行避免冷启动卡死。race 接受:扫描间隙写入轻微突破上限(与 image/video 配额同款 race-tolerant)。③ **网络放开 + 容器内源持久化**:`core/sandbox/network.py` 去掉 `--internal` flag(改 docker bridge default 有 NAT outbound;dogfood 阶段让模型能 `pip install foo` / `curl https://...`),已存在 internal network 不自动 rm 仅 warn(避免破坏现有容器,RUN.md 给迁移命令)。Dockerfile 加 `/etc/pip.conf`(写 `[global]\nindex-url=${PIP_INDEX_URL}` + timeout 60)+ `/etc/npmrc`(写 `registry=${NPM_REGISTRY}`)让运行时 pip / npm install 也走 mirror(此前 `--build-arg` 只 build 时生效)。iptables 红线段不动 ── `169.254/127/10/172.16/192.168/100.64/PG_IP` 仍 DROP,挡 cloud metadata + 内网扫描 + loopback,这是基线不依赖 proxy。**测试**:`tests/test_disk_quota.py` 11 测试覆盖 parse_bytes 各单位 / scan_user_dir 跳 dotfile / 空目录 / 不存在路径;**unittest discover 46/46 PASS**(原 35 + 新 11)。**DESIGN §7.5 #2 待 commit 加"Step 4 延后 + 升级触发表"段落**(本 commit 暂没改 DESIGN ── DESIGN 只在架构变时改,延后决策仍在 §7.7 Stage C 阶段语义内,触发信号沉淀进 PROGRESS / RUN);RUN.md 加 yaml sandbox 段 + 网络迁移 + 配额命令 + 故障兜底 2 行(internal network legacy / 磁盘 413)。否决:(a) network 改 internal 时自动 rm + recreate ── destructive,会破现有容器连接,改 warn 让运维 ack;(b) 写前实时 du ── user_root 大时几秒一次写不能接受,sticky 周期扫描表 + 写前查表是 image/video 配额同款范式;(c) 同时做完整 Step 4 ── 1-2 天大工程,dogfood 不阻塞,先放开网络让模型能 pip install 更急(实测装包 / 拉资源能力是产品门槛);(d) 磁盘配额硬阻所有写(包括 run_python / shell)── 截 syscall 太重,write/edit + upload gate 已覆盖 95%(skill 产物路径),run_python / shell 写文件靠扫描后续感知(下次周期 check 时挡新增写入);(e) yaml `sandbox.memory` 默 4g/2cpu ── 腾讯云轻量 4 核 8G,留 host 跑 web + PG + nginx 需求,2g/1cpu 是合理基线,极端任务用户改 yaml 升配。
- **Stage C Step 3d:fs 工具(read/write/edit/glob/grep)进容器 + DESIGN §7.5 #6 重写**:Ubuntu dogfood 第一次切 docker backend 后发现 host 工具 `Path.cwd()` 漏底 —— 模型用 glob `*` 列出了 host `/home/lighthouse/zcbot/.git/.venv/config/core/...`,即 zcbot 源码自身。回查 DESIGN §7.5 #6 写"host 工具走 `paths.py::resolve_user_path` 校验",grep 代码**根本没那个函数**,假命题;`Tool._resolve` 实际是 `base_dir / path`,base_dir=`Path.cwd()`(= web 启动目录 = zcbot repo 根),绝对路径完全不挡,模型能 read `/etc/passwd` / write zcbot 源码自己。**修法对比**:Phase A(改 cwd → working_dir,1 行 hack)修 UX 不修安全;Phase B(host 工具加 user_root 强制校验 + skills/ 白名单,~80 行)安全但脆弱(symlink/`..`/Windows path 都得 case 挡,漏一个就破);**方案 3(fs 工具进容器)物理边界替代代码护栏,选这条**。`core/sandbox/tool_runner.py` 新增容器内 helper(~80 行,from stdin 接 JSON args 调 `tools/fs.py` Tool 子类,base_dir=cwd 走 docker exec --workdir 传入,user_root=/workspace);`DockerExecutor` 加 `FS_TOOLS = {read,write,edit,glob,grep}` 信任域 + `_exec_fs_tool` 方法 `docker exec -i ... python /sandbox/tool_runner.py <name>` + stdin 喂 JSON args(CJK 路径透明传不被 shell metachar 切);`_run_subprocess` 加 stdin 参数 + is_fs_tool 路径返 stdout 直透(不包 [stdout]/[exit],原模型语义保持),exit≠0 把 stderr 当 ToolResult content。`SandboxPool` 加 `repo_root` 字段,`_docker_run` 加 `<repo>/skills:/sandbox/skills:ro` mount(SKILL.md 内引用 `references/foo.md` 时容器内 read 能解析);`web/app.py` lifespan 透传 `ROOT`;`Dockerfile` `COPY tools/ /sandbox/tools/ + tool_runner.py` 让镜像内有一份 tools 源(build-time COPY 而非 mount —— 容器内代码不应跟随 host repo 修改重启)。**留 host 的工具**:`load_skill`(SkillRegistry 内存查找,无 fs 越界)/ `web_search` / `web_fetch` / `seedream` / `seedance`(持 Bocha/ARK API key,key 不入容器 env;Step 4 egress proxy 后再讨论)。**测试**:`tests/test_executor_docker.py` 改 `test_load_skill_passthrough_to_host`(原 `test_read_passthrough_to_host` 不再成立 —— read 进容器了)+ 加 4 个 fs 路径测试(read argv 形态 / CJK 路径 stdin JSON 透明传 / grep exit≠0 stderr 透传 / glob timeout 杀 docker CLI),`unittest discover 35/35 PASS`。**DESIGN §7.5 #6 重写**:从"工具二分(host fs + container code)"改"几乎所有工具进容器,host 只留持 key + 跨 user 的"+ 标注 2026-05-26 修正记录(原假命题溯源)。**代价**:每个 fs tool call 多 ~200ms docker exec overhead,对话级 N≤15 总 1-3s,LLM 推理 5-30s 下噪声;镜像 build COPY tools/ ~5s 增量。**升级触发**(§7.9 升级表):若 metric `docker_exec_overhead / total_tool_time > 30%` 持续两周,或模型出现"在容器内起长驻服务"工作流,启用容器内 tool-runner unix socket RPC(消除每次 exec 开销)。否决:(a) Phase B path validator —— 跟 §7.9 § "美学统一性 ≠ 升级理由"对称,**安全要"物理 ≠ 代码"才稳**;(b) `COPY core/ tools/ ...` 把整个 zcbot core 进镜像 —— tool_runner 只需要 `tools/fs.py` + base.py,容器内多余代码增加攻击面;(c) tool_runner.py 用 argv 传 JSON args —— CJK / 引号 / 路径分隔符全是 shell metachar 切风险,stdin 喂稳;(d) 让 host backend 也保留 fs 工具走 user_root 校验作"双保险" —— 双源 = 漂移源,docker backend 是 §7.5 的全部论证基础,host backend 不在外部用户场景有它就够。
- **Stage C Step 3 hotfix:exec_user 改 username 跟随 build_arg + Dockerfile 加 Node/Chromium/mermaid-cli**:Ubuntu 上 dogfood 暴露两个真问题。① **uid 错配**:DockerExecutor 写死 `--user 1000:1000`,但镜像 `docker build --build-arg HOST_UID=$(id -u)` 跟随 host 实际 uid(腾讯云轻量 lighthouse 用户 uid=1001),docker exec 进容器 uid=1000 → bind mount `/workspace/<wd>/` owner 1001 → 写文件全 EACCES,文件落 `/tmp/`。改 `DEFAULT_EXEC_USER = "zcbot"`(username,docker 自动查容器 /etc/passwd 拿 uid),无论 HOST_UID build 成 1000/1001/其他都跟 bind mount owner 对齐,且未来切其他部署机不用改 env。② **proposal/patent skill 渲 mermaid 缺 Node**:`skills/proposal/scripts/render_diagrams.py` `render_via_mmdc``shutil.which("mmdc")`,容器没装 → 退到 mermaid.ink 公网 API → 但 sandbox 容器 `--internal` 默 deny outbound,API 也走不通 → ASCII fallback 出 docx 没图不能用。Dockerfile 加 `chromium nodejs npm` apt 装(Debian bookworm 自带 node 18.x 够新)+ `npm install -g @mermaid-js/mermaid-cli@latest`,镜像 +~400MB(接受)。容器内 chromium 缺 setuid sandbox + `/dev/shm` 不够大会跪,镜像落 `/sandbox/puppeteer-config.json`(`--no-sandbox` / `--disable-setuid-sandbox` / `--disable-dev-shm-usage` + executablePath=/usr/bin/chromium)+ ENV `MERMAID_PUPPETEER_CONFIG=/sandbox/puppeteer-config.json`,`render_via_mmdc` 改读 env 拼 `-p <config>` 注入 mmdc;host 上跑 env 没设行为零变化。`PUPPETEER_SKIP_DOWNLOAD=true` + `PUPPETEER_EXECUTABLE_PATH` 让 puppeteer 用容器 chromium 不再下载它自带的 Chrome(省 ~300MB build)。npm 源加 `--build-arg NPM_REGISTRY=https://mirrors.cloud.tencent.com/npm/`(腾讯云内网)防境内 build 慢。`DESIGN.md` 不动(纯实施层 bug fix + skill 依赖);`RUN.md` 加 NPM_REGISTRY 段 + 故障兜底 3 行(EACCES uid 错配 / mmdc 报 launch chromium / npm 慢)。否决:(a) 让 DockerExecutor 启动时探测 `os.getuid()` 自动取 host uid 作 `--user` —— 写死 username 让 docker 查 passwd 比应用层探测更直接,且 部署机 uid 偶尔变(从 1000 重装成 1001)不用改任何东西;(b) 容器走 NodeSource repo 装 Node 20 LTS —— Debian bookworm 自带 18.x 已满足 mermaid-cli 要求(>=14.x),多一步外网拖速度;(c) 不装 chromium 等 Step 4 egress proxy 后用 mermaid.ink —— proposal 是早期就要交付的能力,等 Step 4(还没动手)不现实;(d) puppeteer config 注入靠改 mmdc 启动脚本 —— mmdc 默支持 `-p`,改 render_diagrams.py 读 env 就够,不动 mmdc 内部。
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。

11
RUN.md
View File

@ -320,8 +320,8 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
# ZCBOT_SANDBOX_BACKEND=docker
# 容器内 exec 用户(默 1000:1000;Dockerfile 的 HOST_UID/HOST_GID build-arg 同步对齐)
# ZCBOT_SANDBOX_EXEC_USER=1000:1000
# 容器内 exec 用户(默 zcbot,docker 查容器 /etc/passwd 拿 uid)
# ZCBOT_SANDBOX_EXEC_USER=zcbot
# 容器镜像 tag(默 zcbot-sandbox:latest)
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
@ -329,6 +329,10 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
# ZCBOT_SANDBOX_RUNTIME=
# 空闲多少秒回收(默 300)
# ZCBOT_SANDBOX_IDLE_TTL=300
# 资源限制(优先级 env > yaml `sandbox.*` > 默);改后重启 web 新起容器生效
# ZCBOT_SANDBOX_MEMORY=2g
# ZCBOT_SANDBOX_CPUS=1.0
# ZCBOT_SANDBOX_PIDS_LIMIT=256
# PG 实际 IP,逗号分隔。defense-in-depth ── 即便落内网三段(§7.5 #1),
# init.sh 再加一遍 DROP 规则。生产部署必填。
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
@ -476,6 +480,9 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
| `systemctl restart zcbot` 卡 10s 才退 | 有 SSE 长连接,uvicorn graceful shutdown 等 in-flight。unit 已设 `TimeoutStopSec=10` 兜 SIGKILL,正常现象;真急用 `systemctl kill -s KILL zcbot` |
| `POST /v1/files/rename` 返 409 `folder has active run(s)` | 顶层目录被某 running/cancelling 的 task 占用;先 cancel 等流式 done 再 rename |
| `POST /v1/files/rename` 返 409 `... 前缀嵌套` | 改名后会与其他 task 的 working_dir 形成嵌套;换不冲突的 new_name |
| `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web |
| `[warn] network zcbot-sandbox-net is --internal (legacy)` | 上一版 sandbox network 创建时带了 `--internal`(完全禁 outbound),当前 dogfood 阶段放开。`docker stop $(docker ps -aq -f label=zcbot.product=sandbox) ; docker network rm zcbot-sandbox-net`,重启 web 自动 recreate 为非 internal |
| tool write/edit 返 `[Error] 已达磁盘配额上限` | 同 upload 413,见上 |
| 启动报 `PLATFORM_KEY env not set` / `JWT_SECRET env not set` | D' 过渡 auth 强制双 env 必填。生成 `python -c "import secrets;print(secrets.token_urlsafe(48))"` 各填一,写 `.env` 重起 |
| `/v1/auth/login_password` 返 403 `invalid email or password` | 邮箱不存在 / `password_hash` 列为空(platform_key 入口建的 user) / 密码错。`SELECT user_id, email, password_hash IS NOT NULL AS has_pw FROM users WHERE email=...` 核对;无行 → `main.py user add`;有行无密码 → `UPDATE users SET password_hash=...`(用 `.venv/Scripts/python.exe -c "from web.auth import hash_password;print(hash_password('xxx'))"` 算)或 `user add --user-id` 接到现有 user_id |
| `main.py user add``IntegrityError ... uq_users_email` | 邮箱已存在,改 email 或先 `DELETE FROM users WHERE email=...`(先清该 user 的 tasks) |

View File

@ -12,3 +12,16 @@ system_prompt: prompts/system/general_v1.md
quotas:
images_per_day: 20 # seedream 等图像 tool 调用上限
videos_per_day: 5 # seedance 等视频 tool 调用上限
# per-user 工作目录总字节上限(包括上传 + tool 写的所有产物);≤ 0 视为不限。
# 写前 gate(/v1/files/upload + DockerExecutor.write/edit),超额返 [Error] 硬阻。
# 实测靠 lifespan 后台 15min 扫描 user_disk_usage 表,扫描间隙轻微突破接受
# (跟 image/video 配额 race-tolerant 一致);外部用户开放前再上 OS 层 xfs prjquota 兜底。
disk_bytes_per_user: 5gb # 支持 5gb / 500mb / 1073741824(整数 bytes)
disk_scan_interval_seconds: 900 # 后台扫描周期,默 15 分钟
# Sandbox 容器资源限制(docker run flag,env 可 override);改后重启 web 生效,
# 新起的容器用新值,已 running 的不变(idle 5min 回收后下次起)。
sandbox:
memory: 2g # --memory (env: ZCBOT_SANDBOX_MEMORY)
cpus: 1.0 # --cpus (env: ZCBOT_SANDBOX_CPUS)
pids_limit: 256 # --pids-limit (env: ZCBOT_SANDBOX_PIDS_LIMIT)

View File

@ -47,6 +47,10 @@ from .executor_host import HostExecutor
from .sandbox import SandboxPool
# write/edit 走配额 gate;read/glob/grep 不消耗磁盘,放行
_FS_TOOLS_WRITE = frozenset({"write", "edit"})
# 信任域分类(§7.5 #6,2026-05-26 修正):
# - SHELL_LIKE:执行任意代码,Popen 直接喂 cmd / script,setsid 包一层
# - FS_TOOLS:fs 操作,docker exec → /sandbox/tool_runner.py + stdin 喂 JSON args
@ -187,7 +191,15 @@ class DockerExecutor(Executor):
fs 工具的 cancel / timeout 都用与 shell/run_python 不同的默认值:
- timeout (30s),fs 操作不会跑很久,卡住就说明撞 mount / 大目录扫描
- cancel poll(模型可能 grep user_root 然后用户停止,响应即时)
write/edit 起手 check 磁盘配额(§7.5 #4),超额返 [Error] 不调容器。
read/glob/grep 不消耗磁盘放行
"""
if name in _FS_TOOLS_WRITE:
err = _check_user_disk_quota(self.user_id)
if err is not None:
return ToolResult(content=err, exit_code=2)
timeout = int(args.get("timeout") or 30) if name == "grep" else 30
container = self.pool.ensure(self.user_id)
@ -311,3 +323,23 @@ class DockerExecutor(Executor):
parts.append(f"[stderr]\n{stderr.rstrip()}")
parts.append(f"[exit {proc.returncode}]")
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)
def _check_user_disk_quota(user_id: UUID):
"""write/edit 前 gate;读 yaml 配额 + 查 user_disk_usage 表。
放这里(模块级 helper)而非 DockerExecutor 方法是因为 host_executor 路径
也复用同款 gate(/v1/files/upload),实现一次写两处用
"""
try:
from core.agent_builder import load_config
from core.storage.disk_quota import check_disk_quota, parse_bytes
cfg = load_config() or {}
quotas = cfg.get("quotas") or {}
limit = parse_bytes(quotas.get("disk_bytes_per_user"))
if limit is None or limit <= 0:
return None
return check_disk_quota(user_id, limit)
except Exception:
# 配额查询失败不阻塞主路径(写仍放行,日志靠 caller)
return None

View File

@ -36,16 +36,19 @@ _pool: Optional[SandboxPool] = None
def init_pool(
user_root_base: Path, repo_root: Optional[Path] = None
user_root_base: Path,
repo_root: Optional[Path] = None,
sandbox_cfg: Optional[dict] = None,
) -> SandboxPool:
"""幂等初始化 module-level pool。返回 pool 实例。
lifespan 调一次;ensure_network 内部也幂等重复调用返回同一实例(不重新建)
`repo_root` fs 工具进容器后 SKILL references ro mount( pool.py)
`sandbox_cfg` agent.yaml `sandbox` , memory/cpus/pids_limit
"""
global _pool
if _pool is None:
_pool = setup_pool(user_root_base, repo_root=repo_root)
_pool = setup_pool(user_root_base, repo_root=repo_root, sandbox_cfg=sandbox_cfg)
return _pool

View File

@ -88,7 +88,7 @@ def check_network_present() -> bool:
return True
_warn(
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
f"或手动 `docker network create --internal {NETWORK_NAME}`"
f"或手动 `docker network create {NETWORK_NAME}`"
)
return True # warn 不算失败
@ -225,6 +225,29 @@ CHECK_NAMES = [
]
def _print_sandbox_resources() -> None:
"""打印 yaml `sandbox.*` + 配额段生效值,给运维一眼对账。"""
try:
from core.agent_builder import load_config
from .pool import DEFAULT_CPUS, DEFAULT_MEMORY, DEFAULT_PIDS_LIMIT
cfg = load_config() or {}
sb = cfg.get("sandbox") or {}
quotas = cfg.get("quotas") or {}
# env 优先,跟 SandboxPool ctor 同款解析逻辑
mem = os.getenv("ZCBOT_SANDBOX_MEMORY") or sb.get("memory") or DEFAULT_MEMORY
cpus = os.getenv("ZCBOT_SANDBOX_CPUS") or str(sb.get("cpus") or DEFAULT_CPUS)
pids = os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT") or str(
sb.get("pids_limit") or DEFAULT_PIDS_LIMIT
)
disk = quotas.get("disk_bytes_per_user", "<unset>")
print(f"[info] sandbox.memory = {mem}")
print(f"[info] sandbox.cpus = {cpus}")
print(f"[info] sandbox.pids_limit = {pids}")
print(f"[info] quotas.disk_bytes_per_user = {disk}")
except Exception as e:
print(f"[warn] cannot read sandbox config: {type(e).__name__}: {e}")
def run_sandbox_check() -> int:
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
@ -236,6 +259,8 @@ def run_sandbox_check() -> int:
`core.sandbox.check.check_xxx` 对本函数生效
"""
print("--- sandbox deployment check ---\n")
_print_sandbox_resources()
print()
ok_count = 0
module = sys.modules[__name__]
for label, fn_name in CHECK_NAMES:

View File

@ -1,35 +1,50 @@
"""Sandbox Docker network 管理。
`zcbot-sandbox-net` `--internal` bridge:
- 默认无 outbound(Docker bridge 移除 host NAT 路由)
- 同网络下容器之间默认隔离(Docker bridge 默认行为,internal 也成立)
`zcbot-sandbox-net` docker bridge,**默有 outbound NAT**( host bridge 路由)
sandbox 容器同接此 net + iptables OUTPUT 红线段 DROP(init.sh) cloud metadata /
loopback / 内网 / PG IP
Step 2 起即用 `--internal`,iptables OUTPUT blocklist(init.sh 里的)作为 defense-in-depth
(网络层已堵死,iptables 仍按 §7.5 #1 协议加规则,任一缺失视为部署未完成)
**dogfood 阶段**(当前):容器可访问公网(让模型能 `pip install` / `curl` 公开域名),
iptables 仍挡内网 + cloud metadata
Step 4 引入 egress proxy :proxy 容器同接 `zcbot-sandbox-net`(从内部网到 proxy 容器
保持联通),proxy 容器再走 host 默认网出网sandbox 容器 env `HTTP_PROXY` 指向
proxy 容器名 + iptables ACCEPT 例外,实现"默认 deny + 仅经 proxy"
**外部用户开放时**(§7.7 Stage C Step 4,DESIGN §7.5 #2):
network `--internal`(完全禁 outbound)+ zcbot-proxy 容器接此 net + sandbox
容器 env `HTTP_PROXY` 指向 proxy + proxy allowlist / 字节计量 / audit届时
network bridge internal,需手动 rm + recreate( running 的容器先全停)
操作幂等:create inspect 探测,已存在直接返
操作幂等:create inspect 探测,已存在直接返;若已存在但 Internal=true(上一版
遗留), warn 提示 不自动 rm 避免破坏现有连着的容器( RUN.md "Sandbox
网络迁移"段)。
"""
from __future__ import annotations
import json
import subprocess
NETWORK_NAME = "zcbot-sandbox-net"
def ensure_network() -> None:
"""创建 `zcbot-sandbox-net`(若不存在)。失败 raise。"""
"""创建 `zcbot-sandbox-net`(若不存在);若已存在且 Internal=True 仅 warn。失败 raise。"""
inspect = subprocess.run(
["docker", "network", "inspect", NETWORK_NAME],
capture_output=True, text=True,
)
if inspect.returncode == 0:
# 已存在 ── 检测 Internal 属性,若 true 给迁移提示
try:
data = json.loads(inspect.stdout)
if data and isinstance(data, list) and data[0].get("Internal") is True:
print(
f"[warn] network {NETWORK_NAME} is --internal (legacy);"
f" sandbox 容器将无法 outbound。手动 `docker network rm {NETWORK_NAME}`"
f" 后重启 web,会自动 recreate 为非 internal(详 RUN.md)"
)
except (json.JSONDecodeError, IndexError, AttributeError):
pass
return
r = subprocess.run(
["docker", "network", "create", "--internal", NETWORK_NAME],
["docker", "network", "create", NETWORK_NAME],
capture_output=True, text=True,
)
if r.returncode != 0:

View File

@ -31,7 +31,7 @@ import subprocess
import threading
import time
from pathlib import Path
from typing import Dict, List, Optional
from typing import Any, Dict, List, Optional
from uuid import UUID
from .network import NETWORK_NAME, ensure_network
@ -45,6 +45,11 @@ LABEL_USER_ID_KEY = "zcbot.user_id"
DEFAULT_IMAGE = "zcbot-sandbox:latest"
DEFAULT_IDLE_TTL_SECONDS = 300
# 容器资源限制默认值(可被 yaml `sandbox.*` / env override,详 SandboxPool ctor)
DEFAULT_MEMORY = "2g"
DEFAULT_CPUS = "1.0"
DEFAULT_PIDS_LIMIT = 256
def container_name(user_id: UUID) -> str:
return f"{CONTAINER_NAME_PREFIX}{user_id}"
@ -81,6 +86,9 @@ class SandboxPool:
runtime: Optional[str] = None,
idle_ttl: Optional[int] = None,
pg_ips: Optional[str] = None,
memory: Optional[str] = None,
cpus: Optional[str] = None,
pids_limit: Optional[int] = None,
) -> None:
"""
user_root_base: per-user 子树父目录,典型 `<workspace>/users`bind mount
@ -98,6 +106,10 @@ class SandboxPool:
(env `ZCBOT_SANDBOX_IDLE_TTL`, 300)
pg_ips: 逗号分隔的 PG IP ,塞容器 `ZCBOT_PG_IPS` env,init.sh DROP 规则
(env `ZCBOT_PG_IPS`)defense-in-depth 即便落内网三段
memory/cpus/pids_limit:
容器资源限制, 2g/1.0/256;env(`ZCBOT_SANDBOX_MEMORY` )
override caller 参数 override 默认改后重启 web 生效,新起的
容器用新值; running 不变(idle 5min 回收后下次起按新值)
"""
self.user_root_base = user_root_base
self.repo_root = repo_root
@ -107,6 +119,13 @@ class SandboxPool:
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
)
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
# 资源限制:env > caller > 默
self.memory = os.getenv("ZCBOT_SANDBOX_MEMORY") or memory or DEFAULT_MEMORY
self.cpus = os.getenv("ZCBOT_SANDBOX_CPUS") or cpus or DEFAULT_CPUS
self.pids_limit = int(
os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT")
or (pids_limit if pids_limit is not None else DEFAULT_PIDS_LIMIT)
)
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
self._locks: Dict[UUID, threading.Lock] = {}
self._last_active: Dict[UUID, int] = {}
@ -151,9 +170,9 @@ class SandboxPool:
"--cap-drop=ALL", # 默全丢
"--cap-add=NET_ADMIN", # init.sh 配 iptables 需要;exec 进来的 uid 1000 拿不到
"--security-opt=no-new-privileges",
"--pids-limit=256",
"--memory=2g",
"--cpus=1.0",
f"--pids-limit={self.pids_limit}",
f"--memory={self.memory}",
f"--cpus={self.cpus}",
"-v", f"{user_root}:/workspace",
"-e", f"ZCBOT_PG_IPS={self.pg_ips}",
"--restart=no",
@ -219,15 +238,28 @@ class SandboxPool:
def setup_pool(
user_root_base: Path, repo_root: Optional[Path] = None
user_root_base: Path,
repo_root: Optional[Path] = None,
sandbox_cfg: Optional[Dict[str, object]] = None,
) -> SandboxPool:
"""app 启动便捷入口:ensure 网络存在 + 返回 pool 实例。
`sandbox_cfg` agent.yaml `sandbox` (dict), memory/cpus/pids_limit;
没传走 env / 默认值env 仍可独立 override(SandboxPool ctor 里处理优先级)
典型用法(lifespan 启动钩子):
from core.paths import ROOT
pool = setup_pool(workspace / "users", repo_root=ROOT)
cfg = load_config()
pool = setup_pool(workspace / "users", repo_root=ROOT,
sandbox_cfg=cfg.get("sandbox") or {})
pool.shutdown_all() # 清前驱孤儿
# 后台 reaper task 周期跑 pool.reap_idle()
"""
ensure_network()
return SandboxPool(user_root_base=user_root_base, repo_root=repo_root)
cfg = sandbox_cfg or {}
return SandboxPool(
user_root_base=user_root_base,
repo_root=repo_root,
memory=cfg.get("memory") if isinstance(cfg.get("memory"), str) else None,
cpus=str(cfg["cpus"]) if cfg.get("cpus") is not None else None,
pids_limit=int(cfg["pids_limit"]) if cfg.get("pids_limit") is not None else None,
)

202
core/storage/disk_quota.py Normal file
View File

@ -0,0 +1,202 @@
"""Per-user 工作目录配额(§7.5 #4 软配额,应用层 gate)。
调用入口:
- `scan_user_dir(user_root) -> (bytes, count)` os.walk 累加, dotfile / 损坏 stat
- `upsert_user_usage(user_id, bytes, count)` user_disk_usage
- `check_disk_quota(user_id, limit_bytes) -> Optional[str]` 写前查, None=放行 /
str=拒绝原因`limit_bytes <= 0` 短路放行(不限)
- `scan_all_users(user_root_base, limit_bytes)` lifespan 后台 task 周期跑,
per user 跑完后下一个,避免 IO 风暴
字节单位解析(yaml `disk_bytes_per_user`):
- 整数字节 / "5gb" / "500mb" / "1.5g" case-insensitive 后缀
- 失败返 None,caller 视为不限
"""
from __future__ import annotations
import os
import re
from pathlib import Path
from typing import Iterable, List, Optional, Tuple
from uuid import UUID
from sqlalchemy import select
from sqlalchemy.dialects.postgresql import insert as pg_insert
from .engine import session_scope
from .models import UserDiskUsage
# yaml 字节解析:5gb / 500mb / 1024 / 1.5g
_SIZE_RE = re.compile(r"^\s*([\d.]+)\s*([kmgt]?b?)?\s*$", re.IGNORECASE)
_UNIT_FACTORS = {
"": 1, "b": 1,
"k": 1024, "kb": 1024,
"m": 1024 ** 2, "mb": 1024 ** 2,
"g": 1024 ** 3, "gb": 1024 ** 3,
"t": 1024 ** 4, "tb": 1024 ** 4,
}
def parse_bytes(value) -> Optional[int]:
"""yaml 字节值 → int;无法解析返 None。"""
if value is None:
return None
if isinstance(value, int):
return value
if not isinstance(value, str):
return None
m = _SIZE_RE.match(value)
if not m:
return None
num_s, unit_s = m.group(1), (m.group(2) or "").lower()
factor = _UNIT_FACTORS.get(unit_s)
if factor is None:
return None
try:
return int(float(num_s) * factor)
except ValueError:
return None
# 扫描跳过的 dotfile 顶层名(节省 IO,且 /v1/files API 也隐藏)
_SKIP_TOPLEVEL = frozenset({".zcbot_tmp", ".memory"})
def scan_user_dir(user_root: Path) -> Tuple[int, int]:
"""os.walk 累加 user_root 下所有文件大小,返 (bytes, count)。
跳过顶层 .zcbot_tmp / .memory(开发期临时 + 用户记忆 dotfile,不算入产品配额);
follow_symlinks=False symlink 循环爆
"""
if not user_root.exists() or not user_root.is_dir():
return 0, 0
total_bytes = 0
total_count = 0
try:
for entry in os.scandir(user_root):
if entry.name in _SKIP_TOPLEVEL:
continue
try:
if entry.is_file(follow_symlinks=False):
try:
total_bytes += entry.stat(follow_symlinks=False).st_size
total_count += 1
except OSError:
pass
elif entry.is_dir(follow_symlinks=False):
sub_b, sub_c = _walk_dir(Path(entry.path))
total_bytes += sub_b
total_count += sub_c
except OSError:
pass
except OSError:
pass
return total_bytes, total_count
def _walk_dir(d: Path) -> Tuple[int, int]:
total_b, total_c = 0, 0
for root, dirs, files in os.walk(d, followlinks=False, onerror=lambda _e: None):
for f in files:
try:
st = os.stat(os.path.join(root, f), follow_symlinks=False)
total_b += st.st_size
total_c += 1
except OSError:
pass
return total_b, total_c
def upsert_user_usage(user_id: UUID, bytes_used: int, file_count: int) -> None:
"""落 user_disk_usage 单行;首次 INSERT,后续 UPDATE。"""
from sqlalchemy import func
with session_scope() as s:
stmt = pg_insert(UserDiskUsage).values(
user_id=user_id,
bytes_used=bytes_used,
file_count=file_count,
).on_conflict_do_update(
index_elements=["user_id"],
set_={
"bytes_used": bytes_used,
"file_count": file_count,
"scanned_at": func.now(),
},
)
s.execute(stmt)
def get_user_usage(user_id: UUID) -> Optional[Tuple[int, int]]:
"""读最近一次扫描结果 (bytes, count);无记录返 None。"""
with session_scope() as s:
row = s.execute(
select(UserDiskUsage.bytes_used, UserDiskUsage.file_count)
.where(UserDiskUsage.user_id == user_id)
).first()
if row is None:
return None
return int(row[0]), int(row[1])
def check_disk_quota(user_id: UUID, limit_bytes: int) -> Optional[str]:
"""写前 gate:超额返 错误 msg(给 LLM 直读);放行返 None。
`limit_bytes <= 0` 短路放行(不限)无扫描记录(首次,首次扫描前)放行
避免冷启动期间所有写入卡死15min 后周期扫到就生效
"""
if limit_bytes <= 0:
return None
usage = get_user_usage(user_id)
if usage is None:
return None # 首次,放行,首次扫描后下次 gate 才生效
used, _ = usage
if used >= limit_bytes:
used_mb = used / (1024 ** 2)
limit_mb = limit_bytes / (1024 ** 2)
return (
f"[Error] 已达磁盘配额上限({used_mb:.1f} MB / {limit_mb:.1f} MB);"
f"清理旧产物或联系管理员升配后重试"
)
return None
def list_user_ids_with_root(user_root_base: Path) -> List[UUID]:
"""扫 user_root_base 子目录,返合法 UUID 列表(=有 workspace 子目录的 user)。
不去 DB users 全表 有些 user 可能从未发消息( workspace 目录), disk 占用,
无需 upsert 占位行
"""
if not user_root_base.is_dir():
return []
out: List[UUID] = []
try:
for entry in os.scandir(user_root_base):
if not entry.is_dir(follow_symlinks=False):
continue
try:
out.append(UUID(entry.name))
except ValueError:
continue
except OSError:
pass
return out
def scan_all_users(user_root_base: Path) -> int:
"""扫所有 user 落库,返扫描的 user 数。lifespan 后台 task 调。
串行(per user 跑完下一个)避免 IO 风暴; user 几秒(几百 MB 量级),N user 总耗时
线性失败的 user 静默跳过,下次周期再试
"""
count = 0
for uid in list_user_ids_with_root(user_root_base):
try:
b, c = scan_user_dir(user_root_base / str(uid))
upsert_user_usage(uid, b, c)
count += 1
except Exception:
# 单 user 扫挂不阻塞其他 user;下次周期重试。日志靠 caller 注入。
pass
return count

View File

@ -20,6 +20,7 @@ from typing import Any, Optional
from uuid import UUID, uuid4
from sqlalchemy import (
BigInteger,
DateTime,
ForeignKey,
Integer,
@ -137,3 +138,28 @@ class UsageEvent(Base):
)
class UserDiskUsage(Base):
"""per-user 工作目录字节使用快照(0008,§7.5 #4 软配额表)。
每个 user_id 单行 upsert,lifespan 后台 task 周期( 15min)扫描 user_root 落库;
write gate(DockerExecutor / /v1/files/upload)查这表对比 yaml `quotas.disk_bytes_per_user`,
超额返 [Error] 硬阻
扫描间隙写入会突破上限一点(race-tolerant, image/video 配额一致接受);外部用户
开放前 OS xfs prjquota 兜底真上限 DESIGN §7.5 #4 / PROGRESS。
"""
__tablename__ = "user_disk_usage"
user_id: Mapped[UUID] = mapped_column(
PG_UUID(as_uuid=True),
ForeignKey("users.user_id", ondelete="CASCADE"),
primary_key=True,
)
bytes_used: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0)
file_count: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
scanned_at: Mapped[datetime] = mapped_column(
DateTime(timezone=True), server_default=func.now(), nullable=False
)

View File

@ -0,0 +1,44 @@
"""user_disk_usage 表(§7.5 #4 软配额).
Revision ID: 0008
Revises: 0007
Create Date: 2026-05-27
per-user 工作目录字节使用快照,lifespan 后台 task 周期( 15min)扫描 user_root 落库;
write gate(DockerExecutor / /v1/files/upload)查这表对比 yaml `quotas.disk_bytes_per_user`,
超额返 [Error] 硬阻
扫描间隙写入轻微突破上限接受(race-tolerant, image/video 配额一致);外部用户开放前
OS xfs prjquota 兜底真上限(§7.5 #4)。
"""
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
revision: str = "0008"
down_revision: Union[str, None] = "0007"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
op.create_table(
"user_disk_usage",
sa.Column("user_id", sa.UUID(as_uuid=True), nullable=False),
sa.Column("bytes_used", sa.BigInteger(), nullable=False,
server_default=sa.text("0")),
sa.Column("file_count", sa.Integer(), nullable=False,
server_default=sa.text("0")),
sa.Column("scanned_at", sa.DateTime(timezone=True), nullable=False,
server_default=sa.func.now()),
sa.PrimaryKeyConstraint("user_id"),
sa.ForeignKeyConstraint(
["user_id"], ["users.user_id"], ondelete="CASCADE",
),
)
def downgrade() -> None:
op.drop_table("user_disk_usage")

View File

@ -61,6 +61,13 @@ RUN pip install --no-cache-dir \
-r /tmp/requirements.txt \
&& rm /tmp/requirements.txt
# 持久化 pip 源到 /etc/pip.conf ── 让运行时模型用 `pip install foo` 也走 mirror,
# 不只 build 时。zcbot user / root 都吃这个 global 配置。
RUN printf '[global]\nindex-url = %s\ntimeout = 60\n%s\n' \
"${PIP_INDEX_URL}" \
"${PIP_TRUSTED_HOST:+trusted-host = ${PIP_TRUSTED_HOST}}" \
> /etc/pip.conf
# Node + mermaid-cli + Chromium ── proposal / patent skill 渲 mermaid 图必备
# 镜像膨胀约 +400MB,接受成本(ASCII fallback 出 docx 没图不能用)
# Debian bookworm 自带 nodejs 18.x + chromium,够新;不走 NodeSource repo 减一步外网
@ -82,6 +89,10 @@ RUN npm config set registry ${NPM_REGISTRY} \
&& npm install -g @mermaid-js/mermaid-cli@latest \
&& npm cache clean --force
# 持久化 npm 源到 /etc/npmrc ── 让运行时模型用 `npm install bar` 也走 mirror,
# 不只 build 时。zcbot user 跑 npm 也吃这个 global 配置(优先级:proj > user > global)。
RUN printf 'registry=%s\n' "${NPM_REGISTRY}" > /etc/npmrc
# 容器内 puppeteer 启动 chromium 必备:no-sandbox(容器已 hardening 不需要 chromium 自家
# sandbox 再叠一层 setuid)、disable-setuid-sandbox(同上)、disable-dev-shm-usage
# (容器 /dev/shm 默 64MB 不够 chromium,让它走 /tmp)

85
tests/test_disk_quota.py Normal file
View File

@ -0,0 +1,85 @@
"""disk_quota.py 单元测试。
不连真 DB parse_bytes / scan_user_dir / skip dotfile 行为 / 不存在路径,
都纯 Python 文件系统操作可单测upsert / check_disk_quota 需要 DB,跳过(集成测覆盖)
"""
from __future__ import annotations
import sys
import tempfile
import unittest
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from core.storage.disk_quota import parse_bytes, scan_user_dir
class TestParseBytes(unittest.TestCase):
def test_int_passthrough(self):
self.assertEqual(parse_bytes(1024), 1024)
self.assertEqual(parse_bytes(0), 0)
def test_gb(self):
self.assertEqual(parse_bytes("5gb"), 5 * 1024 ** 3)
self.assertEqual(parse_bytes("5g"), 5 * 1024 ** 3)
self.assertEqual(parse_bytes("5GB"), 5 * 1024 ** 3)
def test_mb(self):
self.assertEqual(parse_bytes("500mb"), 500 * 1024 ** 2)
self.assertEqual(parse_bytes("500m"), 500 * 1024 ** 2)
def test_kb(self):
self.assertEqual(parse_bytes("1kb"), 1024)
def test_bytes(self):
self.assertEqual(parse_bytes("1024b"), 1024)
self.assertEqual(parse_bytes("1024"), 1024)
def test_float_suffix(self):
self.assertEqual(parse_bytes("1.5gb"), int(1.5 * 1024 ** 3))
def test_invalid(self):
self.assertIsNone(parse_bytes(""))
self.assertIsNone(parse_bytes("xxx"))
self.assertIsNone(parse_bytes("1.2.3"))
self.assertIsNone(parse_bytes(None))
class TestScanUserDir(unittest.TestCase):
def test_empty_dir(self):
with tempfile.TemporaryDirectory() as d:
b, c = scan_user_dir(Path(d))
self.assertEqual((b, c), (0, 0))
def test_nonexistent(self):
b, c = scan_user_dir(Path("/nonexistent/path/xxx"))
self.assertEqual((b, c), (0, 0))
def test_count_and_size(self):
with tempfile.TemporaryDirectory() as d:
root = Path(d)
(root / "a.txt").write_bytes(b"hello") # 5
(root / "sub").mkdir()
(root / "sub" / "b.txt").write_bytes(b"world!") # 6
(root / "sub" / "c.txt").write_bytes(b"x" * 1000) # 1000
b, c = scan_user_dir(root)
self.assertEqual(b, 1011)
self.assertEqual(c, 3)
def test_skip_dotfile_toplevel(self):
"""顶层 .zcbot_tmp / .memory 被跳过(开发期临时 + 用户记忆,不算配额)。"""
with tempfile.TemporaryDirectory() as d:
root = Path(d)
(root / "a.txt").write_bytes(b"counted") # 7
(root / ".zcbot_tmp").mkdir()
(root / ".zcbot_tmp" / "skipped.py").write_bytes(b"x" * 99999)
(root / ".memory").mkdir()
(root / ".memory" / "core.md").write_bytes(b"x" * 99999)
b, c = scan_user_dir(root)
self.assertEqual(b, 7)
self.assertEqual(c, 1)
if __name__ == "__main__":
unittest.main()

View File

@ -501,6 +501,38 @@ def create_app() -> FastAPI:
if result.rowcount:
print(f"[startup] reaped {result.rowcount} stale active run(s)")
# 磁盘配额后台扫描(§7.5 #4 应用层 gate)── 不依赖 docker backend,host
# backend 也跑(/v1/files/upload 也走配额 gate)。yaml `quotas.disk_scan_interval_seconds`
# 默 900s = 15min;limit_bytes ≤ 0 视为不限,scan 仍跑(用量统计有用),check 短路放行。
from core.agent_builder import resolve_workspace
from core.storage.disk_quota import parse_bytes, scan_all_users
workspace = resolve_workspace(None, _cfg)
disk_user_root = workspace / "users"
quotas_cfg = _cfg.get("quotas") or {}
disk_scan_interval = int(quotas_cfg.get("disk_scan_interval_seconds") or 900)
async def _disk_scanner() -> None:
loop = asyncio.get_running_loop()
# 启动时跑一次,后续按 interval。首次扫完 check 才能命中。
try:
n = await loop.run_in_executor(None, scan_all_users, disk_user_root)
if n:
print(f"[disk_scanner] initial scan: {n} user(s)")
except Exception as e:
print(f"[disk_scanner] initial scan error: {type(e).__name__}: {e}")
while True:
try:
await asyncio.sleep(disk_scan_interval)
n = await loop.run_in_executor(None, scan_all_users, disk_user_root)
if n:
print(f"[disk_scanner] scanned {n} user(s)")
except asyncio.CancelledError:
raise
except Exception as e:
print(f"[disk_scanner] error: {type(e).__name__}: {e}")
disk_scanner_task = asyncio.create_task(_disk_scanner(), name="disk-scanner")
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
@ -523,7 +555,11 @@ def create_app() -> FastAPI:
try:
# repo_root=ROOT 让 SandboxPool 把 <repo>/skills 只读 mount 进容器
# (fs 工具进容器后 read SKILL references 需要)
pool = init_pool(user_root_base, repo_root=ROOT)
# sandbox_cfg=yaml `sandbox` 段(memory/cpus/pids_limit 可调)
pool = init_pool(
user_root_base, repo_root=ROOT,
sandbox_cfg=_cfg.get("sandbox") or {},
)
removed = pool.shutdown_all()
if removed:
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
@ -552,6 +588,11 @@ def create_app() -> FastAPI:
try:
yield
finally:
disk_scanner_task.cancel()
try:
await disk_scanner_task
except (asyncio.CancelledError, Exception):
pass
if sandbox_reaper_task is not None:
sandbox_reaper_task.cancel()
try:
@ -1479,6 +1520,16 @@ def create_app() -> FastAPI:
路径不存在自动 mkdir(parents=True);重名直接覆盖
文件名严格校验( `/ \\ ..` 或为空 400)
"""
# 磁盘配额 gate(§7.5 #4):超额 413 阻止上传,提示 user 清旧产物
from core.agent_builder import load_config as _load_cfg
from core.storage.disk_quota import check_disk_quota, parse_bytes
_quotas_cfg = (_load_cfg().get("quotas") or {})
_limit = parse_bytes(_quotas_cfg.get("disk_bytes_per_user"))
if _limit is not None and _limit > 0:
_err = check_disk_quota(user_id, _limit)
if _err is not None:
raise HTTPException(413, _err)
root = _load_user_root(user_id)
dest_dir = _safe_join(root, path)
if dest_dir.exists() and not dest_dir.is_dir():