Stage C 收尾包:资源 yaml + 磁盘配额 + 网络放开 + 容器内源持久化
dogfood + 信任同事白名单阶段 Step 4 完整 egress proxy 暂不做(沉淀为升级触发 信号:任一陌生用户注册 / 模型异常 outbound / 信任白名单出现非密切相识者 → 必上)。 本批 3 件: (A) 容器资源 yaml 化(可调不重 build): - agent.yaml 加 sandbox 段(memory/cpus/pids_limit) - SandboxPool ctor 加三字段,优先级 env > yaml > 默(2g/1.0/256) - setup_pool/init_pool 透传 sandbox_cfg - sandbox check 输出加 [info] 4 行给运维一眼对账 (B) 应用层磁盘配额(§7.5 #4 软配额): - migration 0008 user_disk_usage 单行 per user - core/storage/disk_quota.py:parse_bytes("5gb"/int)+ scan_user_dir (os.scandir 跳顶层 .zcbot_tmp / .memory)+ upsert ON CONFLICT + check_disk_quota + scan_all_users 串行 - lifespan _disk_scanner 后台 task(启动跑一次 + 默 15min 周期) - DockerExecutor write/edit 起手 gate 超额 [Error] 不调容器 - /v1/files/upload 同款 gate 超额 HTTP 413 - yaml `quotas.disk_bytes_per_user: 5gb` + `disk_scan_interval_seconds: 900` - race 接受:扫描间隙写入轻微突破(image/video 配额同款 race-tolerant); 外部用户开放前 OS 层 xfs prjquota 兜底 - 11 测试 covered parse_bytes / scan / 跳 dotfile (C) 网络放开 + 容器内源持久化: - network.py 去 --internal flag,容器走 docker bridge default 有 NAT outbound - 已存在 internal network 不自动 rm 仅 warn,RUN.md 给迁移命令(避免破现有容器) - iptables 红线段不动(169.254/127/10/172.16/192.168/100.64/PG_IP DROP), 挡 cloud metadata + 内网扫描 + loopback,基线不依赖 proxy - Dockerfile 加 /etc/pip.conf(global index-url + timeout 60) + /etc/npmrc (global registry),让运行时模型 `pip install foo` / `npm install bar` 也走 mirror(此前 --build-arg 只 build 时生效) unittest discover 46/46 PASS(原 35 + 新 11)。 DESIGN 不动(延后决策仍在 §7.7 Stage C 阶段语义内,触发信号沉淀进 PROGRESS / RUN);RUN.md 加 env 列表 + 网络迁移 + 配额 + 故障兜底 3 行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
792366d9fc
commit
eaf7f3ea1e
|
|
@ -2,7 +2,7 @@
|
||||||
|
|
||||||
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
> 配合 `DESIGN.md`。本文件只记 phase 状态、决策偏差、文件量、下一步。每条 1-2 句:做了啥 + 关键判断;细节查 `git log` / `git diff` / `DESIGN §7.9`。
|
||||||
|
|
||||||
最后更新:2026-05-26(Stage C Step 3d:fs 工具(read/write/edit/glob/grep)进容器 + DESIGN §7.5 #6 重写,物理边界替代代码护栏)
|
最后更新:2026-05-27(Stage C 收尾包:容器资源 yaml 化 + 磁盘配额(scan+gate)+ 网络放开 dogfood + 容器内 pip/npm 源持久化;Step 4 完整 egress proxy 延后到外部用户开放前)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -15,14 +15,15 @@
|
||||||
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
| 5 | Eval Suite | ⏸ 不做 | dogfooding 替代,probe 覆盖健康检查 |
|
||||||
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
| 6 | 长任务工程化 | 🟡 | task + 恢复 ✅;双层记忆 ✅;context 压缩未做 |
|
||||||
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
| 7 | 打磨 | ❌ | Docker 沙盒 / 更多 skill |
|
||||||
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 ✅(Executor 接口 + Docker 池 + DockerExecutor 集成 AgentLoop,`ZCBOT_SANDBOX_BACKEND=docker` 切容器)+ Step 5 部署前置对账 ✅(`main.py sandbox check` + lifespan fs quota WARN)**;Step 4 egress proxy + Step 3b PGID kill 协议待;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
|
| §7 SaaS | DESIGN §7 路线 | 🟡 | A 事件流化 ✅;B 完工 ✅;D `/v1` JSON API ✅;D' 过渡 auth + dev SPA ✅;单活 run 锁 + cancel ✅;0004 schema 瘦身 ✅;入口归位 ✅;真 OIDC 待;**C Step 1-3 + 3d ✅(Executor + Docker 池 + DockerExecutor + fs 工具进容器)+ Step 5 部署前置对账 ✅ + 容器资源 yaml + 应用层磁盘配额(scan+gate)✅ + dogfood 网络放开 + 容器内 pip/npm 源持久化 ✅**;**Step 4 完整 egress proxy + Step 3b PGID kill 协议延后到外部用户开放前**;**外部用户开放仍需 egress proxy + xfs project quota OS 层硬化(§7.5 落地清单 #2 #4)**。 |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 已完成关键能力
|
## 已完成关键能力
|
||||||
|
|
||||||
### 2026-05-26
|
### 2026-05-27
|
||||||
|
|
||||||
|
- **Stage C 收尾包:容器资源 yaml 化 + 应用层磁盘配额 + dogfood 网络放开 + 容器内 pip/npm 源持久化**:Step 4 完整 egress proxy(allowlist + audit + 字节计量)1-2 天工程量,**dogfood + 信任同事白名单阶段不必先做**,符合 DESIGN §7.7 阶段语义;沉淀为升级触发信号(任一陌生用户注册 / dogfood 发现模型异常 outbound / 信任白名单出现非密切相识者 → 必上 Step 4)。本批做 3 件:① **容器资源 yaml 化**:`config/agent.yaml` 加 `sandbox` 段(memory/cpus/pids_limit),`SandboxPool.__init__` 加三个字段,优先级 env > yaml > 默(2g/1.0/256);`setup_pool` / `init_pool` 透传 sandbox_cfg;`main.py sandbox check` 输出加 4 行 `[info]`(memory/cpus/pids_limit/disk_bytes_per_user)给运维一眼对账。② **应用层磁盘配额**:migration `0008_user_disk_usage`(单行 per user,bytes_used/file_count/scanned_at)+ `core/storage/disk_quota.py`(`parse_bytes`("5gb"/"500mb"/int)+ `scan_user_dir`(os.scandir 跳顶层 dotfile `.zcbot_tmp` `.memory`)+ `upsert_user_usage` ON CONFLICT + `check_disk_quota`(超额返中文 msg)+ `scan_all_users` 串行扫所有 user)+ web/app.py lifespan `_disk_scanner` 后台 task(启动跑一次 + 默 15min 周期 `run_in_executor`)+ `DockerExecutor._exec_fs_tool` write/edit 起手 `_check_user_disk_quota` 超额返 `[Error]` 不调容器 + `/v1/files/upload` 同款 gate 超额 HTTP 413。yaml `quotas.disk_bytes_per_user: 5gb` + `disk_scan_interval_seconds: 900`,≤0 视为不限,首次扫描前 check 短路放行避免冷启动卡死。race 接受:扫描间隙写入轻微突破上限(与 image/video 配额同款 race-tolerant)。③ **网络放开 + 容器内源持久化**:`core/sandbox/network.py` 去掉 `--internal` flag(改 docker bridge default 有 NAT outbound;dogfood 阶段让模型能 `pip install foo` / `curl https://...`),已存在 internal network 不自动 rm 仅 warn(避免破坏现有容器,RUN.md 给迁移命令)。Dockerfile 加 `/etc/pip.conf`(写 `[global]\nindex-url=${PIP_INDEX_URL}` + timeout 60)+ `/etc/npmrc`(写 `registry=${NPM_REGISTRY}`)让运行时 pip / npm install 也走 mirror(此前 `--build-arg` 只 build 时生效)。iptables 红线段不动 ── `169.254/127/10/172.16/192.168/100.64/PG_IP` 仍 DROP,挡 cloud metadata + 内网扫描 + loopback,这是基线不依赖 proxy。**测试**:`tests/test_disk_quota.py` 11 测试覆盖 parse_bytes 各单位 / scan_user_dir 跳 dotfile / 空目录 / 不存在路径;**unittest discover 46/46 PASS**(原 35 + 新 11)。**DESIGN §7.5 #2 待 commit 加"Step 4 延后 + 升级触发表"段落**(本 commit 暂没改 DESIGN ── DESIGN 只在架构变时改,延后决策仍在 §7.7 Stage C 阶段语义内,触发信号沉淀进 PROGRESS / RUN);RUN.md 加 yaml sandbox 段 + 网络迁移 + 配额命令 + 故障兜底 2 行(internal network legacy / 磁盘 413)。否决:(a) network 改 internal 时自动 rm + recreate ── destructive,会破现有容器连接,改 warn 让运维 ack;(b) 写前实时 du ── user_root 大时几秒一次写不能接受,sticky 周期扫描表 + 写前查表是 image/video 配额同款范式;(c) 同时做完整 Step 4 ── 1-2 天大工程,dogfood 不阻塞,先放开网络让模型能 pip install 更急(实测装包 / 拉资源能力是产品门槛);(d) 磁盘配额硬阻所有写(包括 run_python / shell)── 截 syscall 太重,write/edit + upload gate 已覆盖 95%(skill 产物路径),run_python / shell 写文件靠扫描后续感知(下次周期 check 时挡新增写入);(e) yaml `sandbox.memory` 默 4g/2cpu ── 腾讯云轻量 4 核 8G,留 host 跑 web + PG + nginx 需求,2g/1cpu 是合理基线,极端任务用户改 yaml 升配。
|
||||||
- **Stage C Step 3d:fs 工具(read/write/edit/glob/grep)进容器 + DESIGN §7.5 #6 重写**:Ubuntu dogfood 第一次切 docker backend 后发现 host 工具 `Path.cwd()` 漏底 —— 模型用 glob `*` 列出了 host `/home/lighthouse/zcbot/.git/.venv/config/core/...`,即 zcbot 源码自身。回查 DESIGN §7.5 #6 写"host 工具走 `paths.py::resolve_user_path` 校验",grep 代码**根本没那个函数**,假命题;`Tool._resolve` 实际是 `base_dir / path`,base_dir=`Path.cwd()`(= web 启动目录 = zcbot repo 根),绝对路径完全不挡,模型能 read `/etc/passwd` / write zcbot 源码自己。**修法对比**:Phase A(改 cwd → working_dir,1 行 hack)修 UX 不修安全;Phase B(host 工具加 user_root 强制校验 + skills/ 白名单,~80 行)安全但脆弱(symlink/`..`/Windows path 都得 case 挡,漏一个就破);**方案 3(fs 工具进容器)物理边界替代代码护栏,选这条**。`core/sandbox/tool_runner.py` 新增容器内 helper(~80 行,from stdin 接 JSON args 调 `tools/fs.py` Tool 子类,base_dir=cwd 走 docker exec --workdir 传入,user_root=/workspace);`DockerExecutor` 加 `FS_TOOLS = {read,write,edit,glob,grep}` 信任域 + `_exec_fs_tool` 方法 `docker exec -i ... python /sandbox/tool_runner.py <name>` + stdin 喂 JSON args(CJK 路径透明传不被 shell metachar 切);`_run_subprocess` 加 stdin 参数 + is_fs_tool 路径返 stdout 直透(不包 [stdout]/[exit],原模型语义保持),exit≠0 把 stderr 当 ToolResult content。`SandboxPool` 加 `repo_root` 字段,`_docker_run` 加 `<repo>/skills:/sandbox/skills:ro` mount(SKILL.md 内引用 `references/foo.md` 时容器内 read 能解析);`web/app.py` lifespan 透传 `ROOT`;`Dockerfile` `COPY tools/ /sandbox/tools/ + tool_runner.py` 让镜像内有一份 tools 源(build-time COPY 而非 mount —— 容器内代码不应跟随 host repo 修改重启)。**留 host 的工具**:`load_skill`(SkillRegistry 内存查找,无 fs 越界)/ `web_search` / `web_fetch` / `seedream` / `seedance`(持 Bocha/ARK API key,key 不入容器 env;Step 4 egress proxy 后再讨论)。**测试**:`tests/test_executor_docker.py` 改 `test_load_skill_passthrough_to_host`(原 `test_read_passthrough_to_host` 不再成立 —— read 进容器了)+ 加 4 个 fs 路径测试(read argv 形态 / CJK 路径 stdin JSON 透明传 / grep exit≠0 stderr 透传 / glob timeout 杀 docker CLI),`unittest discover 35/35 PASS`。**DESIGN §7.5 #6 重写**:从"工具二分(host fs + container code)"改"几乎所有工具进容器,host 只留持 key + 跨 user 的"+ 标注 2026-05-26 修正记录(原假命题溯源)。**代价**:每个 fs tool call 多 ~200ms docker exec overhead,对话级 N≤15 总 1-3s,LLM 推理 5-30s 下噪声;镜像 build COPY tools/ ~5s 增量。**升级触发**(§7.9 升级表):若 metric `docker_exec_overhead / total_tool_time > 30%` 持续两周,或模型出现"在容器内起长驻服务"工作流,启用容器内 tool-runner unix socket RPC(消除每次 exec 开销)。否决:(a) Phase B path validator —— 跟 §7.9 § "美学统一性 ≠ 升级理由"对称,**安全要"物理 ≠ 代码"才稳**;(b) `COPY core/ tools/ ...` 把整个 zcbot core 进镜像 —— tool_runner 只需要 `tools/fs.py` + base.py,容器内多余代码增加攻击面;(c) tool_runner.py 用 argv 传 JSON args —— CJK / 引号 / 路径分隔符全是 shell metachar 切风险,stdin 喂稳;(d) 让 host backend 也保留 fs 工具走 user_root 校验作"双保险" —— 双源 = 漂移源,docker backend 是 §7.5 的全部论证基础,host backend 不在外部用户场景有它就够。
|
- **Stage C Step 3d:fs 工具(read/write/edit/glob/grep)进容器 + DESIGN §7.5 #6 重写**:Ubuntu dogfood 第一次切 docker backend 后发现 host 工具 `Path.cwd()` 漏底 —— 模型用 glob `*` 列出了 host `/home/lighthouse/zcbot/.git/.venv/config/core/...`,即 zcbot 源码自身。回查 DESIGN §7.5 #6 写"host 工具走 `paths.py::resolve_user_path` 校验",grep 代码**根本没那个函数**,假命题;`Tool._resolve` 实际是 `base_dir / path`,base_dir=`Path.cwd()`(= web 启动目录 = zcbot repo 根),绝对路径完全不挡,模型能 read `/etc/passwd` / write zcbot 源码自己。**修法对比**:Phase A(改 cwd → working_dir,1 行 hack)修 UX 不修安全;Phase B(host 工具加 user_root 强制校验 + skills/ 白名单,~80 行)安全但脆弱(symlink/`..`/Windows path 都得 case 挡,漏一个就破);**方案 3(fs 工具进容器)物理边界替代代码护栏,选这条**。`core/sandbox/tool_runner.py` 新增容器内 helper(~80 行,from stdin 接 JSON args 调 `tools/fs.py` Tool 子类,base_dir=cwd 走 docker exec --workdir 传入,user_root=/workspace);`DockerExecutor` 加 `FS_TOOLS = {read,write,edit,glob,grep}` 信任域 + `_exec_fs_tool` 方法 `docker exec -i ... python /sandbox/tool_runner.py <name>` + stdin 喂 JSON args(CJK 路径透明传不被 shell metachar 切);`_run_subprocess` 加 stdin 参数 + is_fs_tool 路径返 stdout 直透(不包 [stdout]/[exit],原模型语义保持),exit≠0 把 stderr 当 ToolResult content。`SandboxPool` 加 `repo_root` 字段,`_docker_run` 加 `<repo>/skills:/sandbox/skills:ro` mount(SKILL.md 内引用 `references/foo.md` 时容器内 read 能解析);`web/app.py` lifespan 透传 `ROOT`;`Dockerfile` `COPY tools/ /sandbox/tools/ + tool_runner.py` 让镜像内有一份 tools 源(build-time COPY 而非 mount —— 容器内代码不应跟随 host repo 修改重启)。**留 host 的工具**:`load_skill`(SkillRegistry 内存查找,无 fs 越界)/ `web_search` / `web_fetch` / `seedream` / `seedance`(持 Bocha/ARK API key,key 不入容器 env;Step 4 egress proxy 后再讨论)。**测试**:`tests/test_executor_docker.py` 改 `test_load_skill_passthrough_to_host`(原 `test_read_passthrough_to_host` 不再成立 —— read 进容器了)+ 加 4 个 fs 路径测试(read argv 形态 / CJK 路径 stdin JSON 透明传 / grep exit≠0 stderr 透传 / glob timeout 杀 docker CLI),`unittest discover 35/35 PASS`。**DESIGN §7.5 #6 重写**:从"工具二分(host fs + container code)"改"几乎所有工具进容器,host 只留持 key + 跨 user 的"+ 标注 2026-05-26 修正记录(原假命题溯源)。**代价**:每个 fs tool call 多 ~200ms docker exec overhead,对话级 N≤15 总 1-3s,LLM 推理 5-30s 下噪声;镜像 build COPY tools/ ~5s 增量。**升级触发**(§7.9 升级表):若 metric `docker_exec_overhead / total_tool_time > 30%` 持续两周,或模型出现"在容器内起长驻服务"工作流,启用容器内 tool-runner unix socket RPC(消除每次 exec 开销)。否决:(a) Phase B path validator —— 跟 §7.9 § "美学统一性 ≠ 升级理由"对称,**安全要"物理 ≠ 代码"才稳**;(b) `COPY core/ tools/ ...` 把整个 zcbot core 进镜像 —— tool_runner 只需要 `tools/fs.py` + base.py,容器内多余代码增加攻击面;(c) tool_runner.py 用 argv 传 JSON args —— CJK / 引号 / 路径分隔符全是 shell metachar 切风险,stdin 喂稳;(d) 让 host backend 也保留 fs 工具走 user_root 校验作"双保险" —— 双源 = 漂移源,docker backend 是 §7.5 的全部论证基础,host backend 不在外部用户场景有它就够。
|
||||||
- **Stage C Step 3 hotfix:exec_user 改 username 跟随 build_arg + Dockerfile 加 Node/Chromium/mermaid-cli**:Ubuntu 上 dogfood 暴露两个真问题。① **uid 错配**:DockerExecutor 写死 `--user 1000:1000`,但镜像 `docker build --build-arg HOST_UID=$(id -u)` 跟随 host 实际 uid(腾讯云轻量 lighthouse 用户 uid=1001),docker exec 进容器 uid=1000 → bind mount `/workspace/<wd>/` owner 1001 → 写文件全 EACCES,文件落 `/tmp/`。改 `DEFAULT_EXEC_USER = "zcbot"`(username,docker 自动查容器 /etc/passwd 拿 uid),无论 HOST_UID build 成 1000/1001/其他都跟 bind mount owner 对齐,且未来切其他部署机不用改 env。② **proposal/patent skill 渲 mermaid 缺 Node**:`skills/proposal/scripts/render_diagrams.py` `render_via_mmdc` 调 `shutil.which("mmdc")`,容器没装 → 退到 mermaid.ink 公网 API → 但 sandbox 容器 `--internal` 默 deny outbound,API 也走不通 → ASCII fallback 出 docx 没图不能用。Dockerfile 加 `chromium nodejs npm` apt 装(Debian bookworm 自带 node 18.x 够新)+ `npm install -g @mermaid-js/mermaid-cli@latest`,镜像 +~400MB(接受)。容器内 chromium 缺 setuid sandbox + `/dev/shm` 不够大会跪,镜像落 `/sandbox/puppeteer-config.json`(`--no-sandbox` / `--disable-setuid-sandbox` / `--disable-dev-shm-usage` + executablePath=/usr/bin/chromium)+ ENV `MERMAID_PUPPETEER_CONFIG=/sandbox/puppeteer-config.json`,`render_via_mmdc` 改读 env 拼 `-p <config>` 注入 mmdc;host 上跑 env 没设行为零变化。`PUPPETEER_SKIP_DOWNLOAD=true` + `PUPPETEER_EXECUTABLE_PATH` 让 puppeteer 用容器 chromium 不再下载它自带的 Chrome(省 ~300MB build)。npm 源加 `--build-arg NPM_REGISTRY=https://mirrors.cloud.tencent.com/npm/`(腾讯云内网)防境内 build 慢。`DESIGN.md` 不动(纯实施层 bug fix + skill 依赖);`RUN.md` 加 NPM_REGISTRY 段 + 故障兜底 3 行(EACCES uid 错配 / mmdc 报 launch chromium / npm 慢)。否决:(a) 让 DockerExecutor 启动时探测 `os.getuid()` 自动取 host uid 作 `--user` —— 写死 username 让 docker 查 passwd 比应用层探测更直接,且 部署机 uid 偶尔变(从 1000 重装成 1001)不用改任何东西;(b) 容器走 NodeSource repo 装 Node 20 LTS —— Debian bookworm 自带 18.x 已满足 mermaid-cli 要求(>=14.x),多一步外网拖速度;(c) 不装 chromium 等 Step 4 egress proxy 后用 mermaid.ink —— proposal 是早期就要交付的能力,等 Step 4(还没动手)不现实;(d) puppeteer config 注入靠改 mmdc 启动脚本 —— mmdc 默支持 `-p`,改 render_diagrams.py 读 env 就够,不动 mmdc 内部。
|
- **Stage C Step 3 hotfix:exec_user 改 username 跟随 build_arg + Dockerfile 加 Node/Chromium/mermaid-cli**:Ubuntu 上 dogfood 暴露两个真问题。① **uid 错配**:DockerExecutor 写死 `--user 1000:1000`,但镜像 `docker build --build-arg HOST_UID=$(id -u)` 跟随 host 实际 uid(腾讯云轻量 lighthouse 用户 uid=1001),docker exec 进容器 uid=1000 → bind mount `/workspace/<wd>/` owner 1001 → 写文件全 EACCES,文件落 `/tmp/`。改 `DEFAULT_EXEC_USER = "zcbot"`(username,docker 自动查容器 /etc/passwd 拿 uid),无论 HOST_UID build 成 1000/1001/其他都跟 bind mount owner 对齐,且未来切其他部署机不用改 env。② **proposal/patent skill 渲 mermaid 缺 Node**:`skills/proposal/scripts/render_diagrams.py` `render_via_mmdc` 调 `shutil.which("mmdc")`,容器没装 → 退到 mermaid.ink 公网 API → 但 sandbox 容器 `--internal` 默 deny outbound,API 也走不通 → ASCII fallback 出 docx 没图不能用。Dockerfile 加 `chromium nodejs npm` apt 装(Debian bookworm 自带 node 18.x 够新)+ `npm install -g @mermaid-js/mermaid-cli@latest`,镜像 +~400MB(接受)。容器内 chromium 缺 setuid sandbox + `/dev/shm` 不够大会跪,镜像落 `/sandbox/puppeteer-config.json`(`--no-sandbox` / `--disable-setuid-sandbox` / `--disable-dev-shm-usage` + executablePath=/usr/bin/chromium)+ ENV `MERMAID_PUPPETEER_CONFIG=/sandbox/puppeteer-config.json`,`render_via_mmdc` 改读 env 拼 `-p <config>` 注入 mmdc;host 上跑 env 没设行为零变化。`PUPPETEER_SKIP_DOWNLOAD=true` + `PUPPETEER_EXECUTABLE_PATH` 让 puppeteer 用容器 chromium 不再下载它自带的 Chrome(省 ~300MB build)。npm 源加 `--build-arg NPM_REGISTRY=https://mirrors.cloud.tencent.com/npm/`(腾讯云内网)防境内 build 慢。`DESIGN.md` 不动(纯实施层 bug fix + skill 依赖);`RUN.md` 加 NPM_REGISTRY 段 + 故障兜底 3 行(EACCES uid 错配 / mmdc 报 launch chromium / npm 慢)。否决:(a) 让 DockerExecutor 启动时探测 `os.getuid()` 自动取 host uid 作 `--user` —— 写死 username 让 docker 查 passwd 比应用层探测更直接,且 部署机 uid 偶尔变(从 1000 重装成 1001)不用改任何东西;(b) 容器走 NodeSource repo 装 Node 20 LTS —— Debian bookworm 自带 18.x 已满足 mermaid-cli 要求(>=14.x),多一步外网拖速度;(c) 不装 chromium 等 Step 4 egress proxy 后用 mermaid.ink —— proposal 是早期就要交付的能力,等 Step 4(还没动手)不现实;(d) puppeteer config 注入靠改 mmdc 启动脚本 —— mmdc 默支持 `-p`,改 render_diagrams.py 读 env 就够,不动 mmdc 内部。
|
||||||
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
|
- **Stage C Step 5:`main.py sandbox check` 部署前置对账 + lifespan fs quota WARN**:外部用户开放是 §7.5 #4 magnetic 要求(xfs prjquota / ext4 project quota / zfs dataset quota,否则"扫描间隙打满共享 fs 拖死同节点"),且 docker backend 启动前置(daemon/镜像/HOST_UID 对齐)出错时 lifespan 直接 fail-fast、traceback 排查贵 —— 把"运维心智清单"沉淀成可执行命令。`main.py sandbox check` 跑 5 项独立探测:① docker daemon 可达(CLI 存在 + `docker version` rc=0)② `zcbot-sandbox:latest` 镜像存在 ③ `zcbot-sandbox-net` network 存在(缺也 OK,lifespan 自动 ensure,这一项 warn 不 err)④ 镜像内 zcbot uid 与 host uid 对齐(`docker run --rm --entrypoint id` 拿镜像 uid 比对 `os.getuid()`;Windows 自动 skip)⑤ workspace/users/ 所在 fs 类型可 quota(`findmnt --target ... -no FSTYPE,OPTIONS` 解析,识别 xfs+prjquota / ext4+project quota / zfs / btrfs / tmpfs / 其他)。`detect_fs_quota(path) -> (level, msg)` 抽出来给 lifespan 复用:`web/app.py` docker backend 启动时同样跑一次,WARN 打 stdout(不阻塞),应用层周期扫描仍生效。**err vs warn 分界**:err = docker backend 启动会 fail-fast 的根因(daemon/镜像/HOST_UID,exit 1);warn = 不阻塞启动但外部用户开放前要清(network 缺 / fs 不可 quota,exit 0)。`tests/test_sandbox_check.py` 19 测试覆盖各分支 + 汇总 exit code,mock subprocess 与 sys.platform(`run_sandbox_check` 改用 module-level lookup 而非固化 `CHECKS` 元组,让 unittest patch 生效);**全套 unittest discover 31/31 PASS**。RUN.md 加"部署前置对账"小节(`sandbox check` 5 项含义)+ "配额硬化"段重写(fs 类型 → 处理动作映射表 + xfs 升级 4 步)+ 故障兜底 3 行(sandbox init failed / fs quota warn / image not found)。否决:(a) lifespan 探测失败 → fail-fast 而非 WARN —— Step 5 阶段应用层周期扫描已有,OS 层 quota 是外部开放硬要求不是 dogfood 硬要求,fail-fast 会阻碍 dogfood 启动;(b) sandbox check 自带 `quota-set` 子命令直接调 `xfs_quota` —— `<pid>` 整数 ↔ user_uuid 映射要建表跟踪,且 sudo + /etc/projects 改动属于运维操作,Step 5 阶段只落 RUN.md 说明 + 命令清单,真要做时在外部开放前一步;(c) 在 sandbox check 里探测 egress proxy 状态 —— Step 4 未实施,占位会让人误以为已落地。`DESIGN.md` 不动(纯按 §7.5 #4 既有协议实施);`RUN.md` 更新如上。
|
||||||
|
|
|
||||||
11
RUN.md
11
RUN.md
|
|
@ -320,8 +320,8 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
|
||||||
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
|
# docker = shell/run_python 走 per-user 容器 docker exec(部署机 / 外部用户)
|
||||||
# ZCBOT_SANDBOX_BACKEND=docker
|
# ZCBOT_SANDBOX_BACKEND=docker
|
||||||
|
|
||||||
# 容器内 exec 用户(默 1000:1000;Dockerfile 的 HOST_UID/HOST_GID build-arg 同步对齐)
|
# 容器内 exec 用户(默 zcbot,docker 查容器 /etc/passwd 拿 uid)
|
||||||
# ZCBOT_SANDBOX_EXEC_USER=1000:1000
|
# ZCBOT_SANDBOX_EXEC_USER=zcbot
|
||||||
|
|
||||||
# 容器镜像 tag(默 zcbot-sandbox:latest)
|
# 容器镜像 tag(默 zcbot-sandbox:latest)
|
||||||
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
|
# ZCBOT_SANDBOX_IMAGE=zcbot-sandbox:latest
|
||||||
|
|
@ -329,6 +329,10 @@ sudo -u zcbot docker network create --internal zcbot-sandbox-net
|
||||||
# ZCBOT_SANDBOX_RUNTIME=
|
# ZCBOT_SANDBOX_RUNTIME=
|
||||||
# 空闲多少秒回收(默 300)
|
# 空闲多少秒回收(默 300)
|
||||||
# ZCBOT_SANDBOX_IDLE_TTL=300
|
# ZCBOT_SANDBOX_IDLE_TTL=300
|
||||||
|
# 资源限制(优先级 env > yaml `sandbox.*` > 默);改后重启 web 新起容器生效
|
||||||
|
# ZCBOT_SANDBOX_MEMORY=2g
|
||||||
|
# ZCBOT_SANDBOX_CPUS=1.0
|
||||||
|
# ZCBOT_SANDBOX_PIDS_LIMIT=256
|
||||||
# PG 实际 IP,逗号分隔。defense-in-depth ── 即便落内网三段(§7.5 #1),
|
# PG 实际 IP,逗号分隔。defense-in-depth ── 即便落内网三段(§7.5 #1),
|
||||||
# init.sh 再加一遍 DROP 规则。生产部署必填。
|
# init.sh 再加一遍 DROP 规则。生产部署必填。
|
||||||
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
|
ZCBOT_PG_IPS=10.1.2.3,10.1.2.4
|
||||||
|
|
@ -476,6 +480,9 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
|
||||||
| `systemctl restart zcbot` 卡 10s 才退 | 有 SSE 长连接,uvicorn graceful shutdown 等 in-flight。unit 已设 `TimeoutStopSec=10` 兜 SIGKILL,正常现象;真急用 `systemctl kill -s KILL zcbot` |
|
| `systemctl restart zcbot` 卡 10s 才退 | 有 SSE 长连接,uvicorn graceful shutdown 等 in-flight。unit 已设 `TimeoutStopSec=10` 兜 SIGKILL,正常现象;真急用 `systemctl kill -s KILL zcbot` |
|
||||||
| `POST /v1/files/rename` 返 409 `folder has active run(s)` | 顶层目录被某 running/cancelling 的 task 占用;先 cancel 等流式 done 再 rename |
|
| `POST /v1/files/rename` 返 409 `folder has active run(s)` | 顶层目录被某 running/cancelling 的 task 占用;先 cancel 等流式 done 再 rename |
|
||||||
| `POST /v1/files/rename` 返 409 `... 前缀嵌套` | 改名后会与其他 task 的 working_dir 形成嵌套;换不冲突的 new_name |
|
| `POST /v1/files/rename` 返 409 `... 前缀嵌套` | 改名后会与其他 task 的 working_dir 形成嵌套;换不冲突的 new_name |
|
||||||
|
| `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web |
|
||||||
|
| `[warn] network zcbot-sandbox-net is --internal (legacy)` | 上一版 sandbox network 创建时带了 `--internal`(完全禁 outbound),当前 dogfood 阶段放开。`docker stop $(docker ps -aq -f label=zcbot.product=sandbox) ; docker network rm zcbot-sandbox-net`,重启 web 自动 recreate 为非 internal |
|
||||||
|
| tool write/edit 返 `[Error] 已达磁盘配额上限` | 同 upload 413,见上 |
|
||||||
| 启动报 `PLATFORM_KEY env not set` / `JWT_SECRET env not set` | D' 过渡 auth 强制双 env 必填。生成 `python -c "import secrets;print(secrets.token_urlsafe(48))"` 各填一,写 `.env` 重起 |
|
| 启动报 `PLATFORM_KEY env not set` / `JWT_SECRET env not set` | D' 过渡 auth 强制双 env 必填。生成 `python -c "import secrets;print(secrets.token_urlsafe(48))"` 各填一,写 `.env` 重起 |
|
||||||
| `/v1/auth/login_password` 返 403 `invalid email or password` | 邮箱不存在 / `password_hash` 列为空(platform_key 入口建的 user) / 密码错。`SELECT user_id, email, password_hash IS NOT NULL AS has_pw FROM users WHERE email=...` 核对;无行 → `main.py user add`;有行无密码 → `UPDATE users SET password_hash=...`(用 `.venv/Scripts/python.exe -c "from web.auth import hash_password;print(hash_password('xxx'))"` 算)或 `user add --user-id` 接到现有 user_id |
|
| `/v1/auth/login_password` 返 403 `invalid email or password` | 邮箱不存在 / `password_hash` 列为空(platform_key 入口建的 user) / 密码错。`SELECT user_id, email, password_hash IS NOT NULL AS has_pw FROM users WHERE email=...` 核对;无行 → `main.py user add`;有行无密码 → `UPDATE users SET password_hash=...`(用 `.venv/Scripts/python.exe -c "from web.auth import hash_password;print(hash_password('xxx'))"` 算)或 `user add --user-id` 接到现有 user_id |
|
||||||
| `main.py user add` 报 `IntegrityError ... uq_users_email` | 邮箱已存在,改 email 或先 `DELETE FROM users WHERE email=...`(先清该 user 的 tasks) |
|
| `main.py user add` 报 `IntegrityError ... uq_users_email` | 邮箱已存在,改 email 或先 `DELETE FROM users WHERE email=...`(先清该 user 的 tasks) |
|
||||||
|
|
|
||||||
|
|
@ -12,3 +12,16 @@ system_prompt: prompts/system/general_v1.md
|
||||||
quotas:
|
quotas:
|
||||||
images_per_day: 20 # seedream 等图像 tool 调用上限
|
images_per_day: 20 # seedream 等图像 tool 调用上限
|
||||||
videos_per_day: 5 # seedance 等视频 tool 调用上限
|
videos_per_day: 5 # seedance 等视频 tool 调用上限
|
||||||
|
# per-user 工作目录总字节上限(包括上传 + tool 写的所有产物);≤ 0 视为不限。
|
||||||
|
# 写前 gate(/v1/files/upload + DockerExecutor.write/edit),超额返 [Error] 硬阻。
|
||||||
|
# 实测靠 lifespan 后台 15min 扫描 user_disk_usage 表,扫描间隙轻微突破接受
|
||||||
|
# (跟 image/video 配额 race-tolerant 一致);外部用户开放前再上 OS 层 xfs prjquota 兜底。
|
||||||
|
disk_bytes_per_user: 5gb # 支持 5gb / 500mb / 1073741824(整数 bytes)
|
||||||
|
disk_scan_interval_seconds: 900 # 后台扫描周期,默 15 分钟
|
||||||
|
|
||||||
|
# Sandbox 容器资源限制(docker run flag,env 可 override);改后重启 web 生效,
|
||||||
|
# 新起的容器用新值,已 running 的不变(idle 5min 回收后下次起)。
|
||||||
|
sandbox:
|
||||||
|
memory: 2g # --memory (env: ZCBOT_SANDBOX_MEMORY)
|
||||||
|
cpus: 1.0 # --cpus (env: ZCBOT_SANDBOX_CPUS)
|
||||||
|
pids_limit: 256 # --pids-limit (env: ZCBOT_SANDBOX_PIDS_LIMIT)
|
||||||
|
|
|
||||||
|
|
@ -47,6 +47,10 @@ from .executor_host import HostExecutor
|
||||||
from .sandbox import SandboxPool
|
from .sandbox import SandboxPool
|
||||||
|
|
||||||
|
|
||||||
|
# write/edit 走配额 gate;read/glob/grep 不消耗磁盘,放行
|
||||||
|
_FS_TOOLS_WRITE = frozenset({"write", "edit"})
|
||||||
|
|
||||||
|
|
||||||
# 信任域分类(§7.5 #6,2026-05-26 修正):
|
# 信任域分类(§7.5 #6,2026-05-26 修正):
|
||||||
# - SHELL_LIKE:执行任意代码,Popen 直接喂 cmd / script,setsid 包一层
|
# - SHELL_LIKE:执行任意代码,Popen 直接喂 cmd / script,setsid 包一层
|
||||||
# - FS_TOOLS:fs 操作,docker exec → /sandbox/tool_runner.py + stdin 喂 JSON args
|
# - FS_TOOLS:fs 操作,docker exec → /sandbox/tool_runner.py + stdin 喂 JSON args
|
||||||
|
|
@ -187,7 +191,15 @@ class DockerExecutor(Executor):
|
||||||
fs 工具的 cancel / timeout 都用与 shell/run_python 不同的默认值:
|
fs 工具的 cancel / timeout 都用与 shell/run_python 不同的默认值:
|
||||||
- timeout 短(30s),fs 操作不会跑很久,卡住就说明撞 mount / 大目录扫描
|
- timeout 短(30s),fs 操作不会跑很久,卡住就说明撞 mount / 大目录扫描
|
||||||
- cancel 仍 poll(模型可能 grep 全 user_root 然后用户停止,响应即时)
|
- cancel 仍 poll(模型可能 grep 全 user_root 然后用户停止,响应即时)
|
||||||
|
|
||||||
|
write/edit 起手 check 磁盘配额(§7.5 #4),超额返 [Error] 不调容器。
|
||||||
|
read/glob/grep 不消耗磁盘放行。
|
||||||
"""
|
"""
|
||||||
|
if name in _FS_TOOLS_WRITE:
|
||||||
|
err = _check_user_disk_quota(self.user_id)
|
||||||
|
if err is not None:
|
||||||
|
return ToolResult(content=err, exit_code=2)
|
||||||
|
|
||||||
timeout = int(args.get("timeout") or 30) if name == "grep" else 30
|
timeout = int(args.get("timeout") or 30) if name == "grep" else 30
|
||||||
|
|
||||||
container = self.pool.ensure(self.user_id)
|
container = self.pool.ensure(self.user_id)
|
||||||
|
|
@ -311,3 +323,23 @@ class DockerExecutor(Executor):
|
||||||
parts.append(f"[stderr]\n{stderr.rstrip()}")
|
parts.append(f"[stderr]\n{stderr.rstrip()}")
|
||||||
parts.append(f"[exit {proc.returncode}]")
|
parts.append(f"[exit {proc.returncode}]")
|
||||||
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)
|
return ToolResult(content="\n".join(parts), exit_code=proc.returncode)
|
||||||
|
|
||||||
|
|
||||||
|
def _check_user_disk_quota(user_id: UUID):
|
||||||
|
"""write/edit 前 gate;读 yaml 配额 + 查 user_disk_usage 表。
|
||||||
|
|
||||||
|
放这里(模块级 helper)而非 DockerExecutor 方法是因为 host_executor 路径
|
||||||
|
也复用同款 gate(/v1/files/upload),实现一次写两处用。
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from core.agent_builder import load_config
|
||||||
|
from core.storage.disk_quota import check_disk_quota, parse_bytes
|
||||||
|
cfg = load_config() or {}
|
||||||
|
quotas = cfg.get("quotas") or {}
|
||||||
|
limit = parse_bytes(quotas.get("disk_bytes_per_user"))
|
||||||
|
if limit is None or limit <= 0:
|
||||||
|
return None
|
||||||
|
return check_disk_quota(user_id, limit)
|
||||||
|
except Exception:
|
||||||
|
# 配额查询失败不阻塞主路径(写仍放行,日志靠 caller)
|
||||||
|
return None
|
||||||
|
|
|
||||||
|
|
@ -36,16 +36,19 @@ _pool: Optional[SandboxPool] = None
|
||||||
|
|
||||||
|
|
||||||
def init_pool(
|
def init_pool(
|
||||||
user_root_base: Path, repo_root: Optional[Path] = None
|
user_root_base: Path,
|
||||||
|
repo_root: Optional[Path] = None,
|
||||||
|
sandbox_cfg: Optional[dict] = None,
|
||||||
) -> SandboxPool:
|
) -> SandboxPool:
|
||||||
"""幂等初始化 module-level pool。返回 pool 实例。
|
"""幂等初始化 module-level pool。返回 pool 实例。
|
||||||
|
|
||||||
lifespan 调一次;ensure_network 内部也幂等。重复调用返回同一实例(不重新建)。
|
lifespan 调一次;ensure_network 内部也幂等。重复调用返回同一实例(不重新建)。
|
||||||
`repo_root` 给 fs 工具进容器后 SKILL references 的 ro mount(详 pool.py)。
|
`repo_root` 给 fs 工具进容器后 SKILL references 的 ro mount(详 pool.py)。
|
||||||
|
`sandbox_cfg` 是 agent.yaml 的 `sandbox` 段,含 memory/cpus/pids_limit。
|
||||||
"""
|
"""
|
||||||
global _pool
|
global _pool
|
||||||
if _pool is None:
|
if _pool is None:
|
||||||
_pool = setup_pool(user_root_base, repo_root=repo_root)
|
_pool = setup_pool(user_root_base, repo_root=repo_root, sandbox_cfg=sandbox_cfg)
|
||||||
return _pool
|
return _pool
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -88,7 +88,7 @@ def check_network_present() -> bool:
|
||||||
return True
|
return True
|
||||||
_warn(
|
_warn(
|
||||||
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
|
f"network missing: {NETWORK_NAME} -- lifespan 启动会自动 ensure;"
|
||||||
f"或手动 `docker network create --internal {NETWORK_NAME}`"
|
f"或手动 `docker network create {NETWORK_NAME}`"
|
||||||
)
|
)
|
||||||
return True # warn 不算失败
|
return True # warn 不算失败
|
||||||
|
|
||||||
|
|
@ -225,6 +225,29 @@ CHECK_NAMES = [
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _print_sandbox_resources() -> None:
|
||||||
|
"""打印 yaml `sandbox.*` + 配额段生效值,给运维一眼对账。"""
|
||||||
|
try:
|
||||||
|
from core.agent_builder import load_config
|
||||||
|
from .pool import DEFAULT_CPUS, DEFAULT_MEMORY, DEFAULT_PIDS_LIMIT
|
||||||
|
cfg = load_config() or {}
|
||||||
|
sb = cfg.get("sandbox") or {}
|
||||||
|
quotas = cfg.get("quotas") or {}
|
||||||
|
# env 优先,跟 SandboxPool ctor 同款解析逻辑
|
||||||
|
mem = os.getenv("ZCBOT_SANDBOX_MEMORY") or sb.get("memory") or DEFAULT_MEMORY
|
||||||
|
cpus = os.getenv("ZCBOT_SANDBOX_CPUS") or str(sb.get("cpus") or DEFAULT_CPUS)
|
||||||
|
pids = os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT") or str(
|
||||||
|
sb.get("pids_limit") or DEFAULT_PIDS_LIMIT
|
||||||
|
)
|
||||||
|
disk = quotas.get("disk_bytes_per_user", "<unset>")
|
||||||
|
print(f"[info] sandbox.memory = {mem}")
|
||||||
|
print(f"[info] sandbox.cpus = {cpus}")
|
||||||
|
print(f"[info] sandbox.pids_limit = {pids}")
|
||||||
|
print(f"[info] quotas.disk_bytes_per_user = {disk}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[warn] cannot read sandbox config: {type(e).__name__}: {e}")
|
||||||
|
|
||||||
|
|
||||||
def run_sandbox_check() -> int:
|
def run_sandbox_check() -> int:
|
||||||
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
|
"""跑所有探测,返 exit code(0=全 ok 或仅 warn;1=有 err)。
|
||||||
|
|
||||||
|
|
@ -236,6 +259,8 @@ def run_sandbox_check() -> int:
|
||||||
`core.sandbox.check.check_xxx` 对本函数生效。
|
`core.sandbox.check.check_xxx` 对本函数生效。
|
||||||
"""
|
"""
|
||||||
print("--- sandbox deployment check ---\n")
|
print("--- sandbox deployment check ---\n")
|
||||||
|
_print_sandbox_resources()
|
||||||
|
print()
|
||||||
ok_count = 0
|
ok_count = 0
|
||||||
module = sys.modules[__name__]
|
module = sys.modules[__name__]
|
||||||
for label, fn_name in CHECK_NAMES:
|
for label, fn_name in CHECK_NAMES:
|
||||||
|
|
|
||||||
|
|
@ -1,35 +1,50 @@
|
||||||
"""Sandbox Docker network 管理。
|
"""Sandbox Docker network 管理。
|
||||||
|
|
||||||
`zcbot-sandbox-net` 是 `--internal` bridge:
|
`zcbot-sandbox-net` 是 docker bridge,**默有 outbound NAT**(走 host 默 bridge 路由)。
|
||||||
- 默认无 outbound(Docker bridge 移除 host NAT 路由)
|
sandbox 容器同接此 net + iptables OUTPUT 红线段 DROP(init.sh)挡 cloud metadata /
|
||||||
- 同网络下容器之间默认隔离(Docker bridge 默认行为,internal 也成立)
|
loopback / 内网 / PG IP。
|
||||||
|
|
||||||
Step 2 起即用 `--internal`,iptables OUTPUT blocklist(init.sh 里的)作为 defense-in-depth
|
**dogfood 阶段**(当前):容器可访问公网(让模型能 `pip install` / `curl` 公开域名),
|
||||||
(网络层已堵死,iptables 仍按 §7.5 #1 协议加规则,任一缺失视为部署未完成)。
|
iptables 仍挡内网 + cloud metadata。
|
||||||
|
|
||||||
Step 4 引入 egress proxy 时:proxy 容器同接 `zcbot-sandbox-net`(从内部网到 proxy 容器
|
**外部用户开放时**(§7.7 Stage C Step 4,DESIGN §7.5 #2):
|
||||||
保持联通),proxy 容器再走 host 默认网出网。sandbox 容器 env `HTTP_PROXY` 指向
|
network 改 `--internal`(完全禁 outbound)+ 起 zcbot-proxy 容器接此 net + sandbox
|
||||||
proxy 容器名 + iptables 加 ACCEPT 例外,实现"默认 deny + 仅经 proxy"。
|
容器 env `HTTP_PROXY` 指向 proxy + proxy 做 allowlist / 字节计量 / audit。届时
|
||||||
|
network 从 bridge 改 internal,需手动 rm + recreate(已 running 的容器先全停)。
|
||||||
|
|
||||||
操作幂等:create 前 inspect 探测,已存在直接返。
|
操作幂等:create 前 inspect 探测,已存在直接返;若已存在但 Internal=true(上一版
|
||||||
|
遗留),打 warn 提示 ── 不自动 rm 避免破坏现有连着的容器(详 RUN.md "Sandbox
|
||||||
|
网络迁移"段)。
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
import subprocess
|
import subprocess
|
||||||
|
|
||||||
NETWORK_NAME = "zcbot-sandbox-net"
|
NETWORK_NAME = "zcbot-sandbox-net"
|
||||||
|
|
||||||
|
|
||||||
def ensure_network() -> None:
|
def ensure_network() -> None:
|
||||||
"""创建 `zcbot-sandbox-net`(若不存在)。失败 raise。"""
|
"""创建 `zcbot-sandbox-net`(若不存在);若已存在且 Internal=True 仅 warn。失败 raise。"""
|
||||||
inspect = subprocess.run(
|
inspect = subprocess.run(
|
||||||
["docker", "network", "inspect", NETWORK_NAME],
|
["docker", "network", "inspect", NETWORK_NAME],
|
||||||
capture_output=True, text=True,
|
capture_output=True, text=True,
|
||||||
)
|
)
|
||||||
if inspect.returncode == 0:
|
if inspect.returncode == 0:
|
||||||
|
# 已存在 ── 检测 Internal 属性,若 true 给迁移提示
|
||||||
|
try:
|
||||||
|
data = json.loads(inspect.stdout)
|
||||||
|
if data and isinstance(data, list) and data[0].get("Internal") is True:
|
||||||
|
print(
|
||||||
|
f"[warn] network {NETWORK_NAME} is --internal (legacy);"
|
||||||
|
f" sandbox 容器将无法 outbound。手动 `docker network rm {NETWORK_NAME}`"
|
||||||
|
f" 后重启 web,会自动 recreate 为非 internal(详 RUN.md)"
|
||||||
|
)
|
||||||
|
except (json.JSONDecodeError, IndexError, AttributeError):
|
||||||
|
pass
|
||||||
return
|
return
|
||||||
r = subprocess.run(
|
r = subprocess.run(
|
||||||
["docker", "network", "create", "--internal", NETWORK_NAME],
|
["docker", "network", "create", NETWORK_NAME],
|
||||||
capture_output=True, text=True,
|
capture_output=True, text=True,
|
||||||
)
|
)
|
||||||
if r.returncode != 0:
|
if r.returncode != 0:
|
||||||
|
|
|
||||||
|
|
@ -31,7 +31,7 @@ import subprocess
|
||||||
import threading
|
import threading
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Dict, List, Optional
|
from typing import Any, Dict, List, Optional
|
||||||
from uuid import UUID
|
from uuid import UUID
|
||||||
|
|
||||||
from .network import NETWORK_NAME, ensure_network
|
from .network import NETWORK_NAME, ensure_network
|
||||||
|
|
@ -45,6 +45,11 @@ LABEL_USER_ID_KEY = "zcbot.user_id"
|
||||||
DEFAULT_IMAGE = "zcbot-sandbox:latest"
|
DEFAULT_IMAGE = "zcbot-sandbox:latest"
|
||||||
DEFAULT_IDLE_TTL_SECONDS = 300
|
DEFAULT_IDLE_TTL_SECONDS = 300
|
||||||
|
|
||||||
|
# 容器资源限制默认值(可被 yaml `sandbox.*` / env override,详 SandboxPool ctor)
|
||||||
|
DEFAULT_MEMORY = "2g"
|
||||||
|
DEFAULT_CPUS = "1.0"
|
||||||
|
DEFAULT_PIDS_LIMIT = 256
|
||||||
|
|
||||||
|
|
||||||
def container_name(user_id: UUID) -> str:
|
def container_name(user_id: UUID) -> str:
|
||||||
return f"{CONTAINER_NAME_PREFIX}{user_id}"
|
return f"{CONTAINER_NAME_PREFIX}{user_id}"
|
||||||
|
|
@ -81,6 +86,9 @@ class SandboxPool:
|
||||||
runtime: Optional[str] = None,
|
runtime: Optional[str] = None,
|
||||||
idle_ttl: Optional[int] = None,
|
idle_ttl: Optional[int] = None,
|
||||||
pg_ips: Optional[str] = None,
|
pg_ips: Optional[str] = None,
|
||||||
|
memory: Optional[str] = None,
|
||||||
|
cpus: Optional[str] = None,
|
||||||
|
pids_limit: Optional[int] = None,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""
|
"""
|
||||||
user_root_base: per-user 子树父目录,典型 `<workspace>/users`。bind mount 源
|
user_root_base: per-user 子树父目录,典型 `<workspace>/users`。bind mount 源
|
||||||
|
|
@ -98,6 +106,10 @@ class SandboxPool:
|
||||||
(env `ZCBOT_SANDBOX_IDLE_TTL`,默 300)
|
(env `ZCBOT_SANDBOX_IDLE_TTL`,默 300)
|
||||||
pg_ips: 逗号分隔的 PG IP 串,塞容器 `ZCBOT_PG_IPS` env,init.sh 加 DROP 规则
|
pg_ips: 逗号分隔的 PG IP 串,塞容器 `ZCBOT_PG_IPS` env,init.sh 加 DROP 规则
|
||||||
(env `ZCBOT_PG_IPS`)。defense-in-depth ── 即便落内网三段。
|
(env `ZCBOT_PG_IPS`)。defense-in-depth ── 即便落内网三段。
|
||||||
|
memory/cpus/pids_limit:
|
||||||
|
容器资源限制,默 2g/1.0/256;env(`ZCBOT_SANDBOX_MEMORY` 等)
|
||||||
|
override caller 参数 override 默认。改后重启 web 生效,新起的
|
||||||
|
容器用新值;已 running 不变(idle 5min 回收后下次起按新值)。
|
||||||
"""
|
"""
|
||||||
self.user_root_base = user_root_base
|
self.user_root_base = user_root_base
|
||||||
self.repo_root = repo_root
|
self.repo_root = repo_root
|
||||||
|
|
@ -107,6 +119,13 @@ class SandboxPool:
|
||||||
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
|
os.getenv("ZCBOT_SANDBOX_IDLE_TTL", str(DEFAULT_IDLE_TTL_SECONDS))
|
||||||
)
|
)
|
||||||
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
|
self.pg_ips = pg_ips if pg_ips is not None else os.getenv("ZCBOT_PG_IPS", "")
|
||||||
|
# 资源限制:env > caller > 默
|
||||||
|
self.memory = os.getenv("ZCBOT_SANDBOX_MEMORY") or memory or DEFAULT_MEMORY
|
||||||
|
self.cpus = os.getenv("ZCBOT_SANDBOX_CPUS") or cpus or DEFAULT_CPUS
|
||||||
|
self.pids_limit = int(
|
||||||
|
os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT")
|
||||||
|
or (pids_limit if pids_limit is not None else DEFAULT_PIDS_LIMIT)
|
||||||
|
)
|
||||||
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
|
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
|
||||||
self._locks: Dict[UUID, threading.Lock] = {}
|
self._locks: Dict[UUID, threading.Lock] = {}
|
||||||
self._last_active: Dict[UUID, int] = {}
|
self._last_active: Dict[UUID, int] = {}
|
||||||
|
|
@ -151,9 +170,9 @@ class SandboxPool:
|
||||||
"--cap-drop=ALL", # 默全丢
|
"--cap-drop=ALL", # 默全丢
|
||||||
"--cap-add=NET_ADMIN", # init.sh 配 iptables 需要;exec 进来的 uid 1000 拿不到
|
"--cap-add=NET_ADMIN", # init.sh 配 iptables 需要;exec 进来的 uid 1000 拿不到
|
||||||
"--security-opt=no-new-privileges",
|
"--security-opt=no-new-privileges",
|
||||||
"--pids-limit=256",
|
f"--pids-limit={self.pids_limit}",
|
||||||
"--memory=2g",
|
f"--memory={self.memory}",
|
||||||
"--cpus=1.0",
|
f"--cpus={self.cpus}",
|
||||||
"-v", f"{user_root}:/workspace",
|
"-v", f"{user_root}:/workspace",
|
||||||
"-e", f"ZCBOT_PG_IPS={self.pg_ips}",
|
"-e", f"ZCBOT_PG_IPS={self.pg_ips}",
|
||||||
"--restart=no",
|
"--restart=no",
|
||||||
|
|
@ -219,15 +238,28 @@ class SandboxPool:
|
||||||
|
|
||||||
|
|
||||||
def setup_pool(
|
def setup_pool(
|
||||||
user_root_base: Path, repo_root: Optional[Path] = None
|
user_root_base: Path,
|
||||||
|
repo_root: Optional[Path] = None,
|
||||||
|
sandbox_cfg: Optional[Dict[str, object]] = None,
|
||||||
) -> SandboxPool:
|
) -> SandboxPool:
|
||||||
"""app 启动便捷入口:ensure 网络存在 + 返回 pool 实例。
|
"""app 启动便捷入口:ensure 网络存在 + 返回 pool 实例。
|
||||||
|
|
||||||
|
`sandbox_cfg` 是 agent.yaml 的 `sandbox` 段(dict),含 memory/cpus/pids_limit;
|
||||||
|
没传走 env / 默认值。env 仍可独立 override(SandboxPool ctor 里处理优先级)。
|
||||||
|
|
||||||
典型用法(lifespan 启动钩子):
|
典型用法(lifespan 启动钩子):
|
||||||
from core.paths import ROOT
|
from core.paths import ROOT
|
||||||
pool = setup_pool(workspace / "users", repo_root=ROOT)
|
cfg = load_config()
|
||||||
|
pool = setup_pool(workspace / "users", repo_root=ROOT,
|
||||||
|
sandbox_cfg=cfg.get("sandbox") or {})
|
||||||
pool.shutdown_all() # 清前驱孤儿
|
pool.shutdown_all() # 清前驱孤儿
|
||||||
# 后台 reaper task 周期跑 pool.reap_idle()
|
|
||||||
"""
|
"""
|
||||||
ensure_network()
|
ensure_network()
|
||||||
return SandboxPool(user_root_base=user_root_base, repo_root=repo_root)
|
cfg = sandbox_cfg or {}
|
||||||
|
return SandboxPool(
|
||||||
|
user_root_base=user_root_base,
|
||||||
|
repo_root=repo_root,
|
||||||
|
memory=cfg.get("memory") if isinstance(cfg.get("memory"), str) else None,
|
||||||
|
cpus=str(cfg["cpus"]) if cfg.get("cpus") is not None else None,
|
||||||
|
pids_limit=int(cfg["pids_limit"]) if cfg.get("pids_limit") is not None else None,
|
||||||
|
)
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,202 @@
|
||||||
|
"""Per-user 工作目录配额(§7.5 #4 软配额,应用层 gate)。
|
||||||
|
|
||||||
|
调用入口:
|
||||||
|
- `scan_user_dir(user_root) -> (bytes, count)` ── os.walk 累加,跳 dotfile / 损坏 stat
|
||||||
|
- `upsert_user_usage(user_id, bytes, count)` ── 落 user_disk_usage 表
|
||||||
|
- `check_disk_quota(user_id, limit_bytes) -> Optional[str]` ── 写前查,返 None=放行 /
|
||||||
|
str=拒绝原因。`limit_bytes <= 0` 短路放行(不限)
|
||||||
|
- `scan_all_users(user_root_base, limit_bytes)` ── lifespan 后台 task 周期跑,
|
||||||
|
per user 跑完后下一个,避免 IO 风暴
|
||||||
|
|
||||||
|
字节单位解析(yaml `disk_bytes_per_user`):
|
||||||
|
- 整数字节 / "5gb" / "500mb" / "1.5g" 等 case-insensitive 后缀
|
||||||
|
- 失败返 None,caller 视为不限
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable, List, Optional, Tuple
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
from sqlalchemy import select
|
||||||
|
from sqlalchemy.dialects.postgresql import insert as pg_insert
|
||||||
|
|
||||||
|
from .engine import session_scope
|
||||||
|
from .models import UserDiskUsage
|
||||||
|
|
||||||
|
|
||||||
|
# yaml 字节解析:5gb / 500mb / 1024 / 1.5g
|
||||||
|
_SIZE_RE = re.compile(r"^\s*([\d.]+)\s*([kmgt]?b?)?\s*$", re.IGNORECASE)
|
||||||
|
_UNIT_FACTORS = {
|
||||||
|
"": 1, "b": 1,
|
||||||
|
"k": 1024, "kb": 1024,
|
||||||
|
"m": 1024 ** 2, "mb": 1024 ** 2,
|
||||||
|
"g": 1024 ** 3, "gb": 1024 ** 3,
|
||||||
|
"t": 1024 ** 4, "tb": 1024 ** 4,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def parse_bytes(value) -> Optional[int]:
|
||||||
|
"""yaml 字节值 → int;无法解析返 None。"""
|
||||||
|
if value is None:
|
||||||
|
return None
|
||||||
|
if isinstance(value, int):
|
||||||
|
return value
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return None
|
||||||
|
m = _SIZE_RE.match(value)
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
num_s, unit_s = m.group(1), (m.group(2) or "").lower()
|
||||||
|
factor = _UNIT_FACTORS.get(unit_s)
|
||||||
|
if factor is None:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return int(float(num_s) * factor)
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# 扫描跳过的 dotfile 顶层名(节省 IO,且 /v1/files API 也隐藏)
|
||||||
|
_SKIP_TOPLEVEL = frozenset({".zcbot_tmp", ".memory"})
|
||||||
|
|
||||||
|
|
||||||
|
def scan_user_dir(user_root: Path) -> Tuple[int, int]:
|
||||||
|
"""os.walk 累加 user_root 下所有文件大小,返 (bytes, count)。
|
||||||
|
|
||||||
|
跳过顶层 .zcbot_tmp / .memory(开发期临时 + 用户记忆 dotfile,不算入产品配额);
|
||||||
|
follow_symlinks=False 防 symlink 循环爆。
|
||||||
|
"""
|
||||||
|
if not user_root.exists() or not user_root.is_dir():
|
||||||
|
return 0, 0
|
||||||
|
|
||||||
|
total_bytes = 0
|
||||||
|
total_count = 0
|
||||||
|
try:
|
||||||
|
for entry in os.scandir(user_root):
|
||||||
|
if entry.name in _SKIP_TOPLEVEL:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
if entry.is_file(follow_symlinks=False):
|
||||||
|
try:
|
||||||
|
total_bytes += entry.stat(follow_symlinks=False).st_size
|
||||||
|
total_count += 1
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
elif entry.is_dir(follow_symlinks=False):
|
||||||
|
sub_b, sub_c = _walk_dir(Path(entry.path))
|
||||||
|
total_bytes += sub_b
|
||||||
|
total_count += sub_c
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
return total_bytes, total_count
|
||||||
|
|
||||||
|
|
||||||
|
def _walk_dir(d: Path) -> Tuple[int, int]:
|
||||||
|
total_b, total_c = 0, 0
|
||||||
|
for root, dirs, files in os.walk(d, followlinks=False, onerror=lambda _e: None):
|
||||||
|
for f in files:
|
||||||
|
try:
|
||||||
|
st = os.stat(os.path.join(root, f), follow_symlinks=False)
|
||||||
|
total_b += st.st_size
|
||||||
|
total_c += 1
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
return total_b, total_c
|
||||||
|
|
||||||
|
|
||||||
|
def upsert_user_usage(user_id: UUID, bytes_used: int, file_count: int) -> None:
|
||||||
|
"""落 user_disk_usage 单行;首次 INSERT,后续 UPDATE。"""
|
||||||
|
from sqlalchemy import func
|
||||||
|
with session_scope() as s:
|
||||||
|
stmt = pg_insert(UserDiskUsage).values(
|
||||||
|
user_id=user_id,
|
||||||
|
bytes_used=bytes_used,
|
||||||
|
file_count=file_count,
|
||||||
|
).on_conflict_do_update(
|
||||||
|
index_elements=["user_id"],
|
||||||
|
set_={
|
||||||
|
"bytes_used": bytes_used,
|
||||||
|
"file_count": file_count,
|
||||||
|
"scanned_at": func.now(),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
s.execute(stmt)
|
||||||
|
|
||||||
|
|
||||||
|
def get_user_usage(user_id: UUID) -> Optional[Tuple[int, int]]:
|
||||||
|
"""读最近一次扫描结果 (bytes, count);无记录返 None。"""
|
||||||
|
with session_scope() as s:
|
||||||
|
row = s.execute(
|
||||||
|
select(UserDiskUsage.bytes_used, UserDiskUsage.file_count)
|
||||||
|
.where(UserDiskUsage.user_id == user_id)
|
||||||
|
).first()
|
||||||
|
if row is None:
|
||||||
|
return None
|
||||||
|
return int(row[0]), int(row[1])
|
||||||
|
|
||||||
|
|
||||||
|
def check_disk_quota(user_id: UUID, limit_bytes: int) -> Optional[str]:
|
||||||
|
"""写前 gate:超额返 错误 msg(给 LLM 直读);放行返 None。
|
||||||
|
|
||||||
|
`limit_bytes <= 0` 短路放行(不限)。无扫描记录(首次,首次扫描前)放行 ──
|
||||||
|
避免冷启动期间所有写入卡死。15min 后周期扫到就生效。
|
||||||
|
"""
|
||||||
|
if limit_bytes <= 0:
|
||||||
|
return None
|
||||||
|
usage = get_user_usage(user_id)
|
||||||
|
if usage is None:
|
||||||
|
return None # 首次,放行,首次扫描后下次 gate 才生效
|
||||||
|
used, _ = usage
|
||||||
|
if used >= limit_bytes:
|
||||||
|
used_mb = used / (1024 ** 2)
|
||||||
|
limit_mb = limit_bytes / (1024 ** 2)
|
||||||
|
return (
|
||||||
|
f"[Error] 已达磁盘配额上限({used_mb:.1f} MB / {limit_mb:.1f} MB);"
|
||||||
|
f"清理旧产物或联系管理员升配后重试"
|
||||||
|
)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def list_user_ids_with_root(user_root_base: Path) -> List[UUID]:
|
||||||
|
"""扫 user_root_base 子目录,返合法 UUID 列表(=有 workspace 子目录的 user)。
|
||||||
|
|
||||||
|
不去 DB 查 users 全表 —— 有些 user 可能从未发消息(无 workspace 目录),无 disk 占用,
|
||||||
|
无需 upsert 占位行。
|
||||||
|
"""
|
||||||
|
if not user_root_base.is_dir():
|
||||||
|
return []
|
||||||
|
out: List[UUID] = []
|
||||||
|
try:
|
||||||
|
for entry in os.scandir(user_root_base):
|
||||||
|
if not entry.is_dir(follow_symlinks=False):
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
out.append(UUID(entry.name))
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def scan_all_users(user_root_base: Path) -> int:
|
||||||
|
"""扫所有 user 落库,返扫描的 user 数。lifespan 后台 task 调。
|
||||||
|
|
||||||
|
串行(per user 跑完下一个)避免 IO 风暴;单 user 几秒(几百 MB 量级),N user 总耗时
|
||||||
|
线性。失败的 user 静默跳过,下次周期再试。
|
||||||
|
"""
|
||||||
|
count = 0
|
||||||
|
for uid in list_user_ids_with_root(user_root_base):
|
||||||
|
try:
|
||||||
|
b, c = scan_user_dir(user_root_base / str(uid))
|
||||||
|
upsert_user_usage(uid, b, c)
|
||||||
|
count += 1
|
||||||
|
except Exception:
|
||||||
|
# 单 user 扫挂不阻塞其他 user;下次周期重试。日志靠 caller 注入。
|
||||||
|
pass
|
||||||
|
return count
|
||||||
|
|
@ -20,6 +20,7 @@ from typing import Any, Optional
|
||||||
from uuid import UUID, uuid4
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
from sqlalchemy import (
|
from sqlalchemy import (
|
||||||
|
BigInteger,
|
||||||
DateTime,
|
DateTime,
|
||||||
ForeignKey,
|
ForeignKey,
|
||||||
Integer,
|
Integer,
|
||||||
|
|
@ -137,3 +138,28 @@ class UsageEvent(Base):
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class UserDiskUsage(Base):
|
||||||
|
"""per-user 工作目录字节使用快照(0008,§7.5 #4 软配额表)。
|
||||||
|
|
||||||
|
每个 user_id 单行 upsert,lifespan 后台 task 周期(默 15min)扫描 user_root 落库;
|
||||||
|
write 前 gate(DockerExecutor / /v1/files/upload)查这表对比 yaml `quotas.disk_bytes_per_user`,
|
||||||
|
超额返 [Error] 硬阻。
|
||||||
|
|
||||||
|
扫描间隙写入会突破上限一点(race-tolerant,跟 image/video 配额一致接受);外部用户
|
||||||
|
开放前 OS 层 xfs prjquota 兜底真上限。详 DESIGN §7.5 #4 / PROGRESS。
|
||||||
|
"""
|
||||||
|
|
||||||
|
__tablename__ = "user_disk_usage"
|
||||||
|
|
||||||
|
user_id: Mapped[UUID] = mapped_column(
|
||||||
|
PG_UUID(as_uuid=True),
|
||||||
|
ForeignKey("users.user_id", ondelete="CASCADE"),
|
||||||
|
primary_key=True,
|
||||||
|
)
|
||||||
|
bytes_used: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0)
|
||||||
|
file_count: Mapped[int] = mapped_column(Integer, nullable=False, default=0)
|
||||||
|
scanned_at: Mapped[datetime] = mapped_column(
|
||||||
|
DateTime(timezone=True), server_default=func.now(), nullable=False
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,44 @@
|
||||||
|
"""user_disk_usage 表(§7.5 #4 软配额).
|
||||||
|
|
||||||
|
Revision ID: 0008
|
||||||
|
Revises: 0007
|
||||||
|
Create Date: 2026-05-27
|
||||||
|
|
||||||
|
per-user 工作目录字节使用快照,lifespan 后台 task 周期(默 15min)扫描 user_root 落库;
|
||||||
|
write 前 gate(DockerExecutor / /v1/files/upload)查这表对比 yaml `quotas.disk_bytes_per_user`,
|
||||||
|
超额返 [Error] 硬阻。
|
||||||
|
|
||||||
|
扫描间隙写入轻微突破上限接受(race-tolerant,跟 image/video 配额一致);外部用户开放前
|
||||||
|
OS 层 xfs prjquota 兜底真上限(§7.5 #4)。
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
import sqlalchemy as sa
|
||||||
|
from alembic import op
|
||||||
|
|
||||||
|
|
||||||
|
revision: str = "0008"
|
||||||
|
down_revision: Union[str, None] = "0007"
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
op.create_table(
|
||||||
|
"user_disk_usage",
|
||||||
|
sa.Column("user_id", sa.UUID(as_uuid=True), nullable=False),
|
||||||
|
sa.Column("bytes_used", sa.BigInteger(), nullable=False,
|
||||||
|
server_default=sa.text("0")),
|
||||||
|
sa.Column("file_count", sa.Integer(), nullable=False,
|
||||||
|
server_default=sa.text("0")),
|
||||||
|
sa.Column("scanned_at", sa.DateTime(timezone=True), nullable=False,
|
||||||
|
server_default=sa.func.now()),
|
||||||
|
sa.PrimaryKeyConstraint("user_id"),
|
||||||
|
sa.ForeignKeyConstraint(
|
||||||
|
["user_id"], ["users.user_id"], ondelete="CASCADE",
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
op.drop_table("user_disk_usage")
|
||||||
|
|
@ -61,6 +61,13 @@ RUN pip install --no-cache-dir \
|
||||||
-r /tmp/requirements.txt \
|
-r /tmp/requirements.txt \
|
||||||
&& rm /tmp/requirements.txt
|
&& rm /tmp/requirements.txt
|
||||||
|
|
||||||
|
# 持久化 pip 源到 /etc/pip.conf ── 让运行时模型用 `pip install foo` 也走 mirror,
|
||||||
|
# 不只 build 时。zcbot user / root 都吃这个 global 配置。
|
||||||
|
RUN printf '[global]\nindex-url = %s\ntimeout = 60\n%s\n' \
|
||||||
|
"${PIP_INDEX_URL}" \
|
||||||
|
"${PIP_TRUSTED_HOST:+trusted-host = ${PIP_TRUSTED_HOST}}" \
|
||||||
|
> /etc/pip.conf
|
||||||
|
|
||||||
# Node + mermaid-cli + Chromium ── proposal / patent skill 渲 mermaid 图必备
|
# Node + mermaid-cli + Chromium ── proposal / patent skill 渲 mermaid 图必备
|
||||||
# 镜像膨胀约 +400MB,接受成本(ASCII fallback 出 docx 没图不能用)
|
# 镜像膨胀约 +400MB,接受成本(ASCII fallback 出 docx 没图不能用)
|
||||||
# Debian bookworm 自带 nodejs 18.x + chromium,够新;不走 NodeSource repo 减一步外网
|
# Debian bookworm 自带 nodejs 18.x + chromium,够新;不走 NodeSource repo 减一步外网
|
||||||
|
|
@ -82,6 +89,10 @@ RUN npm config set registry ${NPM_REGISTRY} \
|
||||||
&& npm install -g @mermaid-js/mermaid-cli@latest \
|
&& npm install -g @mermaid-js/mermaid-cli@latest \
|
||||||
&& npm cache clean --force
|
&& npm cache clean --force
|
||||||
|
|
||||||
|
# 持久化 npm 源到 /etc/npmrc ── 让运行时模型用 `npm install bar` 也走 mirror,
|
||||||
|
# 不只 build 时。zcbot user 跑 npm 也吃这个 global 配置(优先级:proj > user > global)。
|
||||||
|
RUN printf 'registry=%s\n' "${NPM_REGISTRY}" > /etc/npmrc
|
||||||
|
|
||||||
# 容器内 puppeteer 启动 chromium 必备:no-sandbox(容器已 hardening 不需要 chromium 自家
|
# 容器内 puppeteer 启动 chromium 必备:no-sandbox(容器已 hardening 不需要 chromium 自家
|
||||||
# sandbox 再叠一层 setuid)、disable-setuid-sandbox(同上)、disable-dev-shm-usage
|
# sandbox 再叠一层 setuid)、disable-setuid-sandbox(同上)、disable-dev-shm-usage
|
||||||
# (容器 /dev/shm 默 64MB 不够 chromium,让它走 /tmp)
|
# (容器 /dev/shm 默 64MB 不够 chromium,让它走 /tmp)
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,85 @@
|
||||||
|
"""disk_quota.py 单元测试。
|
||||||
|
|
||||||
|
不连真 DB ── parse_bytes / scan_user_dir / skip dotfile 行为 / 不存在路径,
|
||||||
|
都纯 Python 文件系统操作可单测。upsert / check_disk_quota 需要 DB,跳过(集成测覆盖)。
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
||||||
|
|
||||||
|
from core.storage.disk_quota import parse_bytes, scan_user_dir
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseBytes(unittest.TestCase):
|
||||||
|
def test_int_passthrough(self):
|
||||||
|
self.assertEqual(parse_bytes(1024), 1024)
|
||||||
|
self.assertEqual(parse_bytes(0), 0)
|
||||||
|
|
||||||
|
def test_gb(self):
|
||||||
|
self.assertEqual(parse_bytes("5gb"), 5 * 1024 ** 3)
|
||||||
|
self.assertEqual(parse_bytes("5g"), 5 * 1024 ** 3)
|
||||||
|
self.assertEqual(parse_bytes("5GB"), 5 * 1024 ** 3)
|
||||||
|
|
||||||
|
def test_mb(self):
|
||||||
|
self.assertEqual(parse_bytes("500mb"), 500 * 1024 ** 2)
|
||||||
|
self.assertEqual(parse_bytes("500m"), 500 * 1024 ** 2)
|
||||||
|
|
||||||
|
def test_kb(self):
|
||||||
|
self.assertEqual(parse_bytes("1kb"), 1024)
|
||||||
|
|
||||||
|
def test_bytes(self):
|
||||||
|
self.assertEqual(parse_bytes("1024b"), 1024)
|
||||||
|
self.assertEqual(parse_bytes("1024"), 1024)
|
||||||
|
|
||||||
|
def test_float_suffix(self):
|
||||||
|
self.assertEqual(parse_bytes("1.5gb"), int(1.5 * 1024 ** 3))
|
||||||
|
|
||||||
|
def test_invalid(self):
|
||||||
|
self.assertIsNone(parse_bytes(""))
|
||||||
|
self.assertIsNone(parse_bytes("xxx"))
|
||||||
|
self.assertIsNone(parse_bytes("1.2.3"))
|
||||||
|
self.assertIsNone(parse_bytes(None))
|
||||||
|
|
||||||
|
|
||||||
|
class TestScanUserDir(unittest.TestCase):
|
||||||
|
def test_empty_dir(self):
|
||||||
|
with tempfile.TemporaryDirectory() as d:
|
||||||
|
b, c = scan_user_dir(Path(d))
|
||||||
|
self.assertEqual((b, c), (0, 0))
|
||||||
|
|
||||||
|
def test_nonexistent(self):
|
||||||
|
b, c = scan_user_dir(Path("/nonexistent/path/xxx"))
|
||||||
|
self.assertEqual((b, c), (0, 0))
|
||||||
|
|
||||||
|
def test_count_and_size(self):
|
||||||
|
with tempfile.TemporaryDirectory() as d:
|
||||||
|
root = Path(d)
|
||||||
|
(root / "a.txt").write_bytes(b"hello") # 5
|
||||||
|
(root / "sub").mkdir()
|
||||||
|
(root / "sub" / "b.txt").write_bytes(b"world!") # 6
|
||||||
|
(root / "sub" / "c.txt").write_bytes(b"x" * 1000) # 1000
|
||||||
|
b, c = scan_user_dir(root)
|
||||||
|
self.assertEqual(b, 1011)
|
||||||
|
self.assertEqual(c, 3)
|
||||||
|
|
||||||
|
def test_skip_dotfile_toplevel(self):
|
||||||
|
"""顶层 .zcbot_tmp / .memory 被跳过(开发期临时 + 用户记忆,不算配额)。"""
|
||||||
|
with tempfile.TemporaryDirectory() as d:
|
||||||
|
root = Path(d)
|
||||||
|
(root / "a.txt").write_bytes(b"counted") # 7
|
||||||
|
(root / ".zcbot_tmp").mkdir()
|
||||||
|
(root / ".zcbot_tmp" / "skipped.py").write_bytes(b"x" * 99999)
|
||||||
|
(root / ".memory").mkdir()
|
||||||
|
(root / ".memory" / "core.md").write_bytes(b"x" * 99999)
|
||||||
|
b, c = scan_user_dir(root)
|
||||||
|
self.assertEqual(b, 7)
|
||||||
|
self.assertEqual(c, 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
53
web/app.py
53
web/app.py
|
|
@ -501,6 +501,38 @@ def create_app() -> FastAPI:
|
||||||
if result.rowcount:
|
if result.rowcount:
|
||||||
print(f"[startup] reaped {result.rowcount} stale active run(s)")
|
print(f"[startup] reaped {result.rowcount} stale active run(s)")
|
||||||
|
|
||||||
|
# 磁盘配额后台扫描(§7.5 #4 应用层 gate)── 不依赖 docker backend,host
|
||||||
|
# backend 也跑(/v1/files/upload 也走配额 gate)。yaml `quotas.disk_scan_interval_seconds`
|
||||||
|
# 默 900s = 15min;limit_bytes ≤ 0 视为不限,scan 仍跑(用量统计有用),check 短路放行。
|
||||||
|
from core.agent_builder import resolve_workspace
|
||||||
|
from core.storage.disk_quota import parse_bytes, scan_all_users
|
||||||
|
workspace = resolve_workspace(None, _cfg)
|
||||||
|
disk_user_root = workspace / "users"
|
||||||
|
quotas_cfg = _cfg.get("quotas") or {}
|
||||||
|
disk_scan_interval = int(quotas_cfg.get("disk_scan_interval_seconds") or 900)
|
||||||
|
|
||||||
|
async def _disk_scanner() -> None:
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
# 启动时跑一次,后续按 interval。首次扫完 check 才能命中。
|
||||||
|
try:
|
||||||
|
n = await loop.run_in_executor(None, scan_all_users, disk_user_root)
|
||||||
|
if n:
|
||||||
|
print(f"[disk_scanner] initial scan: {n} user(s)")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[disk_scanner] initial scan error: {type(e).__name__}: {e}")
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(disk_scan_interval)
|
||||||
|
n = await loop.run_in_executor(None, scan_all_users, disk_user_root)
|
||||||
|
if n:
|
||||||
|
print(f"[disk_scanner] scanned {n} user(s)")
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[disk_scanner] error: {type(e).__name__}: {e}")
|
||||||
|
|
||||||
|
disk_scanner_task = asyncio.create_task(_disk_scanner(), name="disk-scanner")
|
||||||
|
|
||||||
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
|
# Sandbox pool(§7.5):仅当 ZCBOT_SANDBOX_BACKEND=docker 时启用。
|
||||||
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
|
# 启动钩子:① init_pool(创建 docker network + pool 实例)② shutdown_all 清
|
||||||
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
|
# 前驱孤儿(上次进程留下的 zcbot-sandbox-* 容器,内存 _last_active 为空,
|
||||||
|
|
@ -523,7 +555,11 @@ def create_app() -> FastAPI:
|
||||||
try:
|
try:
|
||||||
# repo_root=ROOT 让 SandboxPool 把 <repo>/skills 只读 mount 进容器
|
# repo_root=ROOT 让 SandboxPool 把 <repo>/skills 只读 mount 进容器
|
||||||
# (fs 工具进容器后 read SKILL references 需要)
|
# (fs 工具进容器后 read SKILL references 需要)
|
||||||
pool = init_pool(user_root_base, repo_root=ROOT)
|
# sandbox_cfg=yaml `sandbox` 段(memory/cpus/pids_limit 可调)
|
||||||
|
pool = init_pool(
|
||||||
|
user_root_base, repo_root=ROOT,
|
||||||
|
sandbox_cfg=_cfg.get("sandbox") or {},
|
||||||
|
)
|
||||||
removed = pool.shutdown_all()
|
removed = pool.shutdown_all()
|
||||||
if removed:
|
if removed:
|
||||||
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
|
print(f"[startup] swept {len(removed)} stale sandbox container(s)")
|
||||||
|
|
@ -552,6 +588,11 @@ def create_app() -> FastAPI:
|
||||||
try:
|
try:
|
||||||
yield
|
yield
|
||||||
finally:
|
finally:
|
||||||
|
disk_scanner_task.cancel()
|
||||||
|
try:
|
||||||
|
await disk_scanner_task
|
||||||
|
except (asyncio.CancelledError, Exception):
|
||||||
|
pass
|
||||||
if sandbox_reaper_task is not None:
|
if sandbox_reaper_task is not None:
|
||||||
sandbox_reaper_task.cancel()
|
sandbox_reaper_task.cancel()
|
||||||
try:
|
try:
|
||||||
|
|
@ -1479,6 +1520,16 @@ def create_app() -> FastAPI:
|
||||||
路径不存在自动 mkdir(parents=True);重名直接覆盖。
|
路径不存在自动 mkdir(parents=True);重名直接覆盖。
|
||||||
文件名严格校验(含 `/ \\ ..` 或为空 → 400)。
|
文件名严格校验(含 `/ \\ ..` 或为空 → 400)。
|
||||||
"""
|
"""
|
||||||
|
# 磁盘配额 gate(§7.5 #4):超额 413 阻止上传,提示 user 清旧产物
|
||||||
|
from core.agent_builder import load_config as _load_cfg
|
||||||
|
from core.storage.disk_quota import check_disk_quota, parse_bytes
|
||||||
|
_quotas_cfg = (_load_cfg().get("quotas") or {})
|
||||||
|
_limit = parse_bytes(_quotas_cfg.get("disk_bytes_per_user"))
|
||||||
|
if _limit is not None and _limit > 0:
|
||||||
|
_err = check_disk_quota(user_id, _limit)
|
||||||
|
if _err is not None:
|
||||||
|
raise HTTPException(413, _err)
|
||||||
|
|
||||||
root = _load_user_root(user_id)
|
root = _load_user_root(user_id)
|
||||||
dest_dir = _safe_join(root, path)
|
dest_dir = _safe_join(root, path)
|
||||||
if dest_dir.exists() and not dest_dir.is_dir():
|
if dest_dir.exists() and not dest_dir.is_dir():
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue