sandbox: yaml sandbox.dns 显式 --dns 注入(绕开 daemon 上游探测)

腾讯云轻量等场景 docker daemon 探测 host systemd-resolved 上游 DNS 不稳,
即使 init.sh ACCEPT 127.0.0.11:53 例外,embedded DNS 自己 forward 不出去
仍跪。显式 docker run --dns 8.8.8.8 --dns 114.114.114.114 直接写容器
/etc/resolv.conf 绕开上游探测。

- agent.yaml 加 sandbox.dns 列表,默 [8.8.8.8, 114.114.114.114]
- SandboxPool 加 dns 字段(env: ZCBOT_SANDBOX_DNS 逗号分隔 override),
  _docker_run 每个 ip 加 --dns flag
- RUN.md 故障兜底 DNS 失败那行补充第二层根因 + 解法

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
caoqianming 2026-05-27 11:28:53 +08:00
parent bbe79ce36d
commit 3a3f8d86cc
3 changed files with 24 additions and 2 deletions

2
RUN.md
View File

@ -483,7 +483,7 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
| `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web | | `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web |
| `[warn] network zcbot-sandbox-net is --internal (legacy)` | 上一版 sandbox network 创建时带了 `--internal`(完全禁 outbound),当前 dogfood 阶段放开。`docker stop $(docker ps -aq -f label=zcbot.product=sandbox) ; docker network rm zcbot-sandbox-net`,重启 web 自动 recreate 为非 internal | | `[warn] network zcbot-sandbox-net is --internal (legacy)` | 上一版 sandbox network 创建时带了 `--internal`(完全禁 outbound),当前 dogfood 阶段放开。`docker stop $(docker ps -aq -f label=zcbot.product=sandbox) ; docker network rm zcbot-sandbox-net`,重启 web 自动 recreate 为非 internal |
| tool write/edit 返 `[Error] 已达磁盘配额上限` | 同 upload 413,见上 | | tool write/edit 返 `[Error] 已达磁盘配额上限` | 同 upload 413,见上 |
| 容器内 `curl https://www.baidu.com``Temporary failure in name resolution` | iptables `127.0.0.0/8 DROP` 把 docker embedded DNS(`127.0.0.11:53`)也挡了;init.sh 已加 `127.0.0.11/32 udp/tcp 53 ACCEPT` 在 DROP 之前。重 build 镜像后,容器 `cat /etc/resolv.conf` 应该是 `nameserver 127.0.0.11`,`getent hosts www.baidu.com` 应该返 IP | | 容器内 `curl https://www.baidu.com``Temporary failure in name resolution` | 两层:① iptables `127.0.0.0/8 DROP` 把 docker embedded DNS(`127.0.0.11:53`)挡了 ── init.sh 已加 `127.0.0.11/32 udp/tcp 53 ACCEPT`;② docker embedded DNS 上游探测 host systemd-resolved 失败(腾讯云轻量等场景常见)── yaml `sandbox.dns` 显式指定 `8.8.8.8 / 114.114.114.114`,docker run `--dns` 直接注入容器 `/etc/resolv.conf`,绕过上游探测路径。重 build 镜像 + `docker rm -f $(docker ps -aq -f label=zcbot.product=sandbox)` 让新 image + 新 dns 配置生效 |
| 启动报 `PLATFORM_KEY env not set` / `JWT_SECRET env not set` | D' 过渡 auth 强制双 env 必填。生成 `python -c "import secrets;print(secrets.token_urlsafe(48))"` 各填一,写 `.env` 重起 | | 启动报 `PLATFORM_KEY env not set` / `JWT_SECRET env not set` | D' 过渡 auth 强制双 env 必填。生成 `python -c "import secrets;print(secrets.token_urlsafe(48))"` 各填一,写 `.env` 重起 |
| `/v1/auth/login_password` 返 403 `invalid email or password` | 邮箱不存在 / `password_hash` 列为空(platform_key 入口建的 user) / 密码错。`SELECT user_id, email, password_hash IS NOT NULL AS has_pw FROM users WHERE email=...` 核对;无行 → `main.py user add`;有行无密码 → `UPDATE users SET password_hash=...`(用 `.venv/Scripts/python.exe -c "from web.auth import hash_password;print(hash_password('xxx'))"` 算)或 `user add --user-id` 接到现有 user_id | | `/v1/auth/login_password` 返 403 `invalid email or password` | 邮箱不存在 / `password_hash` 列为空(platform_key 入口建的 user) / 密码错。`SELECT user_id, email, password_hash IS NOT NULL AS has_pw FROM users WHERE email=...` 核对;无行 → `main.py user add`;有行无密码 → `UPDATE users SET password_hash=...`(用 `.venv/Scripts/python.exe -c "from web.auth import hash_password;print(hash_password('xxx'))"` 算)或 `user add --user-id` 接到现有 user_id |
| `main.py user add``IntegrityError ... uq_users_email` | 邮箱已存在,改 email 或先 `DELETE FROM users WHERE email=...`(先清该 user 的 tasks) | | `main.py user add``IntegrityError ... uq_users_email` | 邮箱已存在,改 email 或先 `DELETE FROM users WHERE email=...`(先清该 user 的 tasks) |

View File

@ -24,4 +24,11 @@ quotas:
sandbox: sandbox:
memory: 2g # --memory (env: ZCBOT_SANDBOX_MEMORY) memory: 2g # --memory (env: ZCBOT_SANDBOX_MEMORY)
cpus: 1.0 # --cpus (env: ZCBOT_SANDBOX_CPUS) cpus: 1.0 # --cpus (env: ZCBOT_SANDBOX_CPUS)
pids_limit: 256 # --pids-limit (env: ZCBOT_SANDBOX_PIDS_LIMIT) pids_limit: 256 # --pids-limit (env: ZCBOT_SANDBOX_PIDS_LIMIT)
# 容器 DNS server 显式配置(docker run --dns,容器 /etc/resolv.conf 直接写,
# 绕过 docker daemon 上游 DNS 探测路径;腾讯云轻量 / 部分云上 daemon 探测
# systemd-resolved 上游会失败,导致 embedded DNS 127.0.0.11 forward 出去也跪)。
# 默公共 DNS,国内访问通畅;留空(`dns: []`)走 docker 默认探测。
dns:
- "8.8.8.8" # Google
- "114.114.114.114" # 国内通用

View File

@ -89,6 +89,7 @@ class SandboxPool:
memory: Optional[str] = None, memory: Optional[str] = None,
cpus: Optional[str] = None, cpus: Optional[str] = None,
pids_limit: Optional[int] = None, pids_limit: Optional[int] = None,
dns: Optional[List[str]] = None,
) -> None: ) -> None:
""" """
user_root_base: per-user 子树父目录,典型 `<workspace>/users`bind mount user_root_base: per-user 子树父目录,典型 `<workspace>/users`bind mount
@ -126,6 +127,12 @@ class SandboxPool:
os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT") os.getenv("ZCBOT_SANDBOX_PIDS_LIMIT")
or (pids_limit if pids_limit is not None else DEFAULT_PIDS_LIMIT) or (pids_limit if pids_limit is not None else DEFAULT_PIDS_LIMIT)
) )
# DNS:env(逗号分隔)> caller > 默空(让 docker 自己探测)
env_dns = os.getenv("ZCBOT_SANDBOX_DNS", "").strip()
if env_dns:
self.dns: List[str] = [x.strip() for x in env_dns.split(",") if x.strip()]
else:
self.dns = list(dns) if dns else []
self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race self._dict_lock = threading.Lock() # 保护 _locks / _last_active 的字典级 race
self._locks: Dict[UUID, threading.Lock] = {} self._locks: Dict[UUID, threading.Lock] = {}
self._last_active: Dict[UUID, int] = {} self._last_active: Dict[UUID, int] = {}
@ -177,6 +184,10 @@ class SandboxPool:
"-e", f"ZCBOT_PG_IPS={self.pg_ips}", "-e", f"ZCBOT_PG_IPS={self.pg_ips}",
"--restart=no", "--restart=no",
] ]
# 显式 DNS(绕过 docker daemon 上游探测,腾讯云轻量等场景下 daemon 探测
# host systemd-resolved 上游不稳)
for dns_ip in self.dns:
cmd += ["--dns", dns_ip]
# repo skills 只读 mount ── fs 工具进容器后(read/glob/grep)能 access # repo skills 只读 mount ── fs 工具进容器后(read/glob/grep)能 access
# SKILL.md 内引用的 references/*.md。host 上 zcbot/skills/ 是项目代码, # SKILL.md 内引用的 references/*.md。host 上 zcbot/skills/ 是项目代码,
# 跟用户 working_dir 正交,只读防容器内进程改 skill 实现。 # 跟用户 working_dir 正交,只读防容器内进程改 skill 实现。
@ -256,10 +267,14 @@ def setup_pool(
""" """
ensure_network() ensure_network()
cfg = sandbox_cfg or {} cfg = sandbox_cfg or {}
dns_cfg = cfg.get("dns") or []
if not isinstance(dns_cfg, list):
dns_cfg = []
return SandboxPool( return SandboxPool(
user_root_base=user_root_base, user_root_base=user_root_base,
repo_root=repo_root, repo_root=repo_root,
memory=cfg.get("memory") if isinstance(cfg.get("memory"), str) else None, memory=cfg.get("memory") if isinstance(cfg.get("memory"), str) else None,
cpus=str(cfg["cpus"]) if cfg.get("cpus") is not None else None, cpus=str(cfg["cpus"]) if cfg.get("cpus") is not None else None,
pids_limit=int(cfg["pids_limit"]) if cfg.get("pids_limit") is not None else None, pids_limit=int(cfg["pids_limit"]) if cfg.get("pids_limit") is not None else None,
dns=[str(x) for x in dns_cfg],
) )