feat(web): systemctl restart 优雅 drain in-flight run，不再误标 error

此前 restart 硬杀 BG run 线程，下次启动 reaper 把所有 running/cancelling 标 error: server restarted before run finished —— 用户一多就不能随便重启。单实例止血，零 DB 改动： - lifespan 加 draining(Event) + inflight 登记表(顺手修 create_task 不留引用可能被 GC 的旧坑)；finally 先拒新 run → await 收尾 → 超 drain_timeout 转协作式 cancel(= 用户按停止，标 idle 不报 error、可重发)→ 超 cancel_grace 仍没退的留给 SIGKILL(最坏退化 = 改前) - POST /messages：draining 期返 503 + Retry-After；起 run 登记 inflight - main.py uvicorn 加 timeout_graceful_shutdown=5(否则长连 SSE 挡在 drain 前) - config/agent.yaml 加 shutdown 段(drain 30s / grace 15s，偏短更安全) - dev SPA chat.js 发送包退避重试(503 背压 + 交接拒连都重试 ~26s) 部署强耦合：unit TimeoutStopSec 10→90(必须 > drain+grace+sandbox 清扫余量)，已写进 RUN.md unit + 故障兜底。B 蓝绿(零 503 窗口)留作触发信号后再做。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 10:54:43 +08:00 · 2026-06-10 10:54:43 +08:00 · 4f6e879050
parent c02d20b005
commit 4f6e879050
6 changed files with 107 additions and 11 deletions
--- a/PROGRESS.md
+++ b/PROGRESS.md
@ -21,6 +21,10 @@
 ## 已完成关键能力
 ### 2026-06-10
 - **`systemctl restart` 优雅 drain in-flight run(单实例止血,不再误标 error)**:此前 restart 硬杀 BG run 线程,下次启动 reaper 把所有 `running/cancelling` 标 `error: server restarted before run finished` —— 用户一多就不能随便重启。落地纯进程内、**零 DB 改动**:① lifespan 加 `app.state.draining`(asyncio.Event)+ `app.state.inflight`(`{asyncio.Task: task_id}`,顺手修 `create_task` 不留引用可能被 GC 的旧坑);② POST `/messages` 起 run 时登记+done 回调自摘除,draining 置位时返 503+`Retry-After`;③ lifespan `finally` 先置 draining 拒新 run,`asyncio.wait(inflight, drain_timeout)` 等收尾,超时的 `broker.request_cancel` 转协作式 cancel(下个 chunk 间隙退、标 idle 不报 error),再过 `cancel_grace` 仍没退的留给 SIGKILL(最坏退化=改前)。④ `main.py` uvicorn 加 `timeout_graceful_shutdown=5`(否则长连 SSE 无限挡在 drain 前);⑤ `config/agent.yaml` 加 `shutdown` 段(drain_timeout 30s / cancel_grace 15s,超时转 cancel = 用户按停止可重发,故偏短);⑥ dev SPA `chat.js` 发送包退避重试(503 背压 + 交接拒连 TypeError 都重试 ~26s,显"服务更新中",耗尽贴友好提示)。**部署强耦合**:unit `TimeoutStopSec` 从 10 提到 90(必须 > drain+grace+sandbox 清扫余量,否则 SIGKILL 砍掉 drain),已写进 RUN.md unit + 故障兜底。B 蓝绿(零 503 窗口)留作触发信号后再做,前置是 instance-aware reaper(§7.8)。
 ### 2026-06-09
 - **PPTX 前端在线预览(LibreOffice→PDF,DESIGN §8.3 Stage 1)**:此前文件区点 `.pptx` 只能下载(`preview.js._categorize` 归 fallback)。关键洞察=前端已有 PDF iframe 路径(`_showPdf`),所以只要后端把 pptx 转 PDF 就**前端几乎不动**。落地:① 新 `web/pptx_render.py`——`pptx_to_pdf()` 同步可缓存,调 `soffice --headless --convert-to pdf`、**每次独立 `-env:UserInstallation` 临时 profile** 绕单 profile 锁、超时 60s kill;soffice 路径发现复用 render_bg 思路;缓存落源同目录 `.preview/<stem>.<hash>.pdf`(hash=mtime+size,源改即失效;dotdir 不污染文件列表),`_prune_stale` 清旧 hash。② 新端点 `GET /v1/files/preview_pdf`——复用 `_safe_join` 鉴权防穿越 + 仅 `.ppt(x)` + per-path `asyncio.Lock` 防并发重转 + `run_in_executor` 不堵事件循环;soffice 缺失 501 / 转换失败 500。③ `preview.js` 加 `ppt` 组,main/mini 共用 `_showPptAsPdf`(fetch PDF→iframe,带 spinner loading + 失败回退下载),`dev.html` 加 `.preview-spinner`(复用 `@keyframes spin`)。**转换跑在 web host 进程,不进沙盒**(沙盒不该有 LibreOffice;预览面向 user_root 任意 pptx,与 deck 生成解耦)。部署:host `apt install libreoffice-impress fonts-noto-cjk`(已写进 RUN.md 一次性 + 故障兜底),sandbox Dockerfile 不动。**未做**(Stage 2):常驻 soffice listener 消冷启、deck 生成后 eager 预转、缩略图导航。
--- a/RUN.md
+++ b/RUN.md
@ -208,9 +208,12 @@ ExecStart=/opt/zcbot/.venv/bin/python main.py web --host 0.0.0.0 --port 8765
 Restart=on-failure
 RestartSec=2
 KillSignal=SIGTERM
-# uvicorn graceful shutdown 会等 in-flight 请求(含 SSE 长连接);
+# ★ 优雅 drain:SIGTERM 后 zcbot 先拒新 run(503)、等在跑的 run 收尾(见
-# 10s 后 systemd 兜底 SIGKILL,避免 SSE 拖住 restart 卡死
+# config/agent.yaml `shutdown` 段:drain_timeout 30s + cancel_grace 15s)。
-TimeoutStopSec=10
+# TimeoutStopSec 必须 > drain_timeout + cancel_grace + 余量(还要算 sandbox 容器清扫
 # + uvicorn graceful 5s),否则 systemd 中途 SIGKILL 把 drain 砍掉、in-flight run 仍
 # 被标 error,白做。改 agent.yaml 那两个数值时这里跟着调。
 TimeoutStopSec=90
 KillMode=mixed
 StandardOutput=journal
 StandardError=journal
@ -226,7 +229,7 @@ sudo systemctl daemon-reload
 sudo systemctl enable --now zcbot
 sudo systemctl status zcbot | head
 sudo journalctl -u zcbot -f                   # 实时日志
-sudo systemctl restart zcbot                  # 重启(REST 抖动 ~2s,SSE 连接断)
+sudo systemctl restart zcbot                  # 重启:先 drain 在跑的 run 再换新版,新发消息期间 503(客户端自动重试)
 sudo systemctl stop zcbot
 ```
@ -234,6 +237,10 @@ sudo systemctl stop zcbot
 ### 无感更新(对 SSE 也尽量不抖)
 **底座:`systemctl restart` 现在优雅 drain 在跑的 run**(2026-06-10)。SIGTERM 后 zcbot 先置 draining 拒新 `POST /messages`(返 503 + `Retry-After`),等所有在跑的 run 自然收尾再换新版;超 `drain_timeout`(config/agent.yaml `shutdown` 段,默 30s)的转协作式 cancel(= 用户按停止,标 idle 不报 error、可重发),再过 `cancel_grace`(默 15s)仍没退的才留给 SIGKILL。**效果**:重启不再把正在跑的对话标 `error`。代价:部署期"新点发送"会吃几十秒 503 窗口 —— dev SPA 已对 503 / 交接拒连退避重试(显"服务更新中"),platform 前端建议加同款。要彻底消灭这个 503 窗口才需要下面 B(蓝绿),A 的 drain 是单实例能做到的上限。**前提**:unit `TimeoutStopSec > drain_timeout + cancel_grace`(见上方 unit 注释)。
 下面两挡是另一个维度(REST / SSE 抖动平滑),与 drain 正交:
 zcbot 现在 5 人级 + SSE 长连接,**严格"零中断"**(蓝绿 + nginx + SSE 客户端 reconnect 设计)代价高,不值得。有性价比的两挡:
 **A. 简易档:`--reload`**(推荐当前规模)
@ -689,7 +696,8 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
 | `mp_*` tool 没出现在对话里 | `.env` 没设 `MP_API_KEY`,build_agent 跳过注册。设了重启 web 即可;Materials Project 联网查询走 host-side tool,离线 pymatgen 不受影响。 |
 | 豆包调价了 | 改 `config/media/doubao.yaml` 的 `price_cny_per_image` 一行 → 重启 web。**历史 usage_events 不受影响**(units jsonb 里有当时单价 snapshot,聚合查仍按旧价);新写入按新价。涨价瞬间到改 YAML 中间这段记账偏低,开发期接受 |
 | `kill -HUP <pid>` 后 `/openapi.json` 没新接口 | uvicorn **不响应 SIGHUP**(没装 handler,落 Python 默认终止;Windows 上信号本身无效)。Ubuntu 上用 `systemctl restart zcbot`,或 unit 加 `--reload` 让 uvicorn 监听文件自动重起(见"部署"段)。验证:`curl -s http://127.0.0.1:8765/openapi.json \| python3 -c 'import sys,json;print([p for p in json.load(sys.stdin)["paths"] if "auth" in p])'` |
-| `systemctl restart zcbot` 卡 10s 才退 | 有 SSE 长连接,uvicorn graceful shutdown 等 in-flight。unit 已设 `TimeoutStopSec=10` 兜 SIGKILL,正常现象;真急用 `systemctl kill -s KILL zcbot` |
+| `systemctl restart zcbot` 要等几十秒才退 | 正常 —— 优雅 drain 在等在跑的 run 收尾(`shutdown.drain_timeout` 默 30s),没在跑 run 时秒退。journal 出现 `[shutdown] draining N in-flight run(s)` 即正常。真急(不在乎杀掉在跑 run):`systemctl kill -s KILL zcbot` |
 | 部署后在跑的对话被标 `error: server restarted before run finished` | 该 run 在 drain 期内没收尾、cancel 也没在 `cancel_grace` 内退,被 SIGKILL 后下次启动 reaper 标的。多半是 run 卡在不 poll cancel 的长动作(如单次超长 docker exec)或 `TimeoutStopSec` 配得比 drain 预算还小被提前 SIGKILL。先核对 unit `TimeoutStopSec > drain_timeout + cancel_grace`;真有超长 run 把 `drain_timeout` 调大 |
 | `POST /v1/files/rename` 返 409 `folder has active run(s)` | 顶层目录被某 running/cancelling 的 task 占用;先 cancel 等流式 done 再 rename |
 | `POST /v1/files/rename` 返 409 `... 前缀嵌套` | 改名后会与其他 task 的 working_dir 形成嵌套;换不冲突的 new_name |
 | `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web |
--- a/config/agent.yaml
+++ b/config/agent.yaml
@ -19,6 +19,16 @@ quotas:
  disk_bytes_per_user: 5gb              # 支持 5gb / 500mb / 1073741824(整数 bytes)
  disk_scan_interval_seconds: 900       # 后台扫描周期,默 15 分钟
 # 优雅 drain(SIGTERM / systemctl restart):先拒新 POST /messages(返 503 + Retry-After,
 # 客户端退避重试覆盖),等在跑的 run 自然收尾;超 drain_timeout 还没完的转协作式 cancel
 # (下个 chunk 间隙退、标 idle 不报 error);再过 cancel_grace 仍没退的留给 systemd SIGKILL,
 # 下次启动 reaper 标 error(最坏退化 = 改前行为)。改后重启 web 生效。
 # ★ systemd unit 的 TimeoutStopSec 必须 > drain_timeout + cancel_grace + 余量(见 RUN.md 部署 SOP)。
 shutdown:
  drain_timeout_seconds: 30     # 等在跑 run 收尾的上限 = 部署期 503 窗口上限;超时转 cancel
                                # (= 用户按停止,标 idle 可重发,非 error),故偏短更安全
  cancel_grace_seconds: 15      # 超时转 cancel 后再给的退场宽限
 # Sandbox 容器资源限制(docker run flag,env 可 override);改后重启 web 生效,
 # 新起的容器用新值,已 running 的不变(idle 5min 回收后下次起)。
 sandbox:
--- a/main.py
+++ b/main.py
@ -189,13 +189,19 @@ def web(host: str, port: int, reload: bool) -> None:
    """启动 Web 服务(JSON API + dev SPA)。Auth 需 PLATFORM_KEY / JWT_SECRET env。"""
    import uvicorn
    # timeout_graceful_shutdown=5:SIGTERM 后 uvicorn 至多等 5s 关掉在连的 HTTP 请求
    # (主要是长连 SSE GET,断开后客户端会重连,run 不受影响),再进 lifespan shutdown
    # 跑真正的 run drain(见 web/app.py finally + config/agent.yaml `shutdown` 段)。
    # 不设的话 uvicorn 会为长连 SSE 无限等,挡在 drain 前面。
    if reload:
        # reload 模式需要 import string + factory,uvicorn 才能监听文件
        uvicorn.run("web.app:create_app", host=host, port=port,
-                    reload=True, factory=True, log_level="info")
+                    reload=True, factory=True, log_level="info",
                    timeout_graceful_shutdown=5)
    else:
        from web.app import create_app
-        uvicorn.run(create_app(), host=host, port=port, log_level="info")
+        uvicorn.run(create_app(), host=host, port=port, log_level="info",
                    timeout_graceful_shutdown=5)
 # ─────────────── Sandbox(Stage C 部署前置对账) ───────────────
--- a/web/app.py
+++ b/web/app.py
@ -558,6 +558,16 @@ def create_app() -> FastAPI:
        broker.bind_loop(asyncio.get_running_loop())
        from core.agent_builder import load_config, resolve_workspace
        _cfg = load_config()
        # 优雅 drain 状态(SIGTERM / systemctl restart 兜底,见下方 finally):
        # draining 置位后 POST /messages 返 503;inflight 登记在跑的 BG run task,
        # 关停时 await 它们收尾。inflight 同时给 create_task 持强引用,防被 GC 中途回收。
        app.state.draining = asyncio.Event()
        app.state.inflight = {}  # dict[asyncio.Task, UUID(task_id)]
        _shutdown_cfg = _cfg.get("shutdown") or {}
        drain_timeout = int(_shutdown_cfg.get("drain_timeout_seconds") or 90)
        cancel_grace = int(_shutdown_cfg.get("cancel_grace_seconds") or 15)
        # Stale-run reaper:上次进程 crash 留下的 "running" / "cancelling" 已无 BG 线程
        # 继续,启动时标 error,让对应 task 重新可发消息(否则 gate 永挂)。
        # TODO 真生产 multi-worker:换 heartbeat / lease,只 reap 自家 worker 的孤儿。
@ -660,6 +670,32 @@ def create_app() -> FastAPI:
        try:
            yield
        finally:
            # ── 优雅 drain:先拒新 run,等在跑的 run 收尾,超时转协作式 cancel ──
            # 单实例形态下消除"restart 误杀 in-flight run 标 error"。新 POST /messages
            # 期间返 503(客户端退避重试覆盖)。drain_timeout 内自然跑完 → idle 零 error;
            # 超时的 broker.request_cancel → 下个 chunk 间隙退(标 idle);cancel_grace 后仍
            # 没退的留给 systemd SIGKILL,下次启动 reaper 标 error(最坏退化 = 改前行为)。
            # ★ systemd TimeoutStopSec 必须 > drain_timeout + cancel_grace + 余量(见 RUN.md)。
            app.state.draining.set()
            inflight = app.state.inflight
            if inflight:
                print(f"[shutdown] draining {len(inflight)} in-flight run(s), "
                      f"timeout={drain_timeout}s")
                _, pending = await asyncio.wait(
                    list(inflight.keys()), timeout=drain_timeout
                )
                if pending:
                    print(f"[shutdown] {len(pending)} run(s) over drain timeout; "
                          f"signalling cooperative cancel")
                    for t in pending:
                        cid = inflight.get(t)
                        if cid is not None:
                            broker.request_cancel(cid)
                    _, still = await asyncio.wait(pending, timeout=cancel_grace)
                    if still:
                        print(f"[shutdown] {len(still)} run(s) still active after "
                              f"cancel grace; SIGKILL takes over, next start reaps them")
            disk_scanner_task.cancel()
            try:
                await disk_scanner_task
@ -1256,6 +1292,12 @@ def create_app() -> FastAPI:
            tid = UUID(task_id)
        except ValueError:
            raise HTTPException(404, f"invalid task id: {task_id!r}")
        # 关停 drain 期:拒新 run,带 Retry-After 让客户端退避重试(部署窗口背压)。
        if getattr(app.state, "draining", None) is not None and app.state.draining.is_set():
            raise HTTPException(
                503, "server is restarting; retry shortly",
                headers={"Retry-After": "3"},
            )
        content = (body.content or "").strip()
        if not content:
            raise HTTPException(400, "empty content")
@ -1282,10 +1324,14 @@ def create_app() -> FastAPI:
        image_variant = _resolve_image_model(body.image_model)
        video_variant = _resolve_video_model(body.video_model)
        broker.start(tid)  # 清上一轮 done 标记,新订阅者才能看到流式
-        # commit 后 lock 释放;BG 线程接管(sink 通过 broker 把 event 桥回 asyncio loop)
+        # commit 后 lock 释放;BG 线程接管(sink 通过 broker 把 event 桥回 asyncio loop)。
-        asyncio.create_task(asyncio.to_thread(
+        # 登记到 app.state.inflight:① 关停 drain 时 await 它收尾 ② 持强引用防 task 被 GC
        # 中途回收(asyncio.create_task 不留引用是已知坑)。done 回调自摘除。
        run_task = asyncio.create_task(asyncio.to_thread(
            _run_agent_bg, tid, user_id, content, image_variant, video_variant,
        ))
        app.state.inflight[run_task] = tid
        run_task.add_done_callback(lambda t: app.state.inflight.pop(t, None))
        return {"events_url": f"/v1/tasks/{tid}/events"}
    # ───────────── Cancel current run ─────────────
--- a/web/static/js/chat.js
+++ b/web/static/js/chat.js
@ -763,6 +763,24 @@ $("chat-stream").addEventListener("click", (e) => {
  }
 });
 // POST /messages 退避重试:覆盖后端优雅 drain 的部署窗口 ——
 //   ① 排空期老进程返 503(背压) ② 进程交接缺口 fetch 拒连(api 抛 TypeError,无 status)。
 // 两种都重试,UI 显"服务更新中";~26s 预算内大多能等到新进程接手。仍失败则抛出由
 // sendMessage 的 catch 贴友好提示,用户稍后重发。其它错误(4xx 等)立即抛不重试。
 async function postMessageWithRetry(taskId, body) {
  const delays = [1000, 2000, 3000, 5000, 5000, 5000, 5000];  // 7 次 ≈ 26s
  for (let attempt = 0; ; attempt++) {
    try {
      return await api("POST", `/v1/tasks/${taskId}/messages`, body);
    } catch (e) {
      const retriable = e.status === 503 || e.name === "TypeError";  // 503 背压 / 网络拒连
      if (!retriable || attempt >= delays.length) throw e;
      $("chat-hint").textContent = "服务更新中,正在重发…";
      await new Promise((res) => setTimeout(res, delays[attempt]));
    }
  }
 }
 async function sendMessage() {
  if (!state.taskId) return;
  if (isCurrentTaskStreaming()) return;
@ -786,7 +804,7 @@ async function sendMessage() {
    wrap.appendChild(asstCard);
    wrap.scrollTop = wrap.scrollHeight;
-    const r = await api("POST", `/v1/tasks/${taskId}/messages`, {
+    const r = await postMessageWithRetry(taskId, {
      content,
      image_model: state.imageModel || "",
      video_model: state.videoModel || "",
@ -812,7 +830,11 @@ async function sendMessage() {
    streamSse(r.events_url, run);
  } catch (e) {
    if (e.status === 401) { logout(); return; }
-    appendErrorCard(e.message);
+    // 重试耗尽仍是 503 / 网络拒连 → 部署窗口比重试预算还长,给友好提示让用户稍后重发
    const msg = (e.status === 503 || e.name === "TypeError")
      ? "服务更新中,请稍后重发"
      : e.message;
    appendErrorCard(msg);
    setActionMode("idle");
    $("chat-hint").textContent = "就绪";
  }