fix(scheduler): 定时任务超时被掐断时记 error 而非误吞成 ok(bump 0.32.2)

实测 bug:isolated 定时 job 跑满 timeout_seconds 被协作式 cancel 后, _run_agent_bg 对 ok/cancelled 都把 run_status 收回 idle(DB 不可区分), 而 _execute_scheduled_job 收尾只判 run_status=="error",于是超时中断被落成 last_status="ok" —— 掩盖"跑到一半没写 sections/没推送",且不计连续失败、不触发兜底。复盘 job e621c8a6「每日水泥科研简报」:timeout=600s,task 创建→last_run_at 正好 600.0s,agent 停在"按期刊打印 38 篇摘要"(还在取数)。修:超时分支置 timed_out 标志,run 收尾后若 timed_out → record_result( status="error", 半成品不投递 notify)并直接返回。复用既有 error 语义(计入 consecutive_failures、到阈值自动停用、前端 crons 显示「上次失败」)。不动 _run_agent_bg 的 idle-on-cancel 共享语义(HTTP cancel/drain 也依赖)。配套:PROGRESS/RUN 故障兜底各加一条;诊断脚本 scripts/diag_sched_e621.py (dump 输出 scripts/_*.txt 入 gitignore)。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 14:35:40 +08:00 · 2026-06-29 14:35:40 +08:00 · 1646205364
parent 89062d99b3
commit 1646205364
6 changed files with 116 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -57,3 +57,6 @@ col.ps1
 # 探测脚本 scripts/probe_clawbot*.py 保留作参考与复测)
 scripts/clawbot_qr*.png
 scripts/zcbot_filetest.txt
+
+# 诊断脚本的使用即弃 dump 输出(diag_*.py 写本地,不入库)
+scripts/_*.txt
--- a/PROGRESS.md
+++ b/PROGRESS.md
@ -21,6 +21,12 @@

 ## 已完成关键能力

+### 2026-06-29 / 修定时任务超时被误记成 ok(bump 0.32.2)
+
+- 实测 bug:定时 job(isolated)跑满 `timeout_seconds` 被调度器协作式 cancel 后,`_run_agent_bg` 对 ok/cancelled 都把 `run_status` 收回 `idle`(二者 DB 不可区分),而 `_execute_scheduled_job` 收尾只判 `run_status=="error"`,于是超时中断被落成 `last_status="ok"` —— 掩盖"跑到一半没写 sections / 没推送",且不计连续失败、不触发兜底。复盘 job `e621c8a6`「每日水泥科研简报」:`timeout_seconds=600`,task 创建→`last_run_at` 正好 600.0s,最后一条 agent 消息停在"按期刊分组打印 38 篇摘要"(还在取数阶段),`last_status` 却是 ok。
+- 修:`web/app.py` `_execute_scheduled_job` 在超时分支置 `timed_out` 标志,run 收尾后若 `timed_out` → `record_result(status="error", ...)` 并直接返回(不投递半成品 notify)。复用既有 error 语义:计入 `consecutive_failures`、到阈值自动停用、前端 crons.js 显示「上次失败」。不动 `_run_agent_bg` 的 idle-on-cancel 共享语义(HTTP cancel/drain 也用)。
+- 配套:该 job 真正的诱因是 600s 超时对"7 刊 38 篇带中文摘要重写 + 渲 docx"太短,需用户把 `timeout_seconds` 调大(或 0=不限)。诊断脚本 `scripts/diag_sched_e621.py`。
+
 ### 2026-06-29 / channel 长会话上下文软重置(Phase 1,bump 0.32.0)

 - 问题:微信/企业微信复用同一常驻 chat_task,`Session.load` 全量喂模型 → 越用越贵/慢,终撞 context window。业界(OpenClaw/Hermes)做法:阈值摘要 + 会话分段 + 持久记忆;IM 场景独有的「会话分段」最高杠杆且零信息损失。
--- a/RUN.md
+++ b/RUN.md
@ -756,6 +756,7 @@ sudo xfs_quota -x -c "limit -p bhard=10g zcbot_<user_uuid>" /opt
 | `kill -HUP <pid>` 后 `/openapi.json` 没新接口 | uvicorn **不响应 SIGHUP**(没装 handler,落 Python 默认终止;Windows 上信号本身无效)。Ubuntu 上用 `systemctl restart zcbot`,或 unit 加 `--reload` 让 uvicorn 监听文件自动重起(见"部署"段)。验证:`curl -s http://127.0.0.1:8765/openapi.json \| python3 -c 'import sys,json;print([p for p in json.load(sys.stdin)["paths"] if "auth" in p])'` |
 | `systemctl restart zcbot` 要等几十秒才退 | 正常 —— 优雅 drain 在等在跑的 run 收尾(`shutdown.drain_timeout` 默 30s),没在跑 run 时秒退。journal 出现 `[shutdown] draining N in-flight run(s)` 即正常。真急(不在乎杀掉在跑 run):`systemctl kill -s KILL zcbot` |
 | 部署后在跑的对话被标 `error: server restarted before run finished` | 该 run 在 drain 期内没收尾、cancel 也没在 `cancel_grace` 内退,被 SIGKILL 后下次启动 reaper 标的。多半是 run 卡在不 poll cancel 的长动作(如单次超长 docker exec)或 `TimeoutStopSec` 配得比 drain 预算还小被提前 SIGKILL。先核对 unit `TimeoutStopSec > drain_timeout + cancel_grace`;真有超长 run 把 `drain_timeout` 调大 |
+| 定时任务「跑到一半没推送」/ crons 页显示「上次失败: 运行超过超时上限 Ns 未完成」 | job 跑满 `timeout_seconds` 被协作式中断(还没写完 / 没推送)。**0.32.2 起超时记 error**(此前误记 ok 看不出来),计入连续失败、到阈值自动停用。处置:该 job 调大 `timeout_seconds`(报告类重活如多刊检索+渲 docx 建议 ≥1800,或 0=不限),被自动停用的重新 enable。诊断单个 job 用 `scripts/diag_sched_e621.py <job_id 前缀>` |
 | `POST /v1/files/rename` 返 409 `folder has active run(s)` | 顶层目录被某 running/cancelling 的 task 占用;先 cancel 等流式 done 再 rename |
 | `POST /v1/files/rename` 返 409 `... 前缀嵌套` | 改名后会与其他 task 的 working_dir 形成嵌套;换不冲突的 new_name |
 | `POST /v1/files/upload` 返 413 `已达磁盘配额上限` | per-user 5GB(yaml `quotas.disk_bytes_per_user`)。让用户在 dev SPA 右侧文件栏删旧产物 / 大文件,或改 yaml 升配重启 web |
--- a/core/init.py
+++ b/core/init.py
@ -1,3 +1,3 @@
 # zcbot 版本号单一事实源:web/app.py 的 FastAPI version、/healthz 返回、前端展示都引这里。
 # 改版本只动这一行。
-__version__ = "0.32.1"
+__version__ = "0.32.2"
--- a/scripts/diag_sched_e621.py
+++ b/scripts/diag_sched_e621.py
@ -0,0 +1,93 @@
+"""diag: 查 scheduled-e621c8a6 这个 job 为何执行到一半没推送(ASCII only, GBK safe)."""
+import os
+import sys
+from pathlib import Path
+
+env = Path(__file__).resolve().parent.parent / ".env"
+for line in env.read_text(encoding="utf-8").splitlines():
+    if line.strip().startswith("ZCBOT_DB_URL="):
+        os.environ["ZCBOT_DB_URL"] = line.split("=", 1)[1].strip()
+from sqlalchemy import create_engine, text  # noqa: E402
+import builtins  # noqa: E402
+
+_out = open(Path(__file__).resolve().parent / "_sched_e621.txt", "w", encoding="utf-8")
+
+
+def print(*a, **k):  # noqa: A001
+    builtins.print(*a, **k, file=_out)
+
+
+PREFIX = sys.argv[1] if len(sys.argv) > 1 else "e621c8a6"
+engine = create_engine(os.environ["ZCBOT_DB_URL"])
+
+
+def s(x, n=2000):
+    t = str(x if x is not None else "")
+    return t if len(t) <= n else t[:n] + f"...[+{len(t)-n}]"
+
+
+with engine.connect() as conn:
+    job = conn.execute(text(
+        "select job_id,user_id,name,mode,cron,tz,enabled,notify,timeout_seconds,"
+        "next_run_at,last_run_at,last_status,last_error,last_task_id,"
+        "consecutive_failures,run_count,bound_task_id,created_at,deleted_at "
+        "from scheduled_jobs where cast(job_id as text) like :p"),
+        {"p": PREFIX + "%"}).fetchall()
+    print(f"[JOBS matched '{PREFIX}'] {len(job)}")
+    for j in job:
+        print("-" * 60)
+        print(f"job_id={j[0]} name={j[2]!r}")
+        print(f"  mode={j[3]} cron={j[4]!r} tz={j[5]} enabled={j[6]} timeout={j[8]}")
+        print(f"  notify={j[7]}")
+        print(f"  next_run_at={j[9]}  last_run_at={j[10]}")
+        print(f"  last_status={j[11]}  consecutive_failures={j[14]}  run_count={j[15]}")
+        print(f"  last_task_id={j[13]}  bound_task_id={j[16]}")
+        print(f"  deleted_at={j[18]}  created_at={j[17]}")
+        if j[12]:
+            print(f"  last_error: {s(j[12], 1500)}")
+
+    if not job:
+        sys.exit(0)
+
+    j = job[0]
+    uid = j[1]
+    last_tid = j[13]
+
+    # 找该 job 关联的所有 task(scheduled_job_id 回填 + last_task_id)
+    tasks = conn.execute(text(
+        "select task_id,name,status,run_status,run_error,tokens_prompt,tokens_completion,"
+        "created_at,updated_at,scheduled_job_id from tasks "
+        "where scheduled_job_id = :jid order by created_at"),
+        {"jid": str(j[0])}).fetchall()
+    print("\n" + "=" * 60)
+    print(f"[TASKS with scheduled_job_id={str(j[0])[:8]}] {len(tasks)}")
+    for t in tasks:
+        print(f"  task={t[0]} name={t[1]!r} status={t[2]} run={t[3]} "
+              f"tok={t[5]}/{t[6]} created={t[7]} updated={t[8]}")
+        if t[4]:
+            print(f"    run_error: {s(t[4], 1500)}")
+
+    # dump last_task_id 的消息(执行到哪一步)
+    tid = last_tid or (tasks[-1][0] if tasks else None)
+    if tid is None:
+        print("\n[no task to dump]")
+        sys.exit(0)
+    print("\n" + "=" * 60)
+    print(f"[DUMP messages of task {tid}]")
+    msgs = conn.execute(text(
+        "select idx,payload,tokens_in,tokens_out,created_at from messages "
+        "where task_id=:t order by idx"), {"t": str(tid)}).fetchall()
+    print(f"messages: {len(msgs)}\n")
+    for idx, p, ti, to, cat in msgs:
+        role = p.get("role")
+        head = f"[{idx}] {role} tok={ti}/{to} at={cat}"
+        print(head)
+        content = p.get("content")
+        if content:
+            print("    content:", s(content, 1500))
+        for tc in p.get("tool_calls") or []:
+            fn = tc.get("function") or {}
+            print(f"    CALL {fn.get('name')}({s(fn.get('arguments'), 800)})")
+        if role == "tool":
+            print(f"    TOOL[{p.get('name')}]:", s(content, 1200))
+        print()
--- a/web/app.py
+++ b/web/app.py
@ -921,15 +921,27 @@ def create_app() -> FastAPI:
                    runner.add_done_callback(lambda t: app.state.inflight.pop(t, None))

                    timeout = int(snap.get("timeout_seconds") or 0)
+                    timed_out = False
                    if timeout > 0:
                        done, _pending = await asyncio.wait({runner}, timeout=timeout)
                        if not done:
+                            timed_out = True
                            broker.request_cancel(tid)  # 协作式停;loop 在 chunk 间 poll 到即退
                            print(f"[scheduler] job {str(job_id)[:8]} timed out ({timeout}s), cancelling")
                            await runner
                    else:
                        await runner

+                    # 超时被掐断:_run_agent_bg 对 ok/cancelled 都把 run_status 收回 idle
+                    # (二者在 DB 里不可区分),只有这里知道本次是 timeout 中断的。必须记为
+                    # error —— 否则会误判成 ok(掩盖"跑到一半没推送"),且不计入连续失败/不触发
+                    # 兜底。半成品不投递 notify,直接收尾返回。
+                    if timed_out:
+                        record_result(job_id, status="error", task_id=tid,
+                                      error=f"运行超过超时上限 {timeout}s 未完成,已中断(本次未推送)")
+                        print(f"[scheduler] job {str(job_id)[:8]} recorded as timeout-error")
+                        return
+
                    # run 终态:_run_agent_bg 收尾把 run_status 写回 idle(ok)/error
                    with session_scope() as s:
                        st = s.execute(