paper_server/apps/resm
caoqianming 88b51f97b0 perf(resm): fix_preview_pdf 多进程并发扫描
读文件 + pypdf 解析是 CPU/IO 密集, 17 万条串行太慢。改用 ProcessPoolExecutor
并行分类, DB 写入留主进程串行(坏文件仅少数, 非瓶颈, 也避免子进程共享 DB 连接)。

- 新增 apps/resm/pdf_utils.py: 抽出 _pdf_page_count / _is_elsevier_preview_pdf /
  _inspect_pdf / classify_pdf_file, 不依赖 Django, 进程池 fork/spawn 均可安全导入
- tasks.py: 改为从 pdf_utils 导入, 删除内联定义
- 命令新增 --workers(默认 CPU 核数) / --batch; 用 .values() 流式分批, 逐批打印进度;
  DB 写入改用 filter().update() 一次完成, 不再加载模型实例

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:24:45 +08:00
..
management perf(resm): fix_preview_pdf 多进程并发扫描 2026-06-29 13:24:45 +08:00
migrations feat(resm): 对接材料前沿简报补充期刊/关键词监控 2026-06-29 13:23:55 +08:00
__init__.py
admin.py feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务 2026-06-21 23:43:58 +08:00
apps.py
cloudflare_checkbox2.png
d_oaurl.py feat:通过cloudflare 验证 2026-03-23 16:30:18 +08:00
d_scihub.py feat:通过cloudflare 验证 2026-03-23 16:30:18 +08:00
filters.py feat(resm): paper 查询加 publication_date 精确 + 范围过滤 2026-06-22 11:12:00 +08:00
models.py feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务 2026-06-21 23:43:58 +08:00
pdf_utils.py perf(resm): fix_preview_pdf 多进程并发扫描 2026-06-29 13:24:45 +08:00
serializers.py feat: paper list 加 pdf_url / xml_url 直链字段 + pg_trgm GIN 索引 2026-05-21 13:48:52 +08:00
services.py
tasks.py perf(resm): fix_preview_pdf 多进程并发扫描 2026-06-29 13:24:45 +08:00
tests.py
urls.py feat: 修改pdf 验证cloudflare 2026-03-24 10:34:06 +08:00
views.py feat: paper list 返 abstract + 加 retrieve 端点 + filterset 扩 year range / 多字段 2026-05-21 13:17:46 +08:00