paper_server/apps/resm
caoqianming 2d6df68135 feat(resm): fix_preview_pdf 增加坏 PDF 识别与清理
历史记录里除 Elsevier 1 页摘要预览页外, 还有把 HTML 错误页 / 截断垃圾当 PDF
存下的损坏文件, 同样被误标 has_fulltext_pdf=True。

- tasks.py: 新增 _inspect_pdf 分类器 (broken/preview/ok/unknown)。broken 仅在铁证
  下判定(非 %PDF 魔数, 或装了 pypdf 且解析失败); 未装 pypdf 又判不出页数归 unknown,
  绝不误删。
- fix_preview_pdf: 预览页文件仅 --delete-file 时删; 坏文件总是删(dry-run 除外),
  坏文件打 fail_reason=pdf_broken; 无 XML 全文者一并回退 has_fulltext。

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 09:38:05 +08:00
..
management feat(resm): fix_preview_pdf 增加坏 PDF 识别与清理 2026-06-29 09:38:05 +08:00
migrations feat(resm): 合并 elsevier 抓取任务 + 抓取链常驻保活 + openalex 限流退避 2026-06-23 10:06:49 +08:00
__init__.py feat: 添加resm app 2026-01-23 10:37:41 +08:00
admin.py feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务 2026-06-21 23:43:58 +08:00
apps.py feat: 添加resm app 2026-01-23 10:37:41 +08:00
cloudflare_checkbox2.png feat: 添加pyautogui调用 2026-02-09 15:17:02 +08:00
d_oaurl.py feat:通过cloudflare 验证 2026-03-23 16:30:18 +08:00
d_scihub.py feat:通过cloudflare 验证 2026-03-23 16:30:18 +08:00
filters.py feat(resm): paper 查询加 publication_date 精确 + 范围过滤 2026-06-22 11:12:00 +08:00
models.py feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务 2026-06-21 23:43:58 +08:00
serializers.py feat: paper list 加 pdf_url / xml_url 直链字段 + pg_trgm GIN 索引 2026-05-21 13:48:52 +08:00
services.py feat: 增加download_pdf 2026-01-28 15:01:49 +08:00
tasks.py feat(resm): fix_preview_pdf 增加坏 PDF 识别与清理 2026-06-29 09:38:05 +08:00
tests.py feat: 添加resm app 2026-01-23 10:37:41 +08:00
urls.py feat: 修改pdf 验证cloudflare 2026-03-24 10:34:06 +08:00
views.py feat: paper list 返 abstract + 加 retrieve 端点 + filterset 扩 year range / 多字段 2026-05-21 13:17:46 +08:00