caoqianming
|
2d6df68135
|
feat(resm): fix_preview_pdf 增加坏 PDF 识别与清理
历史记录里除 Elsevier 1 页摘要预览页外, 还有把 HTML 错误页 / 截断垃圾当 PDF
存下的损坏文件, 同样被误标 has_fulltext_pdf=True。
- tasks.py: 新增 _inspect_pdf 分类器 (broken/preview/ok/unknown)。broken 仅在铁证
下判定(非 %PDF 魔数, 或装了 pypdf 且解析失败); 未装 pypdf 又判不出页数归 unknown,
绝不误删。
- fix_preview_pdf: 预览页文件仅 --delete-file 时删; 坏文件总是删(dry-run 除外),
坏文件打 fail_reason=pdf_broken; 无 XML 全文者一并回退 has_fulltext。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 09:38:05 +08:00 |
caoqianming
|
97b23a2b06
|
fix(resm): 静音 pypdf 解析坏 PDF 时的恢复日志
_pdf_page_count 读到损坏 PDF 时 pypdf 会刷大量 incorrect header / Cannot find
/Root 等恢复日志, 污染 fix_preview_pdf 等批处理输出。将 pypdf logger 调到
CRITICAL 静音; 解析失败仍按 None 处理(跳过该条)。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 09:14:23 +08:00 |
caoqianming
|
e695e04de7
|
fix(resm): 识别 Elsevier 摘要预览 PDF, 避免误标全文
Elsevier Article API 对未授权/in-press 文章, application/pdf 端点会返回仅含
摘要的 1 页预览 PDF (魔数仍是 %PDF、体积也不小), 而全文 XML 可正常获取。旧逻辑
只校验魔数+体积, 误将预览页落库并置 has_fulltext_pdf=True。
- tasks.py: 新增 _pdf_page_count / _is_elsevier_preview_pdf (优先 pypdf, 退化
字节扫描), _elsevier_fetch_pdf 与 save_pdf_from_elsevier 落库前排除 1 页预览页,
打 fail_reason=elsevier_pdf_preview_only; 补抓队列 qs_pdf 排除该标记避免无限重试
- 新增管理命令 fix_preview_pdf: 扫描存量误标记录, 回退 has_fulltext_pdf;
无 XML 全文者一并回退 has_fulltext, 让其重进下载链
- requirements.txt: 增加 pypdf>=4.0.0
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 08:54:07 +08:00 |
caoqianming
|
1e54070d6d
|
fix(resm): openalex PDF 下载请求补 timeout
save_pdf_from_openalex 的 content.openalex.org 请求此前未设超时,
单次请求挂死会拖过 120s alive 心跳, 触发 ensure_fetch_running 重复点火、
叠加并发链狂打 API。补 timeout=(3,15) 与 elsevier 各处一致,
超时由外层 RequestException 捕获并正常收尾。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 15:39:16 +08:00 |
caoqianming
|
75d02814c6
|
fix(resm): 修复 openalex 链失败遮蔽 PDF 兜底下载
send_download_fulltext_task 原用 fail_reason=None 选取, 导致被 openalex
保活链写过 fail_reason 的论文被永久遮蔽, oa_url/elsevier/scihub 兜底路径
永不尝试。改为 download_pdf 终态打稳定标记 download_pdf_tried, 选取时据此
排除 —— 既解除遮蔽, 又防本链路对同一篇无限重试。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 10:44:34 +08:00 |
caoqianming
|
12f97fc47f
|
feat(resm): 合并 elsevier 抓取任务 + 抓取链常驻保活 + openalex 限流退避
- 合并 get_pdf_from_elsevier 进 get_abstract_from_elsevier: 同一 DOI 取 XML 后
发现全文则内联取 PDF, 并补抓存量缺 PDF 论文; 阶段2 批量上限拆为 pdf_number_of_task
- 新增 ensure_fetch_running beat 任务 + alive 心跳: 自触发链重启/崩溃/空闲后自愈
- get_pdf_from_openalex: 限流期间慢节奏刷 alive 不打 API; 普通 429 也退避
- migration 0010 注册 ensure_fetch_running 每 60s 周期任务
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 10:06:49 +08:00 |
caoqianming
|
c5636b5131
|
feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务
- 新增 PaperMonitor model(type=journal/search/keyword、value、name、note、is_active、days、last_run、last_count)+ admin 管理
- 新增 monitor_papers 任务:遍历启用监控,journal→primary_location.source.issn / search→title_and_abstract / keyword→keywords.id,复用 _crawl_openalex_query 入库去重,每天 05:00 调度
- 迁移 0008 建表;0009 种子(8 本无机非金属材料期刊 + 5 英文方向词,note=无机非金属材料)并注册监控周期任务
- 移除 0007:update_paper_meta_from_openalex/elsevier 不再注册为每日周期任务(只需一次性回补,用 backfill_paper_meta_from_openalex);两任务函数保留供手动/回补调用
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-21 23:43:58 +08:00 |
caoqianming
|
7b38d4d234
|
feat(resm): 论文索引自动更新 + 通用 OpenAlex 抓取核心
- 新增通用核心 _crawl_openalex_query:单查询 cursor 分页 + 逐页游标 checkpoint + 停/续标志,全量抓取/每天增量/回补三者共用;顺手修复 get_paper_meta_from_openalex 原先把起始游标写回缓存、年中断点不能续传的 bug
- 新增 update_paper_meta_from_openalex:每天按 from_publication_date 增量(days=30)。from_created_date/from_updated_date 需 OpenAlex Premium,当前 key 无权限,故用发表日期
- 新增 update_paper_meta_from_elsevier:ScienceDirect Search(loadedAfter)补充 Elsevier 新刊
- 新增 backfill_paper_meta_from_openalex:按发表日期一次性回补,支持断点续传/配额暂停续跑
- tasks.py 凭证改从 settings 读取(集中到 gitignore 的 config/conf.py)
- migration 0007:注册两条每天的增量周期任务(OpenAlex 03:00 / Elsevier 04:00)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-21 15:12:04 +08:00 |
caoqianming
|
b826f8f46b
|
feat: get_paper_meta_from_openalex 添加软停止2
|
2026-05-06 14:17:22 +08:00 |
caoqianming
|
6d2f2a452c
|
Merge branch 'main' of http://gitea.xxhhcty.xyz:8080/zcdsj/paper_server
|
2026-05-06 13:51:58 +08:00 |
caoqianming
|
1d8d829797
|
feat: get_paper_meta_from_openalex 添加软停止
Co-authored-by: Copilot <copilot@github.com>
|
2026-05-06 13:50:25 +08:00 |
TianyangZhang
|
76bb3bb4d4
|
Merge branch 'main' of http://gitea.xxhhcty.xyz:8080/zcdsj/paper_server
|
2026-03-23 16:30:21 +08:00 |
TianyangZhang
|
d7dd606f15
|
feat:通过cloudflare 验证
|
2026-03-23 16:30:18 +08:00 |
caoqianming
|
4090ce457d
|
feat: downloadpf 先不走openalex
|
2026-03-11 12:27:58 +08:00 |
caoqianming
|
54780b8ce1
|
feat: get_abstract 优化3
|
2026-02-13 16:18:04 +08:00 |
caoqianming
|
38c4d8109b
|
feat: get_abstract 优化2
|
2026-02-13 16:15:54 +08:00 |
caoqianming
|
3d08c4aeee
|
feat: get_abstract 优化
|
2026-02-13 16:13:56 +08:00 |
caoqianming
|
360456b50c
|
feat: get_paper_meta_from_openalex
|
2026-02-12 14:09:26 +08:00 |
caoqianming
|
b3ea39757e
|
feat: 优化 get_pdf_from_openalex
|
2026-02-12 10:25:13 +08:00 |
caoqianming
|
b300c1779b
|
feat: 需要保存openalex pdferror
|
2026-02-10 14:13:50 +08:00 |
caoqianming
|
3c568c076b
|
feat: save_pdf_from_openalex 保存openalex_pdf_not_found
|
2026-02-10 14:06:33 +08:00 |
caoqianming
|
2690895231
|
fix: timezone bug
|
2026-02-10 14:03:37 +08:00 |
caoqianming
|
5ebf2bde24
|
feat: get_pdf_from_openalex2
|
2026-02-10 13:58:13 +08:00 |
caoqianming
|
1ddca4d34d
|
feat: get_pdf_from_openalex
|
2026-02-10 13:56:44 +08:00 |
caoqianming
|
76e8204680
|
feat: 启用save_pdf_from_openalex
|
2026-02-10 12:16:07 +08:00 |
caoqianming
|
0fb8e5ff94
|
feat: 优化release_working_paper
|
2026-02-10 10:08:46 +08:00 |
caoqianming
|
33afe3af0b
|
feat: save_pdf_from_oa_url 允许202
|
2026-02-10 09:36:26 +08:00 |
caoqianming
|
5dda4efcae
|
feat: 先标记为oa_url_need_play
|
2026-02-09 16:47:43 +08:00 |
caoqianming
|
94f269626d
|
feat: 添加pyautogui调用
|
2026-02-09 15:17:02 +08:00 |
caoqianming
|
9efc412f7d
|
feat: get_abstract_from_elsevier 先执行混排
|
2026-02-05 09:14:51 +08:00 |
caoqianming
|
be264fd558
|
feat: save_pdf_from_scihub返回信息
|
2026-02-04 12:47:34 +08:00 |
caoqianming
|
d7aa8f8ada
|
feat: 优化save_pdf_from_scihub
|
2026-02-04 11:26:55 +08:00 |
caoqianming
|
b9f06b4859
|
feat: 完善get_paper_meata search
|
2026-02-04 10:01:40 +08:00 |
caoqianming
|
76c8748503
|
feat: 完善get_abstract_from_elsevier
|
2026-02-04 09:33:45 +08:00 |
caoqianming
|
8fbdc7c28b
|
feat: get_abstract_from_elsevier 添加参数
|
2026-02-04 08:56:13 +08:00 |
caoqianming
|
51fc1a5c5a
|
feat: 增加d_scihub 调用2
|
2026-02-03 15:55:10 +08:00 |
caoqianming
|
43e8dbc226
|
feat: 增加d_scihub 调用
|
2026-02-03 15:53:13 +08:00 |
caoqianming
|
33a6dbf431
|
feat: 增加d_scihub
|
2026-02-03 15:41:44 +08:00 |
caoqianming
|
aa95818414
|
feat: 支持filter_or
|
2026-02-03 09:19:53 +08:00 |
caoqianming
|
99f9cff9d5
|
feat: 修改get_abstract_from_elsevier2
|
2026-02-02 10:26:27 +08:00 |
caoqianming
|
82b1b41422
|
feat: 修改get_abstract_from_elsevier
|
2026-02-02 10:14:29 +08:00 |
caoqianming
|
3cf01d49e6
|
feat: get_abstract_from_elsevier 返回err_msg
|
2026-02-02 10:12:17 +08:00 |
caoqianming
|
3c84fbba49
|
feat: save_pdf_from_elsevier 使用instoken
|
2026-02-02 09:52:40 +08:00 |
caoqianming
|
b24bb64485
|
feat: get_abstract_from_elsevier 使用instoken
|
2026-02-02 09:23:36 +08:00 |
caoqianming
|
e2687874eb
|
feat: get_abstract_from_elsevier 返回抓取信息
|
2026-01-30 14:09:09 +08:00 |
caoqianming
|
16388682b9
|
feat: release_working_paper2
|
2026-01-30 14:01:03 +08:00 |
caoqianming
|
8c0efbecc2
|
feat: release_working_paper
|
2026-01-30 13:56:20 +08:00 |
caoqianming
|
22de14fdea
|
fix: get_abstract_from_elsevier时save_pdf
|
2026-01-30 13:34:54 +08:00 |
caoqianming
|
1f5def2821
|
feat: 优化fetch_status
|
2026-01-30 10:37:29 +08:00 |
caoqianming
|
bfcc6d77fc
|
feat: 完善一些类型错误
|
2026-01-30 09:13:09 +08:00 |