caoqianming
|
70bac5c22c
|
feat(resm): 对接材料前沿简报补充期刊/关键词监控
按《全球材料前沿动态简报》三、前沿科技检索清单, 新增 8 本期刊
(Nature Materials/Communications/Reviews Materials、Communications
Materials、Science Advances、Scientific Reports、Engineering Structures、
Materials Today)与 6 个低碳建材关键词监控, 复用每天 05:00 的
monitor_papers 周期任务。简报已列且 0009 已收录的期刊不重复添加。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 13:23:55 +08:00 |
caoqianming
|
2d6df68135
|
feat(resm): fix_preview_pdf 增加坏 PDF 识别与清理
历史记录里除 Elsevier 1 页摘要预览页外, 还有把 HTML 错误页 / 截断垃圾当 PDF
存下的损坏文件, 同样被误标 has_fulltext_pdf=True。
- tasks.py: 新增 _inspect_pdf 分类器 (broken/preview/ok/unknown)。broken 仅在铁证
下判定(非 %PDF 魔数, 或装了 pypdf 且解析失败); 未装 pypdf 又判不出页数归 unknown,
绝不误删。
- fix_preview_pdf: 预览页文件仅 --delete-file 时删; 坏文件总是删(dry-run 除外),
坏文件打 fail_reason=pdf_broken; 无 XML 全文者一并回退 has_fulltext。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 09:38:05 +08:00 |
caoqianming
|
97b23a2b06
|
fix(resm): 静音 pypdf 解析坏 PDF 时的恢复日志
_pdf_page_count 读到损坏 PDF 时 pypdf 会刷大量 incorrect header / Cannot find
/Root 等恢复日志, 污染 fix_preview_pdf 等批处理输出。将 pypdf logger 调到
CRITICAL 静音; 解析失败仍按 None 处理(跳过该条)。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 09:14:23 +08:00 |
caoqianming
|
e695e04de7
|
fix(resm): 识别 Elsevier 摘要预览 PDF, 避免误标全文
Elsevier Article API 对未授权/in-press 文章, application/pdf 端点会返回仅含
摘要的 1 页预览 PDF (魔数仍是 %PDF、体积也不小), 而全文 XML 可正常获取。旧逻辑
只校验魔数+体积, 误将预览页落库并置 has_fulltext_pdf=True。
- tasks.py: 新增 _pdf_page_count / _is_elsevier_preview_pdf (优先 pypdf, 退化
字节扫描), _elsevier_fetch_pdf 与 save_pdf_from_elsevier 落库前排除 1 页预览页,
打 fail_reason=elsevier_pdf_preview_only; 补抓队列 qs_pdf 排除该标记避免无限重试
- 新增管理命令 fix_preview_pdf: 扫描存量误标记录, 回退 has_fulltext_pdf;
无 XML 全文者一并回退 has_fulltext, 让其重进下载链
- requirements.txt: 增加 pypdf>=4.0.0
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-29 08:54:07 +08:00 |
caoqianming
|
1e54070d6d
|
fix(resm): openalex PDF 下载请求补 timeout
save_pdf_from_openalex 的 content.openalex.org 请求此前未设超时,
单次请求挂死会拖过 120s alive 心跳, 触发 ensure_fetch_running 重复点火、
叠加并发链狂打 API。补 timeout=(3,15) 与 elsevier 各处一致,
超时由外层 RequestException 捕获并正常收尾。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 15:39:16 +08:00 |
caoqianming
|
75d02814c6
|
fix(resm): 修复 openalex 链失败遮蔽 PDF 兜底下载
send_download_fulltext_task 原用 fail_reason=None 选取, 导致被 openalex
保活链写过 fail_reason 的论文被永久遮蔽, oa_url/elsevier/scihub 兜底路径
永不尝试。改为 download_pdf 终态打稳定标记 download_pdf_tried, 选取时据此
排除 —— 既解除遮蔽, 又防本链路对同一篇无限重试。
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 10:44:34 +08:00 |
caoqianming
|
12f97fc47f
|
feat(resm): 合并 elsevier 抓取任务 + 抓取链常驻保活 + openalex 限流退避
- 合并 get_pdf_from_elsevier 进 get_abstract_from_elsevier: 同一 DOI 取 XML 后
发现全文则内联取 PDF, 并补抓存量缺 PDF 论文; 阶段2 批量上限拆为 pdf_number_of_task
- 新增 ensure_fetch_running beat 任务 + alive 心跳: 自触发链重启/崩溃/空闲后自愈
- get_pdf_from_openalex: 限流期间慢节奏刷 alive 不打 API; 普通 429 也退避
- migration 0010 注册 ensure_fetch_running 每 60s 周期任务
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-23 10:06:49 +08:00 |
caoqianming
|
643cb97e4a
|
feat(resm): paper 查询加 publication_date 精确 + 范围过滤
- publication_date 精确日期过滤
- publication_date_gte / publication_date_lte 日期范围(含端点)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-22 11:12:00 +08:00 |
caoqianming
|
c5636b5131
|
feat(resm): 期刊/关键词监控 PaperMonitor + 移除每日增量周期任务
- 新增 PaperMonitor model(type=journal/search/keyword、value、name、note、is_active、days、last_run、last_count)+ admin 管理
- 新增 monitor_papers 任务:遍历启用监控,journal→primary_location.source.issn / search→title_and_abstract / keyword→keywords.id,复用 _crawl_openalex_query 入库去重,每天 05:00 调度
- 迁移 0008 建表;0009 种子(8 本无机非金属材料期刊 + 5 英文方向词,note=无机非金属材料)并注册监控周期任务
- 移除 0007:update_paper_meta_from_openalex/elsevier 不再注册为每日周期任务(只需一次性回补,用 backfill_paper_meta_from_openalex);两任务函数保留供手动/回补调用
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-21 23:43:58 +08:00 |
caoqianming
|
7b38d4d234
|
feat(resm): 论文索引自动更新 + 通用 OpenAlex 抓取核心
- 新增通用核心 _crawl_openalex_query:单查询 cursor 分页 + 逐页游标 checkpoint + 停/续标志,全量抓取/每天增量/回补三者共用;顺手修复 get_paper_meta_from_openalex 原先把起始游标写回缓存、年中断点不能续传的 bug
- 新增 update_paper_meta_from_openalex:每天按 from_publication_date 增量(days=30)。from_created_date/from_updated_date 需 OpenAlex Premium,当前 key 无权限,故用发表日期
- 新增 update_paper_meta_from_elsevier:ScienceDirect Search(loadedAfter)补充 Elsevier 新刊
- 新增 backfill_paper_meta_from_openalex:按发表日期一次性回补,支持断点续传/配额暂停续跑
- tasks.py 凭证改从 settings 读取(集中到 gitignore 的 config/conf.py)
- migration 0007:注册两条每天的增量周期任务(OpenAlex 03:00 / Elsevier 04:00)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
2026-06-21 15:12:04 +08:00 |
caoqianming
|
6a5a5d7b6b
|
feat: paper list 加 pdf_url / xml_url 直链字段 + pg_trgm GIN 索引
serializers: PaperListSerializer 加 pdf_url / xml_url SerializerMethodField,基于 publication_date + safe_doi 后端拼 absolute_uri;has_fulltext_{pdf,xml}=False 或 publication_date 缺失返空串。LLM 客户端从 list 一次拿到直链,不必拼 URL。
migration 0006: CREATE EXTENSION IF NOT EXISTS pg_trgm + 3 列 GIN 索引(title / first_author / first_author_institution),根治 SearchFilter 跨列 ILIKE '%xxx%' 全表扫 timeout(高频词如 cement 原本 30s+,加索引后几十 ms)。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-21 13:48:52 +08:00 |
caoqianming
|
e8320bce05
|
feat: paper list 返 abstract + 加 retrieve 端点 + filterset 扩 year range / 多字段
为 zcbot research skill 让出 LLM 友好接口:list 端点带 abstract 省 LLM 逐条 get 的 round-trip;PaperViewSet 加 CustomRetrieveModelMixin 修 GET /api/resm/paper/<id>/ 原本 404 的 bug;filterset_class 扩 publication_year_gte/lte + has_fulltext_pdf / is_oa / publication_name / first_author / openalex_id;queryset 加 select_related("abstract") 防 N+1。search_fields 不动(仍 title/first_author/first_author_institution)。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-21 13:17:46 +08:00 |
caoqianming
|
326f6b35d5
|
fix: 真正把 paper_pdf_view 从 swagger 端点枚举里剔除
上一次的 swagger_auto_schema(auto_schema=None) 只抑制 operation 渲染,路径仍进入 endpoints 字典并参与最长公共前缀计算,所以分组依旧塌成 api.改为在 .cls 上设置 swagger_schema = None,命中 EndpointEnumerator.should_include_endpoint 的提前返回(generators.py:66),路径根本不进枚举,公共前缀恢复 /api/.
|
2026-05-06 14:28:02 +08:00 |
caoqianming
|
b8a397eef7
|
fix: 隐藏 paper_pdf_view 避免 swagger 分组按 api 聚合
paper_pdf_view 是 @api_view 装饰的非 api/ 前缀路由,会被 drf-yasg 纳入 schema,导致最长公共前缀塌成空,所有接口归到 api 标签下.加 swagger_auto_schema(auto_schema=None) 把它从 schema 中剔除,公共前缀恢复 /api/,分组按模块还原.
|
2026-05-06 14:17:57 +08:00 |
caoqianming
|
b826f8f46b
|
feat: get_paper_meta_from_openalex 添加软停止2
|
2026-05-06 14:17:22 +08:00 |
caoqianming
|
6d2f2a452c
|
Merge branch 'main' of http://gitea.xxhhcty.xyz:8080/zcdsj/paper_server
|
2026-05-06 13:51:58 +08:00 |
caoqianming
|
1d8d829797
|
feat: get_paper_meta_from_openalex 添加软停止
Co-authored-by: Copilot <copilot@github.com>
|
2026-05-06 13:50:25 +08:00 |
TianyangZhang
|
92c55e8691
|
feat: 修改pdf 验证cloudflare
|
2026-03-24 10:34:06 +08:00 |
TianyangZhang
|
76bb3bb4d4
|
Merge branch 'main' of http://gitea.xxhhcty.xyz:8080/zcdsj/paper_server
|
2026-03-23 16:30:21 +08:00 |
TianyangZhang
|
d7dd606f15
|
feat:通过cloudflare 验证
|
2026-03-23 16:30:18 +08:00 |
caoqianming
|
4090ce457d
|
feat: downloadpf 先不走openalex
|
2026-03-11 12:27:58 +08:00 |
caoqianming
|
b91482609b
|
feat: 添加doi查询条件
|
2026-03-10 09:33:49 +08:00 |
caoqianming
|
54780b8ce1
|
feat: get_abstract 优化3
|
2026-02-13 16:18:04 +08:00 |
caoqianming
|
38c4d8109b
|
feat: get_abstract 优化2
|
2026-02-13 16:15:54 +08:00 |
caoqianming
|
3d08c4aeee
|
feat: get_abstract 优化
|
2026-02-13 16:13:56 +08:00 |
caoqianming
|
360456b50c
|
feat: get_paper_meta_from_openalex
|
2026-02-12 14:09:26 +08:00 |
caoqianming
|
b3ea39757e
|
feat: 优化 get_pdf_from_openalex
|
2026-02-12 10:25:13 +08:00 |
caoqianming
|
d5f8e43751
|
feat: 修改ACCURACY
|
2026-02-10 16:18:48 +08:00 |
caoqianming
|
b300c1779b
|
feat: 需要保存openalex pdferror
|
2026-02-10 14:13:50 +08:00 |
caoqianming
|
3c568c076b
|
feat: save_pdf_from_openalex 保存openalex_pdf_not_found
|
2026-02-10 14:06:33 +08:00 |
caoqianming
|
2690895231
|
fix: timezone bug
|
2026-02-10 14:03:37 +08:00 |
caoqianming
|
5ebf2bde24
|
feat: get_pdf_from_openalex2
|
2026-02-10 13:58:13 +08:00 |
caoqianming
|
1ddca4d34d
|
feat: get_pdf_from_openalex
|
2026-02-10 13:56:44 +08:00 |
caoqianming
|
76e8204680
|
feat: 启用save_pdf_from_openalex
|
2026-02-10 12:16:07 +08:00 |
caoqianming
|
352966946e
|
feat: 确保pdf下载完整
|
2026-02-10 11:14:30 +08:00 |
caoqianming
|
0fb8e5ff94
|
feat: 优化release_working_paper
|
2026-02-10 10:08:46 +08:00 |
caoqianming
|
33afe3af0b
|
feat: save_pdf_from_oa_url 允许202
|
2026-02-10 09:36:26 +08:00 |
caoqianming
|
5dda4efcae
|
feat: 先标记为oa_url_need_play
|
2026-02-09 16:47:43 +08:00 |
caoqianming
|
fd16c6f9d2
|
feat: 函数内导入
|
2026-02-09 15:34:07 +08:00 |
caoqianming
|
94f269626d
|
feat: 添加pyautogui调用
|
2026-02-09 15:17:02 +08:00 |
caoqianming
|
9efc412f7d
|
feat: get_abstract_from_elsevier 先执行混排
|
2026-02-05 09:14:51 +08:00 |
caoqianming
|
be264fd558
|
feat: save_pdf_from_scihub返回信息
|
2026-02-04 12:47:34 +08:00 |
caoqianming
|
d7aa8f8ada
|
feat: 优化save_pdf_from_scihub
|
2026-02-04 11:26:55 +08:00 |
caoqianming
|
b9f06b4859
|
feat: 完善get_paper_meata search
|
2026-02-04 10:01:40 +08:00 |
caoqianming
|
76c8748503
|
feat: 完善get_abstract_from_elsevier
|
2026-02-04 09:33:45 +08:00 |
caoqianming
|
8fbdc7c28b
|
feat: get_abstract_from_elsevier 添加参数
|
2026-02-04 08:56:13 +08:00 |
caoqianming
|
51fc1a5c5a
|
feat: 增加d_scihub 调用2
|
2026-02-03 15:55:10 +08:00 |
caoqianming
|
43e8dbc226
|
feat: 增加d_scihub 调用
|
2026-02-03 15:53:13 +08:00 |
caoqianming
|
33a6dbf431
|
feat: 增加d_scihub
|
2026-02-03 15:41:44 +08:00 |
caoqianming
|
aa95818414
|
feat: 支持filter_or
|
2026-02-03 09:19:53 +08:00 |