2026-06-13

研究揭示LLM作为自动评判者存在系统性脆弱性，非词符号或通用思维开启词可诱导假阳性奖励，影响GPT-o1和Claude-4等模型，并提出Master Reward Models增强鲁棒性。 MaxProof通过群体级测试时扩展和生成-验证-修复能力，使M3模型在IMO 2025和USAMO 2026上超越人类金牌线，标志着数学推理的重大突破。提出操作一致性…

One Token to Fool LLM-as-a-Judge 88

Tags: AI安全 大模型 模型评估 奖励模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究揭示LLM作为自动评判者存在系统性脆弱性，非词符号或通用思维开启词可诱导假阳性奖励，影响GPT-o1和Claude-4等模型，并提出Master Reward Models增强鲁棒性。

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling 85

Tags: 大模型 推理优化 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
MaxProof通过群体级测试时扩展和生成-验证-修复能力，使M3模型在IMO 2025和USAMO 2026上超越人类金牌线，标志着数学推理的重大突破。

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs 85

Tags: 大模型 推理优化 模型评估 AI安全
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出操作一致性(OC)信号，通过组合查询分解一致性检测LLM推理失败，在12个模型、4个数据集上与准确率强相关(r≥0.86)，优于自一致性等基线，提升选择性预测性能。

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents 85

Tags: 智能体 大模型 论文研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出SkillCAT框架，通过对比因果提取、评估增强进化和拓扑感知执行实现训练无关的LLM Agent技能自我进化，在多个基准上平均提升40.40%。

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders 82

Tags: AI安全 大模型 推荐系统
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究FORGE基准揭示搜索增强LLM在推荐中易受单条虚假网页污染，最高73.8%被误导推荐假产品，推理不能防御反而加剧。

TCS and Anthropic partner to bring Claude to regulated industries 80

Tags: 公司动态 企业服务 AI安全
Source: Anthropic News | 阅读原文

[摘要]
Anthropic与印度IT巨头TCS达成合作，TCS将为5万名员工部署Claude，并基于Claude为金融、医疗等受监管行业开发合规AI解决方案，加速企业级AI落地。

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark 80

Tags: 智能体 推理优化 芯片算力
Source: NVIDIA Technical Blog - Generative AI | 阅读原文

[摘要]
NVIDIA在首个智能体AI基准测试中取得领先编码性能，推动智能体推理工作负载标准化，显示其硬件在复杂推理场景的优势。

Kimi 发布并开源最新代码模型 Kimi-K2.7-Code 80

Tags: 模型发布 开源生态 代码模型
Source: AI HOT 精选 | 阅读原文

[摘要]
Kimi发布并开源代码模型K2.7，性能显著提升，推理token使用量降低30%，即将推出6x高速模式。

MiniMax M3 开源权重模型发布，已上架 HuggingFace 80

Tags: 模型发布 开源生态 智能体 大模型
Source: AI HOT 精选 | 阅读原文

[摘要]
MiniMax 开源约428B参数、23B激活的M3模型，融合编码、智能体、多模态及1M上下文稀疏注意力，权重已上架HuggingFace。

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision 80

Tags: 训练方法 推理优化 自我蒸馏
Source: arXiv Computation and Language | 阅读原文

[摘要]
Self-Distillation Zero将二元奖励转化为密集Token级自监督，在数学和代码推理上比GRPO等基线提升超10%，训练更高效。

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation 80

Tags: 大模型 模型研究 结构化生成 偏好优化
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出TAB-PO方法，通过token级自适应屏障改进DPO在结构化生成中的梯度稀释和token侵蚀问题，在科学信息提取任务上平均提升11.59%并超越前沿模型。

Operads for compositional reasoning in LLMs 80

Tags: 推理优化 大模型 AI安全
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出用operads数学框架形式化LLM问题分解，定义operadic一致性度量，实验表明该度量与多跳QA准确率强相关，优于自一致性基线。

The Illusion of Multi-Agent Advantage 80

Tags: 智能体 大模型 研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究挑战多智能体系统优势，系统评估发现自动生成的MAS在推理任务中不如单智能体CoT-SC且成本高，揭示架构膨胀问题。

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent 80

Tags: 智能体 AI研究 同行评审
Source: arXiv Computation and Language | 阅读原文

[摘要]
ProReviewer将同行评审建模为马尔可夫决策过程，通过结构化评审日志实现主动调查，8B模型在五项质量指标上超越大模型基线，最高提升39%。

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension 80

Tags: 视觉文本理解 视觉语言模型 推理优化 自适应渲染
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出AGAR方法，利用VLM自身注意力指导自适应渲染，无需训练即可提升视觉文本理解性能，在多项基准上表现一致。

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents 80

Tags: 智能体 编码代理 推理优化 模型研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究TRACE将用户对编码代理的纠正实时编译为运行时规则，大幅减少偏好违规，降低用户重复纠正需求，提升交互智能体一致性。

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback 80

Tags: 智能体 AI for Science 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
MDForge提出基于LLM智能体的分子动力学管线自动设计方法，通过多智能体辩论应对稀疏奖励，在三个基准上取得与人类专家相当的性能，并发现皮摩尔级新型结合剂，展示了AI在计算化学中的潜力。

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions 80

Tags: AI安全 大模型 研究进展 学术评审
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究揭示仅修改论文展示层面（如摘要、框架、叙述）而不改变科学内容，即可成功攻击AI审稿系统，成功率75.1%，暴露大模型在严肃评审中的脆弱性。

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization 80

Tags: 研究进展 模型优化 开源生态
Source: arXiv Computation and Language | 阅读原文

[摘要]
Mistral-7B通过QLoRA微调在生物医学声明验证上超越GPT-4o/5，成本极低；发现数据集结构缺陷并计划开源所有代码。

Towards More General Control of Diffusion Models Using Jeffrey Guidance 80

Tags: 扩散模型 模型控制 AI研究
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
提出Jeffrey Guidance，利用杰弗里规则扩展扩散模型控制能力，在CIFAR-10/FFHQ显著降FID，并实现公平性控制，提升生成灵活性与可控性。

2026-06-13 ​

One Token to Fool LLM-as-a-Judge 88 ​

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling 85 ​

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs 85 ​

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents 85 ​

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders 82 ​

TCS and Anthropic partner to bring Claude to regulated industries 80 ​

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark 80 ​

Kimi 发布并开源最新代码模型 Kimi-K2.7-Code 80 ​

MiniMax M3 开源权重模型发布，已上架 HuggingFace 80 ​

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision 80 ​

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation 80 ​

Operads for compositional reasoning in LLMs 80 ​

The Illusion of Multi-Agent Advantage 80 ​

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent 80 ​

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension 80 ​

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents 80 ​

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback 80 ​

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions 80 ​

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization 80 ​

Towards More General Control of Diffusion Models Using Jeffrey Guidance 80 ​

2026-06-13

One Token to Fool LLM-as-a-Judge 88

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling 85

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs 85

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents 85

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders 82

TCS and Anthropic partner to bring Claude to regulated industries 80

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark 80

Kimi 发布并开源最新代码模型 Kimi-K2.7-Code 80

MiniMax M3 开源权重模型发布，已上架 HuggingFace 80

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision 80

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation 80

Operads for compositional reasoning in LLMs 80

The Illusion of Multi-Agent Advantage 80

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent 80

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension 80

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents 80

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback 80

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions 80

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization 80

Towards More General Control of Diffusion Models Using Jeffrey Guidance 80