2026-06-24

OpenSIR提出开放式自我博弈框架，让大模型无需外部验证器和标注数据，通过多样性奖励与难度校准在数学推理上持续自我提升，超越传统有监督方法，并泛化到通用推理任务。提出Group-Graph Policy Optimization（G2PO），将序列轨迹转为全局状态转移图，通过图聚合价值估计和边优势估计改进长程代理RL中的信用分配，在WebShop等基准上…

OpenSIR: Open-Ended Self-Improving Reasoner 85

Tags: 大模型 推理优化 研究进展 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
OpenSIR提出开放式自我博弈框架，让大模型无需外部验证器和标注数据，通过多样性奖励与难度校准在数学推理上持续自我提升，超越传统有监督方法，并泛化到通用推理任务。

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning 85

Tags: 研究 强化学习 智能体 大模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出Group-Graph Policy Optimization（G2PO），将序列轨迹转为全局状态转移图，通过图聚合价值估计和边优势估计改进长程代理RL中的信用分配，在WebShop等基准上比GRPO提升最多22.2%。

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions 85

Tags: 模型发布 多模态 音频理解 音频生成
Source: arXiv Computation and Language | 阅读原文

[摘要]
Bagpiper是一款8B参数的音频基础模型，通过丰富字幕预训练实现统一理解与生成，在MMAU等基准上超越Qwen-2.5-Omni和CosyVoice3，标志着通用音频AI的重要进展。

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning 85

Tags: AI安全 多模态 推理优化 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
SingGuard 提出策略自适应的多模态 LLM 护栏，支持运行时规则输入和快慢推理，在35个数据集上达 SOTA，提升动态策略下的安全评估能力。

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models 85

Tags: 研究进展 推理优化 大模型 AI安全
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出Answer Engineering方法，通过局部轨迹编辑在不重训练下提升大模型协议约束决策合规性，SSNHL基准平衡准确率从42%升至80.7%，值得关注。

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs 85

Tags: 大模型 AI安全 研究突破
Source: arXiv Computation and Language | 阅读原文

[摘要]
揭示大语言模型在指令遵循与模式补全冲突时的脆弱性，即使强大模型也易受诱导，输出多样性比语义理解更关键。

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers 85

Tags: 推理优化 大模型 模型发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
Keyless Attention提出去除注意力机制的键投影，仅使用查询和值实现值空间路由和仅值缓存，KV缓存减少50%且解码性能匹配或超越标准注意力，在多种模型上验证效果。

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories 85

Tags: 大模型 模型评估 AI安全
Source: arXiv Computation and Language | 阅读原文

[摘要]
BabelJudge 开源框架系统测量 LLM-as-a-judge 在位置、冗长、顺序偏差及跨语言退化上的可靠性，发现低资源语言下判断接近随机，对评估可信度有重要价值。

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning 85

Tags: 智能体 强化学习 训练方法
Source: arXiv Computation and Language | 阅读原文

[摘要]
Spark提出关键状态动态分支策略，提升长时程智能体强化学习训练的资源效率与泛化能力，开源代码已发布。

Fixed RAG Compression Collapses Measured Reader Scaling 85

Tags: RAG AI研究 评测方法
Source: arXiv Computation and Language | 阅读原文

[摘要]
新研究发现固定RAG压缩会隐藏读者升级并反转模型排名，揭示评估偏差，提出ragscale工具以快速审计读者缩放。

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory 85

Tags: 模型发布 研究进展 智能体 检索增强
Source: arXiv Computation and Language | 阅读原文

[摘要]
EvoEmbedding 提出具有进化表示能力的嵌入模型，通过持续更新的隐式记忆实现长上下文动态检索，超越 Qwen3-Embedding-8B 等大模型，并提升智能体系统性能。

Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning 85

Tags: 智能体 推理优化 模型研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出LRE方法，仅用几KB内存和CPU就能学习保留关键记忆，提升长期智能体任务准确率，减少上下文开销。

Finetuning with Scientific Data Increases Hallucinations: A Multi-domain Factuality Evaluation of LLMs 85

Tags: 研究进展 大模型 AI安全 事实性评估
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究发现科学领域微调后的LLM在所有幻觉类型和科学领域上事实可靠性下降，内部信心降低但语言更武断，挑战了当前领域微调方法。

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards 82

Tags: 训练方法 推理优化 大模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出ACOER方法，通过自适应正确奖励解决大推理模型训练中的奖励崩溃与过度压缩问题，在数学推理任务上提升准确率并减少60%+生成令牌。

Learning the ARTS of Search for Automated Discovery 82

Tags: 智能体 大模型 推理优化 研究进展
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出ARTS方法，利用推理语言模型进行科学发现树搜索，通过测试时训练克服上下文限制，在22个任务上超越现有算法，Qwen3-4B匹配闭源模型且推理成本低5倍。

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes 82

Tags: 研究动态 AI评测 智能体 对话系统
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究发现LLM用户模拟器严重高估了真实非购买者的参与度，低估其放弃倾向，导致销售对话评估失真，挑战了现有模拟方法的可靠性。

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence 82

Tags: 大模型 模型评测
Source: arXiv Computation and Language | 阅读原文

[摘要]
一种自洽的LLM社区基准测试，通过竞争性单词游戏评估结构性智能，抗测试集泄露，首次提出基于奇异值分解无金标准的事实准确率评估，与GPQA Diamond相关性0.92。

Temporal Causal Prior-Data Fitted Networks for Panel Data with Learned Reliability Signals 82

Tags: 因果推断 时间序列 零样本学习 AI研究
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
提出TCPFN，一种面向面板数据的零样本时序因果推断基础模型，同时输出可信度信号。在19个基准数据集上达到临床级因果发现性能，并首次在工业规模数据集上验证了可扩展性。

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm 82

Tags: 推理优化 大模型 研究
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
E²C将LLM推理中的规划与执行分离为两阶段，结构化解耦更高效：AIME'2024上以12.4k token达53.3%准确率，优于思维树的71.3k token/50.0%，并支持轻量领域适应。

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations 82

Tags: AI for Science 模型发布 研究进展
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
GyroSwin 发布可扩展的5D神经替代模型，用于核聚变等离子体湍流模拟，计算成本降低三个数量级，保持物理可验证。

2026-06-24 ​

OpenSIR: Open-Ended Self-Improving Reasoner 85 ​

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning 85 ​

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions 85 ​

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning 85 ​

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models 85 ​

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs 85 ​

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers 85 ​

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories 85 ​

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning 85 ​

Fixed RAG Compression Collapses Measured Reader Scaling 85 ​

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory 85 ​

Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning 85 ​

Finetuning with Scientific Data Increases Hallucinations: A Multi-domain Factuality Evaluation of LLMs 85 ​

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards 82 ​

Learning the ARTS of Search for Automated Discovery 82 ​

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes 82 ​

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence 82 ​

Temporal Causal Prior-Data Fitted Networks for Panel Data with Learned Reliability Signals 82 ​

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm 82 ​

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations 82 ​

2026-06-24

OpenSIR: Open-Ended Self-Improving Reasoner 85

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning 85

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions 85

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning 85

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models 85

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs 85

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers 85

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories 85

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning 85

Fixed RAG Compression Collapses Measured Reader Scaling 85

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory 85

Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning 85

Finetuning with Scientific Data Increases Hallucinations: A Multi-domain Factuality Evaluation of LLMs 85

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards 82

Learning the ARTS of Search for Automated Discovery 82

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes 82

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence 82

Temporal Causal Prior-Data Fitted Networks for Panel Data with Learned Reliability Signals 82

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm 82

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations 82