2026-06-20

DeepSeek发布V4系列MoE模型（1.6T/284B参数），支持百万token上下文，通过CSA和HCA等架构创新大幅降低推理FLOPs和KV缓存，性能刷新开源模型SOTA，checkpoint已开源。 Meta发布OmniSONAR，首个支持数千种语言与语音/文本/代码/数学的跨模态句子嵌入模型，翻译和检索性能大幅超越前人。 AlphaFold负责人…

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 95

Tags: 模型发布 大模型 推理优化 开源生态
Source: arXiv Computation and Language | 阅读原文

[摘要]
DeepSeek发布V4系列MoE模型（1.6T/284B参数），支持百万token上下文，通过CSA和HCA等架构创新大幅降低推理FLOPs和KV缓存，性能刷新开源模型SOTA，checkpoint已开源。

Tags: 模型发布 多模态 跨语言 研究突破
Source: arXiv Computation and Language | 阅读原文

[摘要]
Meta发布OmniSONAR，首个支持数千种语言与语音/文本/代码/数学的跨模态句子嵌入模型，翻译和检索性能大幅超越前人。

AlphaFold 负责人 John Jumper 离职 Google DeepMind，加入 Anthropic 85

Tags: 公司动态 AI安全
Source: AI HOT 精选 | 阅读原文

[摘要]
AlphaFold负责人John Jumper离职Google DeepMind，加入Anthropic，可能重塑AI在科学领域的竞争格局，体现核心人才向AI安全导向公司流动的趋势。

Nature两篇研究：MIRA和AMIE诊断与治疗计划媲美甚至超越医生 85

Tags: 大模型 智能体 AI医疗 模型发布
Source: AI HOT 精选 | 阅读原文

[摘要]
Nature两篇研究：MIRA诊断智能体急诊诊断准确率88.9%超越资深医生(78.1%)，谷歌AMIE治疗计划适切率95%超初级医生(72%)，展示AI临床决策潜力但警告现实差距。

OpenAI 强化学习实现广泛且持久的有益模型 85

Tags: AI安全 模型训练 强化学习 OpenAI
Source: AI HOT 精选 | 阅读原文

[摘要]
OpenAI通过强化学习训练模型，使其在诚实、可纠正性等有益特质上显著提升，并在数十项对齐评测中泛化良好，实现广泛持久的有益模型。

Large Language Models Do Not Always Need Readable Language 85

Tags: 研究 语言模型 推理优化 多Agent
Source: arXiv Computation and Language | 阅读原文

[摘要]
研究提出BabelTele，让LLM使用紧凑非人类可读文本表示，压缩至27.9%长度仍保持99.5%语义保真度，降低上下文开销，对多agent通信有重要价值。

Uncertainty Decomposition for Clarification Seeking in LLM Agents 85

Tags: 智能体 不确定度分解 LLM 研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
论文提出基于提示的不确定性分解方法，分离动作置信度与请求不确定性，使LLM智能体能在任务歧义时主动寻求澄清，在新增基准上F1提升显著。

Vero: An Open RL Recipe for General Visual Reasoning 85

Tags: 模型发布 开源生态 多模态 视觉推理
Source: arXiv Computation and Language | 阅读原文

[摘要]
Vero 提出完全开源的视觉推理强化学习方案，发布 600K 样本数据集 Vero-600K，在 30 个基准上平均提升 2.9-5.4 点，8B 模型超越 Qwen 版思考模型 3.8 点，充分展示开放 RL 数据与奖励对通用视觉推理的推动。

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation 82

Tags: 推理优化 模型发布 研究进展
Source: arXiv Computation and Language | 阅读原文

[摘要]
S2D2提出无训练自推测解码框架，在扩散LLM上实现最高4.7倍加速并提升准确率，无需额外训练或测试时计算，显著优化推理速度与质量。

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines 82

Tags: 研究发布 数据集 基准测试 AI编程
Source: arXiv Computation and Language | 阅读原文

[摘要]
首个基于Godot引擎的项目级游戏代码框架数据集与基准JamSet/JamBench发布，评估9个前沿模型发现大型项目代码能力悬崖，揭示架构设计瓶颈。

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models 82

Tags: 推理优化 模型压缩 研究发布
Source: arXiv Computation and Language | 阅读原文

[摘要]
因果剪枝方法 CAP 通过测量注意力头对推理任务的因果影响，在 Llama-3 等模型上实现最高 61% 的相对准确率提升，优于现有剪枝方法。

DeepSeek研究员开源AutoResearch：AI自主跑通285B模型RL研究闭环 80

Tags: 智能体 开源生态 大模型 研究
Source: AI HOT 精选 | 阅读原文

[摘要]
DeepSeek研究员开源AutoResearch，实现AI在285B模型上自主完成RL研究闭环全程零人工干预，标志持续学习研究重大进展。

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias 80

Tags: LLM评估 模型可靠性 研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
最大规模LLM-as-a-Judge系统评估发现：评判者模型在一致性、可靠性和偏见方面存在普遍问题，如精确匹配与Cohen's kappa间差异达33-41个百分点，部分模型存在高可靠性伴随严重位置偏见的矛盾。

Meta Flow Maps enable scalable reward alignment 80

Tags: 推理优化 模型发布 大模型
Source: arXiv Statistics - Machine Learning | 阅读原文

[摘要]
Meta 提出 Meta Flow Maps 框架，通过单步随机后验采样实现可扩展的奖励对齐，显著降低生成模型推理与微调的计算成本。

Diffusion Language Models: An Experimental Analysis 80

Tags: 模型研究 大模型 推理优化
Source: arXiv Computation and Language | 阅读原文

[摘要]
系统性实验分析比较了8种扩散语言模型（DLMs），评估其在推理、编码、翻译等任务上的性能与计算效率的权衡，为扩散模型实际部署提供了重要指导。

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback 80

Tags: 大模型 教育AI 框架 智能体
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出PsyScore框架，融合心理测量模型与大语言模型，实现自适应作文评分与个性化反馈，推动教育AI发展。

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization 80

Tags: 模型发布 研究进展 推理优化 注意力机制
Source: arXiv Computation and Language | 阅读原文

[摘要]
HydraHead提出head层面的注意力混合架构，通过可解释性选择保留关键head的全注意力，大幅提升长上下文处理能力，性能接近Qwen3.5。

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment 80

Tags: 模型量化 推理优化 研究
Source: arXiv Computation and Language | 阅读原文

[摘要]
该论文质疑KL散度作为量化LLM部署的保真度代理的有效性：全局相关但在近基线区域消失，提示级预测弱，提醒社区不能依赖单一保真度指标。

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models 80

Tags: 模型可解释性 AI安全 语言模型
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出控制窗定律，理论预测语言模型中单神经元干预的连贯行为控制边界，为可解释性与安全调控提供定量框架。

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning 80

Tags: 智能体 强化学习 大模型 开源
Source: arXiv Computation and Language | 阅读原文

[摘要]
提出Connect the Dots框架，通过端到端强化学习训练LLM，使其在长期部署中持续学习和自我更新，实现跨域泛化，提升智能体能力。

2026-06-20 ​

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 95 ​

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech 88 ​

AlphaFold 负责人 John Jumper 离职 Google DeepMind，加入 Anthropic 85 ​

Nature两篇研究：MIRA和AMIE诊断与治疗计划媲美甚至超越医生 85 ​

OpenAI 强化学习实现广泛且持久的有益模型 85 ​

Large Language Models Do Not Always Need Readable Language 85 ​

Uncertainty Decomposition for Clarification Seeking in LLM Agents 85 ​

Vero: An Open RL Recipe for General Visual Reasoning 85 ​

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation 82 ​

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines 82 ​

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models 82 ​

DeepSeek研究员开源AutoResearch：AI自主跑通285B模型RL研究闭环 80 ​

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias 80 ​

Meta Flow Maps enable scalable reward alignment 80 ​

Diffusion Language Models: An Experimental Analysis 80 ​

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback 80 ​

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization 80 ​

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment 80 ​

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models 80 ​

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning 80 ​

2026-06-20

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 95

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech 88

AlphaFold 负责人 John Jumper 离职 Google DeepMind，加入 Anthropic 85

Nature两篇研究：MIRA和AMIE诊断与治疗计划媲美甚至超越医生 85

OpenAI 强化学习实现广泛且持久的有益模型 85

Large Language Models Do Not Always Need Readable Language 85

Uncertainty Decomposition for Clarification Seeking in LLM Agents 85

Vero: An Open RL Recipe for General Visual Reasoning 85

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation 82

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines 82

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models 82

DeepSeek研究员开源AutoResearch：AI自主跑通285B模型RL研究闭环 80

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias 80

Meta Flow Maps enable scalable reward alignment 80

Diffusion Language Models: An Experimental Analysis 80

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback 80

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization 80

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment 80

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models 80

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning 80