AI Tech Daily - 2026-05-26 | Recsys Frontier

type

Post

status

Published

date

May 26, 2026 09:20

slug

ai-daily-en-2026-05-26

summary

📊 Today's Overview

AI hit major milestones today: OpenAI and Google DeepMind both cracked decades-old Erdős math problems — the first time AI has made such a fundamental mathematical breakthrough. On the efficiency front, HRM-Text trained a SOTA 1B model for just $1,500, challenging the scaling law orthodoxy, while DeepSeek permanently slashed API prices by 75%, reshaping the economics of AI inference. Agent infrastructure also matured fast: Microsoft's SkillOpt formalized skill optimization as a trainable parameter, AWS MCP Server hit GA, and the auth.md protocol launched for agent authentication. Meanwhile, Chinese AI models surpassed US models in weekly token usage for the fourth consecutive week.

🔥 Trend Insights

AI cracks 80-year-old math problems: OpenAI and DeepMind independently solved Erdős conjectures — OpenAI with a simple "is Erdős wrong?" prompt, DeepMind's AlphaProof Nexus at hundreds of dollars per proof. This is AI's biggest math breakthrough yet.

Pre-training efficiency revolution: HRM-Text achieves SOTA with 1/100 the compute, while DeepSeek's 75% permanent price cut signals the industry pivoting from compute arms race to cost competition.

Agent skill optimization goes systematic: Microsoft's SkillOpt treats skill documents as trainable parameters, achieving best-or-tied results across all 52 evaluated configurations — skills become transferable across models and frameworks.

🐦 X/Twitter Highlights

📈 热点与趋势

Grok Build Beta 面向所有 SuperGrok 和 X Premium+ 用户开放 – 支持 Plan Mode、Imagine 生成图片/视频，以及 CLI 自动化与编排 @xai

中国 AI 模型周使用量超美国，连续 4 周第一 – 周使用量 9.223 万亿 tokens，环比增 19.89%；美国为 4.93 万亿 tokens，环比增 16.27%。DeepSeek-V4-Flash 居榜首 @zerohedge

Google AI Studio 支持免费构建原生 Android 应用，一周内创建超 25 万个 – 无需编码，<1% 用户此前有开发经验 @OfficialLoganK

字节跳动/微软/OpenAI 动态：字节Seed 发表 LLM 缩放定律论文，微软推 SkillOpt，DeepSeek 称推理成本比 OpenAI 低 50 倍 – DeepSeek 永久降价 75%后，10 亿输出 token 成本约 $3,480，同级 OpenAI 约 $30,000、Claude 约 $15,000 @BullTheoryio | @fly51fly | @omarsar0

🔧 工具与产品

Qwen3.7-Max 公布 Code Arena 排名第二 + 隐式缓存上线 – Qwen 官方：Qwen3.7-Max（阿里通义系列模型）在 Code Arena 得分 1541，仅次于 Claude；同日上线隐式缓存，无需配置即可加速降本，可切换显式缓存 @Alibaba_Qwen | @alibaba_cloud

Unitree 发布 WVLA 2.0 模型：多任务全自动收拾会议室 – 该机器人模型实现全自主多任务操作，单次拍摄完成，存在强外部干扰 @UnitreeRobotics

微软因成本取消 Claude Code 许可，Uber 发现低阶模型就够了？ – 报道称微软因 token 成本取消数千工程师的 Claude Code 许可，转向 GitHub Copilot；Uber 全年 AI 预算 4 月即耗尽 @BullTheoryio（独立博主）

Codex Shim 指南大更新，支持任意模型在 Codex 中全功能运行 – 社区开发者 Terp 发布更新，解决所有兼容修复 @OnlyTerp

MathCode 0.2.0 发布：最大化 prompt-cache 命中率，API 成本降低最多 90% – Yifan Zhang（MathCode 作者）推出新版本 @yifan_zhang_

⚙️ 技术实践

Google DeepMind 发布 AlphaProof Nexus：基于 Gemini 的 agentic 形式化证明框架 – 自主解决 9 个 Erdős 开放问题（含两个 56 年未解问题）、44 个 OEIS 问题，以及一个 15 年未解的代数几何问题 @pushmeet

微软研究院 SkillOpt：将 agent skill 文件视为可训练参数，52 种设置下最优 – 用验证门控优化器编辑 skill 文档，GPT-5.5 上提升 23.5 分，技能可跨模型/框架迁移 @omarsar0

RAG、文档理解和 AI Agent 三年演进全景图：116 页幻灯片开源 – Jerry Liu（LlamaIndex 创始人）发布 workshop 完整材料，覆盖 12 个朴素 RAG 痛点、重排/查询重写、文档解析瓶颈 @jerryjliu0

DR Tulu 被 ICML 2026 接收为 Oral：共演化 agent 与奖励方法 – Rulin Shao 团队验证弱模型也可作为评估器 @RulinShao

On-policy distillation 成为热门后训练技术，已有 183 篇引用论文 – Niels Rogge（Hugging Face 工程师）称该技术在 PapersWithCode 已收录 @NielsRogge

NanoGPT 训练世界纪录刷新至 81.2 秒 – Alex Wa 使用 learnable XSA（可学习每头标量注意力减法），应用于 6 个非配对头层 @_djdumpling

Sakana AI 发布 CUSP 基准：AI 无法预测科学突破，但可预测自身基准 – 与 Oxford、Stanford、Allen AI 合作，用 4760 个科学事件测试前沿模型 @hardmaru

Any2Any：仅用 1% 计算/数据实现人形机器人跨本体全身跟踪迁移 – 将 Gear-Sonic 策略从 Unitree G1 迁移至 LimX Oli/Luna @YJH_GIGIYE

SkillOpt 深度解读：技能文件训练化+自身经验总结 – Garry Tan（Y Combinator 创始人）引用社区实践，提出技能文件描述与体分离问题、保护段不变性 @garrytan

给你 agent 写技能时的token 效率提升技巧 – Peter Steinberger（PSPDFKit 创始人）分享技能文件和 token 效率检测工具 @steipete

ByteDance Seed 论文：从 Shannon 视角分析 LLM 容量和缩放定律 – 论文标题《LLMs as Noisy Channels》 @fly51fly

⭐ Featured Content

AI cracks 80-year-old math problem: OpenAI and DeepMind both break through ｜ Both labs solve Erdős conjectures

OpenAI and Google DeepMind nearly simultaneously announced milestone breakthroughs in mathematical reasoning. OpenAI's LLM solved Paul Erdős's 1946 plane unit distance conjecture (unsolved for 80 years), with the breakthrough coming from a simple prompt: "Is Erdős wrong?" DeepMind's AlphaProof Nexus system autonomously solved nine open Erdős problems — two of which had been unsolved for 56 years — at a cost of just a few hundred dollars per inference. Unlike OpenAI's natural language approach, DeepMind uses the Lean compiler to automatically verify each proof step. Cambridge mathematician Tim Gowers commented that if a human wrote this proof, it could be published directly. This marks the first time AI has achieved such a fundamental mathematical breakthrough, with profound implications for LLM reasoning research.

Sources: New Scientist ｜ The Guardian ｜ The Decoder ｜ Phys.org

HRM-Text: $1,500 trains a 1B model rivaling 7B models ｜ A new efficiency paradigm challenging Scaling Law

HRM-Text proposes a Hierarchical Recurrent Model (HRM) replacing standard Transformers, using MagicNorm to stabilize deep recurrent training and a task-completion objective (PrefixLM) instead of raw text pretraining. The 1B parameter model, trained on just 40B tokens with a $1,500 budget in 1.9 days, achieves MMLU (60.7%), GSM8K (84.5%), and MATH (56.2%) — matching 2-7B open-source models while reducing compute by 96-432x. Code is open-sourced. This work directly challenges the assumption that large-scale pretraining requires massive data and compute, providing empirical evidence for cost-effective efficient pretraining.

Sources: arXiv

2026 open-source LLM selection guide: Specialization replaces general rankings ｜ Choose models by Coding/RAG/Agent scenarios

The 2026 open-source LLM market has entered the "year of specialization." Multiple surveys and leaderboards (LLM Stats, Stormap, CodeSOTA) show MoE architectures and small parameter models (7B-14B) surpassing general large models on specific tasks. Core insight: abandon traditional general benchmark rankings and evaluate models by specific workloads like Coding, RAG, Agents, and local deployment. Among competitive new models — GLM-5, Kimi K2.6, DeepSeek-V4-Pro-Max, Qwen3.5-397B — Kimi K2.6 is the cheapest open-source model in the top 10 ($0.95/M tok). Scale Labs' leaderboard also includes benchmark data for many undisclosed models (e.g., GPT-5.5, Muse Spark).

Sources: Stormap ｜ LLM Stats ｜ Scale Labs ｜ CodeSOTA

LLM Agent evaluation's "disclosure crisis": 12 benchmark papers average only 0.38/1.0 ｜ Systematic reproducibility failure

A meta-study audited 12 well-known LLM Agent benchmark papers using a 5-dimension scoring framework covering benchmark identity, framework specification, inference settings, cost reporting, and failure decomposition. Results show Agent benchmarks average only 0.38/1.0 on disclosure scores, far below classic static benchmarks at 0.66. The biggest gaps: inference cost (0 papers disclosed) and framework specification (no complete container images). The authors released a JSON Schema, codebook, and raw scoring sheets. Another work, AgentAtlas, proposes an evaluation framework beyond traditional outcome leaderboards, including six-state control decision classification and nine-class trajectory failure classification. For Agent practitioners, these are essential reading to understand the root causes of benchmark result variance.

Sources: arXiv (audit) ｜ arXiv (AgentAtlas)

AWS MCP Server reaches GA, MCP ecosystem goes production-grade ｜ Full API coverage + IAM governance

AWS managed MCP Server is now generally available, offering full API coverage and IAM-based governance, becoming the standard interface for AI coding agents to securely access AWS services. The server is part of the AWS Agent Toolkit, supporting latest documentation, authenticated API access, and sandbox script execution. Meanwhile, a deep analysis of MCP server maturity proposes a six-layer model: from Level 1 (simple API wrappers) to Level 6 (write-intent security patterns), finding fewer than 2% of servers reach Level 4 (domain knowledge integration), with about 70% stuck at Level 1. Another ecosystem tracking report catalogs 56 production-ready MCP servers, revealing key trends like registry fragmentation and OAuth 2.1 becoming the mainstream authentication standard.

Sources: InfoQ ｜ Dev|Journal ｜ Digital Applied

DeepSeek permanently cuts prices 75%, pricing war reshapes API market ｜ V4-Pro priced at 1/9 of GPT-5.5

DeepSeek permanently slashed its flagship V4-Pro API price by 75% to $0.44 per million tokens, far below OpenAI's GPT-5.5 at $5 and Anthropic's Claude Opus 4.7 at roughly $3. The company is seeking its first external funding round at a $44 billion valuation, with OpenRouter market share rising to 23.1%. V4-Pro is the largest open-weight model (1.6T parameters), ranked 9th globally. This permanent low-price strategy directly challenges US AI lab pricing models and may reshape AI market economics, with direct impact on practitioners' API selection costs.

Sources: Caixin Global ｜ TheStreet

AI Agent authentication protocol and terminology standardization: auth.md and Hugging Face glossary ｜ Infrastructure moves toward normalization

WorkOS released the auth.md open protocol, solving AI Agent registration and authentication on web services. Built on OAuth standards, it defines two flows — Agent Verified (ID-JAG tokens) and User Claimed (OTP email verification) — supporting scopes, auditing, and revocation. A comparison article systematically evaluates WorkOS, Stytch, Auth0, and other platforms for MCP authentication scenarios. Meanwhile, Hugging Face published an Agent glossary, systematically defining easily confused concepts like model, scaffolding, harness, agent, context engineering, policy, and tool use, providing a unified mental model for the Agent engineering community.

Sources: MarkTechPost (auth.md) ｜ MarkTechPost (auth platform comparison) ｜ Hugging Face

📄 Paper Highlights

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Microsoft ｜ 🏷️ Agent Framework, Agentic Workflow, Fine-tuning

First systematic controllable text-space optimizer for agent skills — treats skill documents as trainable parameters, achieving best-or-tied on all 52 evaluated configurations across 6 benchmarks, 7 models, and 3 execution harnesses.

FastKernels: Benchmarking GPU Kernel Generation in Production

Snowflake AI Research ｜ 🏷️ Agent Framework, Inference, Benchmark

Production-grade GPU kernel benchmark covering 96.2% of HuggingFace architectures — reveals even the strongest kernel agent achieves only 0.94x speedup over baselines, exposing critical benchmark-production misalignment.

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Appen ｜ 🏷️ Safety, Multimodal, RLHF/DPO

First systematic cross-lingual multimodal red-teaming study — finds safety rankings don't transfer across languages, with Qwen Omni overtaking Pixtral Large as most vulnerable in Spanish, proving language and modality alignment failures operate through distinct mechanisms.

🐙 GitHub Trending

claude-cookbooks ｜ Official Claude usage guide

Anthropic's official collection of Jupyter Notebooks covering function calling, multi-step reasoning, and Agent workflows. Run-to-learn examples — the most authoritative starting point for Claude best practices.

GitHub ｜ ⭐ 44,202 ｜ 🗣️ Jupyter Notebook ｜ 🏷️ LLM, Agent, DevTool

OpenBB ｜ Financial data platform for AI Agents

Unified interface for stock, crypto, and market data with natural language querying and Agent integration. Modular data connectors, extensible tool-calling capabilities, and built-in quantitative analysis make it ideal for financial LLM application development.

GitHub ｜ ⭐ 68,104 ｜ 🗣️ Python ｜ 🏷️ Agent, Data, DevTool