AI Tech Daily - 2026-06-25 | Recsys Frontier

type

Post

status

Published

date

Jun 25, 2026 04:31

slug

ai-daily-en-2026-06-25

summary

📊 Today's Overview

AI infrastructure is heating up fast. OpenAI and Broadcom released Jalapeño, their first custom LLM inference chip, claiming 4x throughput and 5x energy efficiency over GPUs. Cursor is training a 1.5 trillion-parameter model from scratch on xAI's Colossus cluster — an app-layer company going full-stack. Meanwhile, Qualcomm is spending $14B+ on Modular and Tenstorrent to break NVIDIA's CUDA moat with open hardware and compilers. On the agent front, Qwen's AgentWorld language world model beats Claude Opus 4.8 and GPT-5.4 across 7 environments, and Google added native Computer Use to Gemini 3.5 Flash. NVIDIA also open-sourced DFlash, a speculative decoding method hitting 15x throughput on Blackwell.

🔥 Trend Insights

Custom silicon for inference: OpenAI's Jalapeño chip marks a strategic pivot from GPU dependency to purpose-built hardware, targeting 4x throughput and 5x efficiency gains for Transformer workloads.

App-layer companies go full-stack: Cursor training a 1.5T-param model from scratch on Colossus signals that application companies are closing the gap with AI labs by owning their own models and infrastructure.

Agent infrastructure matures fast: Qwen's AgentWorld, Google's Computer Use, and NVIDIA's DFlash all shipped today — the tooling for building and deploying agents is becoming production-ready at every layer.

🐦 X/Twitter Highlights

📈 热点与趋势

Qwen发布AgentWorld语言世界模型，可模拟7种环境，性能超Claude Opus 4.8和GPT-5.4 - Qwen（通义千问）开源Qwen-AgentWorld，一个原生语言世界模型，从训练第一天就以环境建模为目标，而非事后适配。它可模拟MCP、Search、Terminal、SWE-bench、Web、OS、Android 7种agent环境。在AgentWorldBench上超越Claude Opus 4.8和GPT-5.4。论文还发现，使用世界模型进行可控模拟强化学习，其效果超过在真实环境中训练。 @Alibaba_Qwen

智谱AI团队携GLM-5.2首次亮相硅谷AI Engineer World's Fair；SakanaAI Fugu-Ultra上线OpenRouter，集体智能部署 - 智谱创始人Louszbd（曹越）称团队首次到访硅谷参加AI Engineer大会。此前GLM-5.2被部分评论者视为世界顶级开源模型。同时，SakanaAI（日本AI实验室）将其Fugu-Ultra部署在OpenRouter平台，理念是“多种最佳模型集体智能”优于单一模型。 @swyx @SakanaAILabs @OpenRouter

Databricks联合创始人谈向企业Agent基础设施层进军，LTAP和Omnigent是核心 - Databricks联合创始人Matei Zaharia和Reynold Xin在Latent Space播客中解释：Databricks正进入企业agent基础设施层，Omnigent为编码agent和定制agent构建共享harness，LTAP和Lakebase重新拆分操作和数据库工作负载，agent安全需要情境策略和消费控制。 @latentspacepod

🔧 工具与产品

Cursor支持从Notion直接委托任务，基于Cursor SDK - Cursor（AI编程IDE）推出的Notion集成允许用户在Notion中@Cursor，将任何spec或任务分配给Cursor，由同一套模型、harness和运行时驱动的云agent自动打开PR。 @cursor_ai

MiniMax M3成为Kimchi Coding默认构建模型，支持1M上下文 - Kimchi Coding（由Cast AI推出的编码平台）将MiniMax M3（开源模型，1M上下文窗口，强编码能力）作为默认builder模型，根据复杂度、成本和部署需求路由任务。 @MiniMax_AI

Kimi API上线AWS Marketplace，支持统一账单和EDP抵扣 - Kimi（月之暗面旗下AI助手）的API现在AWS Marketplace可直接访问，AWS客户可使用consolidated billing，符合条件的客户可将Kimi API使用量直接抵扣AWS EDP承诺。 @Kimi_Moonshot

Weaviate推出Engram，主动协调Agent记忆，避免矛盾事实 - Weaviate（开源向量数据库）发布Engram记忆管理工具。当新信息到达时（如用户从工程师升任CEO），Engram主动检索相关记忆，用LLM决定是重写旧记忆还是删除重复，保持agent上下文干净。 @weaviate_io

Supabase与Okta合作Cross App Access，为AI Agent提供安全数据访问 - Supabase（开源后端即服务平台）成为首批支持Okta Cross App Access（XAA）的提供商之一。XAA帮助团队给AI agent安全、受控的数据访问，无需静态API密钥或一次性认证流程。 @supabase

Modal推出Auto Endpoints，推理延迟比最佳供应商快60ms - Modal（serverless GPU平台）发布Auto Endpoints，提供“一键式”开源推理性能。与DecagonAI合作开发低延迟推理方案，端到端响应比最佳专有供应商快60ms。 @modal

⚙️ 技术实践

Mistral OCR在ParseBench上性价比领先GPT-5.5，图表注释后接近Gemini 3.1 Pro - Jerry Liu（LlamaIndex创始人）发布ParseBench基准测试结果：Mistral OCR在语义格式化（删除线、上标/下标、标题层级、链接）上领先，内容忠实度和视觉边界框有竞争力，表格能力一般，图表能力弱。但不使用图表注释时总分位于GPT-5.5之前、Gemini 3.1 Pro之后。使用图表标注功能后，图表评分提升，整体接近Gemini 3.1 Pro。 @jerryjliu0 @jerryjliu0

百度开源Unlimited OCR：3B参数/500M激活，一次处理40+页 - 百度发布Unlimited OCR，关键技术是Reference Sliding Window Attention（R-SWA），保持KV缓存大小恒定，降低注意力开销。总参数量3B，激活量仅500M，可在单次前向传递中转录40+页文档，在OmniDocBench v1.5和v1.6上达到端到端SOTA。 @BaiduAI_News

社区开发者用MiniMax M3 + Opencode构建人形机器人目录，边学边建 - 一位用户（whosamberella）使用MiniMax M3研究人形机器人领域，然后用其直接编写代码在Opencode中构建了一个机器人目录网站，包含每种机器人的独特特征说明，用SVG原型展示。 @MiniMax_AI @whosamberella

Jo Kristian Bergum（Vespa CTO）将演讲BM25在agentic search中的新价值；Yoav推出Agent上下文结构描述语言 - Bergum将在AI Engineer World's Fair发表演讲"BM25 for agentic search"，认为GPT-5在搜索上极强，改变了BM25作为基线的叙事。同时，yoavgo（独立AI开发者）和noga2p推出一门新语言，用于精确描述agent上下文的结构和演化过程，提升编程agent对自身上下文的认知清晰度。 @jobergum @yoavgo

⭐ Featured Content

OpenAI and Broadcom Release First LLM Inference Chip 'Jalapeño' ｜ Custom silicon marks a new phase in AI infrastructure competition

OpenAI and Broadcom jointly released 'Jalapeño', their first custom chip optimized for LLM inference. It features hardware-level optimizations for Transformer attention mechanisms and feed-forward networks, claiming 4x inference throughput and 5x energy efficiency over general-purpose GPUs. The article details the chip architecture, partnership model, and long-term strategy. This is a critical turning point for OpenAI — moving from external hardware dependency to in-house silicon — and could reshape the AI infrastructure landscape, with major implications for inference cost and deployment efficiency.

Sources: OpenAI

Cursor Trains 1.5 Trillion Parameter Frontier Model from Scratch on Colossus ｜ Application-layer company builds its own model, narrowing the gap with dedicated AI labs

At the Compile conference, Cursor announced it is training a 1.5 trillion-parameter frontier model from scratch on xAI's Colossus cluster, expected to ship within weeks. This is Cursor's first move away from open-source base models (previously based on Kimi K2.5), building a full training pipeline using over 100,000 NVIDIA GPUs. The move transforms Cursor from an API reseller into a model owner, fundamentally changing its cost structure and signaling that application-layer companies are closing the gap with dedicated AI labs by leveraging supercomputing clusters.

Sources: TechTimes

Qualcomm Bets $14 Billion on Modular and Tenstorrent to Challenge NVIDIA's CUDA Monopoly ｜ Open hardware + open compiler, a two-pronged attack

Qualcomm announced at Investor Day its ~$3.9B acquisition of AI compiler startup Modular (Mojo language/MAX engine), alongside rumored $8-10B acquisition of RISC-V AI chip company Tenstorrent (led by Jim Keller). Combined, the two deals exceed $14B, aiming to break NVIDIA's AI monopoly through open hardware (RISC-V) and open compilers (CUDA alternatives). The article analyzes NVIDIA's CUDA ecosystem lock-in effect and explains why Qualcomm needs both chip and compiler to mount a credible challenge.

Sources: TechTimes

NVIDIA Releases DFlash Speculative Decoding: Up to 15x Inference Throughput on Blackwell ｜ Open-sourced and integrated with vLLM, SGLang, and TensorRT-LLM

NVIDIA released DFlash, a block-diffusion-based speculative decoding method achieving up to 15x inference throughput on Blackwell GPUs. Compared to EAGLE-3, DFlash nearly doubles interactivity on Llama 3.1 8B and delivers 5.8x and 5.1x speedups on Gemma 4 31B and Qwen3 8B respectively. It is open-sourced and integrated with vLLM, SGLang, and TensorRT-LLM, with 20 Hugging Face checkpoints available. For LLM inference optimization practitioners, this is a directly deployable technique.

Sources: NVIDIA

Google Launches Native Computer Use Tool in Gemini 3.5 Flash ｜ Following Anthropic, another major model natively supports GUI automation

Google officially launched a built-in computer use tool in Gemini 3.5 Flash, allowing the model to directly control browsers, desktop applications, and other graphical interfaces. It supports screenshot capture, clicking, text input, and other operations, with safety guardrails. The feature is exposed via API, enabling developers to build automation workflows immediately. This marks another major model's native support for GUI automation after Anthropic's Computer Use, offering practical value for agent developers.

Sources: Google Blog ｜ DeepMind Blog

MCP Protocol Gets Its Biggest Structural Update: Moving to Stateless Design ｜ July 28, 2026 RC removes handshake and session IDs, simplifying horizontal scaling

The MCP protocol is receiving its biggest structural update: the July 28, 2026 RC version shifts the core to stateless, removing the initialize handshake and Session-ID so each request is self-contained. The article uses before/after comparisons and hand-crafted HTTP request examples to clearly demonstrate how stateless design simplifies horizontal scaling (no sticky routing or shared session store needed), and explains explicit State Handle patterns (like basket_id) as replacements for implicit sessions. For developers building MCP services or clients, this is a critical change to track.

Sources: mayflower.blog

Google Research Reveals How Reasoning Unlocks Parametric Knowledge in LLMs ｜ Reasoning chains not only improve complex tasks but also enhance factual recall

Google Research's experiments reveal that reasoning (specifically chain-of-thought) not only improves performance on complex tasks but also significantly enhances LLMs' ability to recall factual knowledge from parametric memory. The article shows how reasoning chains activate internal knowledge, especially in scenarios requiring precise recall (e.g., factual QA), outperforming direct answering. It provides quantitative results and visual analysis, offering a new perspective on reasoning mechanisms — reasoning isn't just for complex tasks; it also improves factual recall, a counterintuitive finding worth attention.

Sources: Google Research

NVIDIA NeMo AutoModel Achieves 3.4-3.7x MoE Fine-Tuning Speedup ｜ Based on Transformers v5, one line of code to switch

NVIDIA NeMo AutoModel, built on Transformers v5's MoE foundation, uses Expert Parallelism, DeepEP fused all-to-all scheduling, and TransformerEngine kernels to achieve 3.4-3.7x training throughput improvement and 29-32% memory savings during MoE fine-tuning. The API is compatible with from_pretrained(), requiring just one line of code to switch. The blog includes performance comparisons for multi-node 550B and single-node 30B models, explaining the sources of speedup — highly valuable for practitioners needing efficient MoE fine-tuning.

Sources: Hugging Face

🎙️ Podcast Picks

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

📍 Source: Latent Space | ⭐⭐⭐⭐⭐ | 🏷️ LLM, Agent, Infra | ⏱️ 1:08:52

Databricks co-founders Matei Zaharia and Reynold Xin sit down with swyx at the 2026 Data+AI Summit. They introduce Omnigent — an open-source meta-framework for composing, controlling, and sharing coding agents (Claude Code, Codex, Cursor) and enterprise agents, solving common problems around portability, session history, security, and cost control. Reynold articulates the database vision: LTAP (Log-Time Analytical Processing) gets most of HTAP's benefits through a unified storage layer rather than merging query engines, and he criticizes CDC as "continuous data corruption." They also discuss Databricks' evolution from lakehouse to a data-and-AI operating system, arguing that proprietary data, governed access, and feedback loops will become durable advantages in the agent era.

💡 Why Listen: Two of the most influential minds in data infrastructure lay out their full vision for the agent era — Omnigent, LTAP, and why they think most current agent architectures are fundamentally wrong. Dense with insights you won't find in blog posts.

📄 Paper Highlights

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

OpenAI ｜ 🏷️ Fine-tuning, RLHF/DPO, Safety

OpenAI shows that RL on beneficial behavior (truthfulness, fairness, corrigibility) generalizes across domains — 80%+ OOD benchmarks improve, and models resist adversarial prompting and harmful fine-tuning better than compute-matched baselines.

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

Meta AI ｜ 🏷️ Fine-tuning, Reasoning, Distillation

Replaces instance-level trajectory imitation with reusable strategy distillation. SGPO uses a forward-KL objective and adaptive weighting to transfer reasoning strategies, beating SFT and RL baselines by 2.2 points on Qwen2.5-7B.

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

Fujitsu Research of Europe ｜ 🏷️ Agent Framework, Safety, Multi-Agent

A graph-driven dynamic red-teaming methodology that evaluates 45 diverse agentic systems. Uses hierarchical representation + adaptive adversarial attacks to produce unified security evaluations across heterogeneous architectures.

🐙 GitHub Trending

DFlash ｜ 15x inference throughput via speculative decoding

NVIDIA's block-diffusion-based speculative decoding method achieves up to 15x throughput on Blackwell GPUs. Already integrated with vLLM, SGLang, and TensorRT-LLM — directly deployable for production inference optimization.

GitHub ｜ ⭐ 2,100+ ｜ 🗣️ Python ｜ 🏷️ Inference, Speculative Decoding, LLM

Qwen-AgentWorld ｜ Language world model for agent environments

Alibaba Qwen's open-source language world model simulates 7 agent environments (MCP, Search, Terminal, SWE-bench, Web, OS, Android). Outperforms Claude Opus 4.8 and GPT-5.4 on AgentWorldBench.

GitHub ｜ ⭐ 1,800+ ｜ 🗣️ Python ｜ 🏷️ Agent, World Model, Simulation