为什么LayerNorm+AdamW成了深度网络的标准配置?从尺度不变性到梯度动力学

深度网络依赖LayerNorm(RMSNorm),这创造了局部的尺度不变性(Scale Invariance),它带了独特的梯度动力学(Gradient Dynamics)。在这个独特的动力学场域中,我们关于机器学习的直觉被颠覆了,Norm的物理含义从特征强度表示变成了学习进度的旋钮,Norm理论上稳步增加,SGD自带学习率衰减,但是刹车踩的太狠导致了学习的早停,而Weight Decay从正则化项进化为有效学习率的动态调节阀。AdamW如何成为标配:Adam做到了梯度的步长恒定,有效学习率的平缓刹车;Warmup来处理训练早期的权重过小(梯度爆炸)和二阶矩估计不准的问题;AdamW修正了L2正则的问题,引入Weight Decay,把“方向更新”和“进度控制”拆成两个干净的旋钮。

推荐算法只可锦上添花,不能雪中送炭

在和很多产品、运营团队合作的过程中,我常不得不扮演那个“泼冷水”的角色,特别是当大家对推荐算法寄予厚望的时候。 听到这样的战略规划:“我们明年目标是增长 80%,推荐系统是其中的关键。” 我的观点很直接:如果你的增长战略严重依赖推荐算法,一旦算法效果不及预期,目标就直接崩盘,那么这本质上是一个糟糕的战略**。对于规模增长,推荐算法不能雪中送炭,它只能在规模之上锦上添花。

从RL比SFT更不容易遗忘到反观推荐系统缺陷

最近陆续有了一些研究LLM中RL相比SFT更不容易造成灾难性遗忘的工作,清晰地支出是RL的On-Policy特性带来了参数的稳定,而SFT将模型参数推向与预训练分布差异很大的方向,导致了遗忘问题(如图,遗忘问题的衡量就是随着新任务的学习,旧任务的平均表现下降)。 这一清晰地结论,点亮了我对很多事情的理解,推荐系统原来孤立的问题也有可能连成一片,有了更深层次的支撑。 本文包括: • LLM领域,RL比SFT更不容易造成灾难性遗忘的工作解读 • 推荐系统是标准的off-policy 监督学习,(猜想)许多缺陷也应当由此而生

推荐系统线上能跑多大的模型

本文不是从系统优化角度谈复杂的模型的部署和优化问题,而是从行业成本角度,看线上推理多复杂的模型是可以满足成本及ROI要求的。 做一个假设: • 电商推荐行业,主要是更熟悉成本核算 • 部署标准的Transformer作为排序模型,参考OneTrans结构 • 参数规模对齐qwen2的系列模型,更直观看看能跑哪个尺寸

Talent Dilution Roofline:你的算法团队可能不需要再招人了?

Roofline model是高性能计算领域用来分析程序性能瓶颈的一个直观模型,因为画出来像一个屋顶形状而得名。如下图,横坐标是算法的计算强度Flop/Byte(算法的浮点计算数除以内存访问量),纵坐标是算力Flop/s,它描述的是如果算法计算强度提升算力线性提升(Memory-Bound),直到算数强度超过硬件的拐点,之后算力逼近硬件的上限(Compute-Bound)。它核心回答了:你的程序到底受什么限制——计算能力还是内存带宽?应该优化哪里?

OneTrans 推荐系统对齐序列处理与特征交叉

从精排切换成深度学习以来,工业界一直会把排序的模型结构研究切分成基本的两部分,序列处理和特征交叉,甚至有一些公司的排序组,下面都拆成两个Team分别处理行为序列和特征交叉。从最早的时候,比如序列用DIN来处理,序列就被压成了一个或多个向量表征,再参与与其他特征的交叉。我们可以理解成MLP(concat(DIN, Features)),发展到今天大多数的模型研究,还是分立地把MLP换成DCN,增加个LHUC,复杂化为Rank Mixer或Transformer,把DIN叠加MHA,直接换成Transformer,可以写成RankMixer(concat(Transformer, Features))。 从MLP(concat(DIN, Features))到RankMixer(concat(Transformer, Features)),本质没有变,就是序列处理和特征交叉是一个隐式的两阶段处理,序列被压缩到Vector Space才和特征发生交叉。而LLM的有趣之处,就是在Next Token Prediction利用到的交叉发生在词序列的Token Space之中,它能启发推荐排序模型的,就是每一个特征的交叉应该发生在用户序列的Token Space之中。

AI Weekly 2026-W26

This week in AI centers on a single core narrative: capability breakthroughs at the massive infrastructure layer are accelerating the shift from lab to production. OpenAI dropped two bombs on the same day — its in-house inference chip Jalapeño and GPT-5.6 Sol — covering the full stack from hardware to model. These aren't isolated launches; they're coordinated moves up and down the stack: the chip optimizes inference cost, the model pushes the capability ceiling, and both share the same infrastructure. The second thread is Agent engineering moving from experiments to production governance. Stripe published a real-world case on financial compliance agents, AWS posted three consecutive blogs on MCP agent layers and data governance, and GitHub shared benchmarking data on Copilot's agentic harness. Meanwhile, Anthropic's Claude Slack Tag positions the LLM as a persistent organizational member — Karpathy called it "the third major LLM UI/UX design paradigm." Agents are no longer one-shot conversations but continuously running roles inside companies. The third thread is post-training evolving from manual exploration to automated, systematic processes. Amazon released A-Evolve, achieving autonomous post-training on a 30B model with no human intervention. OpenAI verified that beneficial-behavior RL generalizes out-of-distribution durably. Qwen's landmark language world model provides a scalable training environment for agent RL. These works collectively signal: RL is no longer just a fine-tuning step after SFT — it's becoming the main engine for expanding model capabilities.

RecSys Weekly 2026-W26

Of the 12 papers this week, industrial deployments dominate — 8 come from first-tier platforms like YouTube, TikTok, Kuaishou, Tencent, and Walmart, all with online A/B experiment metrics. Research clusters around three overlapping directions: generative recommendation with LLM augmentation, GPU acceleration for large-scale retrieval, and industrial system architecture and attribution optimization. Generative recommendation moves from "generating item IDs" to "generating physical items": Kuaishou's RaG unifies generative recommendation with video generation, achieving +1.87% ad revenue on a 400M DAU platform. YouTube's TokenMinds extends Semantic ID from the item side to the user side, producing both discrete user tokens and dense embeddings, covering full user traffic. Both routes point to the same judgment — generative recommendation is moving from offline consistency verification to online revenue realization. User modeling accelerates its shift from dense vectors to discrete semantic IDs: Kuaishou and YouTube published SID-based frameworks almost simultaneously. This isn't just a change in representation form — it means that the underlying token space of recommendation systems is beginning to align with that of the LLM world, substantially lowering the cost of cross-scenario unification (short-form video / long-form video, recommendation / advertising). Industrial attribution and scaling methodology move toward precision: TikTok's Attribution Correction Framework aligns causal experiments with daily production attribution, reducing measured cannibalization by roughly 15 percentage points. Tencent's NOVA uses an agent to automate architecture evolution, achieving +2.02% GMV on L3 tasks online. Kuaishou's UniFormer proposes a model-centric scaling framework that explicitly decomposes the modeling space into feature and task dimensions. Together, these three reveal a pattern: as model architectures converge, engineering automation and measurement accuracy become th

AI Tech Daily - 2026-06-29

AI infrastructure hit new milestones today: Microsoft's $7.3B Fairwater campus links hundreds of thousands of Blackwell GPUs into a single supercomputer via 800G Ethernet. DeepSeek V4's DSpark framework slashes inference latency by 80% with full-stack open source, while SubQ's dynamic sparse attenti

AI Tech Daily - 2026-06-27

A massive day in AI: OpenAI previewed GPT-5.6 Sol with a new architecture and 1M context, but the release was held back by the Commerce Department requiring per-customer approval — a regulatory first that could reshape how frontier models ship. Meanwhile, GLM-5.2 became the first open-weight model t

AI Tech Daily - 2026-06-26

Agent infrastructure funding hit new highs today: Sail raised $80M for long-running agent inference, and PimDeWitte closed $320M at a $2.3B valuation for world model data. SWE-bench Pro replaced the compromised SWE-bench Verified, while OpenAI's economic report revealed Codex consumes 99.8% of its o

AI Tech Daily - 2026-06-25

AI infrastructure is heating up fast. OpenAI and Broadcom released Jalapeño, their first custom LLM inference chip, claiming 4x throughput and 5x energy efficiency over GPUs. Cursor is training a 1.5 trillion-parameter model from scratch on xAI's Colossus cluster — an app-layer company going full-st