type
Post
status
Published
date
Jun 29, 2026 07:07
slug
rec-weekly-en-2026-W26
summary
Of the 12 papers this week, industrial deployments dominate — 8 come from first-tier platforms like YouTube, TikTok, Kuaishou, Tencent, and Walmart, all with online A/B experiment metrics. Research clusters around three overlapping directions: generative recommendation with LLM augmentation, GPU acceleration for large-scale retrieval, and industrial system architecture and attribution optimization. Generative recommendation moves from "generating item IDs" to "generating physical items": Kuaishou's RaG unifies generative recommendation with video generation, achieving +1.87% ad revenue on a 400M DAU platform. YouTube's TokenMinds extends Semantic ID from the item side to the user side, producing both discrete user tokens and dense embeddings, covering full user traffic. Both routes point to the same judgment — generative recommendation is moving from offline consistency verification to online revenue realization. User modeling accelerates its shift from dense vectors to discrete semantic IDs: Kuaishou and YouTube published SID-based frameworks almost simultaneously. This isn't just a change in representation form — it means that the underlying token space of recommendation systems is beginning to align with that of the LLM world, substantially lowering the cost of cross-scenario unification (short-form video / long-form video, recommendation / advertising). Industrial attribution and scaling methodology move toward precision: TikTok's Attribution Correction Framework aligns causal experiments with daily production attribution, reducing measured cannibalization by roughly 15 percentage points. Tencent's NOVA uses an agent to automate architecture evolution, achieving +2.02% GMV on L3 tasks online. Kuaishou's UniFormer proposes a model-centric scaling framework that explicitly decomposes the modeling space into feature and task dimensions. Together, these three reveal a pattern: as model architectures converge, engineering automation and measurement accuracy become th
tags
Recommendation Systems
Weekly
Papers
category
Rec Tech Report
icon
📚
password
priority
1
Weekly Overview
Of the 12 papers this week, industrial deployments dominate — 8 come from first-tier platforms like YouTube, TikTok, Kuaishou, Tencent, and Walmart, all with online A/B experiment metrics. Research clusters around three overlapping directions: generative recommendation with LLM augmentation, GPU acceleration for large-scale retrieval, and industrial system architecture and attribution optimization.
Generative recommendation moves from "generating item IDs" to "generating physical items": Kuaishou's RaG unifies generative recommendation with video generation, achieving +1.87% ad revenue on a 400M DAU platform. YouTube's TokenMinds extends Semantic ID from the item side to the user side, producing both discrete user tokens and dense embeddings, covering full user traffic. Both routes point to the same judgment — generative recommendation is moving from offline consistency verification to online revenue realization.
User modeling accelerates its shift from dense vectors to discrete semantic IDs: Kuaishou and YouTube published SID-based frameworks almost simultaneously. This isn't just a change in representation form — it means that the underlying token space of recommendation systems is beginning to align with that of the LLM world, substantially lowering the cost of cross-scenario unification (short-form video / long-form video, recommendation / advertising).
Industrial attribution and scaling methodology move toward precision: TikTok's Attribution Correction Framework aligns causal experiments with daily production attribution, reducing measured cannibalization by roughly 15 percentage points. Tencent's NOVA uses an agent to automate architecture evolution, achieving +2.02% GMV on L3 tasks online. Kuaishou's UniFormer proposes a model-centric scaling framework that explicitly decomposes the modeling space into feature and task dimensions. Together, these three reveal a pattern: as model architectures converge, engineering automation and measurement accuracy become the new moats in industrial competition.
Generative Recommendation & User Representation: From Semantic IDs to Video Generation
The densest cluster of industrial deployments this week sits in the generative recommendation paradigm. Three papers from top video and search platforms present two technical routes — one using semantic IDs as a unified language, the other directly connecting recommendation to multimodal generation models.
RaG (Kuaishou) proposes a Recommendation-as-Generation paradigm. The core idea: if generative recommendation (e.g., GRM) can predict the next item a user is interested in from a sequence of semantic IDs, why not feed that sequence's output directly as input to video generation? RaG connects the recommendation pipeline to the video generation pipeline through shared semantic IDs. Specifically, each video is encoded into two types of SID — content semantic SID and creative style SID — and a user's interaction sequence is modeled as a preference distribution over these two SID types. Then, Video Generation Agents receive the inferred target SIDs and hierarchically plan visual composition, audio alignment, and artistic effect enhancement. For end-to-end optimization, RaG introduces cross-domain cooperative reward learning that jointly measures interest alignment, user feedback, and video quality. Deployed on an industrial platform with 400M DAU, RaG lifts revenue by +1.87% in advertising scenarios, outperforming the already-deployed strong baseline GRM. The deeper implication: the output of a recommendation system is no longer limited to "selecting items from a candidate pool" — it can now "create items on demand," upending the passive relationship between recommendation systems and content consumption.
TokenMinds (YouTube) takes a pure representation route with SIDs. Previous work PLUM used RQ-VAE to generate hierarchical semantic IDs for items, but the user side always lacked a corresponding discrete representation. TokenMinds fills this gap: it extends PLUM's encoder-decoder architecture from item retrieval to user modeling, outputting two things — a set of discrete SID user tokens and a dense user embedding. The practical value of the dual-output design lies in compatibility: downstream ranking models can continue to use the dense embedding for feature engineering while also leveraging the discrete tokens for cross-scenario modeling. YouTube previously had to train and maintain separate models for short-form and long-form video, but the shared SID vocabulary allows them to be merged, substantially reducing training and inference costs. TokenMinds is deployed on YouTube's full user traffic (billions of users) and decouples representation generation from downstream scoring via an asynchronous architecture. Unlike RaG's direct invocation of generation models, TokenMinds' contribution is that it gives the user side of the recommendation system a discrete semantic space isomorphic to the item side, providing a cleaner interface for cross-scenario transfer and multi-task sharing.
Walmart's INSPIRE is an industrial case of LLM distillation for retrieval, but its technical route differs from the two above. Instead of using SIDs, it distills user queries and product titles into structured intent attributes (brand, taste, dietary preference, etc.) using an LLM, then fuses these attribute features into the representation layer of a two-tower model. Online A/B tests show +12.4% ad revenue and +5.8% CTR. RaG and TokenMinds validate the feasibility of SID in video recommendation scenarios; INSPIRE demonstrates that in e-commerce, fine-grained attribute distillation remains an efficient path to improving intent matching.
Taken together, these three papers advance the proposition of "representation unification in recommendation systems" from different angles. RaG aligns recommendation output with video generation, TokenMinds aligns the semantic spaces of users and items, and INSPIRE aligns query intent with item attributes. The common trend: the core demand of generative recommendation is shifting from "generating sequences" to "generating representations that synchronize meaning."
Industrial System Architecture Automation & Attribution Calibration
As model architectures converge, the incremental gains of industrial systems increasingly depend on two directions — automating the architecture evolution process itself, and ensuring measurement accuracy. Tencent's NOVA and TikTok's attribution correction address these two problems respectively.
NOVA (Tencent) tackles the "silent failure" problem during recommendation model architecture upgrades. Industrial recommendation systems constantly need to transform research prototypes (e.g., RankMixer, TokenMixer-Large, MixFormer) into production code, but this process is highly dependent on expert experience. AutoML only tunes hyperparameters, and while LLM coding agents can generate runnable code, it's not guaranteed to be an effective recommendation architecture — a candidate that passes local tests might silently degrade to zero gains online. NOVA's core technical innovation is "architecture gradient" — an SGD-inspired, non-differentiable update signal that aggregates prior modification records, verification diagnostic results, metric feedback, and trajectory memory to guide the next architecture modification. On top of that, NOVA builds a four-level verification cascade (L1–L4): from structural semantic checking (L1) to local executability (L2), to offline effectiveness (L3), and finally to online impact (L4). Invalid candidates are intercepted early, and failure modes are recorded as "forbidden directions." High-risk L4 tasks are automatically routed to a Copilot for human review. Deployed in Tencent's advertising system, NOVA achieves a 54.5% effective pass rate on L2 ScaleUp tasks and 60.0% on L3 Literature-to-Production tasks. After successful L1–L4 verification, L3 candidates are moved online, improving GMV by +1.25%, +1.70%, and +2.02% on three pCVR targets, while reducing pCVR bias by 37.3%–66.7%. The human time from paper to production per cycle is shortened by more than 13×. NOVA can be seen as an industrial instantiation of the LLM agent self-evolution concept proposed in Self-Evolving Recommendation System — but NOVA adds a critical verification layer to ensure that autonomously generated architectures do not produce negative returns.
Attribution Correction Framework (TikTok) addresses an equally classic problem — the deviation between ad attribution data and true incrementality. Among the daily new users (DNU) acquired through TikTok's paid channels, some would have arrived anyway through brand search or organic channels even without advertising. These are counted as incremental but are actually cannibalization. Direct causal experiments (incrementality tests) exist, but they are sparse and cannot cover every channel and business level every day. The paper's approach: use incrementality experiments as causal anchors to convert sparse lift measurements into daily corrected estimates; then, under structural consistency constraints, allocate the corrected cannibalization to each business level. Offline forward validation shows the method substantially reduces calibration error. After deployment across multiple TikTok markets globally, the adjusted budget and traffic delivery strategy reduced measured cannibalization by roughly 15 percentage points. This method differs from traditional Shapley value or multi-touch attribution — it doesn't try to attribute "which channel caused the conversion," but answers "which channel drove true incrementality." The answer to that question directly determines the correctness of budget allocation decisions.
UniFormer (Kuaishou) addresses the direction of scaling at the model architecture level. Previous work like HyFormer and OneTrans attempted cross-module joint scaling, but remained constrained to the feature space. UniFormer proposes decomposing the overall modeling space into a feature space and a task space, respectively modeled by a stacked feature-space interaction module and a task-space interaction module. To improve inference efficiency, UniFormer introduces semantic tokenization — after tokenizing user historical behavior, it decouples computation from the current request item token (request-level inference acceleration). To prevent preference collapse, it uses multi-sequence cross-attention to capture heterogeneous behavior patterns separately, then enhances interaction through self-attention. In online experiments on both Kuaishou and Kuaishou Lite, watch time increased by +0.729% and +1.113% respectively. UniFormer's value lies in providing a clear scaling methodology — not "scale everything up," but converge parameter expansion for different modeling objectives into two orthogonal spaces.
LLM-Annotated Data: An Industrial Path to Replace Manual and Click Signals
Three papers this week, from different e-commerce/search platforms, systematically answer the same question: how to use LLMs to generate high-quality training/evaluation data, replacing costly manual annotation and biased click signals. The deployment scenarios are Walmart's sponsored search, Capital One's financial services, and Walmart's search evaluation.
Scaling Dense Retrieval with LLM-Annotated Training Data (Walmart) is the most complete case. The starting point is an intuitive insight: heterogeneous retrieval systems have substantial disagreement on retrieved items, and this disagreement itself serves as a natural labeling signal. Specifically, three production-grade retrieval systems (semantic, lexical, hybrid) produce — from their common candidate results — "easy positives" that all systems agree on, "hard positives" that only the lexical system finds, and "hard negatives" that just one system was fooled by. After extracting these heterogeneous signals as structured training material, Walmart further uses a three-model cascade (184M cross-encoder → 2B LLM → 8B LLM) for graded relevance annotation, achieving 89.1% agreement with human annotators. The training phase uses a three-stage progressive curriculum — BCE → MNR → Triplet — organizing 240M+ training samples into 5 difficulty levels. The final deployed two-tower BERT model achieves +5.1% NDCG@10 over the click training baseline, with the most notable gains on long-tail queries; "awkward retrieval" (rating 0) drops from 8.7% to 3.5%. A 14-day online A/B test shows +2.80% higher ad spend, +1.4% CTR, +2.8% eCPM, and +2.9% CVR.
AutoRelAnnotator (Walmart) focuses on efficiency optimization for search relevance annotation. Its core finding: accuracy and cost can be orthogonally optimized — domain fine-tuning contributes +20 accuracy points, a cascade model (small first, large second) halves computation while maintaining accuracy, and per-class isotonic calibration provides an additional +0.6 points. In Walmart's production system, AutoRelAnnotator has processed over 150M annotations, accelerating experimentation cycles. This work and the previous one form Walmart's dual-line layout on annotation data this week — one paper addresses training data, the other addresses evaluation data.
Cross-Platform Session Embeddings (Capital One) demonstrates the application of LLM distillation in cross-platform user modeling. The financial services scenario faces a unique challenge: users browse products on the web before logging in, then manage accounts in the app after logging in — the behavior is very different between the two. This work uses a self-supervised Transformer to compress raw clickstreams into compact session embeddings, while an LLM distillation pipeline generates interpretable intent labels (e.g., "comparing credit card annual fees"). In online tests, session embeddings improve Recall@1 by 1.88% and reduce Log Loss by 13.38% for the homepage layout ranking task; intent labels achieve F1 only 7% lower than the LLM on conversion prediction, with near-zero latency.
Taken together, LLM-annotated data is moving from "whether it exists" to "how good it is." Walmart's methodology is particularly noteworthy — it provides three reusable principles: (1) use disagreement among heterogeneous systems to generate unbiased training signals, (2) use curriculum learning to organize difficulty levels, and (3) use cascading + calibration to control costs. These principles can be transferred to most industrial search and recommendation systems.
Directions to Watch
Integration of generative recommendation and multimodal generative models. RaG validated the economic feasibility of "Recommendation as Generation" on 400M DAU, with +1.87% ad revenue. This number may seem smaller than some end-to-end generative recommendation offline metric improvements (e.g., Recall up by double digits), but it hints at a change in the recommendation system's business model — when a recommendation system starts "creating" content instead of "selecting" it, the supply constraint for recommended ads shifts from "limited inventory" to "limited computation." YouTube's TokenMinds, while not directly connected to video generation, provides a foundation for future similar integration through its SID user tokens. Kuaishou's joint optimization of video quality, user feedback, and interest alignment in cross-domain cooperative reward learning is also a practical multi-objective problem that generative recommendation must solve. This direction deserves close attention, especially RaG's separation of "content style SID" from "content semantic SID" — this could be the key to controllable generation.
Unified SID for user and item representation. TokenMinds extends SID from the item side to the user side; Kuaishou's Gryphon is simultaneously pursuing a similar direction. This means recommendation systems have the opportunity to handle user and item representations the way LLMs handle text tokens — unified, discrete, transferable. Unified modeling across scenarios (short-form video / long-form video / live streaming) can substantially save training and inference resources. However, the quality of SID representations heavily depends on RQ-VAE encoding quality and the definition of the semantic space. Two subsequent concerns: how to ensure the fidelity of item reconstruction from SID, and how to maintain semantic consistency of SID across scenarios.
Automation of industrial architecture evolution. NOVA's 13× speedup and 60% L3 pass rate in Tencent's advertising system provides the first quantifiable baseline for "architecture evolution automation." Its "architecture gradient + verification cascade" framework can be seen as an AutoML tailored for recommendation systems, but different from traditional NAS — NOVA operates on production-level code modifications, not searches over model structure parameter space. This is closer to an engineer's real workflow. The AutoML field has long lacked deployment cases in recommendation scenarios; NOVA demonstrates the feasibility of this direction in an industrial setting. Extending NOVA's paradigm to the retrieval or pre-ranking stage is a viable direction for follow-up research.
Paper Roundup
Generative Recommendation & User Representation
RaG (Kuaishou) — Proposes the Recommendation-as-Generation paradigm, unifying generative recommendation with video generation through shared semantic IDs. Deployed on a 400M DAU platform, ad revenue incr. +1.87%.
TokenMinds (YouTube) — Extends Semantic ID from item retrieval to user modeling, generating both discrete user tokens and dense embeddings. Deployed on full YouTube user traffic, validating that cross-scenario unified models reduce training costs.
INSPIRE (Walmart) — Distills structured intent attributes (brand, taste, dietary preference) via LLM and fuses them into two-tower retrieval model. Online ad revenue +12.4%, CTR +5.8%.
Industrial Architecture & Attribution Optimization
NOVA (Tencent) — A verification-aware agent framework that automates recommendation model architecture evolution via architecture gradients and a four-level verification cascade. 60.0% effective pass rate on L3 Literature-to-Production tasks, online GMV +1.25%~+2.02%, human time reduced 13×.
Attribution Correction Framework (TikTok) — Uses incrementality experiments as causal anchors to correct daily attribution data, allocating cannibalization to business levels under structural consistency constraints. After deployment, cannibalization rate reduced by ~15 percentage points.
UniFormer (Kuaishou) — A unified model-centric scaling framework that decomposes the modeling space into feature and task spaces, introduces semantic tokenization for inference acceleration. Watch time improvements: Kuaishou +0.729%, Kuaishou Lite +1.113%.
LLM-Annotated Data & Methods
Scaling Dense Retrieval with LLM-Annotated Data (Walmart) — Combines multi-channel retrieval mining + LLM cascade annotation (89.1% human agreement) + three-stage curriculum training, replacing click training baseline. Online CTR +1.4%, CVR +2.9%.
AutoRelAnnotator (Walmart) — Calibrated model cascade for cost-efficiency: fine-tuning brings +20 accuracy points, cascade halves computation, per-class isotonic calibration adds +0.6 points. Processed 150M+ annotations.
Cross-Platform Session Embeddings (Capital One) — Self-supervised Transformer + LLM distillation dual output: session embedding improves homepage ranking Recall@1 by 1.88%; intent labels achieve F1 only 7% lower than LLM on conversion prediction with zero latency.
GPU-Accelerated Retrieval & Indexing
TileMaxSim — IO-aware Triton MaxSim kernel using multi-query SRAM tiling, dimension tiling, and fused PQ scoring. Achieves 80.2% peak bandwidth on H100, 71.6M docs/sec, 220× faster than loop baseline.
GPUSparse — GPU-accelerated exact sparse retrieval with parallel inverted index and fused Triton kernels. 1.27ms/query on MS MARCO, 235× faster than Pyserini CPU, no recall loss.
IRENE (Microsoft) — Meta-classification framework that synthesizes zero-shot item classifiers on the fly via a meta-classifier. CTR +4.2%, Recall@10 +15% on Bing Ads retrieval tasks.
Other
EMA-FS (PayPal) — EMA gain-aware feature selection limiting histogram construction to top-K high-gain features. 1.45× speedup on IEEE-CIS fraud data (30% retention rate), ~120 lines of C++ code.