2024 Netflix Workshop on Personalization, Recommendation and Search (PRS)

Friday

May

31

|

8:30AM

–

6:00PM

PDT

Event Details

The eighth Netflix workshop on Personalization, Recommendation and Search (PRS) aims at bringing together practitioners and researchers working in domains to facilitate the sharing of ideas, information and approaches to build bridges between these communities.

Please register in advance using the RSVP button above. Registrations will close when we reach capacity (which we have in prior years) or by Friday, May 24th. So if you're interested, don't delay.

If you are interested in presenting a poster during the workshop, please fill out this form before Friday, May 10th. Accepted posters will be notified by Friday, May 17th.

The event will be in-person only, at our beautiful Netflix campus in Los Gatos, CA.

This @NetflixResearch workshop is organized by:

Justin Basilico - jbasilico[at]netflix.com

Grace Huang - ghuang[at]netflix.com

Sudarshan Lamkhede - slamkhede[at]netflix.com

Kriti Kohli - kritik[at]netflix.com

Aish Fenton - afenton[at]netflix.com

Nathan Kallus - nkallus[at]netflix.com

Linas Baltrunas - lbaltrunas[at]netflix.com

For questions, contact prs-organizers[at]netflix.com

Previous PRS workshops: 2023, 2022, 2021, 2019, 2018, 2017, 2016.

Agenda

8:30 AM PDT

Registration Opens

Breakfast & Coffee

9:20 AM PDT

Welcome & Opening remarks

Workshop Organizers (Netflix)

9:30 AM PDT

LLMs as Agents [recording]

Alane Suhr (UC Berkeley)

10:00 AM PDT

Applying Language Models to Recommendation Experiences: Challenges and Lessons [slides, recording]

Eugene Yan (Amazon)

10:30 AM PDT

Break

Registration Closes

11:00 AM PDT

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai (Meta)

11:30 AM PDT

Conversational Recommender Systems [slides, recording]

Harald Steck (Netflix)

12 PM PDT

Lunch

12:30 PM PDT

Poster Session

1:30 PM PDT

Long-Term Value of Exploration: Measurements, Findings and Algorithms

Yi Su (Google)

2:00 PM PDT

Beyond the binge: recommending for long-term member satisfaction at Netflix [slides, recording]

Jiangwei Pan (Netflix)

2:30 PM PDT

Break

3:00PM PDT

Toward Practical Robustness in AI

Alex Beutel (OpenAI)

3:30 PM PDT

Building Airbnb Categories with ML and Human in the loop [recording]

Mihajlo Grbovic (AirBnB)

4:00 PM PDT

Personalization at Spotify [slides, recording]

Maria Dimakopoulou (Spotify)

4:30 PM PDT

Closing & Happy hour

6:00 PM PDT

End of event

The Final Countdown!

Time left for the event days hours minutes seconds

The countdown doesn't work if the event start date is set to TBD

Speakers

Alex Beutel

OpenAI

Alex Beutel is the technical lead for model safety at OpenAI. Prior to joining OpenAI, he was a senior staff research scientist, tech lead, and manager at Google Research, co-leading a Responsible ML team and driving research spanning recommender systems, fairness, robustness, reinforcement learning, and machine learning for databases, together resulting in numerous papers and >50 launches across multiple products. He has a PhD in Computer Science from Carnegie Mellon University, where his thesis on user behavior modeling, fraud detection, and recommender systems received KDD’s Doctoral Dissertation Award Runner-Up. He also received the Best Paper Award at ACM KDD and ACM GIS.

Alane Suhr

UC Berkeley

Alane Suhr recently joined EECS and BAIR at UC Berkeley as an Assistant Professor. Alane's work focuses on building language-using systems that communicate with and learn from human users in collaborative, situated interactions. Prior to joining Berkeley, Alane completed a PhD in Computer Science at Cornell University / Cornell Tech and spent a year afterwards as a Young Investigator at the Allen Institute for AI.

Eugene Yan

Amazon

Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He's currently a Senior Applied Scientist at Amazon. Previously, he led machine learning at Lazada (acquired by Alibaba) and a Healthtech Series A Startup. He writes & speaks about ML, RecSys, LLMs, and engineering at eugeneyan.com and ApplyingML.com.

Harald Steck

Netflix

Harald Steck is a research scientist at Netflix, working on recommender systems, search algorithms and related topics. Prior to that he conducted research in machine learning at Bell Labs, Siemens, ETH Zurich and MIT, after obtaining his PhD from the Technical University of Munich.

Jiangwei Pan

Netflix

Jiangwei Pan is a research scientist at Netflix, working on recommendation algorithms for the Netflix homepage. Prior to that, he was a research scientist at Facebook and worked on news feed ranking problems. He obtained his PhD on theoretical computer science from Duke university.

Jiaqi Zhai

Meta

Jiaqi Zhai is a Distinguished Engineer at Meta. He leads efforts to improve recommendation systems across Facebook and Instagram, with a mission to connect billions of people to informative, entertaining, and insightful content. His team developed multiple state-of-the-art foundational technologies, including the first trillion-parameter scale generative recommenders used in production. Prior to Meta, he spent 6 years at Google and developed the cross-platform user understanding system used in Search, Chrome, and YouTube, Google's first billion-user scale online learning system with minute-level latency, and the first generative model deployed on Google Search. His work has been published in top conferences including KDD, WWW, and SIGMOD.

Maria Dimakopoulou

Spotify

Maria Dimakopoulou is the Director of Machine Learning & Head of Homepage Personalization at Spotify, leading a team of 60 people responsible for generating, ranking and distributing personalized content recommendations across music, podcasts and audiobooks on the Homepage of 600+ million listeners. Before Spotify, she was at Netflix, researching causal ML research for personalization and subsequently building and leading the Adaptive Experimentation working group, leveraging multi-arm bandits and causal inference for experimentation. Before that, she received a PhD on reinforcement learning and causal inference at Stanford, advised by Benjamin Van Roy and Susan Athey. Before that, she worked at Google Research on large-scale optimization algorithms for Technical Infrastructure and Ad Exchange.

Mihajlo Grbovic
Airbnb

Mihajlo Grbovic, Ph.D. is a Machine Learning Scientist at Airbnb. With more than 15 years of technical experience in applied Machine Learning including leading numerous successful projects. At Airbnb, he focuses on Search & Recommendation problems including building the its first Search Autocomplete algorithm, a Machine Learning-powered Search for Airbnb Experiences, and algorithms that power Airbnb Categories for the homepage. Currently, he is creating an AI Travel Concierge. Prior to Airbnb, he worked at Yahoo on integrating Machine Learning in various products, including building Ad Targeting for Tumblr, Email Classification for Yahoo Mail and query-ad matching for Yahoo Search Ads. He holds a PhD in Machine Learning from Temple University.

Yi Su

Google DeepMind

Yi Su is a Research Scientist at Google DeepMind, where she works on machine learning for interactive systems, with a focus on bandits and RL. Prior to Google, she finished her post-doc at UC Berkeley, working on data-driven optimization and offline RL algorithms. She received her PhD in Statistics from Cornell University, being awarded by Bloomberg Data Science Fellowship and Rising Stars in EECS 2020. She has been in the program committee for various ML conferences & workshops including co-organizing the KDD workshop on Online and Adaptive Recommender Systems.

Talks & Abstracts

Title: Applying Language Models to Recommendation Experiences: Challenges and Lessons

Speaker: Eugene Yan (Amazon)

Abstract:

Advancements in language models have made it possible to provide richer recommendation experiences to users. Nonetheless, applying them comes with its own set of challenges. In this talk, we’ll briefly introduce how language models can help with recommendations, such as extracting metadata, summarizing items, or helping with explainability. We’ll then discuss some challenges of using language models, such as building reliable evals, scaling with low latency and cost, and detecting hallucinations.

Title: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Speaker: Jiaqi Zhai (Meta)

Abstract:

Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.

Title: Long-Term Value of Exploration: Measurements, Findings and Algorithms

Speaker: Yi Su (Google DeepMind)

Abstract:

Effective exploration is believed to positively influence the long term user experience on recommendation platforms. Determining its exact benefits, however, has been challenging. Regular A/B tests on exploration often measure neutral or even negative engagement metrics while failing to capture its long-term benefits. In this talk, we will introduce our study of the value of exploration through the lens of measurements, findings and algorithms. We will introduce new experiment designs to formally quantify the long-term value of exploration by examining its effects on content corpus, and connecting content corpus growth to the long-term user experience from real-world experiments. Once established the values of exploration, we investigate the Neural Linear Bandit algorithm as a general framework to introduce exploration into any deep learning based ranking systems. We conduct the live experiments on one of the largest short-form video recommendation platforms that serves billions of users to validate the new experiment designs, quantify the long-term values of exploration, and to verify the effectiveness of the adopted neural linear bandit algorithm for exploration.

Title: Beyond the binge: recommending for long-term member satisfaction at Netflix

Speaker: Jiangwei Pan (Netflix)

Abstract:

Many recommender systems are trained on user engagements due to their abundance, immediacy of feedback, and the insights they provide into user preferences. However, engagements may not align with the underlying, often long-term, objective. At Netflix, the objective of our recommender systems is long-term member satisfaction. We highlight a practical approach to meet this objective that augments engagement data with reward signals that align with the desired long-term outcomes. We term this iterative process of identifying, evaluating, and integrating reward signals as reward innovation. In this work, we formalize the problem, describe the approach, share the practical challenges we encountered, and highlight valuable lessons learned.

Title: Building Airbnb Categories with ML and Human in the loop

Speaker: Mihajlo Grbovic (Airbnb)

Abstract:

Online travel search hasn’t changed much in the last 25 years. The traveler enters her destination, dates, and the number of guests into a search interface, which dutifully returns a list of options that best meet the criteria. The biggest shortcoming of these approaches is that the traveler must have a specific destination in mind. Even travelers who are flexible get funneled to a similar set of well-known destinations, reinforcing the cycle of mass tourism. Introducing Airbnb Categories. In our recent release, we flipped the travel search experience on its head by having the inventory dictate the destinations, not the other way around. In this way, we sought to inspire the traveler to book unique stays in places they might not think to search for. By leading with our unique places to stay, grouped together into cohesive “categories”, we inspired our guests to find some incredible places to stay off the beaten path. Though our goal was an intuitive browsing experience, it required considerable work behind the scenes to pull this off. In this talk I will cover the approaches we took to come up with categories, classify our entire inventory into them, set up a human-in-the-loop approach to labeling and self-improving Machine Learning as well as how to display and rank them on the homepage.

Title: Conversational Recommender Systems

Speaker: Harald Steck (Netflix)

Abstract:

Coming soon.

Title: Personalization at Spotify

Speaker: Maria Dimakopoulou (Spotify)

Abstract:

At the scale of 600M+ listeners and a catalog of 100M+ tracks, 4B+ playlists, 5M+ podcast titles and 350K+ audiobooks, personalization is at the heart of what we do at Spotify. In fact, when we ask our listeners what they like most about Spotify, more than 81% cite our personalization. The teams in Spotify’s Personalization Mission innovate in a range of of ML disciplines (generative AI, adaptive learning, deep learning, multi-objective optimization, causal inference & more), as well as ML infrastructure, data and backend systems at scale, with the goal to serve recommendations that maximize the listeners’ satisfaction, while supporting Spotify’s business strategy. In this talk, we will focus on the science behind some of Spotify’s personalized experiences; how we view the Spotify’s Homepage recommendations through a causal contextual bandit lens, how we accelerate creator audience growth in a distributed recommendations setup, how we balance multiple content types (music, podcast, audiobooks) on Spotify’s Homepage, how we recommend a cold-starting audio format such as audiobooks and how do we optimize our recommendations for the long-term.

Title: LLMs as Agents

Speaker: Alane Suhr (UC Berkeley)

Abstract:

The increasing capability of LLMs makes them appealing for adoption in labor-intensive human tasks. For example, significant efforts have recently focused on developing agents -- systems that map observations and instructions to executable actions -- and their benchmarks in real-world tasks like web navigation. In this talk, I will discuss recent work in developing better evaluations for these agents, which in turn can be used to automatically improve agent performance without requiring any demonstration data or human annotation. However, in developing systems like this, and in applying LLMs and other large pre-trained models to real-world problems, we should be aware of their fundamental limitations; for example, their sensitivity to design considerations like prompt formatting. I will detail recent work where we find that LLMs can be incredibly sensitive to arbitrary design decisions, like choices of separators or multiple choice labels.

Title: Toward Practical Robustness in AI

Speaker: Alex Beutel (OpenAI)

Abstract:

Coming soon.

Posters

The following posters will be presented from 12:30-1:30 at the workshop

Experimenting, Fast and Slow: Bayesian Optimization of Long-term Outcomes with Online Experiments [poster]

Qing Feng, Sam Daulton, Benjamin Letham, Maximilian Balandat, Eytan Bakshy (Meta)

Internet systems commonly have parameters that are tuned via online experiments, also known as online A/B tests. Such experiments have a multitude of applications including optimizing recommender system ranking and retrieval policies, infrastructure, and streaming controllers. Decision-makers generally wish to optimize for long-term treatment effects with respect to these changes, but measuring long-term effects can require running experiments for a significant amount of time because short-term measurements can be misleading due to non-stationarity in treatment effects over time. Thus, experimentation strategies that run a sequence of experiments, each measuring the long-term treatment effect, may be prohibitively time-consuming when optimizing over large decision-spaces. We describe a novel approach that combines short experiments (e.g. biased experiments that only run for a few hours) and/or proxies (e.g. off-policy evaluation) with long-term experiments to perform Bayesian optimization over large action spaces in a short amount of time.

Collaborative Large Language Model for Recommender System [poster]

Yaochen Zhu (Netflix & University of Virginia)

Recently, there has been growing interest in developing the next-generation recommender systems (RSs) based on pretrained large language models (LLMs). However, the semantic gap between natural language and recommendation tasks is still not well addressed, leading to multiple issues such as spuriously correlated user/item descriptors, ineffective language modeling on user/item data, inefficient recommendations via auto-regression, etc. In this paper, we propose CLLM4Rec, the first generative RS that tightly integrates the LLM paradigm and ID paradigm of RSs, aiming to address the above challenges simultaneously. We first extend the vocabulary of pretrained LLMs with user/item ID tokens to faithfully model user/item collaborative and content semantics. Accordingly, a novel \textit{soft+hard prompting} strategy is proposed to effectively learn user/item collaborative/content token embeddings via language modeling on RS-specific corpora, where each document is split into a prompt consisting of heterogeneous \textit{soft} (user/item) tokens and \textit{hard} (vocab) tokens and a main text consisting of homogeneous item tokens or vocab tokens to facilitate stable and effective language modeling. In addition, a novel mutual regularization strategy is introduced to encourage CLLM4Rec to capture recommendation-related information from noisy user/item content. Finally, we propose a novel recommendation-oriented finetuning strategy for CLLM4Rec, where an item prediction head with multinomial likelihood is added to the pretrained CLLM4Rec backbone to predict hold-out items based on soft+hard prompts established from masked user-item interaction history, where recommendations of multiple items can be generated efficiently without hallucination

Enhancing Roku's Recommendation System with LLM-Generated Explanations

Jin Bao, Jing Xie, Abhishek Bambha (Roku)

Training base large language models (LLMs) involves extensive development cycles and significant data requirements. Once trained, these models can be refined for specific applications in a few days. At Roku, we have incorporated our recommendation system with LLM-generated explanations to improve user experience. This integration includes using LLMs for metadata extraction and summarization tasks. To ensure content safety, we employ advanced moderation models and content filtering systems to keep explanations free from explicit language. We also employ a comprehensive list of content metadata to avoid mismatches and have developed a heuristic method to ensure reliability in metadata extraction by using LLM only for data predating the LLM training data cutoff date; otherwise, we revert to our precise internal data sources. Through ongoing iterative enhancements, we have customized generation processes to align with our business needs. Our system has been rigorously AB tested and successfully implemented across various business domains.

Planning adaptive experiments: a math programming approach [poster]

Ethan Che (Columbia), Hongseok Namkoong (Columbia), Daniel Jiang (Meta), Jimmy Wang (Columbia)

Adaptive experimentation can improve statistical power significantly, but typical algorithms overlook important issues that arise in practice: multiple objectives, non-stationarity, batched/delayed feedback, constraints, and personalization. Moving away from developing bespoke algorithms for each setting, we present a mathematical programming view of adaptive experimentation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. By formulating a dynamic program in the batched limit, our modeling framework enables the use of scalable optimization methods (e.g., SGD and auto-differentiation) to solve for treatment allocations. To spur algorithmic progress, we build a suite of benchmark problems based on hundreds of real A/B tests at ASOS that model key practical issues such as non-stationarity, personalization, multi-objectives, and constraints. Our empirical results show standard Thompson sampling-based policies fail to reliably improve upon static designs, and demonstrate the effectiveness of a simple planning approach.

RAGSys: Item-Cold-Start Recommender as RAG System [poster]

Emile Contal, Garrin McGoldrick (Crossing Minds)

Large Language Models (LLM) hold immense promise for real-world applications, but their generic knowledge often falls short of domain-specific needs. Fine-tuning, a common approach, can suffer from catastrophic forgetting and hinder generalizability. In-Context Learning (ICL) offers an alternative, which can leverage Retrieval-Augmented Generation (RAG) to provide LLMs with relevant demonstrations for few-shot learning tasks. This research explores the desired qualities of a demonstration retrieval system for ICL. We argue that ICL retrieval in this context resembles item-cold-start recommender systems, prioritizing discovery and maximizing information gain over strict relevance. We propose a novel evaluation method that measures the LLM's subsequent performance on NLP tasks, eliminating the need for subjective diversity scores. Our findings demonstrate the critical role of diversity and quality bias in retrieved demonstrations for effective ICL, and highlight the potential of recommender system techniques in this domain.

Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits [poster]

Lequn Wang (Netflix), Akshay Krishnamurthy (Microsoft), Aleksandrs Slivkins (Microsoft)

We consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first general oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.

Measuring Local Accuracies to Assess Evaluation Metrics [poster]

Athiya Deviyani, Fernando Diaz (CMU)

The design and evaluation of automated evaluation metrics prove to be challenging yet crucial to appropriately benchmark decision-making systems. In this work, we explore the effectiveness of measuring local metric accuracies to compare metrics by evaluating how well they can differentiate between systems within the same subsets and how their capabilities vary as the system outputs shift when the subset changes. We show that a metric's absolute and relative local accuracy changes in different subsets through various metrics spanning three tasks - machine translation, automated speech recognition, and ranking for recommender systems. Following our results, we hope to show that measuring local accuracies provides a different perspective and a valuable angle from which to evaluate existing evaluation metrics.

Venue

Add To Waitlist

Get to know us

NETFLIX RESEARCH