Bronze Medal Solution on Kaggle OTTO — Multi-Objective Recommender System

Aji Samudra
6 min readFeb 8, 2023

Top 6% solution out of 2587 teams competing

The Competition

Link to competition page: https://www.kaggle.com/competitions/otto-recommender-system/overview

143th out of 2587 teams

Solution GitHub repo: https://github.com/ajisamudra/kaggle-otto-multi-objective-recsys

The Task

Clipped from competition overview

Online shoppers have their pick of millions of products from large retailers. While such variety may be impressive, having so many options to explore can be overwhelming, resulting in shoppers leaving with empty carts. This neither benefits shoppers seeking to make a purchase nor retailers that missed out on sales. This is one reason online retailers rely on recommender systems to guide shoppers to products that best match their interests and motivations. Using data science to enhance retailers’ ability to predict which products each customer actually wants to see, add to their cart, and order at any given moment of their visit in real-time could improve your customer experience the next time you shop online with your favorite retailer.

Hence, the task is

Given a sequence of customer activities in real-time, predict the top 20 items each customer wants to click, add-to-cart, and order”.

Illustration from OTTO about the model we want to build.

ref: https://github.com/otto-de/recsys-dataset/blob/main/.readme/ground_truth.png

Evaluation metric

The Data

The data consist of 4 columns i.e. session, aid, ts, and type. It is seemingly simple, but we can approach the data with various algorithms. OTTO made the dataset publicly available here https://github.com/otto-de/recsys-dataset.

session - the unique session id
events - the time ordered sequence of events in the session
aid - the article id (product code) of the associated event
ts - the Unix timestamp of the event
type - the event type, i.e., whether a product was clicked, added to the user's cart, or ordered during the session

Given this dataset format, it’s similar to H&M competition where the data is not-ready-to-train. We need to design a data pipeline & preprocess it before we call model.fit().

Dataset Statistics

Looking more details into the data, in train set there are 12.89m sessions and 1.86 items. A session is equivalent to 1 unique user in the given observation period. There is a total of 216m events, which consists of 194m click events, 16.9m cart events, and only 5m order events. That’s huge dataset with small proportion of order events.

How about statistics at the session level? The histogram number of events per session is as follows

number of events per session
mean: 16.80
std: 33.58
min: 2
50%: 6
75%: 15
90%: 39
95%: 68
max: 500

That’s a wide range of value. The model should be able to recommend relevant items given as min event as smallest as 2 and max event as largest as 500.

The Solution

The solution was two stages model where it has (1) Candidate Retrieval and (2) Ranking.

Candidate Retrieval

Candidate Retrieval focuses more on recall means that it aims to filter 1.86 items to hundreds that are relevant for a session.

I used various strategies as the Recall model.

1. All past items in the session
2. Click item-covisitation
3. Buys item-covisitation
4. Buy2buy item-covisitation
5. Popular items in the week
6. Nearest neighbors item in Word2Vec embedding
7. Nearest neighbors item in Matrix Factorization embedding

Cumulative recall over different strategies

With these strategies, we have weight-averaged recall@150 = 0.6339 and detail recall for each event as follows click recall@150 0.62512, cart recall@150 0.50515, and order recall@150 0.6998.

Ranking

Ranking will focus more on precision. It means that the model aims to correct the order of hundred candidates and select only the top 20 items.

I have 1 model for each future event i.e. click, cart, order. For each event model, I use CatBoostRanker, CatBoostClassifier, and LGBMClassifier. Based on the experiments, CatBoostRanker had better CV scores than others. I use scores from these 3 different algorithms to make ensemble scores.

two-stage recommender system

Negative Sampling
In the context of ranking ML, the training data must contain both positive and negative samples. Positive samples are actual events from the session, we label them as 1 (clicked/added-to-cart/ordered). Negative samples are samples that we label as 0 (retrieved by step 1, but not actually clicked/added-to-cart/ordered). Why do we need negative samples? Because we’ll use ranking ML not only for sorting items that session clicked/added-to-cart/ordered but also items that session not clicked/added-to-cart/ordered. I randomly sample the negative candidates so that the ratio positive:negative = 1:10. I experimented with different ratio (1:10,1:15,1:20) but there were no significant differences in CV score, so I decided to use 1:10 because it’s the smallest and I can save some memory with it.

Features
There are 3 types of features:
1. Session feature. groupby(“session”).agg() . Example: how long the session in mins, how many unique item_id in the session, how many click/cart/order

2. Item feature. groupby(“aid”).agg() . Example: how many click/cart/order in the week, what’s click-to-order cvr, avg/std num of click/cart/order in the past week (to capture anomaly trend)

3. Session-Item feature. groupby([“session”, “aid”]).agg(). Example: how many click/cart/order, time_lapsed_since_the_last_event, recency score

4. Similarity features (cosine similarity, euclidean distance) between last item in the session and candidates item. Example: cosine similarity in Word2Vec/Matrix Factorization embedding.

Because the majority of positive candidates came from `past items in the session`, there’s one important feature that the ranker relies on i.e. exponential decayed recency score. It gives more weight to recent items than old ones. Another variant of this feature is type-weighted exponential decayed recency score where I add weight on different event type click/add-to-cart/order.

Ensemble

I used weighted average of the normalized models’ scores. Because each algorithm’s score had different range of values, I need to do normalization so the range was between 0 and 1.

the difference in scores distribution

Learnings & Closing

The competition taught me a lot about real-world recommendation systems and one of the challenges was the huge dataset. Given limited memory (2x 8GB), I need to think carefully about how should I process the data. The solution to this is to sample and chunk the data. The sample so that the sample’s distribution is still the same as the population’s distribution.

I also did experiment with many new tools that I would definitely use in my work.

1. polars. I used it for most of the data transformation, it’s way faster and memory efficient compared to pandas.

2. Annoy. I used it for candidate retrieval & similarity features generation. It’s faster than raw keyed vectors, with trade-off in accuracy.

3. Treelite. I used it for faster LGBM inference time.

As I spent more time in the competition, my code was getting complex. It’s very natural when we started with a baseline pipeline, then continue to add more components to it. So it’s better to structure the pipeline’s repository as modular as possible from the start. It’ll be easier to add components than refactor the code when you already had too many components.

I hope this post could be a useful reference for anyone who wants to implement similar projects in the future. Thanks for reading all along!

--

--