Silver Medal Solution on Kaggle H&M Personalized Fashion Recommendations
89th place’s solution out of 2952 teams competing
The competition was hosted by H&M Group on Kaggle. I joined the competition with my colleague from work for fun. We wanted to learn about the recommendation system and agreed that the best way for it is by working on an exciting project like this competition.
Link to competition page: https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview
H&M’s online store offers shoppers an extensive selection of products to browse through. But with too many choices, customers might not quickly find what interests them or what they are looking for, and ultimately, they might not make a purchase. To enhance the shopping experience, product recommendations are essential.
Hence, the task is “Given previous transactions, customer & product metadata including product description and image, recommend 12 products that are relevant for customer”. Since we need to sort/rank the 12 most relevant products, we might approach this problem as a ranking or classification problem.
Participants were allowed to freely choose the data they want to use either only historical transactions or product description text or even product images. The solution is not restricted to specific algorithms. We were only expected to generate the top 12 relevant products for customers, then later will be evaluated by the same metric: Mean Average Precision @ 12 (MAP@12).
The top 12 predictions will be compared to products that customers actually bought in the next 7 days. The MAP@12 metrics is one of evaluation metric in ranking problem. One way to describe it is as follows “MAP is a metric that tells you how much of the relevant documents (products, in this case) are concentrated in the highest-ranked predictions.” In simple words, if we can place products that were actually bought in the highest ranking, the better the score.
For more information regarding this metric, go to this link.
The competition was rather different than previous Kaggle competitions in which we were given not-ready-to-train datasets. We need to come up with pipeline design and preprocess the data before we call model.fit() method. It made the competition setup closer to real-world data science project setup.
We were given three datasets + 1 collection of image data.
1. Article (product) metadata. It consists of product name, categorical columns describing product’s colour, department, garment group, and descriptions.
2. Customer metadata. It consists of club member status, whether they subscribe to fashion news or not, and age.
3. Historical transactions.
4. Article images. We don’t use this in our solution, since we only use limited computation power (free Kaggle notebook 30hrs 16GB GPU and MacBook 8GB)
The datasets consist of 106k products and 1.37m customers. If we included all products as candidates per customer, we’ll have 106k X 1.37mn =145 billion rows of data, not to mention the size of the features. We need to find a better way to reduce the data or we can’t fit the data into RAM.
One great analysis from Kaggler here helps us to focus on only relevant products for customers. Shown number of article count per week group by the last time it’s been bought before this week. The majority of transactions made by articles that's still been bought last 1 week. From this, we decide to only use articles that's still been bought last 6 weeks as candidates. Later we’ll filter these candidates again to reduce the number of candidates per customer.
Our solution was inspired by famous two steps architecture: (1) Candidate Retrieval and (2) Ranking. Candidate Retrieval focuses more on recall means that it aims to filter 106k products to 30 that are relevant for customers. Ranking focuses more on precision means that it aims to correct the order of 30 candidates and select only top 12 products.
Candidate Retrieval (Recall)
Cold-start problem is a problem where new customer coming and we don’t have any information about them. We used 12 popular products from last 1 week to address this problem.
Rule-based candidate retrieval
We used several strategies as our Recall model.
1. Popular products from last 1-week group by customer segment
2. Products that were previously purchased by customer
3. Products that are bought together
4. Products with similar price
5. TensorFlow Recommenders Retrieval
With these strategies, we have ~ 7% recall. There are possible techniques that we didn’t use: collaborative filtering, similarity from image embedding, and similarity from product description embedding.
From start, we decided to develop ranking models separately. We aimed to have diverse models so that later we could ensemble the predictions. Our approach was different even from how we prepared the training data.
In the context of ranking ML, the training data must contain both positive and negative samples. Positive samples are actual purchased from historical transaction, we label it as 1 (purchased). Negative samples are sample that we label as 0 (not purchased). Why we need negative samples? Because we’ll use ranking ML not only for sorting products that customer bought but also products that customer not bought. Remember that our Candidate Retrieval only has 7% recall means most of candidates are not actually bought by customers. Then how we generate negative samples?
Two techniques that we use for generating negative samples.
1. Use the output from Candidate Retrieval since it has both positive and negative samples.
2. Randomly select N available products as negative samples.
In the context of ranking ML, there is 3 type of feature.
1. User feature. It’s derived only from customer data. Example: age, num_trx_last_90d, mean_price_last_90d, etc
2. Item feature. It’s derived only from product data. Example: product_colour, product_category, product_price, num_bought_last_7d, etc
3. User-Item feature. It’s derived from interaction between customer and product. Example: difference user_mean_price and product_price, how many user purchase product with the same colour/category/etc. It’s important to have good user-item feature to get well performing model.
We had varying windows of observation for feature engineering i.e. all time, 8/4/1 weeks. Following are features that we use
Since we label the data with 1/0, we could use either Ranker algorithm or Classification algorithm . We mainly use 2 libraries
1. LGBM. (LGBMRanker & LGBMClassifier)
2. CatBoost. (CatBoostClassifier)
The final submission was an ensemble of 7 different models. We differentiate how we generate negative samples in training data with several positive and negative sample ratios hence we got quite diverse models in the end.
It’s very important to set up robust validation strategy for Kaggle competition. We used last 7 days in the training data as our validation set. It would be better if we have > 1 fold validation set, but with computation and time constraints we decided to only have 1 fold validation set. We observed with only 1 fold validation set is still giving us good correlation with leaderboard score.
The competition taught us a lot. Special thanks to my teammate, Hervind Philipe, we earned a silver medal and earned learnings implementing machine learning solutions that revolve around a recommendation system. We hope this post could be a useful reference for anyone who wants to implement similar projects in the future. Thanks for reading all along!