Strategy Design Pattern for Effective ML Pipeline

Aji Samudra
6 min readFeb 9, 2022

Python Strategy and Factory design patterns help us structure complex problems into smaller pieces which easier to expand and modify, hence Effective ML Pipeline.

Motivation

Given ML problem at hand, as Data Scientist you might want to experiment with different aspects of your ML pipeline e.g. Feature Engineering, Learning Algorithms. And for each aspect, you might do a lot of different things too. For example, if you decide to experiment with learning algorithms you might try to use different algorithms from the simplest ones i.e. linear model to the most complex ones i.e. Neural Network, Gradient Boosting. If you decide to experiment with feature engineering, there are a lot of possible combinations of features and you need to evaluate how they affect model performance too. Not to mention if you want to experiment with both learning algorithms and feature engineering.

Doing experimentation in ML training is a complex problem.

Photo by Markus Winkler on Unsplash

Learning Objective

We will learn about two Python design patterns that come in handy in Data Science projects i.e. Strategy and Factory.

Code used in this blog post is here

Python Design Pattern

Factory

Abstract Factory is a creational design pattern, which solves the problem of creating entire product families without specifying their concrete classes. Factory defines an interface for creating all distinct products but leaves the actual product creation to concrete factory classes. Refactoring Guru

In other words, there are 2 classes in Factory, the Abstract Class and the Concrete Class. Concrete Class is a subclass of Abstract Class. We define interfaces in Abstract Class without specifying its concrete implementation. The interface is defined as abstractmethod. The implementation of the interfaces is provided in the Concrete Class. The abstractmethod enforces the Concrete Class must implement the interface. With this, we could create as many different implementations of Concrete Class as we want without worrying about whether the implementation will break the code. Why? because all of the different Concrete Classes have the same interface.

Example case that fits the Factory pattern:

Example 1:
Scikit-learn famous interface `fit` and `predict` methods. This is one of example having great standard interfaces across algorithms available in the library.

Example 2:
Let’s say you want to create a training pipeline for two different libraries i.e. Scikit-learn and lightgbm/xgboost/catboost. Even though they have both the same `fit` and `predict` methods, they are not the same. In lightgbm/xgboost/catboost, `fit` method has additional parameters such as `early_stopping_round` and `eval_set` which are useful for preventing the model from overfitting. On the other hand, Scikit-learn linear model (LinearRegression, LogisticRegression, etc) requires us to perform features scaling before calling `fit` method. It’ll help the linear model to fasten the training process to reach the optimal solution. With only 2 different algorithms (linear model and gradient boosting), we need to differentiate the process in the pipeline.

Gradient boosting: (1) read training data (2) split data to train and test (3) fit model (4) predict (5) evaluate model
Linear model: (1) read training data (2) split data to train and test (3) scale feature (4) fit model (5) predict (6) evaluate model

The Factory is useful for us to create a standard interface for different ML libraries by wrapping different processes in wrapper method `fit` and `predict`.

Strategy

Strategy is a behavioral design pattern that turns a set of behaviors into objects and makes them interchangeable inside original context object. The original object, called context, holds a reference to a strategy object and delegates it executing the behavior. In order to change the way the context performs its work, other objects may replace the currently linked strategy object with another one. Refactoring Guru

In other words, we use different implementations of Concrete Class from Factory in the Strategy pattern. It enables us to interchangeably switch the behavior/strategy in the same context of the problem.

Example case that fits Strategy pattern:
In ML training pipeline problem, we separate the training pipeline into several parts: (1) the pipeline (2) the data (3) the algorithm (4) the feature engineering, etc.

  1. If we frame the pipeline as a Context object.
  2. We could have different Algorithm as different behavior/strategies.
  3. We could have different Feature Engineering or Dataset as different strategies.

So then later if we want to add a new Algorithm or Feature Engineering or Dataset, we just need to create a new Concrete Class for algorithm or feature engineering without modifying the pipeline or Context object. In short, our code is easily modified and extended.

Implementation

Let’s go to code!
In this example, we use the Iris dataset. It has 3 classes and we frame the problem as a multi-class classification problem. We use two algorithms from different libraries for this purpose:

  1. LogisticRegression from Scikit-learn
  2. LGBMClassifier from lightgbm

As mentioned before, we need to implement different processes for each algorithm.
LGBMClassifier: (1) read training data (2) split data to train and test (3) fit model (4) predict (5) evaluate model
LogisticRegression: (1) read training data (2) split data to train and test (3) scale feature (4) fit model (5) predict (6) evaluate model

Before

How do we create a pipeline which able to train different algorithms? The easiest way is to use the if-else statement in a functional programming way. For each if-else block, we define the unique process for one algorithm. This how our code will look like.

You might notice we didn’t follow the DRY principle here because we repeat steps (5) evaluate model. But we can’t combine step (5) because in LogisticRegression we need to do feature scaling but in GradientBoosting we don’t. Imagine if you need to add a new algorithm in the pipeline. You need to add a new if-else block for this algorithm and might repeat the same code for different algorithms. It’s also hard to maintain if you have a lot of algorithms. We’ll end up with a very long function and this is not a good practice. Let’s see how Strategy and Factory patterns can do.

After

To implement Strategy and Factory patterns, we need to switch from functional programming to object-oriented programming. There are 2 high-level objects that we need: Context and Strategy. In our case, the training pipeline is our Context object. The Context could take various Strategy objects which are defined in the Factory pattern.

The Context object will look like this. It takes different Algorithm (i.e. Strategy objects) as input. The Algorithm strategy is interchangeable, so if we want to add a new algorithm in the pipeline, we don’t need to change the Context object.

Notice that now the Algorithm object will only have wrapper `fit` and `predict`. The different processes behind `fit` and `predict` which are defined later in the Factory pattern.

How about the implementation of Algorithm strategy using Factory pattern?
Here, we need to define Abstract Class and Concrete Class. Abstract Class only defines the interface. We define the wrapper `fit` and `predict` as abstractmethod.

The Concrete Class defines different implementations for each algorithm. For example, in LogisticRegression, we need to scale features on X first before fitting the model. Then we need to save the scaler object which we’ll use later in predict method. The Context object doesn’t know about this different implementation, because it only needs wrapper `fit` and `predict`.

We now have implemented different Strategies for Algorithm. In this setup, we can add new Strategy objects for Feature Engineering/Dataset. It’s now easier to add a new block of Strategies around the Pipeline (Context object).

Conclusion

Strategy and Factory patterns are useful, especially when you have a lot of possible “strategies” given a certain context.

Reference

  1. Refactor Guru: https://refactoring.guru/
  2. Arjan Codes — Write Better Python Code: https://github.com/ArjanCodes/betterpython

Become a ML Writer

--

--