Object Oriented Machine Learning — Create Production grade Machine Learning Pipeline using OOPs
Data Science today is a melting pot from several fields. While its common to see professionals from fields like statistics, economics, computer engineering etc, you will also come across people from cellular biology, physics and even chemistry working as data scientists in the other industries. Because of this diverse mix of professionals, its common to see a variety of practices followed. A statistician, for instance, may have good background of statistical aspect of modeling the data and be good at looking into aspects of data like biases, multi-collinearity, heteroscedasticity etc. but he/she might not be well-versed with established practices like Object Oriented Programming Systems(OOPs) in software engineering.
An ML model is as good as its application and action-ability in real world. This article focuses on approaching ML architecture using OOPs, and its benefits in creating an end-to-end, production level, reproduce-able machine learning pipeline. We will create our code structure step-by-step from scratch based on “time series” problem template.
Defining the Skeleton
Let’s start by creating a main function, which is like the driver of whole code base. It takes a string variable called “ml_stage” as an input which will define which of the following stages of machine learning pipeline we want to trigger.
- Train — we train our model at this stage, input data may be balanced depending on target variable
- Eval — we evaluate our trained model on unseen, unbiased data
- Serve — we send the predictions to relevant platform for action-ability
The idea is that this main function will be called with the help of argument given to it from command line interface. In turn, it will call upon a class object which will have all the necessary methods and variables packed in a structured manner. Let’s start by conceptualizing what will be the variable this class will have.
Here model_id is the model you want to run, this way for your next machine learning target problem, you can reuse the same code just with a different model_id. Since, we are developing this application for time series problem, it good to make grain dynamic as well. In case of monthly predictions, this will be “monthly”, and for daily “daily” etc. “stage” again is the ml_stage we discussed above.
Its also good to save your model in one of your class variables (provided you are not building Terabytes of model dumps in which case, keep it in some storage for later use). For this, we use “pipeline_ensemble” (Because I like ensemble pipelines).
Lastly, we will have a list of all the column we use during training, so we can restrict our eval and serve stages to columns used in train stage.
Now we can start with created our object of App class
Create Pre-Model Dataset (Pre-Processing)
Now, lets create model use-able data and save it. Then load it back. For the scope of this article, I have not shows how premodel data is created. But its important to understand what df_cohort and df_ads are below. “df_cohort” is data set of IDs (customers/patients/others) at monthly level (because our grain is monthly) that qualify for our particular model_id. df_ads has all kinds of features from any known source, and again each row represents, a particular ID on a particular month. Hence this is a monthly ADS. One important difference between df_ads and df_cohort is that df_cohort is model_id specific. This way, we can reuse our bulk of common features in df_ads for multiple problems.
Its also important to extract column lists at this point. This way, we will only refer to self.col variables throughout our pipeline.
Adding above code to main function call.
Create/Train/Predict Machine Learning Pipeline
Now, lets create a pipeline. This pipeline should have all the pre-processing as estimator (read pipelines in sklearn/pyspark). Imputation of missing values, custom transformations (better if its in the form of custom class), scaling and feature selection, all is done in “fit” method in its call to “create_pipeline”.
Adding to main..
As you can see, if we are training, we will create pipeline and train it. Then we also save it. This is a good practice to “de-couple” the process in the middle so that we can debug if there is any issue at a certain stage and enable “restart-ability” from the point of failure/issue. So we now create save_model and load_model methods along with split_train_test.
The reason save_model and load_model are created as separate methods (you can always create one by using self.stage as well), is because this way, if our external storage changes from say S3 to GCS, we dont have to alter too much of our code and only to relevant methods. We can also support multiply external storage this way. Whereas split_train_test is a separate method because we might want to experiment with our split ratio. Again, the idea is wherever we want to change, remains a separate module for ease of testing.
Now, let’s add predict and get_metrics (evaluate) to our main function body.
Deliver Results for Action
Last but not the least, our results have no value if they are not connected to some campaign management platform or some other platform where action is possible to take. Normally, each platform has their own specific export format requirement. Some may ask for a json file with certain format, or a csv, or even a direct database upload using {O|J}DBC connections. By creating a separate method “convert_to_export_Schema” we can enable different formats just by this function. After that, “map_and_export” can take care of eventual upload.
Adding to main to finally complete our code structure. As you can see, we don’t need to upload train stage output as it might be balanced and not correct reflection of real world data.
Have a look at complete pseudo code here.
Now, to train, all you need to is call main function with ml_stage “train”. Once done, you can switch to “eval” and “serve”. I have not included detailed code as this is for software architecture purposes. As you can see I’m using PySpark (see caching in main), but you can always use same structure for Python as well. Admittedly, I have not covered some important aspects of ML development cycle like feature selection, categorical feature engineering conversion and outlier detection because they are in themselves an expansive world. From what I’ve personally seen and practiced, its better to deal with Outliers in a custom class which is part of our pipeline, whereas for feature engineering and selection, they can be done as part of model pipeline object.
Hope this helps you develop structured and reproduce-able machine learning solutions. Let me know in comments if you see any ambiguities or have questions.