Solve Large Scale Machine Learning Problem in Python
This guide is demonstrating the xLearn library, which is used to apply machine learning tasks on large-scale data (Data with a huge number of rows and columns).
xLearn official site: https://xlearn-doc.readthedocs.io/en/latest/index.html
xLearn library Documentation: https://xlearn-doc.readthedocs.io/_/downloads/en/latest/pdf/
Let’s Get Started!
First, we need to install the xLearn library using the pip command
pip install xlearn
for importing the xlearn library we use
In case you are having an issue while importing xLearn, you can follow the steps given in this answer on GitHub: https://github.com/aksnzhy/xlearn/issues/71#issuecomment-741799045
ML Models in xLearn
xLearn currently supports three machine learning algorithms:
- linear model
- factorization machine (FM)
- field-aware factorization machine (FFM)
Let’s look at each of these separately.
1) Linear Model
You can import a linear model from xLearn using
The linear model is used to find that line that perfectly defines the relationship between X and Y variables. Linear models can be applied when the Y variable is continuous, and X can be continuous/Categorical.
2) Factorization Machine (FM)
You can import the FM model from xLearn using
FM Model can bed used for Regression, Classification. It is an extended version of the linear model used to find an interaction between features of high dimension dataset. It works well on high dimensionality datasets.
3) Field-aware factorization machine (FFM)
You can import the FFM model from xLearn using
FFM Model is an improved version of the FM Model by overcoming its issues.
If you want to develop the core understanding of Factorized machine and field-aware factorization. A very detailed blog is written on this topic, here is the link to that blog: https://wngaw.github.io/field-aware-factorization-machines-with-xlearn/
The key point mentioned on their site:
For LR and FM, the input data format can be CSV or libsvm. For FFM, the input data should be in the libffm format.
Simple Example
Suppose we are working with a binary classification problem, where we want to predict whether the user is going to click on the ad of a website or not. This type of problem requires high dimensionality dataset, as that kind of data includes more features of that user, so to get a better understanding of that user’s clickable events.
Dataset link: https://github.com/aksnzhy/xlearn/tree/master/demo/classification/criteo_ctr
small_test.txt is renamed as test data in this blog, while small_train.txt is renamed as train data in this blog
We will be using the xLearn library to solve this ML problem.
First, we import the xLearn library
Then we have to use the ML model of xLearn, since it is a classification problem we have two options to choose from:
- Factorized Machine — FM
- Field-aware Factorized Machine — FFM
We cannot choose the Linear model because it is used for a regression problem. We are going with the second option i.e., FFM.
Let’s create that model using the xLearn library
We can use the sklearn train test split feature to split the data into train and test, but here I have already created a separate file for each, so we won’t need to use sklearn in this case.
SetTrain tells the FFM model about our training dataset.
Now we define the parameter for FFM Model.
Each parameter’s purpose:
- Task parameter: which tells the type of problem you are solving, currently we are working with a binary classification problem, if you are working with a regression problem the n FFM Model, you need to pass reg in the task variable.
- Learning Rate: You may have already used it before, the purpose is the same here, to make our model learn faster.
- Lambda: You surely be working with high dimensionality dataset, so lambda does play an important role here to discard those variables which are least important.
If you are working with regression, your parameter values should look like this:
Now we need to fit our model into the training dataset.
We fit our model using the .fit() method passing param variables along with a name to that model, a new file with the name of the model.out will get created in your current directory. That model.out is going to be used to predict our test dataset.
When you run the code, you will get an output of something like this:
It is training the model on our train dataset, once it got completed, our model will be saved in our current directory.
Now we will use that model.out file to predict our test data:
In the above code, we are telling the model about our test dataset, predicting it, and then saving the predicted values in the output.txt file.
This is what the output.txt file looks like:
-1.58631
-0.393496
-0.638334
-0.38465
-1.15343
Negative values represent that the user is not going to click the event while a positive value means that the user going to click on it, If you want your output to be 0 or 1, you can use the setSign() method
Now the output.txt file looks like this:
0
0
0
0
This is a very brief guide to the xLearn library, there is a lot to cover. I will be uploading more blogs on Machine Learning.
If you want to read how to apply 40 Machine Learning models in a single line of code you can read that blog from the below link:
If you have any queries feel free to ask me!