Solve Large Scale Machine Learning Problem in Python

Fareed Khan
5 min readSep 24, 2022

--

This guide is demonstrating the xLearn library, which is used to apply machine learning tasks on large-scale data (Data with a huge number of rows and columns).

xLearn official site: https://xlearn-doc.readthedocs.io/en/latest/index.html

xLearn library Documentation: https://xlearn-doc.readthedocs.io/_/downloads/en/latest/pdf/

Let’s Get Started!

First, we need to install the xLearn library using the pip command

pip install xlearn

for importing the xlearn library we use

In case you are having an issue while importing xLearn, you can follow the steps given in this answer on GitHub: https://github.com/aksnzhy/xlearn/issues/71#issuecomment-741799045

ML Models in xLearn

xLearn currently supports three machine learning algorithms:

  • linear model
  • factorization machine (FM)
  • field-aware factorization machine (FFM)

Let’s look at each of these separately.

1) Linear Model

You can import a linear model from xLearn using

The linear model is used to find that line that perfectly defines the relationship between X and Y variables. Linear models can be applied when the Y variable is continuous, and X can be continuous/Categorical.

2) Factorization Machine (FM)

You can import the FM model from xLearn using

FM Model can bed used for Regression, Classification. It is an extended version of the linear model used to find an interaction between features of high dimension dataset. It works well on high dimensionality datasets.

3) Field-aware factorization machine (FFM)

You can import the FFM model from xLearn using

FFM Model is an improved version of the FM Model by overcoming its issues.

If you want to develop the core understanding of Factorized machine and field-aware factorization. A very detailed blog is written on this topic, here is the link to that blog: https://wngaw.github.io/field-aware-factorization-machines-with-xlearn/

The key point mentioned on their site:

For LR and FM, the input data format can be CSV or libsvm. For FFM, the input data should be in the libffm format.

Simple Example

Suppose we are working with a binary classification problem, where we want to predict whether the user is going to click on the ad of a website or not. This type of problem requires high dimensionality dataset, as that kind of data includes more features of that user, so to get a better understanding of that user’s clickable events.

Dataset link: https://github.com/aksnzhy/xlearn/tree/master/demo/classification/criteo_ctr

small_test.txt is renamed as test data in this blog, while small_train.txt is renamed as train data in this blog

We will be using the xLearn library to solve this ML problem.

First, we import the xLearn library

Then we have to use the ML model of xLearn, since it is a classification problem we have two options to choose from:

  • Factorized Machine — FM
  • Field-aware Factorized Machine — FFM

We cannot choose the Linear model because it is used for a regression problem. We are going with the second option i.e., FFM.

Let’s create that model using the xLearn library

We can use the sklearn train test split feature to split the data into train and test, but here I have already created a separate file for each, so we won’t need to use sklearn in this case.

SetTrain tells the FFM model about our training dataset.

Now we define the parameter for FFM Model.

Each parameter’s purpose:

  1. Task parameter: which tells the type of problem you are solving, currently we are working with a binary classification problem, if you are working with a regression problem the n FFM Model, you need to pass reg in the task variable.
  2. Learning Rate: You may have already used it before, the purpose is the same here, to make our model learn faster.
  3. Lambda: You surely be working with high dimensionality dataset, so lambda does play an important role here to discard those variables which are least important.

If you are working with regression, your parameter values should look like this:

Now we need to fit our model into the training dataset.

We fit our model using the .fit() method passing param variables along with a name to that model, a new file with the name of the model.out will get created in your current directory. That model.out is going to be used to predict our test dataset.

When you run the code, you will get an output of something like this:

It is training the model on our train dataset, once it got completed, our model will be saved in our current directory.

Now we will use that model.out file to predict our test data:

In the above code, we are telling the model about our test dataset, predicting it, and then saving the predicted values in the output.txt file.

This is what the output.txt file looks like:

-1.58631
-0.393496
-0.638334
-0.38465
-1.15343

Negative values represent that the user is not going to click the event while a positive value means that the user going to click on it, If you want your output to be 0 or 1, you can use the setSign() method

Now the output.txt file looks like this:

0
0
0
0

This is a very brief guide to the xLearn library, there is a lot to cover. I will be uploading more blogs on Machine Learning.

If you want to read how to apply 40 Machine Learning models in a single line of code you can read that blog from the below link:

If you have any queries feel free to ask me!

--

--