Introducing xpanseML

Automated Machine Learning For Everyone

Anup Vasudev
xpanseML
6 min readMar 26, 2020

--

Automated Machine Learning ( AutoML ) has been the buzz word recently. There are many services, tools and APIs available to perform various kinds of machine learning. xpanseML is a Machine Learning platform for training and publishing predictive models.

But what does that mean ? Organizations collect vast amounts of customer or process related data. These historical data-sets can be used for predictive analysis to identify the likelihood of future outcomes. xpanseML platform allows you to use these data sets as training data to build machine learning predictive models. Its a simple wizard style user interface to upload data, scale data, split data, configure model, train model and publish model. Most machine learning services stop at the creation of the model. Its up to the user to load and use the model pragmatically, which is cumbersome for a business user. On the xpanseML platform, publishing a model generates a user interface allowing the users to input the values in its original form and generate the predicted output!

So how does the xpanseML platform work? Simply follow the below steps.

Upload Data

In this step, upload your raw data in csv format. You are however advised to remove unnecessary columns to keep the file size small, before uploading. Once uploaded the first 100 rows will be displayed. Scan through the data to make sure the columns are represented accurately. Sometimes due to errors in the csv file format, the data will not be loaded properly. In such situations you may want to correct the file and upload again. You will be able to proceed to the next step only after successfully uploading a csv file.

Limitations

The csv file should have a header containing the column names.

The column names should not contain special characters and should not start with a number. Such columns will be renamed automatically during upload.

Date and time data and time series forecasting are currently not supported. It will be in the future.

Re uploading a file with the same file name will overwrite the file and remove all related models and configurations.

Feature Engineering

In this step, select your label (the output field you want to predict) and the features (the input fields you want to use for prediction).

Training a machine learning model require the fields to be numerical. However in the real world, data is not always numerical. Based on the type of the non-numerical data you may want to choose the ‘Type’ as either ‘Ordinal’ or ‘Nominal’. Choose ‘Ordinal’ if the data represents an order. For example, high, medium, low. Where they can be safely converted to numerical values like 3,2,1 without losing the meaning of the values. Choose ‘Nominal’ if the data represents a category. Like male and female. Other types of non-numeric data should be excluded from selection. Note that if a selected label represents a category but contains tens and hundreds of different classes, select the ‘Type’ as ‘Ordinal’ though they don’t represent an order.

If the data is already numerical, then choose the ‘Type’ as ‘Continuous’. Training a machine learning model works best when the field values are normalized or scaled to a similar range. For example, if you have a field age and a field salary, age would lie between 18 to 70 and salary would lie between 50000 to 500000 for example. Scaling the field will reduce both of them to a general range of -1 to 1. This range is great for training predictive models. Scaling options include Z-Score, Scale To Range and Log Scaling. If you are unsure of these, select Z-Score, which usually works well.

Data Analysis

Click on the Analyze button to peek into the data distribution of the field. If the selected label type is Ordinal or Nominal, you can also see how the feature overlaps with the label.

We don’t expect the uploaded data to be free from errors. For example, numeric fields may accidentally contain non numerical characters or empty values. Such data will be replaced by the mean value of the numerical data automatically before training begins.

Data may also contain duplicates. You can view duplicates by clicking the ‘Check For Duplicates’ button. This operation is based on the selected features. If duplication does not represents your data distribution accurately, then remove them and re upload the data.

Split Data

In this step, select the percentage of data you want to use for training, validation and testing.

Training data is the actual data-set that is used to train the model

Validation data is the sample of the data-set used to provide an unbiased evaluation of a model fit on the training data-set while tuning model hyper-parameters

Test data is the sample of data-set used to provide an unbiased evaluation of the final model fit on the training data-set.

Configure Model

In this step, configure the neural network. If you are unsure ,leave them to the default values. A neural network has 3 types of layers. The feature layer represents the inputs/features. The Label layer represents the predicted output/label. Between them are the hidden layers interconnecting the features and the label. The following parameters of the neural network can be modified

Optimizes are algorithms which make the neural network learn.

Loss Function is used to determine the error between the predictions and the given target value.

Learning Rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function

Batch Size defines the number of samples to work through before updating the internal model parameters

Epochs defines the number times that the learning algorithm will work through the entire training data-set.

Activation function is a mathematical “gate” in between two layers

Units are the number of nodes in each layer. The feature/input layer will contain nodes equal to the number of selected features. The label/output layer will contain one node for a Continuous label or several based on number of Nominal/Ordinal category classes.

Train & Test Model

Click the start button to begin training the model. Training will repeat for the configured number of epochs. Each epoch iteration will try and bring down the loss. The lower the loss (closer to 0) the better. For Ordinal and Nominal labels, higher the accuracy (closer to 1) the better. For Continuous label, lower the Mean Absolute Error (closer to 0) the better.

Once you click on Start, you can Stop the training only when the training step begins, after the download step. Stopping in between will still create the model till the point it was trained. This is a quick way to evaluate your model. If you see that loss/accuracy/MAE is moving away from your expected figures, you can stop the training, go back to Configure Model and tweak the parameters and re train.

Limitations

Training happens in the browser while the tab is active. In the future training will be moved to the server.

Publish Model

In this step you can publish the model which will generate a user interface URL. The user interface will allow you to enter or upload the selected input fields to predict the selected output. You may choose to publish the model to the public through Publish Options tab and re publishing. Note that if you have a new trained model, re publishing will overwrite the current live model.

Below is the final result of a published model.

The final result of a published model

--

--