Easy machine learning with MindsDB
A tutorial for working with MindsDB and using it to understand machine learning concepts.
Machine learning is pretty much everywhere now, almost every company nowadays accumulates giant amounts of data and would like to extract as much information from it as possible to enhance its business and customer satisfaction, Facebook and Google try to match the perfect ads to you, Netflix tries to predict what is the next movie you would like to watch, and Spotify tries to find music that you would love.
Machine learning is not easy, it requires some knowledge to extract data, add missing information, find what pieces of data correlates to your predictions and create a prediction algorithm based on different variables — that means you need to hire machine learning experts and have the extra infrastructure for it, some tools are making it easier like Firebase’s auto ML, but MindsDB is the only tool that does that and is open source, and is connecting directly to your database to make it easy and efficient to make predictions.
In this example MindsDB will be working as a middleman between your server and the database, you can run it using docker and it will provide you with GUI and a rest API to communicate with the database to make predictions, it keeps the trained models and the metadata it needs in your database in a separate table.
In this tutorial, we will be completely framework agnostic in work only with the GUI and the database, but I will make other posts demonstrating the work with the API in future posts.
We can make predictions on different, more complicated data, but in this post, we will make predictions using the Iris dataset, if you ever learned ML you probably read a tutorial with this dataset, it’s very common since it is very simple to understand.
The Iris dataset defines 3 species of the Iris flower, with information about their petal and sepal sizes, these measurements have been taken from 150 different flowers, so we have 5 columns and 150 rows of data, we are going to use this information to categorize the flower species, given the petal and sepal measurements.
For example, we want to say that if an Iris flower has a sepal width of 3.5, a sepal length of 5.1, and petal length of 1.4, and a petal width of 0.2 -> that Iris is Iris-setosa.
Let’s get started, we will have to follow a few steps first, we need to create a database, fill it with the Iris dataset, run MindsDB, connect it to the database, and then we can make predictions, it’s pretty simple, so let’s start.
1. Run a database and fill it with data
MindsDB supports different databases including MongoDB, MariaDB, PostgreSQL, MySQL, and more… you can choose the database you're comfortable with, but for the sake of this post, I will use PostgreSQL.
So I’m running PostgreSQL locally and filling it with the Iris dataset, I’m using the pre-made SQL command from this gist to populate the database quickly, your database should look somewhat like this.
- You can skip this step and use the Iris dataset CSV file directly if you just want to play with MindsDB, by using the upload button on the GUI, this CSV will be your dataset.
2. Run MindsDB
I’m running MindsDB as a docker container, but of course, you can decide to run it locally, the installation guide is here.
I’m expecting you already have docker running on your machine, if not just install and run it, once it’s running run these shell commands:
# pull the image
$ docker pull mindsdb/mindsdb# run the image and expose port 47334
$ docker run -p 47334:47334 mindsdb/mindsdb
After you run the image, you can go to http://localhost:47334 o access the MindsDB GUI, it should look something like this:
3. Create a dataset
I will be connecting my PostresQL to MindsDB by clicking the database icon and then “add database”:
now that our database is connected we can clock “New Dataset” and extract the dataset using an SQL query, in my case I want all the columns from the dataset and all the rows, so a simple
SELECT * FROM iris is enough for us.
Then we can go to the dataset and see two buttons Overview where you can see your dataset in a table, and Quality where you can assess the data quality, let’s take a look at the quality view:
Here we can see the data is really good, no missing values are here, in many cases you might be missing some information, in those cases, MindsDB will fill these missing values with the average values for you, the more data is missing or inconsistent, the worse your dataset is, the less accurate our machine learning algorithm will be.
4. Train a predictor
Now we can create a predictor by clicking “predictors” and then “train”, decide what you would like to predict, and click “train”.
* Advanced Mode — you can use the advanced mode to make some changes to the predictor, for example, you can use the time-series mode, group items, and change the sample margin, we currently have a pretty simple problem and it’s a simple post, so we won't use it — I will include this in a future post.
We can click “Preview” on the trained predictor to see some information about it and I’ll explain some basic machine learning concepts, as you see in “dataset splitting and usage”, our data is being divided and used as Training data and Test data, MindsDB selects at random around 120 rows to train the machine learning model, and around 30 rows to test the model, this helps it to determine the accuracy f the predictor — it’s common to use 20% of the data for testing and 80& for training ML models.
The second thing to watch for is the column importance, we might want to throw every piece of information we have into a machine learning model, but not all the information is useful for every prediction, let’s say we also have the height of the flower or its color, it might be completely useless when determining it’s species, in this case, we can make changes to the data or decide not to use all the columns next time — which will speed up the training time and improve accuracy.
The next piece of information we get is the confusion matrix, confusion matrix allows us to see how accurate the test data and the training data are for each prediction (each iris species in ur case), that is the “100% accuracy” we saw in the last picture, it means that when looking in the test data and running it against the model, it predicted correctly 100% of the flowers. Do notice, 100% accuracy is very rare, and it is mostly determined by the quality and size of the data we have.
5. Make a prediction
This is the last step, we simply need to use our predictor, to determine how to classify a new Iris flower we just found, we can go to “query” and click “new query”, fill in the data we got from a new flower we found, and let our predictor determine which species it is.
Now we can see the prediction, we have predicted that a flower with these measurements is Iris-setosa, and we are 81% sure about it, the bigger our dataset is, the more sure we will be about the predictions we make (and I pretty much used random numbers for this prediction).
We have created and trained a machine learning model (predictor) and used it to make a prediction, all this can take just a minute or two with MindsDB, As you can see MindsDB is really easy and simple to use.
The next step will be to make API calls to MindsDB to automate the extraction of the dataset, the training of the model, and to make predictions, we will be doing this in different languages in a future post.