How to Start Machine Learning Now & Learn Data Preprocessing Later
Almost everyone wants to jump head first into machine learning without learning data preprocessing, which is the non-sexy part.
Imagine if you could download knowledge to do something without the slow painful process of learning.
Now you can.
Well sort of.
In an Iron Man/Tony Stark, kind of way, that is.
By asking a bot — fastowl.
Although fastowl helps you to dive into machine learning without knowing data preprocessing, its more beautiful purpose is to help you learn data preparation in a painless and efficient way as a mid-term goal. More on this in the ‘Conclusion’ section.
If you are already bursting with either skepticism or excitement, go ahead and play with her — http://fastowl.xyz.
NOTE: fastowl goes to sleep when inactive. Please be patient if you are the person waking her up. If a lot of people show interest to learn with fastowl (‘claps’ for her on medium.com), I will pay for her to never sleep.
Try asking fastowl to find a solution to Kaggle’s Titanic challenge, which requires you to perform binary classification to predict survival given a passenger’s attributes.
You can just start with a question like ‘What are the steps for data preprocessing?’ From there on, just ask fastowl for ‘hint’ to get advice for the next step. You can also ask for ‘help’, which brings up a list of commands.
In Part 1 of this data preprocessing tutorial series, fastowl will take you from zero to data preparation hero in 15 minutes; from knowing nothing about data preprocessing to being able to determine which are good features for your machine learning model.
In case fastowl proves to be not user-friendly enough yet, I introduce 11 questions below that you can use as guidance on how to interact with her.
Alternatively you can watch a video of how I interact with fastowl here (< 18 mins).
The end result is a Jupyter notebook (https://github.com/nethsix/fastowl-kaggle-titanic-1) with all the code (load data from file, process the data, & plot charts) to select good features, and output charts.
If fastowl is not working for some reason, the video above shows how fastowl works, and please contact me at ‘k h o r @fastowl.xyz’ (please remove the spaces in the email address).
“A prudent question is one-half of wisdom.” — Francis Bacon
Let us ask fastowl to help us find a solution to Kaggle’s Titanic challenge, which requires you to perform binary classification to predict survival given a passenger’s attributes.
The list of questions follows, and each question is succeeded with Learn, which summarizes what you will learn from the question.
Q1: How to prepare data for machine learning?
Learn: Steps for data preparation
Q2: How to start data preparation?
Learn: The best way to start learning data preparation is to use Jupyter, the most popular data science tool. The fastest way to get started with Jupyter is to use fastowl’s complete machine learning environment, which does not require you to install anything.
Q3: How to use use playground?
Learn: Use Jupyter; an environment to record both your data preparation code, and results, to ensure that your work is replicable. This is crucial in machine learning; you need to be able to re-run your experiments, and tweak them in a consistent manner. Here is the example of the Jupyter notebook that contains all the code snippets, and output results for finding good features in the Kaggle Titanic challenge by using fastowl: https://github.com/nethsix/fastowl-kaggle-titanic-1/blob/master/fastowl-kaggle-titanic-1.ipynb
Q4: How to load data?
Learn: Load data from a file to start manipulating the data
Q5: How to upload data file?
Learn: Upload your data file into Jupyter
Q6: How to show data headers?
Learn: See the data column names in the file your loaded
Q7: How to select good features?
Learn: Select good features to train your model. This is the most important step in machine learning.
Q8: How to determine feature uniqueness?
Learn: One indicator of a good feature is the non-overly uniqueness of the values. Understand how to look for non overly unique features
Q9: How to get dataframe size?
Learn: To look for non overly unique features, you need to understand how much data you have in order to compare the different values with the data size.
Q10: How to show dataframe column types?
Learn: Figure out whether features are discrete, or continuous
Q11: Does <feature column name> affect <outcome column name>?
- Plot discrete outcome versus discrete features
- Plot discrete outcome versus continuous binned features
From zero, you are now a data preprocessing hero!
Subconsciously, you have learned how to use:
- Jupyter (tool for holding together programming code, data, and results)
- python (programming language) with pandas (data processing), matplotlib (plotting) libraries
fastowl believes in spaced-repetition system (SRS) for learning so once you signed in to fastowl, it will send your Github email occasionally to reinforce the data preprocessing techniques and code snippets that you have encountered.
If you’re happy and you know it (data preparation), clap your hands!
Please ‘follow’ this publication to get notifications when I release the next installment of this data preprocessing series.
fastowl was build through data preprocessing knowledge acquired from: