Machine Learning and Data Analysis with Python, Titanic Dataset: Part 0, Setting Up

Published in

Analytics Vidhya

3 min readFeb 14, 2020

This is part 0 of the series Machine Learning and Data Analysis with Python on the real world example, the Titanic disaster dataset from Kaggle. This is going to be a series of videos where I show you how to use Python, Pandas, and SciKit Learn for machine learning and data analysis with a real-world problem. In this series I will be doing a step-by-step walkthrough of how to get started in tackling a problem like this. Starting with data exploration and visualization, feature engineering, and then going into building a model to make predictions.

Part 0 will take you through how to get started, including setting up an environment and downloading the necessary data.

If you learn better with videos, I also made a video showing all the steps so you can go follow that instead:

Download the dataset from Kaggle:

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

Go to data, click download all. Place it somewhere easy to find and install the dependencies. We are going to use Jupyter Notebook and I'll show you how to set up that environment.

The easiest way is to download Anaconda:

Anaconda Python/R Distribution - Free Download

The open-source Anaconda Distribution is the easiest way to perform Python/R data science and machine learning on…

www.anaconda.com

Make sure to download the Python 3.7 version (I'm assuming you have Python and are familiar with some of the basics of Python, if not, download it here: https://www.python.org/downloads/).

Launch Anaconda Navigator from your applications folder, and you will see an install button next to Jupyter Notebook. Hit install, and then launch.

In Jupyter Notebook you will see all the files and folders from your root directory. Go to wherever you stored the downloaded folder. At first, you should only be able to see 3 files:

gender_submission.csv
train.csv
test.csv

These are the files we are going to work with. The train.csv file is going to contain what we are later going to use as training data. The test.csv file is going to be almost identical as the train.csv, except it's going to have one missing column, which is the ground truth value - what we are trying to predict for. The gender_submission.csv file is a sample submission file where the survived column is hard coded to 0 is the sex of this passenger is male and 1 if this passenger is female. When we want to submit our predictions we will have to make it into a csv file with the same format. I'll talk about how to do that later (or if you checkout my YouTube playlist the video should be ready by now).

Now we want to create a new file, create a new Python3 file by clicking on the "New" tab on the upper right corner. Give it a name.

Make sure you have the libraries we'll be using installed by running:

import pandas as pd
import sklearn

If you don't have any of these libraries, you can download them using pip.

In part 1 of this series I will show you how to understand the dataset with some visualization:

Machine Learning and Data Analysis with Python, Titanic Dataset: Part 1, Visualization

Every great machine learning and data science project starts with defining the problem: What data do you have to work…

medium.com