AI Blueprint Engine in Action: Purpose-Built Deep Learning for Predicting Survival of the Titanic Disaster
In our previous articles we motivated GUI-driven design and subsequent code generation for Deep-Learning-based machine learning and introduced our AI Blueprint Engine — a purpose-built tool we are developing to address this task. In this article, we will use the first public beta release of the AI Blueprint Engine to exemplify its utility in the “Titanic: Machine Learning from Disaster” Kaggle competition. In this context, we will design a suitable neural network using the graphical user interface (GUI) that predicts the survival probability of a passenger on the Titanic conditioned on various features of heterogeneous data types. Subsequently, we will set up an execution environment from the generated project, train the model using the competition’s training set, and predict survival probabilities of passengers using the competition’s test set. Developing purpose-built solutions to complex real-world machine learning problems often requires customization whose diversity is difficult to adequately cover by a graphical higher level of abstraction. For this reason, the generated source code is modular, fully documented, and style-compliant, and we will show that customization not explicitly covered by the GUI is easy and merely involves editing a small portion of the code.
Starting the project
Let’s create a new folder, which will be our project root for the remainder of this article, and enter it.
mkdir -p ~/projects/titanic_ml_from_disaster && cd "$_"
Retrieving and preparing data
Next, we will retrieve, inspect, and preprocess the data set. Let’s create a new folder in our project root that will contain the data set.
We then download the data set from the competition website (registration for the competition is required) and store the files
test.csv in the folder
data. Before the data set is suitable for training, we need to perform some data cleansing and preprocessing in order to make it consumable by our machine learning algorithm. This step typically requires at least some amount of domain knowledge and is inherently difficult to automate. As can be observed from the numerous topics in the competition discussion, participants are quite creative in designing effective preprocessing steps to squeeze out every bit of information from the data set to improve their models’ predictive performance. We’re not seeking top performance in the competition in this article but rather intend to illustrate the benefits of our tool. Hence, we will only perform a couple of basic preprocessing steps including imputation and feature selection, and leave advanced feature engineering for you to explore on your own.
Let’s create a Python script for preparing the data and open it in a code editor:
First, we would like to get an overview of the data set.
inspect_data is a convenience function that prints some insights into the various features contained in the data set:
Based on this overview, let’s discard the features Name, Ticket, and Cabin because they either contain many missing values or they do not seem relevant for predicting survival probabilities. We impute missing values of the continuous features Age and Fare with their respective mean values computed from
The missing values of the categorical feature Embarked are imputed by sampling from all available categories of this feature uniformly at random.
The string features Sex and Embarked and the Pclass categories are encoded as categorical indices.
We write the preprocessed data to the files
Building the model with the AI Blueprint Engine
We can now design our model to predict survival of the Titanic disaster. In the following video, we use the AI Blueprint Engine to graphically design a model architecture that
- loads the continuous features Age and Fare from
train.csvand normalizes them to have zero mean and unit variance,
- loads the count features SibSp and Parch from
train.csvand log-transforms them,
- loads the categorical features Pclass, and Embarked from
train.csvand embeds them into learned distributional representations,
- loads the binary feature Sex from
- concatenates all features (or the corresponding distributional representations where applicable) and passes them through 3 alternating batch normalization and dense layers before learning to predict the target Survived by minimizing the average binary cross-entropy over all training examples between the prediction and the target.
The constructed model can be found here.
In addition, we regularize the model weights using L2 regularization with a small coefficient and employ early stopping to reduce the risk of overfitting since the model is slightly overparameterized for this small data set. After completing the design phase, we generate the project, download the resulting ZIP archive, and extract its content into our project root:
unzip ~/Downloads/titanic_ml_from_disaster.zip -d ./
Let’s take a look at the file tree of our project root:
requirements-gpu.txt contain lists of Python packages, that can be installed using pip, needed to execute the generated source code on a CPU or using an NVIDIA CUDA-capable GPU.
README.md contains comprehensive project documentation and instructions for executing the generated source code. Docker images for CPU and GPU-accelerated execution of the generated source code can be built using
titanic_ml_from_disaster/training.py contains source code of the graphically designed model, data loading and preprocessing functions, training code, and a command-line interface for configuring the training process;
titanic_ml_from_disaster/inference.py contains source code for data and model loading, making predictions using the loaded model, and a command-line interface for configuring the inference process.
Setting up the environment
Before our model can be trained, we need to set up the runtime environment. We choose the Conda package manager for managing virtual environments and installing packages. It is good practice to create isolated environments for different Python projects in order to avoid polluting the global environment and risking package conflicts, so let’s create a new environment, activate it, and install required packages:
# Create new environment "venv" inside the current # working directory and activate it. $ conda create --name venv python=2.7 && source activate venv # Install packages. $ pip install -r requirements.txt
Training the model
To train the model using the training data, we simply execute
training.py and monitor the training progress printed to STDOUT:
When we ran this code, early stopping detected no improvement of the model error after 39 epochs and terminated the training process. The best model was found in epoch 29 with a binary cross-entropy of 0.4043 on the validation set (5% of the competition training set).
Predicting survival of passengers from the test set
After the model has been trained, we would like to predict survival probabilities of the passengers in the test set
data/test_preprocessed.csv and submit our predictions to the Kaggle server for evaluation. For this purpose we use the generated inference script
titanic_ml_from_disaster/inference.py. It requires a few modifications, though, because
- the column order in
data/test_preprocessed.csvis not identical to
data/train_preprocessed.csvsince the test set does not contain the column Survived, and
- the output of the inference script does not match the competition’s submission format.
Apart from these two modifications, we can keep most of the data loading and all data preprocessing, model loading, and prediction code as well as the command-line interface. Before we start editing the inference script, let’s create a copy:
titanic_ml_from_disaster/inference.py \ titanic_ml_from_disaster/inference_kaggle.py
Addressing the first modification is trivial. The column Survived is the left-most column in the training set, so we simply need to decrement the column indices by one in the data loading function
load_source_file_1 for loading the test set.
The submission format requires the predictions to be submitted in a two-column CSV format where the first column represents the passenger IDs and the second column contains binary values that indicate the survival of passengers.
PassengerId,Survived 892,0 893,1 ...
For making predictions in the submission format, we need to change the code block that prints or stores the predictions of the model from
We also need to take care of mapping survival probabilities to binary decision indicators using a probability threshold of 0.5, i.e. a survival probability above 0.5 means “survived”. We call the modified script with the test set using the generated command-line interface and see the predictions printed to STDOUT:
$ python titanic_ml_from_disaster/predict_for_submission.py \ --source_file_1 ./data/test_preprocessed.csvPassengerId,Survived 892,0 893,0 894,0 ...
With these results, the Kaggle server reports a score of 0.78947 (rank 3472/11192). While this score is not cutting-edge, it is not too bad considering the minimal effort we put in data preprocessing and neural architecture design. Of course, there are countless ways to further improve the preprocessing pipeline and improve the model through experimentation with various model architectures. Fortunately, conducting these experiments becomes a breeze with our AI Blueprint Engine as making architecture modifications is only a matter of a few clicks.
We very much appreciate feedback, suggestions, and feature requests, which you may submit via our GitHub issue tracker. You may also leave a comment below if you like (or dislike — constructive criticism is appreciated as well) what we are doing.