Teaching data science students to tackle the “hard” technical problems

David Kofoed Wind
Peergrade
Published in
4 min readMar 27, 2016
Image from https://pixabay.com/

During my time as a student and now as a PhD student and lecturer within data science, I have noticed that students are surprisingly bad at dealing with real technical problems. What I am referring to here is that students that want to specialise in machine learning and data science know their way around the methods and theory of for example a neural network, but are unable to implement it efficiently themselves and execute it on a cluster, or even pull a working implementation from Github and install it.

What normally happens in a course at my university when teaching students about the K-means clustering algorithm, is that they are explained the theory of the algorithm and then provided with a pre-made MATLAB implementation where they are asked to vary the parameter K and observe what happens. With this approach, the students might get a thorough understanding of the algorithm — and maybe even some deeper theoretical insights — without getting bogged down by implementation details.

Image from http://www.vlfeat.org/overview/kmeans.html

The problems usually first surface when they are writing their master thesis, or when they get their first job after university. As a data scientist and even as a mathematician your primary task will often involve a lot of programming (especially mathematicians realise this too late). At this point, some students are for the first time presented with a really large dataset, the task to run their algorithms on a cluster or the problem of installing and modifying a library hosted on Github.

When teaching my own course Computational Tools for Big Data, the agenda for the first week is UNIX, Git and Amazon EC2. During this week I ask the students to learn a range of UNIX-commands for working with files, how to use Git for version control and how to set up and run stuff on a virtual machine at Amazon. I ask them to solve actual problems with real files in their terminals (and if they use Windows, they need to start by learning SSH to get a terminal on the cluster), to create and use their own repositories on Github and to launch a free instance on Amazon EC2 (including setting up RSA-keys and other stuff) and get stuff to run on it.

By letting the students get a true experience of what it requires to launch a virtual machine on Amazon and clone a repository from Github to it, I hope that they will be more comfortable doing this on their own in the future. From my own experience, a large fraction of time as a data-scientist is not spend on model-selection, but on grunt-engineering tasks such as merging branches in git, installing libraries, optimising database queries and processing files in the terminal.

In later weeks they are asked to install and work with PostgreSQL, MongoDB, Vagrant and various Python libraries. They are also asked to use data files from previous competitions on Kaggle.com which are stored in the JSON-format to make sure they have not only seen CSV-files when they graduate as engineers.

Some of the subjects in my course are unfortunately still too much for all students to handle — this includes setting up an Apache Spark cluster and getting Caffe to run for training convolutional neural networks. For this we currently provide them with a Vagrant-file where the libraries are installed. I would love to get away from this, but the assistant teachers should not spend 99% of their time debugging weird system errors on student laptops. Hopefully we will find a better solution in the future.

--

--