A Data Scientist's workstation

Erik Jan de Vries
bigdatarepublic
Published in
3 min readMay 18, 2018

By Erik Jan de Vries for BigData Republic

As a Data Scientist, Linux is my preferred operating system for machine learning, especially for GPU accelerated algorithms. On the other hand I like to use Microsoft Office for word processing, spreadsheets and presentations. This blog is the introduction to a series, in which I discuss my experiences with setting up my laptop for Data Science.

As a Data Scientist, there are several very different aspects to my job. The most obvious one is probably analysing data and making predictions. On the other hand, the most important aspect of my job may well be interacting with the business: if I don’t fully understand how the business works, I cannot create the best solutions for them. Unfortunately, these two worlds — analysing data and communicating with the business — have very different technological requirements.

Most modern data science tools and technologies are being developed in the open source or Linux world. While support for the Windows platform has been improving a lot over the years, some solutions still require a Linux machine. A great tool for developing independent experiments that can run simultaneously without interfering with each other is Docker. With Docker you can easily create independent containers in which you run your analyses and algorithms. Some of these algorithms can enjoy tremendous speed improvements when you run them on your GPU instead of your CPU. There is a great solution to run your Docker containers on your Nvidia GPU, called Nvidia-Docker, but this is technologically only possible on a Linux host-machine.

For a business consultant, there are very different technological requirements. When I attend a meeting, I like to take notes, or share information using presentations and spreadsheets. This is the domain of Microsoft Office. While there are some open source alternatives, they simply are no match for Microsoft Office. And even if there were good alternatives, most businesses run on Microsoft Office, making it almost a hard requirement. And by extension, Microsoft Windows is almost a hard requirement. I know there are online versions of Microsoft’s office applications, but they are limited in functionality.

While I prefer having a single device with a single operating system, given the circumstances I think having a dual boot system is an adequate solution for my requirements. Without a doubt, other people will prefer other solutions. Still, this setup allows me to do everything I want to, even though switching between Linux and Windows is a little cumbersome, requiring a reboot. Perhaps in the future I will look into running Microsoft Windows and Office in a virtual machine (according to this blog, virtual machines and Docker containers should go very well together).

In the next few blog posts, I will take you through the process of setting up a Data Scientist's workstation, including:

I hope my experiences help you avoid some of the pitfalls I fell into!

About the author

Erik Jan de Vries is a data scientist at BigData Republic, a data science consultancy company in the Netherlands. We hire the best of the best in BigData Science and BigData Engineering. If you are interested in deploying machine learning and deep learning solutions in production using Docker and Kubernetes, feel free to contact us at info@bigdatarepublic.nl.

--

--