Establishing a proper development setup is essential to an efficient workflow. This is as important to Data Scientists as it is to Web or App developers. The preferred environment usually evolves over time and there are many different ways to approach this process. This is mine.
First, My Development Languages
I primarily work with Python and R, with a side of light HTML, JS and other web-based tinkering. I’m a fan of open source software and I’m appreciative of the work that goes into establishing a robust development community that supports the packages I use daily. I believe software development tools should made freely and easily available to as many people as possible.
The Operating System
While have used every common OS from Linux to ChromeOS to Mac OS, my current preference is Windows 10 on my laptop. When it comes to the breadth features required in the UI all business and tasks, it suits my needs best at the moment. While I have tried various Linux distros and Chromebooks, none of them quite stack up against the full suite of offering provided by Windows or Mac.
Having committed to Windows as my primary OS after having used Mac for several years, the one thing I missed was the built-in Unix-like terminal. This is extremely useful for basic needs like using ssh, git, and pip/npm package management. Yes, I know the Windows terminal can do all of the same basic tasks but I was more familiar with the Linux commands. So I would sometimes resort to running a full virtual machine or docker image inside of windows for development-related tasks. This however can be cumbersome to use seamlessly. Enter “bash on Windows”! This beta feature was later more formally renamed the Windows Subsystem for Linux (WSL).
Windows Subsystem for Linux
This wonderful little addition by Microsoft is one of the most underrated features to be developed by the Windows team in recent years. As of Windows 10 Fall Creators Update the WSL moved from beta to be a fully supported feature. They’ve also made it easier to install directly from the Microsoft store. Unlike the beta versions, this makes it easy to get started easily.
So, why do this? One major benefit is that the WSL runs a lightweight version of the standard linux distro without requiring a full virtualization environment like HyperV, VMware, or VirtualBox. This greatly simplifies setup and minimized resources required from the host computer. There is no GUI and that’s by design. For users like me who simply want a Linux command shell within Windows, it works very well.
Visual Studio Code might be the best all-around modern IDE. It is based on the GitHub Atom editor and the Electron framework. It is currently by far my favorite editor. There is a rich community-supported extension offering, integrated git and debugging support and it’s a multi-platform open source project.
Cool, so what about R development? Well, I have done some light work on .R and .Rmd within VS Code. However, I find that I still revert back to RStudio some of the time. This may change over time as I continue to learn R but for now RStudio is still seems easiest to work with.
Notebook Editors for Data Science
Tools like Jupyter Notebook or R Markdown are increasingly important for Data Science work. They make it easy to control and document the execution of code for clear, reproducible work. They also provide a simple way to visualize results and present analysis in shareable format.
Workflow Best Practices
It is important to produce well-documented and reproducible work as a data scientist. Spending time on your project objectives and documentation will always pay future dividends regardless of your level of expertise. Also, make README files, web reference links and screenshots a priority. Try different projects and get tips from code posted by others. Share your code publicly on Github, write blog articles or Gist pages, and contribute to forum posts. While it may not seem significant in the sea of available resources out there, it might be the thing somebody needs to fix an issue they encounter or learn something new when just starting out.
Whether you are a student just getting started in simple statistical analytics or a post doc working on a large-scale machine learning project, your development workflow might be the key to maintaining your sanity as you go along.