Getting Started with Data Science

From zero to one to hero

Nicholas Teague
From the Diaries of John Henry
9 min readJul 4, 2020

--

A friend recently shared that they were thinking about getting into coding, so I thought I’d assemble a few pointers and helpful hints for getting started. Basically I was in a similar position a few years ago and so you can think of these as kind of a letter to former self for what may have helped me get started way back when. Anyone not yet acclimated to the data science ecosystem may find this helpful I hope. Yeah so without further ado.

Software Engineering vs Data Science

So there’s kind of a fundamental distinction between the kind of coding workflows that go into software engineering verses mainstream data science. Specifically, software engineering is the act of creating self-contained packaged systems, with defined inputs and outputs, and engineering is an appropriate term because done properly it involves creating specifications, documentation, pseudo code, architectures, and implementations. Data science on the other hand is a little bit of a looser term, “science” here is kind of a generous monicker, it’s not exactly a science as practiced in mainstream use. Data science is all about extracting insights from some data corpus. The data could be any range of applications, from financial data, business data, web data, or even for advanced practice stuff that you might not think about as “data” like images, video, language, speech, music, etc. Basically anything that we can represent in a digital form can be a target for a data science analysis by machine learning. That being said, a whole lot of the type of analysis that may be performed in a business setting isn’t quite as exotic, a lot involves just getting numeric and categoric sets into properly formatted tables such as may be passed to machine learning algorithms. But I’m getting ahead of myself.

Stack Overflow

This is such a great resource that am going to go ahead and share right off the bat. Stack Overflow is a crowd-sourced question answering platform in which people share questions about working with code and other people share answers. The system works by way of voting systems and point collecting, for example if you participate in the ecosystem and post questions and responses, you gain a little more reputation points such as I believe helps your questions get higher visibility and stuff like that. So as you use Stack Overflow, try to get into the habit of upvoting those responses you found helpful, responding to questions where you may have some perspective to contribute, and oh yeah if you think you have an original question to ask, do so. This is how crowd-sourced documentation comes about. I can’t recommend enough Stack Overflow as a helpful resource for working with code.

Programming languages

Of course any coding project is first and foremost going to need a programming language basis. There are a lot to choose from, computer languages are a constantly evolving landscape. Each language in general may have different strengths and weakness, such as speed, ease of use, integration with external libraries for various use-cases, as well as more advanced software engineering considerations which I won’t try to go into since this essay is intended for beginners (yes that is a cop-out I know). Just to give you a flavor, some of the most durable environments over the years have proven to be languages like C or it’s variants C++, Objective-C, etc, which is known for raw speed (at a cost of complexity of implementation) which has made these languages of choice for operating systems and stuff. If you’re going to develop software for applications like mobile apps, you’ll probably do some work with languages like Java for Android or Swift for iOS, and to make these easier in practice there may even be code editing environments available that incorporate a user interface for building functionality — like check out Xcode for Apple development for instance, but I digress. In the field of data science, the mainstream practice has coalesced around a few notable frameworks. You’ll probably come across ‘R’ at some point, I’ve heard this is particularly strong for use in data exploration. That being said, the by far dominant language for data science applications is python. In fact this has kind of become the most dominant language in general even outside of data science. It’s not a perfect language, for example it’s not the fastest language, but it more than makes up for that with an intuitive development structure, in other words it’s both easy to write code and perhaps just as importantly it’s easy to read code written in python. That’s not the only reason it’s so dominant, it’s also the case that most of the major software libraries associated with data science have implementations in python, we’ll get into that soon.

Code editors

If you’re reading between the lines, you’re probably seeing this as a recommendation for beginners to try out python. Which it is. That being said, it’s probably a good rule of thumb to let the use case serve as basis for the language rather than the other way around, in other words python isn’t a good fit for everything, but if you’re looking to get acclimated to the data science ecosystem there’s no better place to start. So how does one start experimenting with python? Well for that you’re going to need a code editor my friend. Probably the simplest starting point for running python statements is an open source project known as “Jupyter notebooks”, in which a user can enter python statements and view their output, as well as collect corresponding natural language documentation, all in a simple file. Note that Jupyter is available in a few flavors, the original Jupyter is kind of the vanilla flavor, and then for more code editor feature integration there’s a variant called “Jupyter Labs”. Unfortunately, from a beginner standpoint, the installation and initiation of Jupyter notebooks isn’t as straight-forward to describe. So a few choices if you want to get up and running quickly are to check out online resources like Google’s “Colaboratory”, which is actually a pretty neat online solution but comes at a cost of complexity for dealing with data sets not available in the cloud, so an alternate that is a simple installation and operation solution is Anaconda, which can be installed on your local machine, which as a bonus comes with many mainstream data science libraries pre-installed which is neat.

The Terminal

So just a heads up, jupyter notebooks are great for running experiments and trying out python code, but in practice sooner or later you’re going to need to jump up to a higher layer, so I’ll go ahead and try to provide some context here. The terminal is a kind of command line interface for interacting with your operating system. In macOS the terminal is helpfully available as a preinstalled app called “terminal”, or on a Windows PC it’s call the “command prompt” so yeah there are kind of variations on conventions for commands and stuff for working with each. When you open a terminal command line interface, you’re not directly working with python, it’s more like an environment from which you can access python, initiate jupyter notebooks, install libraries, and actually do pretty much anything you can do in the graphic interface, such as e.g. navigate your file system, open applications, and yeah all kinds of stuff. For example there’s a really really weird word processor you can run within the terminal called VIM which everyone says they like but I have a strong suspicion that they’re all pretending. One bit of advice, I don’t recommend just installing libraries to your terminal willy nilly, partly because it’s easy to lose track of all of the various dependancies and whatnot, so best practice if you want to experiment with new libraries is to do so in a compartmentalized virtual environment which is available with the Conda library for instance, just something to keep in mind. So yeah anyway if you’re just playing around with jupyter notebooks this isn’t going to come up, really just a heads up that sooner or later this will come up so be ready.

Cloud Services

As you get into advanced practice, you might find benefit from integrating your coding sessions with various cloud offerings. For example, in advanced data science projects, the iterative nature of tuning machine learning models can often benefit from accelerator hardware to speed up computations on large data sets. These types of accelerator hardware, such as GPU’s / graphic processing units (repurposed from graphics for parallelized number crunching in data science) are available as hardware for your computer (Nvidia graphic cards in particular have strong support for accelerating various machine learning libraries), or you’ll likely find that often a better option will be to access a GPU from a cloud provider. For example, the Colaboratory service allows access to GPU’s capped by number of hours even on their free tier, or Kaggle, a data science competition platform, also has GPU accelerated notebooks available (side note: Kaggle is a great resource for beginner tutorials, seriously check out for example their Titanic competition, that’s a great starting point, sorry I digress). For advanced users, cloud services like AWS, Google Cloud, or smaller services like Paperspace for instance have virtual cloud sessions, such as to access jupyter notebooks, which allow you to scale up or down your accelerator hardware to meet the needs of an application, such as for example may be rented by the hour. But yeah this is just something you might get into down the road, when you’re just getting started in data science you probably won’t need to think about this stuff just yet.

Libraries

Again part of the reason why python is such as dominant language for the data science ecosystem is because of the robust selection of libraries available for various use-cases. For example Numpy is a library for working with arrays of data which could be, like, tables of rows and columns for tabular data, or could be aggregations in higher dimensions for what are called tensors. Pandas is a library built on top of Numpy which adds features for working with tabular data “dataframes” including the integration of column headers and index columns which turns out to be pretty useful for various data manipulation techniques. Matplotlib is a library for data visualizations such as for generating charts and stuff which are often helpful for data exploration and presentation. Then as we get into applying predictive algorithms some common libraries include Scikit-learn which is a great starting point for machine learning. And then as you get into working with neural networks, libraries like Tensorflow and Pytorch are all built on top of python, bonus they automatically interface with accelerator hardware like GPU’s under the hood, so you get built in support which is nice. I mean this is just a small selection of the types of libraries available, there’s a bunch.

One more thing

Oh yeah one more thing. On the subject of libraries, there is one that’s really useful that I want to highlight. Automunge (pronounced “auto” and then “munge”, just like it’s spelled) is a data science library that automates the preparation of tabular data for machine learning. It’s kind of a data science application built by software engineering, see there are intersections, and it takes as input tabular data and automatically numerically encodes and normalizes, such as to provide a push-button means to feed raw tabular data directly to machine learning. What’s more, a user doesn’t have to defer to automation, the library also serves as a platform for feature engineering, and various encodings as may be applied to distinct columns in a tabular data set are available for push-button assignment. One of the core challenges of data science projects is cleaning the data to make it suitable for machine learning. That’s what Automunge does, and it’s open source and it’s easy. Seriously check us out on GitHub or something. Cheers.

Chopin’s Etude in C minor, featuring “Searchlight” by Marc Handelman on display at Orlando Museum of Art

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.