Getting to know your data-oriented toolbox

Matthew Oliver
Nov 6 · 5 min read

In data science, we use many different tools to perform our statistical testing and machine learning. The tools we use today for analyzing data have come a long way from the ancient accountants of antiquity who used the abacus as their primary IDE (integrated development environment). Having a strong understanding of these tools is essential for making efficient products and providing comprehensive data analyses. All of the tools we use have been built up to be optimized for some specific aspect of the data science procedure, from storing data to building neural networks. The last few years of growth in data science has been largely due to the rapid progression of computational processing power. This exponential growth of computing power has been due to advancements in both the hardware and software on which our programs run.

The Hardware

While Machine Learning may be built on advanced statiscal models, even the most advanced model (especially the advanced models) would be meaningless on computers from just the early 2000s. While all computer processing is just changing 0s and 1s, the way in which the computer changes the 0’s into 1’s is very important to how a program runs. Hardware advances have run hand in hand with every breakthrough in computing power.

Nvidia GPU core

GPUs and Image Processing

While we all think of Pixar as a great film company who pioneered the animated storytelling genre with hits like Toy Story and A Bug’s Life, their first commercial product was computer hardware. Ed Catmull, (a PhD in Comp. Sci.) and Steve Jobs began Pixar by selling high end computers which specialized in image processing, with little success. They soon switched industries, applying their technology to the art of Film making. This allowed them to grow massively as a company, leveraging the powerful computers they had at their disposal for creative visual storytelling. projecting a 3D graphical object such as an animated Buzz-Lightyear character onto a 2D space involves a lot of matrix operations, which can be extremely RAM intensive on a computer, so a specialized processor called a GPU (Graphical Processing Unit) is used to handle these calculations.

“Embarrassingly Parallel”

Graphics cards are designed from the ground up to be really good at performing long, easy calculations (such as multiplying a whole matrix). The CPU (Central Processing Unit) runs very quickly, but can only run its operations sequentially (one at a time), or by the amount of cores it has, which typically is less than ten, with an IPhone having four. GPUs have hundreds of cores, each being much smaller. For operations like image processing and building neural networks, the GPU can perform the simple

The matrix used in calculating a 2D camera view from a 3D virtual object

task of iterating through these matrix performing simple operations much better than a CPU can because the CPU, while quicker at performing these calculations, can only handle them a few at a time. Matrix operations which are parsed up into many smaller similar processes are called “Embarrassingly Parallel” and are what make GPUs much more powerful at performing most data science operations. Data Scientists are leveraging these specialized devices for their complicated operations in the same way Pixar did for digital film making at the turn of the century.

The Software

https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks

So now you know that a GPU is better for performing the matrix operations required for data science, how do we utilize this? most Python libraries don’t run matrix operations on the GPU but there are packages which use CUDA, the language by which the CPU commands the GPU, to leverage the GPU’s specialized processing methods

Why Python?

There are many different options of programming languages a data scientist to use, the most common being R and Python. R is written explicitly for data science, with RStudio (the IDE) having many handy tools for data scientists easily accessible. Python, named after the comedy group Monty Python, is wider reaching, and able to handle operations such as automated web scraping, api calls and machine learning. Python, being open-sourced has many libraries which are handy for someone interested in working with data, specific examples are Numpy and Pandas. Python may be more “readable” than other, lower level languages, but it is important to note that by making something more readable to the user, it can sometimes make it harder for the computer to understand.

Vectorization

This is where libraries like Numpy come in handy, because they use something called vectorization. Libraries like Numpy and Pandas sort data in vectors, and array-like objects that make it easier for the computer to perform many operations (R does this automatically). This may not make a difference when multiplying a few elements together, but when working with larger data sets, vectorized processing allows the CPU to handle these operations much more efficiently. A CPU has fewer cores than the GPU, which means it is able to do less things at the same time, but by breaking down the computations into similar calculations across parallel vectors, it can operate at a much greater speed, despite having fewer cores. This is why for-loops may be easy to write, but shouldn’t be used when working with large pandas datasets, because it fails to utilize the efficiencies built into these programs.

Data Scientists are just like every other technical profession, effective work begins with understanding the tools we have at our disposal. While we may have an unprecedented amount of processing power at our fingertips, the programs we run also become more complex. Understanding how to most efficiently use the hardware and software in our toolkit is just as effective in developing programs and running analyses. There are many ways to define efficient code, but before methods like Big O notation can be applied, the right tool must be used to build the program.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade