Python VS R for Data Science

Kasthuri P
Women Data Greenhorns
11 min readJul 16, 2018

As part of the Bertelsmann Data Science Scholarship, Udacity and Bertelsmann challenged 15,000 students to learn about Data Science, complete a course about Descriptive Statistics and Advanced Concepts with Python and SQL and interact with other data enthusiasts from all over the Globe.

#UdacityDataScholars #PoweredByBertelsmann

When it comes to determining a preferred language for your data science learning path, I guess most of you will agree with me that both R and Python will be considered at the same time and it is difficult to select one out of these two languages.

Both the languages are open source and free to access, were developed in the early 1990s — R for statistical analysis and Python as a general-purpose programming language. For anyone interested in the field of machine learning, working with the range of large datasets, or creating complex data visualizations, one of these languages is completely essential.

Stack Overflow Trends

Image Source: https://dzone.com/articles/r-or-python-data-scientists-delight

The above graph visualizes how these two languages have become popular over the time, based on the use of their tags since 2008, when Stack Overflow was founded!

While both the languages (R and Python) are competing to be the preferred data scientist’s language of choice, let’s look at their platform share and compare 2016 with 2017.

Image Source: https://dzone.com/articles/r-or-python-data-scientists-delight

Now, it is time to look at these two languages from the perspective of their usage in a data pipeline, including:

1. Data Collection

2. Data Exploration

3. Data Modeling

4. Data Visualization

Data Collection

Data collection is the process of gathering and measuring information on targeted variables in an established systematic fashion, which then enables one to answer relevant questions and evaluate outcomes.

Python

Python supports all kinds of different data formats. You can play with comma-separated value documents (known as CSVs) or you can play with JSON sourced from the web. You can import SQL tables directly into your code.

You can also create datasets. The Python requests library is a beautiful piece of work that allows you to take data from different websites with a line of code. It simplifies HTTP requests into a line of code. You’ll be able to take data from Wikipedia tables, and once you’ve organized the data you get with beautifulsoup, you’ll be able to analyze them in-depth.

Modern File formats & features for collecting data in Python:

Feather (Fast reading and writing of data to disk)

  • Fast, lightweight, easy-to-use binary format for filetypes
  • Makes pushing data frames in and out of memory as simply as possible
  • Language agnostic (works across Python and R)
  • High read and write performance (600 MB/s vs 70 MB/s of CSVs)
  • Great for passing data from one language to another in your pipeline

Ibis (Pythonic way of accessing datasets)

  • Bridges the gap between local Python environments and remote storages like Hadoop or SQL
  • Integrates with the rest of the Python ecosystem

ParaText (Fastest way to get fixed records and delimited data off of disk and into RAM)

  • C++ library for reading text files in parallel on multi-core machines
  • Integrates with Pandas: paratext.load_csv_to_pandas(“data.csv”)
  • Enables CSV reading of up to 2.5GB a second
  • A bit difficult to install

bcolz (Helps you deal with data that’s larger than your RAM)

  • Compressed columnar storage
  • You have the ability to define a Pandas-like data structure, compress it, and store it in memory
  • Helps get around the performance bottleneck of querying from slower memory

R

You can import data from Excel, CSV, and from text files into R. Files built in Minitab or in SPSS format can be turned into R data frames as well. While R might not be as versatile at grabbing information from the web like Python is, it can handle data from your most common sources.

Many modern packages for R data collection have been built recently to address this problem. Rvest will allow you to perform basic web scraping, while magrittr will clean it up and parse the information for you. These packages are analogous to the requests and beautiful soup libraries in Python.

Modern File formats & features for collecting data in R:

Feather (Fast reading and writing of data to disk)

  • Same as for Python

Haven (Interacts with SAS, Stata, SPSS data)

  • Reads SAS and brings it into a dataframe

Readr (Reimplements read.csv into something better)

  • read.csv sucks because it takes strings into factors, it’s slow, etc
  • Creates a contract for what the data features should be, making it more robust to use in production
  • Much faster than read.csv

JsonLite (Handles JSON data)

  • Intelligently turns JSON into matrices or dataframes

Data Exploration

Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a dataset.

Python

To unearth insights from the data, you’ll have to use Pandas, the data analysis library for Python. It can hold large amounts of data without any of the lag that comes from Excel. You’ll be able to filter, sort and display data in a matter of seconds.

Pandas is organized into data frames, which can be defined and redefined several times throughout a project. You can clean data by filling in non-valid values such as NaN (not a number) with a value that makes sense for numerical analysis such as 0. You’ll be able to easily scan through the data you have with Pandas and clean up data that makes no empirical sense.

Modern Features for cleaning & transforming data in Python:

Blaze (NumPy for big data)

  • Translates a NumPy / Pandas-like syntax to data computing systems.
  • The same Python code can query data across a variety of data storage systems.
  • Good way to future-proof your data transformations and manipulations.

xarray (Handles n-dimensional data)

  • N-dimensional arrays of core pandas data structures (e.g. if the data has a time component as well).
  • Multi-dimensional Pandas dataframes.

Dask (Parallel computing)

  • Dynamic task scheduling system.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.

R

R was built to do statistical and numerical analysis of large data sets, so it’s no surprise that you’ll have many options while exploring data with R. You’ll be able to build probability distributions, apply a variety of statistical tests to your data, and use standard machine learning and data mining techniques.

Basic R functionality encompasses the basics of analytics, optimization, statistical processing, optimization, random number generation, signal processing, and machine learning. For some of the heavier work, you’ll have to rely on third-party libraries.

Modern Features for cleaning & transforming data in R:

Dplyr (Swiss army chainsaw)

  • The way R should’ve been from the first place
  • Has a bunch of amazing joins.
  • Makes data wrangling much more humane

Broom (Tidy your models)

  • Fixes model outputs (gets around the weird incantations needed to see model coefficients)
  • tidy, augment, glance

Tidy_text (Text as tidy data)

  • Text mining using dplyr, ggplot2, and other tidy tools
  • Makes natural language processing in R much easier

Data Modeling

Data modeling is a set of tools and techniques used to understand and analyze the data for data-driven decision making.

Python

You can do numerical modeling analysis with Numpy. You can do scientific computing and calculation with SciPy. You can access a lot of powerful machine learning algorithms with the scikit-learn code library. scikit-learn offers an intuitive interface that allows you to tap all of the power of machine learning without its many complexities.

Modern Features for data modeling in Python:

Keras (Simple deep learning)

PyMC3 (Probabilistic programming)

  • Contains the most high-end research from labs in academia
  • Powerful Bayesian statistical modeling

R

In order to do specific modeling analyses, you’ll sometimes have to rely on packages outside of R’s core functionality. There are plenty of packages out there for specific analyses such as the Poisson distribution and mixtures of probability laws.

Modern Features for data modeling in R:

MXNet (Simple deep learning)

  • Intuitive interface for building deep neural networks in R
  • Not quite as nice as Keras

TensorFlow

  • Now has an interface in R

Data Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

Python

The IPython Notebook that comes with Anaconda has a lot of powerful options to visualize data. You can use the Matplotlib library to generate basic graphs and charts from the data embedded in your Python. If you want more advanced graphs or better design, you could try Plot.ly. This handy data visualization solution takes your data through its intuitive Python API and spits out beautiful graphs and dashboards that can help you express your point with force and beauty.

You can also use the nbconvert function to turn your Python notebooks into HTML documents. This can help you embed snippets of nicely-formatted code into interactive websites or your online portfolio. Many people have used this function to create online tutorials on how to learn Python and interactive books.

Modern features for advanced data visualization in Python:

Altair (Like a Matplotlib 2.0 that’s much more user friendly)

  • You can spend more time understanding your data and its meaning.
  • Altair’s API is simple, friendly and consistent.
  • Create beautiful and effective visualizations with a minimal amount of code.
  • Takes a tidy DataFrame as the data source.
  • Data is mapped to visual properties using the group-by operation of Pandas and SQL.
  • Primarily for creating static plots.

Bokeh (Reusable components for the web)

  • Interactive visualization library that targets modern web browsers for presentation.
  • Able to embed interactive visualizations.
  • D3.js for Python, except better.
  • Already has a big gallery that you can borrow steal from.

Geoplotlib (Interactive maps)

  • Extremely clean and simple way to create maps.
  • Can take a simple list of names, latitudes, and longitudes as input.

R

R was built to do statistical analysis and demonstrate the results. It’s a powerful environment suited to scientific visualization with many packages that specialize in graphical display of results. The base graphics module allows you to make all of the basic charts and plots you’d like from data matrices. You can then save these files into image formats such as jpg., or you can save them as separate PDFs. You can use ggplot2 for more advanced plots such as complex scatter plots with regression lines.

Modern features for advanced data visualization in R:

ggplot2 (ggplot2 was recently massively upgraded)

  • Recently had a very significant upgrade (to the point where old code will break)
  • You can do faceting and zoom into facets

htmlwidgets (Reusable components)

  • Brings of the best of JavaScript visualization to R
  • Has a fantastic gallery you can borrow steal from

Leaflet (Interactive maps for the web)

  • Nice Javascript maps that you can embed in web applications

Tilegramsr (Proportional maps)

  • Create maps that are proportional to the population
  • Makes it possible to create more interesting maps than those that only highlight major cities due to population density

Summary:

Questions to Ask Before Choosing One of the Languages

1 — Do you have prior programming experience in other languages?

· If you have some programming experience, Python might be the language for you. Its syntax is more similar to other languages than R’s syntax.

· Python can be read much like a verbal language. This readability emphasizes development productivity, while R’s unstandardized code might be a hurdle to get through in the programming process.

2 — Do you want to go into academia or industry?

· The real difference between Python and R comes in being production ready.

· Python is a full-fledged programming language and many organizations use it in their production systems.

· On the other hand, R is a statistical programming software favored by many academia.

· Only recently due to the availability of open-source R libraries that the industry has started using R.

3 — Do you want to learn “machine learning” or “statistical learning”?

· Machine learning is a subfield of Artificial Intelligence, while Statistical Learning is a subfield of Statistics.

· Machine learning has a greater emphasis on large-scale applications and prediction accuracy; while statistical learning emphasizes models and their interpretability, and precision and uncertainty.

Since R was built as a statistical language, it suits much better to do statistical learning. It represents the way statisticians think pretty well, so anyone with a formal statistics background can use R easily. Python, on the other hand, is a better choice for machine learning with its flexibility for production use, especially when the data analysis tasks need to be integrated with web applications.

4 — Do you want to do a lot of software engineering?

· Python is for you. It integrates much better than R in the larger scheme of things in an engineering environment.

5 — Do you want to visualize your data in beautiful graphics?

· For rapid prototyping and working with datasets to build machine learning models, R inches ahead.

· Python has caught up some with advances in Matplotlib but R still seems to be much better at data visualization (ggplot2, htmlwidgets, Leaflet).

Conclusion

Python is a powerful, versatile language that programmers can use for a variety of tasks in computer science. Learning Python will help you develop a versatile data science toolkit, and it is a versatile programming language you can pick up pretty easily even as a non-programmer.

On the other hand, R is a programming environment specifically designed for data analysis that is very popular in the data science community. You’ll need to understand R if you want to make it far in your data science career.

The reality is that learning both tools and using them for their respective strengths can only improve you as a data scientist. Versatility and flexibility are traits any data scientist at the top of their field. The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist.

Bottom Line: Both languages are winners…

Additional Resources:

--

--

Kasthuri P
Women Data Greenhorns

Data Science Scholar || Lyft Scholar for Self-Driving Cars || Python/C++ Programmer || TechReader