Which Languages Should You Learn For Data Science?

Peter Gleeson
Aug 31, 2017 · 12 min read

R

What you need to know

License

Free!

Pros

  • Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
  • The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.
  • Data visualization is a key strength with the use of libraries such as ggplot2.

Cons

  • Performance. There’s no two ways about it, R is not a quick language.
  • Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.
  • Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.

Verdict — “brilliant at what it’s designed for”

R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.

Python

What you need to know

License

Free!

Pros

  • Python is a very popular, mainstream general purpose programming language. It has an extensive range of purpose-built modules and community support. Many online services provide a Python API.
  • Python is an easy language to learn. The low barrier to entry makes it an ideal first language for those new to programming.
  • Packages such as pandas, scikit-learn and Tensorflow make Python a solid option for advanced machine learning applications.

Cons

  • Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.
  • For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.

Verdict — “excellent all-rounder”

Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.

SQL

What you need to know

License

Varies — some implementations are free, others proprietary

Pros

  • Very efficient at querying, updating and manipulating relational databases.
  • Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what SELECT name FROM users WHERE age > 18 is supposed to do!
  • SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.

Cons

  • SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.
  • For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.
  • There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.

Verdict — “timeless and efficient”

SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.

Java

What you need to know

License

Version 8 — Free! Legacy versions, proprietary.

Pros

  • Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.
  • Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.
  • Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.

Cons

  • For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.
  • Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.

Verdict — “a serious contender for data science”

There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages. However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.

Scala

What you need to know

License

Free!

Pros

  • Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.
  • Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.
  • Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.

Cons

  • Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
  • The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.

Verdict — “perfect, for suitably big data”

When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too. Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.

Julia

What you need to know

License

Free!

Pros

  • Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.
  • Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.
  • Readability. Many users of the language cite this as a key advantage

Cons

  • Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.
  • Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).

Verdict — “one for the future”

The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R. But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.

MATLAB

What you need to know

License

Proprietary — pricing varies depending on your use case

Pros

  • Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.
  • Data Visualization. MATLAB has some great inbuilt plotting capabilities.
  • MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.

Cons

  • Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
  • MATLAB isn’t an obvious choice for general-purpose programming.

Verdict — “best for mathematically intensive applications”

MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science. The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality; indeed, MATLAB was specifically designed for this.

Other Languages

There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!

C++

C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.

JavaScript

With the rise of Node.js in recent years, JavaScript has become more and more a serious server-side language. However, its use in data science and machine learning domains has been limited to date (although checkout brain.js and synaptic.js!). It suffers from the following disadvantages:

  • Few relevant data science libraries and modules are available. This means no real mainstream interest or momentum
  • Performance-wise, Node.js is quick. But JavaScript as a language is not without its critics.

Perl

Perl is known as a ‘Swiss-army knife of programming languages’, due to its versatility as a general-purpose scripting language. It shares a lot in common with Python, being a dynamically typed scripting language. But, it has not seen anything like the popularity Python has in the field of data science.

Ruby

Ruby is another general purpose, dynamically typed interpreted language. Yet it also hasn’t seen the same adoption for data science as has Python.

Conclusion

Well, there you have it — a quickfire guide to which languages to consider for data science. The key here is to understand your usage requirements in terms of generality vs specificity, as well as your personal preferred development style of performance vs productivity.

freeCodeCamp.org

This is no longer updated. Go to https://freecodecamp.org/news instead

Peter Gleeson

Written by

Founder Associate, Revolut

freeCodeCamp.org

This is no longer updated. Go to https://freecodecamp.org/news instead