Announcing Terality — Pandas fully managed, 100x faster

Guillaume Duvaux
Terality
Published in
4 min readApr 6, 2021

--

Terality is a distributed data processing engine for Data Scientists to execute all their Pandas code 100 times faster, even on terabytes of data, by only changing one line of code. Terality is hosted, so there is no infrastructure to manage, and memory is virtually unlimited. Data Scientists can be up and running within minutes, even on an existing Pandas codebase.

Pandas does not scale well

When I met with Adrien last year, now my co-founder, he was a Data Scientist. Adrien was very frustrated by one particular thing: executing Pandas code at scale.

He had two main issues:

  1. Managing memory: Pandas runs in memory, and sometimes needs 10x dataset’s size in memory. He had to constantly make sure to have enough memory available on his machine.
  2. Delays in code execution: Pandas runs on one core by default. Even if he had eight cores available, Pandas would use only one. Adrien experienced delays from a few minutes to hours for the most complex Pandas functions. Even if a few minutes looks acceptable, waiting for 2 minutes, 20 times a day, can be very frustrating.

As a Data Scientist and a developer, Adrien tried to find solutions. Among them:

  • Provisioning a bigger machine: this solution solved only the “memory issues” by providing him with a bigger RAM. However, his code execution was still slow, especially on complex methods such as “sort_values” and “merge”. What’s more, Adrien still relied on another team to setup, provision and maintain his infrastructure. He wasn’t autonomous, and that caused a lot of friction and time lost.
  • Using sub-sampling: working with smaller datasets solved both his memory and speed issues, but the results weren’t good enough as he was obviously missing data. And it took time to know how to sub-sample the best way.
  • Using the available open source solutions: great tools have been created to parallelize Pandas. Among them he could find Dask, Pandaralell or Modin. Those solutions are great but often incomplete (just a few Pandas functions are implemented), don’t manage infrastructure, don’t have the same UX as Pandas and are hard to understand.
  • Using Spark: Spark is well known nowadays, and it’s a great engine. But you still need to anticipate and plan for the infrastructure you need and manage it (even with Databricks, EMR or Dataproc). What’s more, even using the PySpark API, Adrien had to learn a new syntax, change its existing code, and things were not as efficient as with Pandas.

Coming from there, we talked to 50+ data scientists, working in many industries such as software, financial services, logistics or gaming. Adrien and I quickly realized and validated that those issues were commonly shared among them. Starting from a few GB, Data Scientists struggled with Pandas. What’s more, most of the Data Scientists we interviewed were not software engineers, and struggled even more with existing open source solutions or Spark. Actually, Pandas is loved by Data Scientists and they want to stick with the Pandas syntax.

Building a fully managed and 100x faster solution: Hello Terality!

We decided to build the ultimate data processing engine, capable of executing all Pandas functions 100x faster, even terabytes of data, without having to handle any infrastructure, and by only changing one line of code. In the short term, our ambition is to solve all the pain points Data Scientists have with executing Pandas at scale and its alternatives.

From the very beginning, we wanted to strongly focus on the data scientists’ experience. Our mission is to improve data scientists’ lives.
To achieve that, we need to build the best experience possible for them including:

  • Managing the whole infrastructure on our side, so that data scientists can work autonomously, without having to maintain infrastructure or worry about memory.
  • Accelerating Pandas’ code execution by 100x, so that data scientists can perform more experiments, execute more code, and stop being frustrated by waiting times.
  • Parallelizing 100% of the Pandas API, so that Data Scientists don’t have to go back and forth from Pandas to Terality, and from Terality to Pandas. All Pandas would work the same way in Terality.
  • Offering the exact same syntax as Pandas, so that the data scientists don’t have to learn a new syntax, and they can execute Terality on existing code without any modification except from the import line.
Pandas versus Terality. Only change the import line, experience pandas at scale.

Amazing first results from our alpha testers

We’ve pushed hard within the last months to release a first version of Terality which already provides value to Data Scientists. Here are the first benefits we observed with our first beta testers:

  • With Terality memory is virtually unlimited ⇒ No more memory management, Data Scientists are completely independent as infrastructure is fully managed by Terality. They don’t have to think about memory anymore.
  • With Terality code runs (way) faster ⇒ No more waiting times. Less time is spent on data preparation, more time to craft better models, more predictions are run with the same timeframe and projects are released faster in production.

What’s next for Terality? Private beta.

Today, we are starting to accept teams and companies to our private beta. If you are interested in joining early and influencing the direction of Terality, sign up here and follow us on Twitter or Linkedin.

--

--