How to speed-up pandas 10x with datatable

Marek Głowacki
2 min readFeb 11, 2019

--

Pandas is a sweet and powerful library for data analysis, but sometimes quite slow…

There is a lot information how to speed-up when data is already lodaded to dataframe:

But not so much about loading/saving dataframe, unless you are into C++

There is a simple trick… Don’t use pandas for loading/saving dataframes.

Let’s try to load/save some old csv from kaggle e.g. Santander (117MB) with pandas:

The same task with datatable:

50x time faster.

Moreover, after loading to csv, I converted datatable dataframe to pandas dataframe… so no need to learn new syntax. All you need is to import datatable and change two lines.

If you want know more about datatable:

Caveats:

  • more test required, but with two lines of code cost is pretty low
  • results of load and conversion from datatable could differ from pandas
  • both libraries are under active development, but have different goals, so in foreseeable future datatable will be faster, but pandas has greater coverage, with trick above you can have your cake and eat it too :)

Note: in pandas gist there is modin. It is drop-in multicore replacement for pandas. Unfortunately for datasets, I’ve testetd so far there is no improvement over regular pandas.

,

--

--