Stunning Python Libraries you should know to work on Data Science projects.

Amsavalli Mylasalam
Variablz Academy
Published in
4 min readAug 16, 2022

As a data scientist, data analysis and manipulation is a day-to-day job. Knowing the right library for the right tasks tremendously reduces the working hours. So, I thought to share the 3 primary libraries I used to work for my regular tasks for various purposes.

Credits to Aatomz

1. Pandas Library

pandas were released on 11 January 2008

pandas library 📚 is one of the most used and widespread libraries 📚 in Python. Pandas library is most required for data manipulation necessary for data analysis 🧐 or machine learning.

Pandas Library helps us to work on structured data optimally for data structures and functions. The name panda does not imply the animal 🐼 ; it expresses Panel Data which means a structured dataset. Data frame and series are the two main classes to work on pandas. Python with Pandas is used in a wide range of different fields.

Why should you use Pandas Library?

  • The Pandas library provides a systematic method to manage and explore data.
  • Alignment and indexing are one of the best methods in pandas library.
  • Pandas library provides tools for loading data into in-memory data
    objects from different file formats.
  • The different file format we can import into pandas are Comma-separated values (CSV), XLSX, ZIP, plain Text(txt), JSON, XML, HTML, Images, Hierarchical Data Format, PDF, DOCX, MP3, MP4, SQL
  • Handling of missing data is integrated within pandas libraries.
  • Handling of missing data is integrated within pandas libraries.
  • Using pandas features, we can easily clean 🧼 up our data.
  • Reshaping and pivoting of data sets.
  • Label-based slicing, indexing, and subsetting of large data sets.
  • Filter, Sort, and Transpose
  • Function Application like Lambda, Aggregate, Group by, Map, Transform
    And pipe.
  • Pandas can help to combine, concatenate, join and merge data.
  • Pandas play an essential role in Descriptive Statistics and Random sampling.

The downside of the Pandas library

  • Poor closeness for 3-Dimensional Matrices. We cannot efficiently process the image data using the Pandas library.
  • Pandas have a very steep learning curve. There are too many functionalities available in Pandas, and it will be a time-consuming process to learn.
  • To process big datasets is limited due to out-of-memory errors in pandas.
  • Slow, limited multicore algorithms for large datasets

2. Dask Library

Dask library was released on 8 January 2015

To deal with extensive data sets and parallel computing, the best one ☝️ is with Dask Library.

For parallel computing, Dask is the extendable open source python 🐍 library.

Why should you use the Dask Library?

  • Dask is familiar due to parallelizes NumPy and pandas data-frame.
  • Dask Runs hardy on clusters with 1000s of cores.
  • Dask is suitable for fast numerical algorithms.
  • With python concurrent futures, Dask supports a real-time task framework.
  • The higher-level Dask API is Dask Array or Dask Delayed, Dask ML, Dask Bags, and Dask DataFrame.

The downside of Dask

  • Dask is not good at optimizing complex SQL queries.
  • Index, Sort, and shuffle Operations are not good at Dask parallel computing.

3. Polars Library

Credits to: www.pola.rs

The Polars project was started in March 2020 by Ritchie Vink.

Polars is a DataFrame library in the rust programming language and uses Apache Arrow as a foundation.

Polars is the wrap speed data frame library for python and rust.

Polars does not use an index for the data frame.; it utilizes an apache arrow why because the apache arrow is efficient in areas like load time, memory usage, and computation.

Why should you use Polars Library?

  • Polars library provides a fast and easy way to work with a large dataset.
  • Polars is a data manipulation and analysis library written in rust with APIs in Python.
  • polars library gives full Support for numerical calculations.
  • String manipulation and data frame operations like filtering, joining, intersection, and aggregations such as groupby can be made accessible using the polars library.
  • Parallelization, optimizing CPU, Arrow2 framework makes polar so fast.
  • when building data pipelines, polars is the best tool.

The downside of polars

  • polars is not much efficient in compatibility.

I have written about 3 primary libraries I used to work on a day-to-day job, but it doesn’t mean I won’t use other libraries. As data scientists, we should always be up to date with the tech. Even there are dozens of libraries available on the market now. I have tried many, but these 3 are most attracted me to my data science job. What is your favorite library on data analysis tasks?

Thanks & Regards

Amsavalli

Connect with me on LinkedIn for more data science insights!

https://www.linkedin.com/in/amsavalli-datascientist/

--

--