One- or multi-dimensional data interpolation made easy with Python Scipy package.

Image source: Created by the Author

What are we trying to achieve?

Interpolation may sound like a fancy mathematical exercise, but in many ways, it is much like what machine learning does.

  • Start with a limited set of data points relating multiple variables
  • Interpolate (basically, create a model)
  • Construct a new function that can be used to predict any future or new…


How to use the Numpy text reading utility methods for more efficient data analytics tasks.

Image source: Pixabay (free to use)

Where did that Numpy array come from?

When we do sophisticated statistical analyses or build complex machine learning models, we often forget that the most like source of the data was a plain text file, read from a disk drive or over an internet connection (parsed from an HTML).

This is the fact. Numeric data, used in…


Why you should use Parquet files with Pandas to make your analytics pipeline faster and efficient.

Image source: Pixabay (Free to use)

Why Parquet?

Comma-separated values (CSV) is the most used widely flat-file format in data analytics. It is simple to understand and work with. CSV files perform decently in small to medium data regimes. However, as we progress towards working with larger datasets, there are some excellent reasons to move towards file formats…


Producing high-quality documentation can be made easy and intuitive with the help of a little Python package. Include Markdown and LaTeX too.

Image source: Pixabay (Free to use)

Documentation is important

We all love good and comprehensive documentation when we use a new library (or re-use our favorite one for the millionth time), don’t we?

Imagine how would you feel if they took away all the docs from Scikit-learn or TensorFlow website. You will feel pretty powerless, wouldn’t you?

Documentation is…


How to perform arbitrary-precision computation and much more math (and fast too) than what is possible with the built-in math library in Python.

Image source: Pixabay (Free to use)

Limitless math?

Sounds like a catchy title? Well, what we really meant by that term is arbitrary-precision computation i.e. breaking away from the restriction of 32-bit or 64-bit arithmetic that we are normally familiar with.

Here is a quick example.


How to find the best-matching statistical distributions for your data points — in an automated and easy way. And, then how to extend the utility further.

What’s our goal?

Image source: Prepared by the author with Pixabay image (Free to use)

You have some data points. Numeric, preferably.

And you want to find out which statistical distribution they might have come from. Classic statistical inference problem.

There are, of course, rigorous statistical methods to accomplish this goal. But, maybe you are a busy data scientist. Or, a busier software engineer who…


How to build and train an AI model to identify various common anomaly patterns in time-series data

What do we want to achieve?

We want to train an AI agent or model that can do something like this,

Image source: Prepared by the author using this Pixabay image (Free to use)

Variances, anomalies, shifts

Little more specifically, we want to train an AI agent (or model) to identify/classify time-series data for,

  • low/medium/high variance
  • anomaly frequencies (little or high fraction of anomalies)
  • anomaly scales (are the anomalies too far from…


Notes from Industry

A simple and intuitive way to create synthetic (artificial) time-series data with customized anomalies — particularly suited to industrial applications.

Image source: Author created with Pixabay (Free to use) image

Why synthetic time-series?

As I wrote in my highly-cited article, “ a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. …


Running an ML algorithm on a multi-GB dataset with Dask. This would have been difficult with standard Pandas or Scikit-learn.

Image source: Compiled by the author based on Pixabay-image-1 and Pixabay-image-2

What is Dask and where is it used?

Dask is a feature-rich, easy-to-use, flexible library for parallelized computing in Python. It is specifically optimized and designed for data science and analytics workloads.

In most common scenarios, Dask comes to the rescue when you are dealing with large datasets that would have been tricky (if not downright impossible) to…


Whenever possible, use the df.values method instead of the iterrows method for faster iteration. Even better, vectorize where possible.

Image source: Pixabay (Free touse)

Iteration with Pandas DataFrame

As data scientists, all of us have been there.

We are given a large Pandas DataFrame and asked to check some relationships between various fields in the columns — in a row-by-row fashion. It could be some logical operation or some sophisticated mathematical transformation on the raw data.

Essentially, it…

Tirthajyoti Sarkar

Stories on Artificial Intelligence, Data Science, and ML | Speaker | Open-source contributor | Mentor | Author of multiple DS books

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store