Jupyter notebook Trick — 2==>10 Data Manipulation and Wrangling Tricks

Aravind GJ
5 min readSep 16, 2020

--

In this article, I will walk you through some of these data manipulation and data wrangling hacks, tips, and tricks that have served me well. I hope these help you in your journey and role as well!

Contents

We’ll cover these data manipulation and data wrangling hacks, tips and tricks :

  1. Data Science Hack #1 — Select Data Type using Pandas
  2. Data Science Hack #2 — Extract E-mail from text
  3. Data Science Hack #3 — Remove Emojis from Text
  4. Data Science Hack #4 — Image Augmentation
  5. Data Science Hack #5 — Resizing Images
  6. Data Science Hack #6 — Apply Pandas Operations in Parallel
  7. Data Science Hack #7 — Pandas Melt
  8. Data Science Hack #8 — Divide equal proportion of classes (Classification)
  9. Data Science Hack #9 — Reading Data from multiple files
  10. Data Science Hack #10 — Splitting Dataframe using str.split()

Trick #1 — Select Data Type using Pandas

At the start of my data science journey, I used to write an ‘if’ condition to separate out continuous and categorical variables for data analysis. This was a taxing task as it consumed a lot of unnecessary time and energy. Then I came across this simple Pandas hack which made my life so much simpler!

Trick #2 — Extract E-mail from text

One of the most important parts of digital marketing is getting E-mails IDs of your customers. Is there any way that I can extract these IDs? Of course, there is — RegEx to the rescue!

This hack provides the regular expression you may use to extract E-mail ids from the text!

Trick #3 — Remove Emojis from Text

Preprocessing is one of the key steps for improving the performance of any machine learning model. One of the main reasons for text preprocessing is to remove unwanted characters from text like punctuation, emojis, links and so on which are not required for our problem statement.

This hack will help you get rid of these unnecessary emojis!

Trick #4 — Image Augmentation

Deep Learning models usually require a lot of data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques.

It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.

Trick #5 — Resizing Images

While building an image classification model using deep learning, it is required that all the images should be of the same size. However, as the data comes from different sources, images may have different shapes.

So, to convert them to the same shape, we can use the resize function from the OpenCV library. This hack will help you convert the images of any shape to a specified shape:

Original Images
Resized Images

Trick #6 — Apply Pandas Operations in Parallel

The traditional Pandas library is slow especially if you have a large dataset. Pandarallel is a simple and efficient tool to parallelize Pandas operations on all your available CPUs! This trick is certainly going to save loads of your precious time.

Dataset
Without and With Parallelization

Trick #7 — Pandas Melt

Pandas’ melt function helps you to bring your dataframe into a tidy form. It gives you the functionality to unpivot a dataframe from wide to long format. In pd.melt(), one or more columns are used as identifiers. You can “Unmelt the data”, using pivot() function:

Trick #8 — Divide equal proportion of classes (Classification)

It is a very common mistake made by beginners — for classification problems, not splitting the classes into equal proportions in train and test set which often leads to spurious results. Sklearn provides an easy way to do it using the “stratify” parameter in the train_test_split function.

In this example, we pass stratify = y, and you can observe the difference of proportion in both cases — with stratify and without stratify.

Trick #9 Reading Data from multiple files

A lot of times you may require to read multiple data files. For example, a retailer maintains his sales data in files split according to years. In this case, you’ll use glob, a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell to read each file. Let’s see it in this example:

Trick #10 Splitting Column using str.split()

str.split() is used to apply vectorized string functions on a Pandas dataframe column. Let’s say you want to split the names in a dataframe column into first name and last name. pandas.Series.str along with split( ) can be used to perform this task.

Conclusion

In this article, we covered 10 data manipulation and data wrangling hacks, tips and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.

--

--