Rahul Agarwal
Bridging the gap between Data Science and Intuition. Data Scientist @WalmartLabs. Data science communicator at mlwhiz and TDS. Connect on Twitter @mlwhiz
Image for post
Image for post
Image by Денис Марчук from Pixabay

All the PyTorch functionality you will ever need while doing Deep Learning. From an Experimentation/Research Perspective.

PyTorch has sort of became one of the de facto standards for creating Neural Networks now, and I love its interface. Yet, it is somehow a little difficult for beginners to get a hold of.

I remember picking PyTorch up only after some extensive experimentation a couple of years back. …


Image for post
Image for post
bestImage by lisa runnels from Pixabay

Think ahead to production so that you don’t let your machine learning project collapse before it even gets started

As data scientists, one of our jobs is to create the whole design for any given machine learning project. Whether we’re working on classification, regression, or deep learning project, it falls to us to decide on the data preprocessing steps, feature engineering, feature selection, evaluation metric, and the algorithm as well as hyperparameter tuning for the said algorithm. And we spend a lot of time worrying about these issues.

All of that is well and good. But there are a lot of other important things to consider when building a great machine learning system. …


Image for post
Image for post
Awesome Atom Functionalities that make it a worthy editor for Python

Github + Markdown + Stack Overflow + Autocomplete + Jupyter

Before I even begin this article, let me just say that I love iPython Notebooks, and Atom is not an alternative to Jupyter in any way. Notebooks provide me an interface where I have to think of Coding one code block at a time,” as I like to call it, and it helps me to think more clearly while helping me make my code more modular.

Yet, Jupyter is not suitable for some tasks in its present form. And the most prominent is when I have to work with .py files. And one will need to work with .py files whenever they want to push your code to production or change other people’s code. So, until now, I used sublime text to edit Python files, and I found it excellent. …


Image for post
Image for post
Photo by Kunal Kalra on Unsplash

Python Shorts

Non-Equi Joins with Pandas and PandaSQL

Pandas is one of the best data manipulation libraries in recent times. It lets you slice and dice, groupby, join and do any arbitrary data transformation. You can take a look at this post, which talks about handling most of the data manipulation cases using a straightforward, simple, and matter of fact way using Pandas.

But even with how awesome pandas generally is, there sometimes are moments when you would like to have just a bit more. Say you come from a SQL background in which the same operation was too easy. …


Image for post
Image for post
Image by Omni Matryx from Pixabay

Some course recommendations for the much sought after Machine Learning Engineering positions

With ML Engineer job roles in all the vogue and a lot of people preparing for them, I get asked a lot of times by my readers to recommend courses for the ML engineer roles particularly and not for the Data Science roles.

Now, while both ML and Data Science pretty much have a high degree of overlap and I could very well argue that ML engineers do need to know many of the Data Science skills, there is a special place in hell reserved for ML engineers and that is the production and deployment part of the Data Science modeling process. …


A Layman’s Introduction to GANs for Data Scientists using PyTorch

Image for post
Image for post
Source

Most of us in data science has seen a lot of AI-generated people in recent times, whether it be in papers, blogs, or videos. We’ve reached a stage where it’s becoming increasingly difficult to distinguish between actual human faces and faces generated by artificial intelligence. However, with the currently available machine learning toolkits, creating these images yourself is not as difficult as you might think.

In my view, GANs will change the way we generate video games and special effects. Using this approach, we could create realistic textures or characters on demand.

So in this post, we’re going to look at the generative adversarial networks behind AI-generated images, and help you to understand how to create and build your similar application with PyTorch. We’ll try to keep the post as intuitive as possible for those of you just starting out, but we’ll try not to dumb it down too much. …


Image for post
Image for post
Image by Paul Barlow from Pixabay

Data science requires a range of sophisticated technical skills—but don’t let that expertise get in the way of critical thinking.

As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.

When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.

Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist. …


Image for post
Image for post
Photo by Genessa Panainte on Unsplash

Spark 3.0 + GPU is here. And it is a gamechanger

Data Exploration is a key part of Data Science. And does it take long? Ahh. Don’t even ask. Preparing a data set for ML not only requires understanding the data set, cleaning, and creating new features, it also involves doing these steps repeatedly until we have a fine-tuned system.

As we moved towards bigger datasets, Apache Spark came as a ray of hope. It gave us a scalable and distributed in-memory system to work with Big Data. By the by, we also saw frameworks like Pytorch and Tensorflow that inherently parallelized matrix computations using thousands of GPU cores.

But never did we see these two systems working in tandem in the past. We continued to use Spark for Big Data ETL tasks and GPUs for matrix intensive problems in Deep Learning. …


Image for post
Image for post

Yes, Yolov5 is here

Ultralytics recently launched YOLOv5 amid controversy surrounding its name. For context, the first three versions of YOLO (You Only Look Once) were created by Joseph Redmon. Following this, Alexey Bochkovskiy created YOLOv4 on darknet, which boasted higher Average Precision (AP) and faster results than previous iterations.

Now, Ultralytics has released YOLOv5, with comparable AP and faster inference times than YOLOv4. This has left many asking: is a new version warranted given similar accuracy to YOLOv4? Whatever the answer may be, it’s definitely a sign of how quickly the detection community is evolving.

Image for post
Image for post
Source: Ultralytics Yolov5

Since they first ported YOLOv3, Ultralytics has made it very simple to create and deploy models using Pytorch, so I was eager to try out YOLOv5. As it turns out, Ultralytics has further simplified the process, and the results speak for themselves. …


Image for post
Image for post
Photo by Cody Black on Unsplash

Using Amazon EC2+Pytorch+Fastapi and Docker

Just recently, I had written a simple tutorial on FastAPI, which was about simplifying and understanding how APIs work, and creating a simple API using the framework.

That post got quite a good response, but the most asked question was how to deploy the FastAPI API on ec2 and how to use images data rather than simple strings, integers, and floats as input to the API.

I scoured the net for this, but all I could find was some undercooked documentation and a lot of different ways people were taking to deploy using NGINX or ECS. …


Image for post
Image for post
Image by Louise Dav from Pixabay

Using PyTorch And Transfer Learning

Have you ever wondered how Facebook takes care of the abusive and inappropriate images shared by some of its users? Or how Facebook’s tagging feature works? Or how Google Lens recognizes products through images?

All of the above are examples of image classification in different settings. Multiclass image classification is a common task in computer vision, where we categorize an image into three or more classes.

In the past, I always used Keras for computer vision projects. However, recently when the opportunity to work on multiclass image classification presented itself, I decided to use PyTorch. …


Image for post
Image for post
Image by intographics from Pixabay

API creation just became simple

Have you ever been in a situation where you want to provide your model predictions to a frontend developer without them having access to model related code? Or has a developer ever asked you to create an API that they can use? I have faced this a lot.

As Data Science and Web developers try to collaborate, API’s become an essential piece of the puzzle to make codes as well as skills more modular. …


Image for post
Image for post
Image by Stefan Keller from Pixabay

With just 2 lines of code using Hummingbird

With the advent of so many computing and serving frameworks, it is getting stressful day by day for the developers to put a model into production. If the question of what model performs best on my data was not enough, now the question is what framework to choose for serving a model trained with Sklearn or LightGBM or PyTorch. And new frameworks are being added as each day passes.

So is it imperative for a Data Scientist to learn a different framework because a Data Engineer is comfortable with that, or conversely, does a Data Engineer need to learn a new platform that the Data Scientist favors? …


Image for post
Image for post
Image by Jill Wellington from Pixabay

Opinion

How will the field change in response to the new normal?

With the ongoing COVID-19 situation and the likelihood of a new normal, once the worst of it abates, businesses will be looking to change their processes. One of the biggest changes we can expect in the near future is the proliferation of work-from-home options, even after the lockdown eases. A large proportion of the workforce is likely to see an increase in the frequency with which they are working from home.

So, what does this mean for the field of data science and its practitioners?

I have now been working from home for the past month and a half. I can say for myself that I’m having difficulties with time management, not connecting with others face to face, and facing the continual stream of really horrid news. During all of this, I’m also trying to make some sense of what the future of our work will look like when it comes to data science. …


Image for post
Image for post
Source: Pixabay

A bookmarkable cheatsheet containing all the Dataframe Functionality you might need

Big Data has become synonymous with Data engineering. But the line between Data Engineering and Data scientists is blurring day by day. At this point in time, I think that Big Data must be in the repertoire of all data scientists.

Reason: Too much data is getting generated day by day

And that brings us to Spark which is one of the most used tools when it comes to working with Big Data.

While once upon a time Spark used to be heavily reliant on RDD manipulations, Spark has now provided a DataFrame API for us Data Scientists to work with. Here is the documentation for the adventurous folks. But while the documentation is good, it does not explain it from the perspective of a Data Scientist. …


Image for post
Image for post

DL Rig

CUDA, CuDNN, Python, Pytorch, Tensorflow, RAPIDS

Creating my own workstation has been a dream for me if nothing else. I knew the process involved, yet I somehow never got to it.

But this time I just had to do it. So, I found out some free time to create a Deep Learning Rig with a lot of assistance from NVIDIA folks who were pretty helpful. On that note special thanks to Josh Patterson and Michael Cooper.

Now, every time I create the whole deep learning setup from an installation viewpoint, I end up facing similar challenges. It’s like running around in circles with all these various dependencies and errors. …


Image for post
Image for post
Source: Pixabay

Creating Rank, Lags and Rolling Features with pySpark

In my last few posts on Spark, I explained how to work with PySpark RDDs and Dataframes.

Although these posts explain a lot on how to work with RDDs and Dataframe operations, and I would ask readers to go through them if they want to learn Spark Basics, I still didn’t mention quite a lot when it comes to working with PySpark Dataframes.

One such thing is the Spark window functions.

Recently, when I was working on one of my projects did I realize their power, and so this post is going to be about some of the most important Window functions available in Spark.


Image for post
Image for post
Image by Artsy Solomon from Pixabay

Opinion

A plethora of online courses and tools promise to democratize the field, but just learning a few basic skills does not a true data scientist make

Every few years, some academic and professional field gets a lot of cachet in the popular imagination. Right now, that field is data science. As a result, a lot of people are looking to get into it. Add to that the news outlets calling data science sexy and various academic institutes promising to make a data scientist out of you in just a few months, and you’ve got the perfect recipe for disaster.

Of course, as a data scientist myself, I don’t think the problem lies in people choosing data science as a profession. If you’re interested in working with data, understanding business problems, grappling with math, and you love coding, you’re probably going to thrive in data science. You’ll get a lot of opportunities to use math and coding to develop innovative solutions to problems and will likely find the work rewarding. …


With all the installations and environments

Image for post
Image for post
Image by Bessi from Pixabay

I have found myself creating a Deep Learning Machine time and time again whenever I start a new project.

You start with installing Anaconda and end up creating different environments for Pytorch and Tensorflow, so they don’t interfere. And in the middle of it, you inevitably end up messing up and starting from scratch. And this often happens multiple times.

It is not just a massive waste of time; it is also mighty(trying to avoid profanity here) irritating. Going through all those Stack Overflow threads. Often wondering what has gone wrong.

So is there a way to do this more efficiently? …


Image for post
Image for post
Retrain your mind. Image by John Hain from Pixabay

Everyone is prey to cognitive biases that skew thinking, but data scientists must prevent them from spoiling their work.

Recently, I was reading Rolf Dobell’s The Art of Thinking Clearly, which made me think about cognitive biases in a way I never had before. I realized how deeply seated some cognitive biases are. In fact, we often don’t even consciously realize when our thinking is being affected by one. For data scientists, these biases can really change the way we work with data and make our day-to-day decisions, and generally not for the better.

Data science is, despite the seeming objectivity of all the facts we work with, surprisingly subjective in its processes.

As data scientists, our job is to make sense of the facts. In carrying out this analysis, we have to make subjective decisions though. So even though we work with hard facts and data, there’s a strong interpretive component to data science. …


Image for post
Image for post
Image by Anrita1705 from Pixabay

Have you ever thought about how toxic comments get flagged automatically on platforms like Quora or Reddit? Or how mail gets marked as spam? Or what decides which online ads are shown to you?

All of the above are examples of how text classification is used in different areas. Text classification is a common task in natural language processing (NLP) which transforms a sequence of a text of indefinite length into a single category.

One theme that emerges from the above examples is that all have a binary target class. For example, either the comment is toxic or not toxic, or the review is fake or not fake. …


Image for post
Image for post
Image by S. Hermann & F. Richter from Pixabay

Python Shorts

A simple guide to use the new functionality in Python 3

Python provides us with many styles of coding.

And with time, Python has regularly come up with new coding standards and tools that adhere even more to the coding standards in the Zen of Python.

Beautiful is better than ugly.

In this series of posts named Python Shorts, I will explain some simple but very useful constructs provided by Python, some essential tips, and some use cases I come up with regularly in my Data Science work.

This post is specifically about using f strings in Python that was introduced in Python 3.6.


3 Common Ways of Printing:

Let me explain this with a simple example. Suppose you have some variables, and you want to print them within a statement. …


Image for post
Image for post

A Data Science Based News App

It seems that the way that I consume information has changed a lot. I have become quite a news junkie recently. One thing, in particular, is that I have been reading quite a lot of international news to determine the stages of Covid-19 in my country.

To do this, I generally visit a lot of news media sites in various countries to read up on the news. This gave me an idea. Why not create an international news dashboard for Corona? And here it is.

This post is about how I created the news dashboard using Streamlit and data from NewsApi and European CDC. …


Image for post
Image for post
Image by Sasin Tipchai from Pixabay

Caching = Better User Experience

It is straightforward now how to create a web app using Streamlit, but there are a lot of things that it doesn’t allow you to do yet. One of the major issues I faced recently was around caching when I was trying to use a News API to create an analytical news dashboard in Streamlit.

The problem was that I was hitting the News API regularly and was reaching the free API limits. Also, running the news API every time the user refreshed the app was getting pretty slow.

A solution to that would have been caching the API data. But when I used @st.cache decorator, the page never refreshed as the parameters to the API call remained the same. …

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store