Towards Data Science
A Medium publication sharing concepts, ideas, and codes.

488K Followers
·

Image for post
Image for post
Background photo by Johannes Plenio from Pexels

Find our latest picks here plus info about our new design and how to contribute to our publication

Our latest picks:

The new design

We just opted in to Medium’s new design!

It offers new features that we think will be valuable to both our readers and our writers. We did lose our menu (it will be back soon), but we believe that the recommendations at the bottom of our articles and the presentation of our authors on the sidebar are immensely valuable. …


My forecast relies on the historical data of the popular data in every state

Acknowledgment

Thanks to Asma Barakat for helping me in gathering the needed data for this research!

Introduction

Many factors affect the election results as the COVID-19, impeachment, economy, unemployment rate, natural disasters’ response, climate change, foreign policy, people’s loyalty to their party, debates, presidents’ heights, and plenty of other factors. In this article, I will focus solely on historical data of the popular vote in every state.

Model

The programming language used in the modeling and analysis is SAS 9.4. The model used is the PROC UNIVARIATE.

Results

To test my model, I predicted the 2016 elections and compared the results to the actual values. Collectively, the algorithm shows that Trump would win in 2016, which actually happened. Also, it forecasts that he will win the 2020 elections. However, by inspecting the state's results, I found that the code predicted 11 states wrong, further details below. …


Peeking inside Deep Neural Networks with Integrated Gradients, Implemented in PyTorch.

Image for post
Image for post
Photo by img.ly on Unsplash

Neural networks are known to be black box predictors where the data scientist does not usually know which particular input feature influenced the prediction the most. This can be rather limiting if we want to get some understanding of what the model actually learned. Having this kind of understanding may allow us to find bugs or weaknesses in our learning algorithm or in our data processing pipeline and thus be able to improve them.

The approach that we will implement in this project is called integrated gradients and it was introduced in the following paper:

In this paper, the authors list some desirable axioms that a good attribution method should follow and prove that their method Integrated gradients statisfies those axioms. …


Automate Boring Stuff with Python and Bash For Loop

Image for post
Image for post
Photo by Sincerely Media on Unsplash

Motivation

When putting your code into production, you will most likely need to deal with organizing the files of your code. It can be really time-consuming to read, create, and run many files of data. This article will show you how to automatically

  • Loop through files in a directory
  • Create nested files if they do not exist
  • Run one file with different inputs using bash for loop

These tricks have saved me a lot of time while working on my data science projects. I hope you will find them useful as well!

Loop through Files in a Directory

If we have multiple data to read and process like…


A quick walk-through on creating choropleth using the Plot.ly module for Python.

Image for post
Image for post
(Image by author)

Introduction

As COVID-19 cases rise again in the United States, an interesting and exciting development that might have been somewhat overlooked is the actual level of threat that each state has against a virus of this nature. …


Spatial Visualisation — Using the combined power of Turf and Hextile JavaScript libraries to generate hex maps

Having come across many articles, I do notice many individuals seek to generate hex maps for their current spatial datasets either for the sake of aesthetics (E.g. …


Easily animate beautiful maps in R using leaflet and shiny.

Image for post
Image for post
Photo by Timo Wielink on Unsplash

Every day, we as data scientists and data analysts have to work with different kinds of data. And we all know that visualization of the data and our findings is key, especially when presenting it to co-workers or clients. After all, it is far easier to tell a story with a chart than it is with plain numbers or text. When you’re presenting data via a dashboard, not only static visualization becomes important, you will also want to display changes in your data dynamically.

In this post, I will demonstrate how you can easily animate charts based on geospatial data using the leaflet and shiny libraries in R. …


Adding spaCy Word Vectors to a Keras Model

A close-up photo of a fountain pen writing in cursive on lined paper with black ink.
A close-up photo of a fountain pen writing in cursive on lined paper with black ink.
Photo by Aaron Burden on Unsplash
  1. The story so far
  2. Exploratory data analysis
  3. Imputing missing values
  4. Optimizing data types
  5. Creating document vectors
  6. Building the pipeline
  7. Evaluating the model
  8. Next steps

The story so far

A few months ago, I built a neural network regression model to predict loan risk, training it with a public dataset from LendingClub. Then I built a public API with Flask to serve the model’s predictions.

Then last month, I decided to put my model to the test and found out that my model can pick grade A loans better than LendingClub!

But I’m not done. Now that I’ve learned the fundamentals of natural language processing (I highly recommend Kaggle’s course on the subject), I’m going to see if I can eke out a bit more predictive power using a couple of freeform text fields in the dataset: title and desc (description). …


Analyze headlines and story text with Streamlit, Transformers and FastAPI

Image for post
Image for post

Vast amounts of media, news and commentary are generated on a daily basis. Headlines and other attention-grabbing content is constantly put on our screens to try to get us to click through. Putting together a good headline is almost as important as the content within an article and there are teams of people dedicated it.

Natural Language Processing (NLP) is a large and growing field focused on the application of machine learning to attain human-level understanding of textual data. …


Image for post
Image for post
(Image by Author)

Rethinking the Value of Labels for Improving Class-Imbalanced Learning (NeurIPS 2020)

Let me introduce to you our latest work, which has been accepted by NeurIPS 2020: Rethinking the Value of Labels for Improving Class-Imbalanced Learning. This work mainly studies a classic but very practical and common problem: the classification problem under the imbalance of data categories (also referred to as the long-tailed distribution of data). Through theoretical modeling and extensive experiments, we found that both semi-supervised and self-supervised learning can significantly improve learning performance under imbalanced data.

The source code (and relevant data, >30 pre-trained models) can be found via this GitHub link: https://github.com/YyzHarry/imbalanced-semi-self.

To begin with, I would like to first summarize the main contribution of this article in one sentence: We have verified both theoretically and empirically that, for learning problems with imbalanced data (categories)…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store