Some Data Science Projects Every Data Scientist Must Know

Andre Vianna
My Data Science Journey
5 min readNov 6, 2021
Data Science

Open source data science projects to enhance your portfolio
Let’s divide the projects into categories:

  1. Open Sourcer Computer Vision
  • FaceX-Zoo
  • Bottleneck Transformer — Pytorch
  • StyleGAN2-ADA — Official PyTorch implementation

2. Open Source Natural Language Processing

  • Trankit
  • EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

3. Open Source Machine Learning

  • SeaLion

1. Open Sourcer Computer Vision

FaceX-Zoo

FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.

FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.

Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.

Bottleneck Transformer — Pytorch

Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.

The paper says-

“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”

Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?

The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.

StyleGAN2-ADA — Official PyTorch implementation

When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.

The project come with a lot of promises including-

  • Full support for all primary training configurations.
  • Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
  • Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.

2. Open Source Natural Language Processing

Trankit

The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.

Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-

Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.

EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

Neural Machine Tranlation

With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.

It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.

At present, the project provides the following models-

3. Open Source Machine Learning

SeaLion

SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.

SeaLion

SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.

It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include:

  1. Deep Neural Networks
  2. Regression
  3. Dimensionality Reduction
  4. Unsupervised Clustering
  5. Naive Bayes
  6. Trees
  7. Ensemble Learning
  8. Nearest Neighbors
  9. Utils

--

--

Andre Vianna
My Data Science Journey

Software Engineer & Data Scientist #ESG #Vision2030 #Blockchain #DataScience #iot #bigdata #analytics #machinelearning #deeplearning #dataviz