Some Data Science Projects Every Data Scientist Must Know

Andre Vianna
My Data Science Journey
5 min readNov 6, 2021
Data Science

Open source data science projects to enhance your portfolio
Let’s divide the projects into categories:

  1. Open Sourcer Computer Vision
  • FaceX-Zoo
  • Bottleneck Transformer — Pytorch
  • StyleGAN2-ADA — Official PyTorch implementation

2. Open Source Natural Language Processing

  • Trankit
  • EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

3. Open Source Machine Learning

  • SeaLion

1. Open Sourcer Computer Vision


FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.

FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.

Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.

Bottleneck Transformer — Pytorch

Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.

The paper says-

“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”

Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?

The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.

StyleGAN2-ADA — Official PyTorch implementation

When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.

The project come with a lot of promises including-

  • Full support for all primary training configurations.
  • Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
  • Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.

2. Open Source Natural Language Processing


The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.

Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-

Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.

EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

Neural Machine Tranlation

With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.

It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.

At present, the project provides the following models-

3. Open Source Machine Learning


SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.


SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.

It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include:

  1. Deep Neural Networks
  2. Regression
  3. Dimensionality Reduction
  4. Unsupervised Clustering
  5. Naive Bayes
  6. Trees
  7. Ensemble Learning
  8. Nearest Neighbors
  9. Utils



Andre Vianna
My Data Science Journey

Software Engineer & Data Scientist #ESG #Vision2030 #Blockchain #DataScience #iot #bigdata #analytics #machinelearning #deeplearning #dataviz