Some Data Science Projects Every Data Scientist Must Know

Andre Vianna

Published in

My Data Science Journey

5 min readNov 6, 2021

Open source data science projects to enhance your portfolio
Let’s divide the projects into categories:

Open Sourcer Computer Vision

FaceX-Zoo
Bottleneck Transformer — Pytorch
StyleGAN2-ADA — Official PyTorch implementation

2. Open Source Natural Language Processing

Trankit
EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

3. Open Source Machine Learning

SeaLion

1. Open Sourcer Computer Vision

FaceX-Zoo

FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.

FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.

Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.

GitHub - Medium-Posts/FaceX-Zoo: A PyTorch Toolbox for Face Recognition

FaceX-Zoo is a PyTorch toolbox for face recognition. It provides a training module with various supervisory heads and…

github.com

Bottleneck Transformer — Pytorch

Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.

The paper says-

“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”

Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?

The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.

GitHub - Medium-Posts/bottleneck-transformer-pytorch: Implementation of Bottleneck Transformer in…

Implementation of Bottleneck Transformer, SotA visual recognition model with convolution + attention that outperforms…

github.com

StyleGAN2-ADA — Official PyTorch implementation

When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.

The project come with a lot of promises including-

Full support for all primary training configurations.
Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.

GitHub - Medium-Posts/stylegan2-ada-pytorch: StyleGAN2-ADA - Official PyTorch implementation

Training Generative Adversarial Networks with Limited Data Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine…

github.com

2. Open Source Natural Language Processing

Trankit

The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.

Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-

A trainable pipeline for fundamental NLP tasks over 100 languages
90 downloadable pretrained pipelines for 56 languages

Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.

GitHub - Medium-Posts/trankit: Trankit is a Light-Weight Transformer-based Python Toolkit for…

Our technical paper for Trankit won the Outstanding Demo Paper Award at EACL 2021. Please cite the paper if you use…

github.com

EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.

It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.

At present, the project provides the following models-

GitHub - Medium-Posts/EasyNMT: Easy to use, state-of-the-art Neural Machine Translation for 100+…

This package provides easy to use, state-of-the-art machine translation for more than 100+ languages.

github.com

3. Open Source Machine Learning

SeaLion

SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.

SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.

It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include:

Deep Neural Networks
Regression
Dimensionality Reduction
Unsupervised Clustering
Naive Bayes
Trees
Ensemble Learning
Nearest Neighbors
Utils

GitHub - Medium-Posts/SeaLion: The first machine learning framework that encourages learning ML…

SeaLion is designed to teach today's aspiring ml-engineers the popular machine learning concepts of today in a way that…

github.com

Some Data Science Projects Every Data Scientist Must Know

1. Open Sourcer Computer Vision

FaceX-Zoo

GitHub - Medium-Posts/FaceX-Zoo: A PyTorch Toolbox for Face Recognition

FaceX-Zoo is a PyTorch toolbox for face recognition. It provides a training module with various supervisory heads and…

Bottleneck Transformer — Pytorch

GitHub - Medium-Posts/bottleneck-transformer-pytorch: Implementation of Bottleneck Transformer in…

Implementation of Bottleneck Transformer, SotA visual recognition model with convolution + attention that outperforms…

StyleGAN2-ADA — Official PyTorch implementation

GitHub - Medium-Posts/stylegan2-ada-pytorch: StyleGAN2-ADA - Official PyTorch implementation

Training Generative Adversarial Networks with Limited Data Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine…

2. Open Source Natural Language Processing

Trankit

GitHub - Medium-Posts/trankit: Trankit is a Light-Weight Transformer-based Python Toolkit for…

Our technical paper for Trankit won the Outstanding Demo Paper Award at EACL 2021. Please cite the paper if you use…

EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

GitHub - Medium-Posts/EasyNMT: Easy to use, state-of-the-art Neural Machine Translation for 100+…

This package provides easy to use, state-of-the-art machine translation for more than 100+ languages.

3. Open Source Machine Learning

SeaLion

GitHub - Medium-Posts/SeaLion: The first machine learning framework that encourages learning ML…

SeaLion is designed to teach today's aspiring ml-engineers the popular machine learning concepts of today in a way that…

Written by Andre Vianna