Some Data Science Projects Every Data Scientist Must Know
Open source data science projects to enhance your portfolio
Let’s divide the projects into categories:
- Open Sourcer Computer Vision
- FaceX-Zoo
- Bottleneck Transformer — Pytorch
- StyleGAN2-ADA — Official PyTorch implementation
2. Open Source Natural Language Processing
- Trankit
- EasyNMT — Easy to use, state-of-the-art Neural Machine Translation
3. Open Source Machine Learning
- SeaLion
1. Open Sourcer Computer Vision
FaceX-Zoo
FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.
FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.
Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.
Bottleneck Transformer — Pytorch
Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.
The paper says-
“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”
Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?
The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.
StyleGAN2-ADA — Official PyTorch implementation
When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.
The project come with a lot of promises including-
- Full support for all primary training configurations.
- Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
- Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.
With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.
2. Open Source Natural Language Processing
Trankit
The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.
Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-
- A trainable pipeline for fundamental NLP tasks over 100 languages
- 90 downloadable pretrained pipelines for 56 languages
Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.
EasyNMT — Easy to use, state-of-the-art Neural Machine Translation
With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.
It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.
At present, the project provides the following models-
- Opus-MT from Helsinki-NLP,
- mBART50_m2m from Facebook Research
- M2M_100 from Facebook Research
3. Open Source Machine Learning
SeaLion
SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.
SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.
It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include:
- Deep Neural Networks
- Regression
- Dimensionality Reduction
- Unsupervised Clustering
- Naive Bayes
- Trees
- Ensemble Learning
- Nearest Neighbors
- Utils