Python Data Engineering & Machine Learning Skill Matrix
Background
As part of my job hunt which is underway I talk to many recruiters and as part of one such job role, I was given to fill in a skill matrix, usually job roles and profiles are quite vague or you get to know about the role only when you talk with the Hiring manager or the team.
I really liked the exhaustiveness of skill matrix that was shared by the recruiter. I am not sure of the confidentiality of the skill matrix but since the skills are in public, I will let you also check it out and see how much you know in this and rate yourself. I will keep on adding topics & make changes in future to make this skill matrix more exhaustive.
Motivation
This is a Competencies matrix form (Data Engineer + Machine Learning skills)which will it easier for you to match your competencies with job requirements. Also help to access what you need to up-skill. Of course, we know all below skills may change in time, but we need refreshing this information for today.
The Skill Matrix
How to Rate Yourself
Rate yourself based What technology do you use for Data Engineering and/or Machine Learning.
Set level of experience based on rules:
1 — never, new for me
2 — using from times to times
3 — everyday tool
4 — I’m expert in this tool
5 — I have 3+ years of experience *
The below subsections give the skills in major categories that would be expected out of a Data Engineer.
Python
- Python programming with OOP
- CPython programming — Python + C/C++ APIs
Machine Learning
- Linear Algebra
- Calculus
- Numerical Analysis
- Statistics (Frequentionist & Bayesian)
- Linear regression & Logistic Regression
- Mixture Models, EM
- Latent Linear Models (PCA, SVD)
- Kernels (SVMs, Kernel Machines itd.)
- Markov Models, Hidden Markov Models
- Ensembles
- MCMC inference
- Clustering
- Unsupervised dimensionality reduction (t-SNE, UMAP)
- Latent Models for Discrete Data (LSA, LDA)
Deep Learning
- Feed-forward Neural Nets
- Convolutional Neural Networks
- Recurrent Neural Networks
- Siamese Networks
- Few-Shot Learning
- Attention modules (including Transformers)
- Generative Adversarial Networks
- Generative Models (DBNs, RBMs)
Deep Learning & Image Processing
- Image classifiers (ResNets, EfficientNet, VGG, Inception)
- Object Detectors (SSD, Yolo, Faster-RCNN)
- Semantic/Instance Segmentation (CRF-RNN, Deeplab, Mask-RCNN)
- Learning similarity (Siamese Networks, DeepRank)
- Image Generation/Image translation (CycleGAN, StarGAN, StyleTransfer, VAE)
- Landmarks detection (Keypoints on face like eyes/nose or pose detection)
- Super Resolution
- Object tracking (video)
- Action recognition
- Video classification
- Facial recognition and modeling
- Image retrieval
- Image captioning
- StyleTransfer
Deep Learning & NLP
- Optical Character Recognition
- Text representations (bag of words, tf-idf, n-gram)
- Word Embeddings (word2vec, doc2vec, GloVe, fasttext)
- Language Models (BERT, GPT-2, ELMO, ERNIE itd)
- Name Entity Recognition
- Sentiment Analysis
- Question Answering
- Machine Translation
- Summarization (Abstractive & Extractive)
- Topic modeling
- Language generation
- Speech Recognition
- Speech Synthesis
- Speakers Separation
- Emotion Recognition
- Speech Verification
- Speech Enhancement
AI & ML Addtional Topics
- Reinforcement Learning (Q-learning, Deep Q-Learning, Alpha-Zero)
- Bayesian optimization
- Adversarial validation
- Survival analysis
- Non-gradient optimization
- Adversarial attacks & defenses
- Recommenders (Collaborative Filtering, Matrix Factorization)
- Time Series Analysis
AI & ML Frameworks & Libraries
- PyTorch
- Tensorflow
- fast.ai
- cleverhans
- DALI
- gin-config
- imbalanced-learn
- mlxetend
- numpy
- RAPIDS
- scikit-learn
- scipy
- transformers (From HuggingFace)
- nltk
- spacy
- flair
- farm
- pytext
- nevergrad
- pyro
- pgmpy
- surpise
- pykaldi/pytorchkaldi/kaldi
- pandas
- matplotlib
- seaborn
- plotly
- OpenCV
ML Ops & Data Ops Platforms
- Polyaxon
- Quilt Data
- MLFlow
- DVC
- Seldon
- Kubeflow
- Tensorflow Serving
Data Management
- SQL (i.e postgresql, oracle, mysql)
- document NoSQL (i.e. MongoDB, CouchDB)
- graph dbs NoSQL (i.e. neo4j, OrientDB, AWS Neptune)
- column dbs NoSQL (i.e. Apache Cassandra, Apache Druid, GCP BigQuery, ClickHouse)
Open Source Data Platforms
- Apache Airflow
- Apache Beam
- Apache Nifi
- Apache Arrow
- Apache Parquet
- OCR format
- AVRO format
- Apache Kafka
- Apache Flink
- Kafka Streams
- Spark Streaming
- Apache Spark
Databases, Streaming & Queues
- PostgreSQL
- MySQL
- InfluxDB
- Oracle Database
- Redis
- HBase
- Neo4J
- RabbitMQ
- ELK stack — Elasticsearch, Logstash, Kibana
- snowflake data cloud
Data Visualisation & Notebooks
- Apache Superset
- Metabase
- Zeppelin
- databricks.com
Data Governance & Metadata
Apache Atlas
Data Engineering & Architecture Basic
- Scala programming language
- DAGs building for ETL jobs
- data catalogs
- data schemas with registry and versioning (i.e. AVRO)
- data warehouses
- JSON format
- XML format
- logs, monitoring (i.e. kibana, graylog, datadog, prometheus)
- infrastructure horizontal scaling
- infrastructure vertical scaling
- orchestrators in general (i.e. Mesos, Marathin, kubernetes)
- software performance testing
- software unit testing
- software functional/behavioral testing
- REST APIs + HTTP protocol
- Security & Data Masking
- software debugging skills
AWS Data, AI & ML Offerings
- AWS Dynamo
- AWS Redshift
- AWS Batch
- AWS Lambda
- AWS
- AWS S3
- AWS EC2
- AWS Athena
- AWS EMR
- AWS Kinesis
- AWS SageMaker
- AWS Quicksight
- AWS Glue
- AWS Lake Formation
- AWS Data Exchange
- AWS DynamoDB
GCP Data, AI & ML Offerings
- GCP BigQuery
- GCP Looker
- GCP Dataproc
- GCP Dataflow
- GCP Pub/Sub
- GCP Cloud Data Fusion
- GCP Data Catalog
- GCP Cloud Composer
- GCP Google Data Studio
- GCP Marketing Platform
- GCP Cloud Life Sciences
- GCP Dataprep
Azure Data, AI & ML Offerings
- Azure Analysis Services
- Azure Data Explorer
- Azure Data Lake Storage
- Azure Data Share
- Azure Databricks
- Azure Stream Analytics
- Azure Synapse Analytics
- Azure Data Catalog
- Azure Data Factory
- Azure Data Lake Analytics
- Azure Event Hubs
- Azure HDInsight
- Azure Log Analytics
- Azure Power BI Embedded
- Azure R Server for HDInsight
- Azure Azure Purview
- Azure Cosmos DB
- Azure Table Storage
- Azure SQL Database
- Azure AI + ML — text services
- Azure AI + ML — Computer Vision
- Azure AI + ML — Data Science Virtual Machines
- Azure Open Datasets
- Azure Cognitive Search
- Azure Machine Learning
Additional Tech Platforms
- IoT services in public clouds
Finally
While this is a long list and I know there are lot of existing and emerging technologies which is not there. This can easily be a starting point for you to build a skill matrix for yourself or your team and enhance with more relevant skill to up-skill or map to a given business problem, project or business requirement.