Python Data Engineering & Machine Learning Skill Matrix

DailyPriyab
DAM DATA-AI-ML Learning
4 min readJan 19, 2022
Stock Image from Pixbay https://bit.ly/3qHMggY

Background

As part of my job hunt which is underway I talk to many recruiters and as part of one such job role, I was given to fill in a skill matrix, usually job roles and profiles are quite vague or you get to know about the role only when you talk with the Hiring manager or the team.

I really liked the exhaustiveness of skill matrix that was shared by the recruiter. I am not sure of the confidentiality of the skill matrix but since the skills are in public, I will let you also check it out and see how much you know in this and rate yourself. I will keep on adding topics & make changes in future to make this skill matrix more exhaustive.

Motivation

This is a Competencies matrix form (Data Engineer + Machine Learning skills)which will it easier for you to match your competencies with job requirements. Also help to access what you need to up-skill. Of course, we know all below skills may change in time, but we need refreshing this information for today.

The Skill Matrix

How to Rate Yourself

Rate yourself based What technology do you use for Data Engineering and/or Machine Learning.

Set level of experience based on rules:
1 — never, new for me
2 — using from times to times
3 — everyday tool
4 — I’m expert in this tool
5 — I have 3+ years of experience *

The below subsections give the skills in major categories that would be expected out of a Data Engineer.

Python

  • Python programming with OOP
  • CPython programming — Python + C/C++ APIs

Machine Learning

  • Linear Algebra
  • Calculus
  • Numerical Analysis
  • Statistics (Frequentionist & Bayesian)
  • Linear regression & Logistic Regression
  • Mixture Models, EM
  • Latent Linear Models (PCA, SVD)
  • Kernels (SVMs, Kernel Machines itd.)
  • Markov Models, Hidden Markov Models
  • Ensembles
  • MCMC inference
  • Clustering
  • Unsupervised dimensionality reduction (t-SNE, UMAP)
  • Latent Models for Discrete Data (LSA, LDA)

Deep Learning

  • Feed-forward Neural Nets
  • Convolutional Neural Networks
  • Recurrent Neural Networks
  • Siamese Networks
  • Few-Shot Learning
  • Attention modules (including Transformers)
  • Generative Adversarial Networks
  • Generative Models (DBNs, RBMs)

Deep Learning & Image Processing

  • Image classifiers (ResNets, EfficientNet, VGG, Inception)
  • Object Detectors (SSD, Yolo, Faster-RCNN)
  • Semantic/Instance Segmentation (CRF-RNN, Deeplab, Mask-RCNN)
  • Learning similarity (Siamese Networks, DeepRank)
  • Image Generation/Image translation (CycleGAN, StarGAN, StyleTransfer, VAE)
  • Landmarks detection (Keypoints on face like eyes/nose or pose detection)
  • Super Resolution
  • Object tracking (video)
  • Action recognition
  • Video classification
  • Facial recognition and modeling
  • Image retrieval
  • Image captioning
  • StyleTransfer

Deep Learning & NLP

  • Optical Character Recognition
  • Text representations (bag of words, tf-idf, n-gram)
  • Word Embeddings (word2vec, doc2vec, GloVe, fasttext)
  • Language Models (BERT, GPT-2, ELMO, ERNIE itd)
  • Name Entity Recognition
  • Sentiment Analysis
  • Question Answering
  • Machine Translation
  • Summarization (Abstractive & Extractive)
  • Topic modeling
  • Language generation
  • Speech Recognition
  • Speech Synthesis
  • Speakers Separation
  • Emotion Recognition
  • Speech Verification
  • Speech Enhancement

AI & ML Addtional Topics

  • Reinforcement Learning (Q-learning, Deep Q-Learning, Alpha-Zero)
  • Bayesian optimization
  • Adversarial validation
  • Survival analysis
  • Non-gradient optimization
  • Adversarial attacks & defenses
  • Recommenders (Collaborative Filtering, Matrix Factorization)
  • Time Series Analysis

AI & ML Frameworks & Libraries

  • PyTorch
  • Tensorflow
  • fast.ai
  • cleverhans
  • DALI
  • gin-config
  • imbalanced-learn
  • mlxetend
  • numpy
  • RAPIDS
  • scikit-learn
  • scipy
  • transformers (From HuggingFace)
  • nltk
  • spacy
  • flair
  • farm
  • pytext
  • nevergrad
  • pyro
  • pgmpy
  • surpise
  • pykaldi/pytorchkaldi/kaldi
  • pandas
  • matplotlib
  • seaborn
  • plotly
  • OpenCV

ML Ops & Data Ops Platforms

  • Polyaxon
  • Quilt Data
  • MLFlow
  • DVC
  • Seldon
  • Kubeflow
  • Tensorflow Serving

Data Management

  • SQL (i.e postgresql, oracle, mysql)
  • document NoSQL (i.e. MongoDB, CouchDB)
  • graph dbs NoSQL (i.e. neo4j, OrientDB, AWS Neptune)
  • column dbs NoSQL (i.e. Apache Cassandra, Apache Druid, GCP BigQuery, ClickHouse)

Open Source Data Platforms

  • Apache Airflow
  • Apache Beam
  • Apache Nifi
  • Apache Arrow
  • Apache Parquet
  • OCR format
  • AVRO format
  • Apache Kafka
  • Apache Flink
  • Kafka Streams
  • Spark Streaming
  • Apache Spark

Databases, Streaming & Queues

  • PostgreSQL
  • MySQL
  • InfluxDB
  • Oracle Database
  • Redis
  • HBase
  • Neo4J
  • RabbitMQ
  • ELK stack — Elasticsearch, Logstash, Kibana
  • snowflake data cloud

Data Visualisation & Notebooks

Data Governance & Metadata

Apache Atlas

Data Engineering & Architecture Basic

  • Scala programming language
  • DAGs building for ETL jobs
  • data catalogs
  • data schemas with registry and versioning (i.e. AVRO)
  • data warehouses
  • JSON format
  • XML format
  • logs, monitoring (i.e. kibana, graylog, datadog, prometheus)
  • infrastructure horizontal scaling
  • infrastructure vertical scaling
  • orchestrators in general (i.e. Mesos, Marathin, kubernetes)
  • software performance testing
  • software unit testing
  • software functional/behavioral testing
  • REST APIs + HTTP protocol
  • Security & Data Masking
  • software debugging skills

AWS Data, AI & ML Offerings

  • AWS Dynamo
  • AWS Redshift
  • AWS Batch
  • AWS Lambda
  • AWS
  • AWS S3
  • AWS EC2
  • AWS Athena
  • AWS EMR
  • AWS Kinesis
  • AWS SageMaker
  • AWS Quicksight
  • AWS Glue
  • AWS Lake Formation
  • AWS Data Exchange
  • AWS DynamoDB

GCP Data, AI & ML Offerings

  • GCP BigQuery
  • GCP Looker
  • GCP Dataproc
  • GCP Dataflow
  • GCP Pub/Sub
  • GCP Cloud Data Fusion
  • GCP Data Catalog
  • GCP Cloud Composer
  • GCP Google Data Studio
  • GCP Marketing Platform
  • GCP Cloud Life Sciences
  • GCP Dataprep

Azure Data, AI & ML Offerings

  • Azure Analysis Services
  • Azure Data Explorer
  • Azure Data Lake Storage
  • Azure Data Share
  • Azure Databricks
  • Azure Stream Analytics
  • Azure Synapse Analytics
  • Azure Data Catalog
  • Azure Data Factory
  • Azure Data Lake Analytics
  • Azure Event Hubs
  • Azure HDInsight
  • Azure Log Analytics
  • Azure Power BI Embedded
  • Azure R Server for HDInsight
  • Azure Azure Purview
  • Azure Cosmos DB
  • Azure Table Storage
  • Azure SQL Database
  • Azure AI + ML — text services
  • Azure AI + ML — Computer Vision
  • Azure AI + ML — Data Science Virtual Machines
  • Azure Open Datasets
  • Azure Cognitive Search
  • Azure Machine Learning

Additional Tech Platforms

  • IoT services in public clouds

Finally

While this is a long list and I know there are lot of existing and emerging technologies which is not there. This can easily be a starting point for you to build a skill matrix for yourself or your team and enhance with more relevant skill to up-skill or map to a given business problem, project or business requirement.

--

--

DailyPriyab
DAM DATA-AI-ML Learning

Data Engineering | Data Governance | Azure | Spark | Python | Manager