CODEX

Beginners Resources for Aspiring Python Data Analysts/Scientists

Ina Hanninger
CodeX
Published in
5 min readMar 23, 2021

--

Towards the end of my time teaching an Introduction to Python course for CodeFirstGirls, we’ve gotten several enthusiastic questions on how to get started learning to use Python libraries for data analytics, visualization and data science. Having learned all this only not too long ago, I thought I’d share some resources I personally found useful!

Before I delve right in, I think it’s important to first clarify what data analysis and data science actually mean and what the differences are (as they are often times confused).

Data analysis = the act of identifying trends in datasets, performing descriptive statistics, developing charts and visualizations of the data to help individuals or companies make specific strategic decisions (usually business decisions).

Data science = refers to the broader exploratory pursuit of extracting real world knowledge from data (data analysis can be thought of as a subset of data science). On top of developing visualizations, data science tends to involve building more complex statistical models to represent the data (e.g. linear regression), or training machine learning models to make future predictions. Often this involves a lot more coding and computer science expertise than data analysis.

Source: https://www.ironhack.com/en/data-analytics/data-science-data-analytics

Sources:
https://hackr.io/blog/data-science-vs-data-analytics https://www.northeastern.edu/graduate/blog/data-analytics-vs-data-science/

Resources for Data Analysis and Visualisation

First off, here are some widely used libraries you should be aware of. The following links to the documentation also include tutorials you can take a look at:

  • Pandas — the “Python Data Analysis Library”. It allows fast and easy manipulation of both tabular and time series data; merging and joining of rows and columns, aggregations such as groupbys and effective handling of missing datapoints. It is build on top of Numpy, and introduces two fundamental data structures — the Series(for 1-dimensional data) and the DataFrame (2-dimensional).
  • Numpy — allows support and manipulation of multi-dimensional array and matrix data structures. In the context of data analysis, data is often defined in Numpy arrays and this is how we create Pandas dataframes.
  • Matplotlib — used for plotting a wide range of graphs and charts for statistical visualization
  • Seaborn — an alternative to matplotlib (which people tend to find a bit more aesthetic)

I found that the best way to learn these libraries was to follow practical examples, while having the official documentation up to make sure I understand what each function does. I recommend following blog posts in Towards Data Science, and finding notebooks on Google Colab you can run through live.

Here are a few starting points:

  1. A Beginners Guide to Data Analysis in Python (Towards Data Science)
  2. Data Visualisation in Python (Medium Post)
  3. Intro to Pandas (Google Colab)
  4. Visualisation with Seaborn (Google Colab)

Resources for Machine Learning and Data Science

When it comes to learning more advanced forms of data science — i.e. training predictive models, building and optimizing machine learning algorithms — understanding the mathematical and statistical theory behind these methods becomes quite important. Although many of the modern Python libraries abstract away the implementation of these algorithms into single lines of code, understanding the underlying theory of how they work is critical to ensuring that the results you get are accurate and appropriate. For example, knowing which classifier or regression method makes sense for your particular dataset, preventing data leakage (where you unintentionally feed test data into the training process), and knowing how to optimize the model hyperparameters. That being said, don’t worry too much at the beginning about fully grasping all the mathematical details (which can at first be quite intimidating!)— I find it’s best to learn both the theory and practical coding side by side.

For understanding the theory:

  • Andrew Ng’s Machine Learning Coursera for a more general overview of machine learning concepts. It gets quite mathematical and may lack the practical side of things (on top of being based in MATLAB instead of Python), but this is otherwise ideal for nailing the theoretical and conceptual fundamentals.
  • DeepLearning.AI Coursera which focuses more on deep learning algorithms such as neural networks, and includes Jupyter notebooks you can work through as well. Though I’d only move onto this once you’ve covered at least some of the general ML concepts.

For learning the practical side:

Here are some widely used machine learning libraries:

  • Scikit-learn — implements a wide range of classification, regression, clustering, dimensionality reduction, model selection and preprocessing algorithms (built on NumPy, SciPy, and matplotlib)
  • TensorFlow — a library developed by Google focusing on training and inference of deep neural networks
  • PyTorch — similar to Tensorflow but with a bit more of an emphasis on computer vision and natural language processing

Just like with the data analysis libraries stated above, I found the best way for me to learn was by following through examples of other people’s code. In addition to TowardsDataScience blog posts and Google Colab notebooks, I also found the following resources very helpful:

  • Kaggle Code:
    Kaggle is an online platform where companies and organizations share their datasets to the public, and users compete to build the highest accuracy machine learning models for their use cases. But alongside this, there is also a rich and broad archive of Jupyter notebooks aimed at beginners which explains each step in their code. From the link above, you can search through any type of dataset or machine learning method you would like to learn about (make sure you filter by Python though). As an example, this is one of the first notebooks I followed along — What Causes Heart Disease? Explaining the Model
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition:
    If you’re someone who finds it easier to grasp concepts when it’s written on paper, I recommend this textbook as a practical guide for how to use the most popular machine learning libraries in Python. It’s particularly great at explaining only the most relevant bits of theory, and then giving you code examples you can try yourself.
  • DataQuest.io:
    A website with a range of interactive courses for data science and machine learning. The full version is paid, but you can still access a range of free lessons to get you started.

Hope that helps!

--

--

Ina Hanninger
CodeX

London based software/data engineer and recent MEng graduate from Oxford with research experience in machine learning and biomedical engineering