Data Science Series with IEEE Part 2

3 min readFeb 24, 2019

On 17th February 2019, IBM conducted a workshop on Data Analysis with python at American University of Sharjah (AUS), UAE. It was delivered by me and Mahitab Hassan. The workshop was the second part of the IEEE series.

The workshop started off with Mahitab Hassan giving a recap of what happened in the first workshop. This included introduction to data science, pandas, numpy, matplotlib, Jupyter Notebooks on IBM Watson Studio and the data science pipeline.

This was followed by me starting off the data analysis with python session . I explained how data analysis can be used for solving various problems. I explained the various steps included in data analysis such as data requirements where we need to understand what kind of data do we require for a particular problem . This would include various brain storming sessions with the domain experts.

After data requirements the next step is to collect the data from various distributed sources . The data collected can be structured, unstructured and could contain misleading data such as missing values, unbalanced data, outliers and non standardized data.

Hence, the next step of the flow would be data preparation. I explained what are various ways that one can deal with missing values, how to standardize data, why and when to use normalization for data. I also demonstrated how to use python and pandas to do the above in Jupyter Notebooks on IBM Watson Studio.

After the data has been processed, the next step that is basic data understanding using statistical methods like mean, percentiles, standard deviation etc and how to use a single function in pandas ( df.describe () ) to get all the statistical results easily. The next step would be to know exactly which features of the dataset you should keep based on it’s importance or correlation with the target value. For this, we went over what is correlation and how to use it, how to use group by function in pandas to perform calculations on groups of data and what is ANOVA and why it is useful to know how each group in a certain column is affecting the target column.

The session was concluded with Q/A and giving them the resources and links including the quiz that is based on the same workshop. Passing this quiz will help students achieve a badge from IBM which can be shown as a proof of their understanding of data analysis with python.

Resources:

Jupyter Project (http://jupyter.org/)
Numpy (http://www.numpy.org/)
Matplotlib (https://matplotlib.org)
Seaborn (https://seaborn.pydata.org)
Cognitive Class (https://cognitiveclass.ai/courses/data-analysis-python/)
IBM Watson Studio (https://datascience.ibm.com/)
IBM Developer (developer.ibm.com/code)
Udacity (https://eu.udacity.com/school-of-ai
Kaggle (https://www.kaggle.com/)

Data Science Series with IEEE Part 2

Written by Chandni Vaya