I just completed my first Exploratory Data Analysis (EDA) project, week 3 of being in General Assembly’s Data Science Bootcamp. Here’s what I’ve learnt so far:

Gloria Neo
4 min readJul 16, 2023

--

If there is one thing anyone considering a data science bootcamp needs to know: It’s that everyday passes like lighting. By mid week 2, we had already covered the essentials required for data analysis — python, pandas, matplotlib, seaborn, etc., and then we had around 10 days (amidst lessons and other labs) to complete the project.

For this article, I have decided not to go in depth into posting about the project, because there are already tons of EDA projects on medium. Rather, I would like to share my takeaways to benefit others who are doing a project for the first time.

If you would like to read more about my project, you can find details of the code and powerpoint slide deck here.

Photo by Luke Chesser on Unsplash

The Project

For context, we were supposed to do an exploratory data analysis (EDA) of Singapore rainfall. We were given the freedom to decide on the problem statement, and my problem statement was:

You are part of the National Environment Agency (NEA) taskforce set up to develop a set of air quality early warning systems to predict potential periods of air quality deterioration, allowing the government to take necessary precautionary action. As such, your EDA should provide insights to the impact (immediacy, magnitude) that the various weather factors have on air quality.

While it looked straightforward, this problem statement shaped the direction and purpose of the entire EDA, and could make or break the project. But, let’s move on to talk about the overall lessons learnt.

1. Data is Everything.

Look for your data sources first.

For this first project, we had 10 days to complete it, and we have not been taught / are not expected to do webscraping or webcrawling yet, due to the short time frame of the project. In a scenario like this, it is important to look for data sources in conjunction with thinking up your problem statement. I had made the mistake of spending a whole night crafting my problem statement and doing research on another topic (flood control) only to realise there were insufficient datasets to work on.

Data cleaning is the most important part of the process.

In my first few labs and first project, I had the misconception that data import and cleaning was somewhat like the pre-work to be done before the actual work. But I realised soon enough about the importance of data cleaning. As they always say “Garbage in, Garbage out” — the initial data cleaning sets the foundation for the other data processes to follow.

2. Automate where possible.

That time spent to write some functions is absolutely worth it. Further down the notebook, when u realised you have cut down so much time on potentially repetitive code, and made your notebook look sooo much neater… future you will thank yourself.

Besides, whats the point of pivoting into a tech if you don’t automate things?

3. Share your thought process.

Another misconception I had earlier, was that shorter code notebooks and chain code meant neater, more efficient, and more effective code.

However, this may not necessarily be true. For a start, especially if you are new to coding, it is good practise to break down your workflow into separate steps. This would allow you track back to code errors more easily, and allow you to check your intermediate steps along the way.

Whether in a school or work setting, there will likely be people reading your code notebooks. It is always good to include markdowns to rationalize your thought process and findings amidst the various steps.

4. Be a storyteller.

For this project, I tried to explore various data visualization methods on how to effective convey and explore time series data, and ended up exploring the subjects in terms of breadth, rather than depth.

A screenshot of some of my project 1 visualizations

There is nothing wrong with exploring breadth-wise for EDA. But perhaps going in depth would allow you to craft a narrative more effectively through the visualisations. Also, the flow between your various visualizations would be more natural.

Conclusion

There is no right of wrong in EDA, but having a clear problem statement and being able to link your findings back to your problem statement is key. Afterall, the aim of an EDA is to extract valuable insights, identify data issues, and inform the data analysis process, laying the groundwork for more advanced modeling and decision-making steps.

--

--

Gloria Neo

Junior AI Engineer with a focus on computer vision. Learning is my way of experiencing the world, one new discovery at a time. https://glorianeo.carrd.co/