Data analysis case study of cancer cases in Nigeria

Omolara Akanni
7 min readMay 19, 2024

--

Recently, I came across an interesting statistic in the Punch Newspapers: Nigeria records over 120,000 new cancer cases yearly. You can read more here. This raised the question of the level of awareness about the disease. How many Nigerians are familiar with the symptoms associated with cancer? How many are aware of the availability of vaccines possibly preventing cancers, as well as medical screening tests capable of detecting early-stage cancer or benign tumors? How many Nigerians know that being diagnosed with cancer doesn’t necessarily translate to a death sentence? It is this imperative need to increase awareness among individuals that made me decide to work on this data visualization project.

The following headings explain the data analysis process:

Data Research & Planning

While data resources are everywhere on the internet, sourcing valid and reasonable data in Nigeria is a serious hassle. It is easier as a data analyst working in a company to connect to the company’s database and extract the data needed using SQL or NoSQL, depending on the language required. For this project, there was no database to connect to and I encountered a challenge of not having a specific website to retrieve data from. Despite exploring health sites in Nigeria, such as https://www.health.gov.ng/, my search led to dead ends. Finally, after persistent research, I found a resource: Cancer in Nigeria Vol II. This document contained data on cancer cases diagnosed across different hospitals in some selected states in Nigeria between 2009 and 2016.. A link of this resource will be provided at the end of the article for reference. What I learnt during this process is that, as a data analyst, having good research and critical thinking skills cannot be over emphasized.

Planning

Before I start working on any data project, I draft the content of the visualization and break down tasks for each section of the project. This helps me have a clear overview of the entire project from start to finish. For this project, I planned the sections on the dashboard and documented on Notion. You can view this project plan in this Notion document.

A screenshot showing a part of the Notion page where I documented my plan for the cancer dataset.
A screenshot showing a part of the Notion page where I created project tasks and timeline for the dataset

Data Processing & Cleaning

In the real world, data hardly comes in a clean, ready-to-use format. Hence, the need to clean this dataset to answer specific questions. Most times, I start with predefined questions, other times, questions are drafted as I scan through the available dataset and resource. To analyze and clean this data, I used both methods: starting with initial questions, then creating new and final questions as I analyze the dataset. However, there were some limitations with the dataset available at the process of cleaning, and in this case, I had to do a separate research on survival rates for the most common type of cancer highlighted.

Initial Questions:

  • What is the survival rate for cancer in Nigeria?
  • How can I make more people aware of cancer and its threats?
  • How does early detection contribute to the survival of cancer in Nigeria?
  • How can I encourage more people to get vaccines, undergo regular check-ups and understand the importance of a healthy lifestyle to reduce the risk of cancer?
  • What types of cancer pose the greatest threat for both males and females?

Since the dataset retrieved from the internet was a book in PDF format, I used smallpdf.com to convert it into an XLS file. Then, I used Google Sheets to clean the dataset categorising cancer cases by gender and states collected in respective worksheets. A link to the cleaned dataset is provided in the GitHub repository.

A screenshot showing the dataset for female cancer cases reported in hospitals in Anambra, Nigeria.

During the cleaning process, I joined some tables, removed states with insufficient data so the entire dataset is consistent. I also ensured that the row and column headers were the same across all states and arranged in a similar manner.

Final Questons:

  • What type of cancer is most commonly diagnosed?
  • What is the survival rate of the most common cancer types when detected early?
  • How are reported cancer cases distributed across different states in the country?
  • What types of cancer are mostly associated with females and males?
  • Which age groups are mostly affected by certain types of cancer?

Data Analysis

After cleaning the data, the next step was to analyse the data based on the final questions before visualization. For this task, I made use of Google Sheets and SQL for simple calculations and queries.

  • Top Female and Male Cancer cases diagnosed across all the states

Here, across all states, I aggregated each cancer case for each gender. Using DataBeaver, an SQL software app, I imported the dataset and ensured accurate labelling.

Below is a snippet of the SQL query used in aggregating all male cancer cases from the respective Nigerian states. This query was replicated for female cancer cases.

A screenshot showing the SQL query used in aggregating all male cancer sites across the states analyzed

The exported response in table format:

The table above clearly indicates that “Prostate cancer” was the most frequently diagnosed cancer case among males in Nigeria.

  • Top Female and Male cancer cases and distribution across age groups

The age groups were already arranged in columns so it was easy to aggregate a specific cancer case across all female states.

Below is an SQL query snippet showing how this was achieved. This was replicated for the other cancer sites to be visualized.

Screenshots showing an SQL query selecting breast cancer cases for females across age groups in different states.

Data Visualization

After gathering all the analysed datasets, the last step was to visualize my findings in a dashboard in order to represent and communicate effectively, eventually drawing some conclusions. To start with, I began by sketching out my ideas on how to visualize the dataset. These initial sketches provided a direction for designing the dashboard ensuring that the data analyzed was clear and insightful.

Then, I made use of Tableau visualization tool to bring this dataset to life. During the process of building the dashboards, I made use of features in Tableau such as: filters, calculated fields and parameters to build an interactive dashboard. The charts used in this project are bar charts, bubble charts and BANs. All definitions in this viz were gotten through ChatGPT.

A screenshot showing the cancer data analysis visualization on Tableau Public
A screenshot showing the top male and female cancer cases diagnosed

Here are some observations and conclusions from this project:

  • Breast and cervical cancers were the most diagnosed cancer cases among females in Nigeria.
  • Prostate and liver cancers were predominant among males in Nigeria
  • Women between the ages 35–45 years are more likely to be diagnosed of breast cancer.
  • Women between the ages 45 and 60 are more likely to be diagnosed of cervical cancer.
  • Men between the ages 65 and 75 years are more likely to be diagnosed of prostate cancer.
  • Men between the ages 30 and 55 years are more likely to be diagnosed of liver cancer.
  • There is an 86% chance (5-year survival rate) of surviving breast cancer if detected early i.e. when the cancer has spread outside the breast to nearby structures or lymph nodes.
  • There is a 60% chance (5-year survival rate) of surviving cervical cancer if detected early i.e. when the cancer has spread beyond the cervix and uterus to nearby structures or lymph nodes.
  • There is a 99% chance (5-year survival rate) of surviving prostate cancer if detected early outside the prostate when the cancer has spread to nearby structures or lymph nodes.
  • There is a 14% chance (5-year survival rate) of surviving liver cancer if detected early outside the prostate when the cancer has spread to nearby structures or lymph nodes.

For more insights, you can interact with the data visualization on my Tableau public profile here! Here is also the GitHub link that contains the cleaned dataset and SQL queries for the project.

Data Source:
Cancer in Nigeria Volume II by Nigerian National System of Cancer Registries Federal Ministry of Health of Nigeria

Contact:
Twitter: https://twitter.com/_molara
Linkedin: https://www.linkedin.com/in/omolara-akanni-a68363173/
Tableau Public profile: https://public.tableau.com/app/profile/omolara.akanni/vizzes
Portfolio: https://read.cv/_molara

--

--