HDSC ’22 Capstone Project: Real Life Machine Learning Topics

Hamoye Blog
Published in
8 min readMar 12, 2022
Octavian Dan from Unsplash

Very often, we hear the words learn by doing, but how often do you ponder on the advantage it has over other learning models? Hamoye has developed an approach to learning that incorporates skill-based, context-based and project-based learning. This is why we can be very confident about our interns in the job market, especially those that are religious about the Hamoye Data Science Internship. In the words of Confucius, “I learn and I forget. I see and I remember. I do and I understand”.

The HDSC Winter 2022 has been planned out in such a way that one remembers and one understands, because this is how we want to build the future of work, and raise the next generation of problem solvers that the world really needs. The capstone project will be graded by experts that apply machine learning to day-to-day activities. Apart from the expert advantage one will get from completing the capstone project, one gets a chance to solve problems. The problems have been curled from 10 different sectors of the world’s economy and the solutions interns provide will be just enough to make the world a better place.

These 10 scopes/sectors and their problems are explained below:


Femicide In Turkey 2008–2020 Turkish Resource (Dataset)

Femicide is the killing of a female(woman or girl) by a man on account of her gender. This is a dataset we hope will cease to exist. However, Turkey has indulged in this for so long that data spanning about 12 years– from January 2008 till August 2020– has been collated with which facts can be exposed, questions raised, or solutions proffered. This dataset can be found here

Global Terrorism (Dataset)

Containing data of about 180 000 attacks from 1970-through 2017, this a collection of violence amidst and against humanity. Check the dataset to begin an exploratory analysis of the world’s situation in these times and do more with it even.

Seattle Police Department 911 incident response (Dataset)

Inspired by police violence in Milwaukee, USA, responses to 911 calls were recorded in this data. The claim is that 911 calls for help are reduced since the increased occurrences of police violence. Find out how much you can do with this dataset.

Cyber security: Common Vulnerabilities and Exposures (CVE) (Dataset)

Something no one wants is to be exposed or feel insecure about personal information while surfing the web. This dataset contains some common information — security vulnerabilities and exposures that we ought to be aware of.


Worldwide Governance Indicator (Dataset)

A report on six broad dimensions of governance by the Worldwide Governance Indicators (WGI) for over 215 countries and territories over the period 1996–2018 can be seen in the provided dataset. It includes the process by which governments are selected, monitored and replaced, and so on. A lot of political information and ideas can be generated from this dataset.

Supreme Court Judgment Prediction (Dataset)

Want to predict the outcome of court cases with facts, but by means of artificial intelligence? With this rich dataset containing 3304 cases, it can be done such that final verdicts can be predicted by emulating the human jury.

ICPC World ranking (Dataset)

The International Collegiate Programming Contest is an algorithmic programming contest for college students. The dataset is an interesting data collation of the contest in the past ten years which can be used to determine the next winner, top team, and more

Election, COVID, and Demographic Data by County: What Factors Influenced the USA 2020 Election? (Dataset)

Awesome data compilation of “How Voting Was” in the years 2016, and 2020 (the COVID-19 year). As can be seen, there are definitely factors that caused a shift in the opinion of voters between 2016 and 2020. With this dataset, these factors can be found out, as well as other things responsible for the poll results in both years across the states.


Marriage Proposals (Dataset)

One of the beautiful things about life is marriage , and even those averse to marriage are entertained by a good proposal. Here is a dataset of marriage proposals in Sri Lanka that can be used to predict how many proposals ended in a marriage, how many were accepted, etc.

Predicting Divorce (Dataset)

Here are a series of questions asked couples to predict whether their marriage will survive the threat of divorce or not. The responses can be used to run this suggested prediction or any other likely prediction revolving around this dataset.

Gender Statistics (Dataset)

A richly factor based gender statistic has a lot of potential for analyses, predictions, etc., especially one that has just been updated like this dataset.


Job Fraud Detection (Dataset)

Just as legitimate jobs are advertised especially online, so are illegal jobs. This dataset displays the different types of jobs, their location, salary range, job descriptions and roles. By applying machine learning techniques, we can help job seekers detect job ads that are fraudulent and deploy a job recommendation system.

HR Analytics: Job Change of Data Scientists- Predict who will move to a new job (Dataset)

After a company invests considerably in training a set of people for a particular job, it is not within their power to determine if the trainees will stay back after the training. So how can we predict who will leave a job or not? What factors influence their decision to leave or not? This dataset includes the demography, experience and education of candidates for analysis.

Public Perception of Artificial Intelligence AI (Dataset)

AI came to the fore in 1956 and has since gained traction across the world accompanied with varying beliefs, interest, and sentiment. This dataset captures the levels of engagement, pessimism, optimism, rampancy of specific hopes (in healthcare and education) and worries (ethical concerns, impact on jobs) about AI for the past 30 years in New York city. By applying machine learning techniques, one can understand clearly public perception about the topic

Recruitment scam (Dataset)

The job search journey is in our dispensation almost entirely online, which has subjected the process to suspicious activities. Good news is that Machine Learning (NLP) can help predict a potential recruitment scam. Employment Scam Aegean Dataset provides over 17,000 real jobs ads and 866 fraudulent job ads between 2012 and 2014 to help decipher between a scam and a real job.


Eating & Health (Dataset)

Our eating habit has a direct relationship with our health. This dataset contains information about eating, meal preparation and our health. Hence, one can predict with the aid of machine learning tools if grocery shopping patterns are influenced by income and the connection between Body Mass Index (BMI) and meal preparation patterns, as well as consumption of fresh or fast food.

Breast Cancer Prediction (Dataset)

Breast cancer is the second major cause of death in the world and the most common type of cancer amongst women. The dataset from the University of Wisconsin Hospitals contains diverse cases of breast cancer to help forecast cancer of the breast in a woman.

Body fat prediction (Dataset)

This dataset provided detailed body circumference measurements of 252 men including their chest circumference, percent body fat from Siri’s (1956) equation, knee circumference, age, biceps (extended) circumference etc. Since health practitioners recommend that an individual check their health by estimating their body fats, with this data and adaptation of machine learning tools, an individual can accurately measure their body volume


China Scholarship (Dataset)

There are numerous scholarships offered by the Chinese government. This dataset contains details about different schools in China and the type of scholarships they offer.

Global literacy: Education spending and access (Dataset)

The World Bank is an intergovernmental organization aimed at alleviating poverty by providing loans and support to countries for capital projects such as education. This particular dataset provides information on how education funds are spent, the beneficiaries and an overview of the global literacy rate.

Fortune 500 companies 1955–2021 (Dataset)

Different companies make up the fortune 500 list yearly. This dataset contains the thousands of companies that have made the list from 1955 till date including their ranks and annual revenue

Ridge & Lasso- SPORTS

Women international Football Results (Dataset)

A lot has been said about men’s international football but very little about women’s. The provided dataset includes 4,169 women’s international football results. Some international friendlies, particularly tournaments, are included.

Home advantage in soccer and basketball (Dataset)

The data set includes information about different leagues in different sports (Basketball and Soccer) all around the world, as well as some basic facts about each country, regarding the home advantage phenomenon in sports.

Injury prediction for competitive runners (Dataset)

The data set consists of a detailed training log from a Dutch high-level running team over a period of seven years (2012–2019). We included the middle and long-distance runners of the team, that is, those competing on distances between the 800 meters and the marathon. This design decision is motivated by the fact that these groups have strong endurance-based components in their training, making their training regimes comparable.


Person of the year, 1927 — Present (Dataset)

The provided dataset includes a record for every Time Magazine cover which has honored an individual or group as “Men of the Year”, “Women of the Year”, or (as of 1999) “Person of the Year” and inspired by questions such as

  • Who has been featured on the magazine cover the most times?
  • Did any American presidents receive the honor for their election victory?
  • How has the selection criteria for the Person of the Year changed over time?
  • Have the magazine’s choices become more or less controversial?

Plane crash database (1929-Date) (Dataset)

The provided dataset includes all civil and commercial aviation accidents of scheduled and non-scheduled passenger airliners worldwide, which resulted in a fatality, all cargo, positioning, ferry and test flight fatal accidents, all military transport accidents with 10 or more fatalities etc.

African wildlife (Dataset)

The provided dataset was collected with the original goal of training an embedded device to perform real-time animal detection in nature reserves in South Africa. Four animal classes commonly found in nature reserves in South Africa are represented in this data set: buffalo, elephant, rhino and zebra.

StandardScaler- PEOPLE

Personality Prediction (Dataset)

The provided dataset was collected through the Personality Cafe forum, as it provides a large selection of people and their MBTI personality type, as well as what they have written. Includes a large number of people’s MBTI type and content written by them.

Handwritten character recognition (Dataset)

This project will aim to recognize handwritten characters, i.e. English alphabets from A-Z. This we are going to achieve by modeling a neural network that will have to be trained over a dataset containing images of alphabets.

Sign Language Recognition (Dataset)

In this sign language recognition project, we create a sign detector, which detects numbers from 1 to 10 that can very easily be extended to cover a vast multitude of other signs and hand gestures including the alphabets.

Speech emotion recognition (Dataset)

The provided dataset includes speeches in .mkv file format. The aim of the project is to develop a machine learning model that can recognize each speech.


Traffic Prediction (Dataset)

Traffic congestion is rising in cities around the world. Contributing factors include expanding urban populations, aging infrastructure, inefficient and uncoordinated traffic signal timing and a lack of real-time data. The provided dataset includes hourly traffic data on four different junctions.

Accidents in France (Dataset)

Every year, road accidents cause thousands of deaths. The dataset is provided to help analyze the occurrence and the possibility of accidents in France

Airplane crash since 1908 (Dataset)

The dataset provided is scraped from planecrashinfo.com and includes date, time, airline, flight, route etc. From the provided dataset and your machine learning techniques, find any insights from dataset such as

  • Which operators are the worst?
  • Which aircrafts are the worst?



Hamoye Blog

Our mission is to develop an army of creative problem solvers using an innovative approach to internships.