Stories by Melat Abera on Medium

Does University Prestige Predict Higher Post-Graduate Earnings? (Continued)

Melat Abera — Sat, 16 May 2026 02:35:11 GMT

Introduction

For high school seniors, applying to colleges was most likely one of the most stressful times of their lives. The endless list of deadlines and essays would be enough to stress anyone out, but above that, they have to deciding which universities to actually apply for. Colleges have become increasingly more competitive as time goes on, and with fees for every application, applying everywhere wouldn’t be financially or strategically feasible. So, students are forced to be selective, deciding early on which types of schools are actually worth pursuing.

Beyond this, a new factor has come into light: post-graduate earnings. With the current state of the economy, many want to focus on optimizing their chances at a higher income. For some, that might mean applying to big name universities in hopes for better connections and resources. But others find it more beneficial to pursue “the right” degree at a state school, regardless of prestige. In this new context, which one is more important?

My Question: Does attending a top university result in higher post-grad salaries than attending an in-state university?

New Methods + Stakeholders

In my previous examination, I conducted an exploratory data analysis on two datasets to answer this question. I compared the post-graduate earnings between highly-ranked, top universities and state universities, seeing if there was any pronounced differences between the two groups. The result suggested that top universities did have a slight edge when it came to earnings after graduation, but it also left an important question unresolved — Is the line between top vs. state school actually clearly defined? Is it true in all scenarios that top schools provide a better outcome or is there state schools that preform comparably or better?

This analysis will aim to address these questions, extending my initial discovering through k-means clustering and linear regression. The motivating question stays the same, but the goal is to arrive at a more precise quantifiable answer that considers a broad range of outcomes

Just as the previous one, high school senior primary stakeholders for this analysis. But, with the he addition of machine learning techniques aim to provide a fuller picture of post-graduate earnings. While colleges applications seem to be framed as black and white, they are often not. With a more encompassing view of schools, students can better understand the range of outcomes associated with different institutions. In addition, they can use this defined range to make informed decisions about their financial and educational futures.

The Data

For this in-depth analysis, I will continue to use the same two datasets, Where it Pays to Attend College and US University & College Ranking. Both were sourced from Kaggle and together, provide a strong foundation for comparing earnings across different universities and tiers.

From the Where it Pays to Attend College dataset, I collected university data, including school name, school type (Ivy League, State, etc) starting median salary, and mid-career salary. It had a total of 249 unique values which I cross referenced with the US University and College Ranking dataset. This dataset contained a list of the highest ranked universities (167 unique values), including university name, state, and rank from 1984 to 2023. For simplicity, I only collected the 2023 (the most recent rankings).

Since each of the datasets contained their own naming conventions, it was difficult to directly merge them. So, I manually compiled a list of the universities from the US University & College Ranking dataset and merge the dataset based on a specific naming convention. After all the data was merged, there was a total of 200 universities for the primary analysis.

K-Means Clustering

K-Means clustering is an unsupervised machine learning algorithm that clusters unlabeled data based on similarity. It works by grouping points based on distance to the center of clusters, using Euclidean distance to determine the straight-line distance between the point and centers. Selecting the right amount of clusters is crucial for the accuracy of this method. The Elbow Method is a technique that helps find this optimal number, identify the number of clusters that no longer improves the model.

After running the Elbow method on the merged list of schools (top and state), we can see that the “elbow” is at 3 clusters. This confirms the idea that the data doesn’t cleanly split into just two groups (top vs. state); there is an additional group of school that fit in between. This could likely be flagship universities that, despite not being ivies or recognized internationally, bring outcomes that are slightly stronger than lesser known state schools.

Cluster 1 — Mid Earnings State School

Cluster 1 is composed entirely of state schools and represent intuitions that create salary outcomes that are relatively modest. compared to the broader dataset. Schools in this cluster have a starting salary range of low to mid $40k and a mid-career range of high $70k to high $80k. These institutions tend to be regional schools and campuses of less selective universities part of larger systems. While they still provide reasonable salary outcomes, graduates enter the workforce at lower starting points and have less long-term earning potential than Cluster 2.

Cluster 2 — High Earning State and Ivy League Universities

Cluster 2 is a particularly significant group because it directly addresses the weakness of my previous descriptive-based analysis. Previously, the analysis was conducted under the assumption of two clusters: top schools and state schools. But, through K-means clustering, it becomes evident that there is no clear-cut line between the two. Unprompted and without prior assumption, the algorithm introduces a group of both state and top universities. Rather than separating into two camps, a subset of state schools (including Cal Poly San Luis Obispo, University of Missouri-Rolla, SJSU, etc) cluster along Ivy League institutions. This suggests that the earning gaps between these two type of universities aren’t as distinct as previously assumed. For students admitted to these high-preforming schools, the data suggests that the financial return is comparable to that of an Ivy League or top state school (like ULCA) education.

Cluster 3 — Low Earnings State Schools

Just like cluster 1, this group is composed entirely of state schools, but represent the lowest salary outcomes across the dataset. Schools in this cluster have a starting median salary of low to high 30K and mid-career salary of low to high 70K. The institutions in this group are largely regional schools of low recognition, and unlike cluster 1, have a noticeably lower long-term earning potential from the whole dataset. This cluster reinforces the idea that lower state universities consistently produce below-average salary outcomes than Ivy league or reputable state schools.

Linear Regression

To further test the lines between state school and top universities salary earnings, I also applied a linear regression model. This time, I will test if starting median income will predict the long-term earning potential for a student. For this model, I used Mid-Career Median Salary and Starting Median Salary as features. These features aim to predict if a school’s starting salary and school type are reliable for earnings, specifically mid-career median salary. This directly aims to answer if the type of institution a student attends has a measurable impact of their financial future. If a school type carries significant predictive weight, it would suggest that the type of institution is genuinely influential to the long-term earnings of students.

The feature importance plot reveals that, no, school type does not carry a significant marginal weight on predicting long-term earnings. In fact, starting median salary, which carries a 95% predictive power, is far more relevant than the school that a high school senior choose. This means that, what a graduate earn right out of school is a defining factor for what they earn 10 or 20 years from now, not necessarily the school on the certificate.

The R² for the linear regression model was 0.759, which indicates that the model explains about 76% of the variance in mid-career salary can be explained by school type and (most notably) starting median salary. In addition, the RSME score of $6,801 indicate a relatively low margin of error. Taken together, these metrics suggest a high reliability predictor, which is strong enough to draw meaningful conclusions from.

The linear regression and the cluster models combined provide a consistent story: the type of school an university student attends is not an accurate predictor of post-graduate earnings. In the clustering model, it is shown that some state schools show the same level of earning as Ivy-league. The regression model shows that starting income is a higher indicator of long-term earning potential, not the type of institution. All in all, the name on the degree is less relevant; students should pick institutions not by name but those which provide the best environment for their learning.

Conclusion

Tools

For this analysis, I used a combination of AI assistance and documentation. Since AI is not always reliable, I referenced a combination of my previous experience coding with linear regression and clustering models (in-class assignments, lecture slides, etc) to ensure that I applied each method correctly. I have also used documentation to confirm that the python methods used were accurate to the outcome of my code. In confirming the accuracy of my code, the clustering model was something I had to fix constantly. It was definitely a struggle ensuring the categories of universities were correct so that the clusters made could be an accurate reflection.

Limitations

The biggest limitations for this particular analysis was the range of universities I used. Because of data cleaning complications and time constraints, I couldn’t use find a bigger dataset that included a larger subset of universities. The small sample size could have also promoted ‘overfitting’ and produce an inaccurate result. This is widely the reason why a bigger sample is important in an analysis like this .In the future, I would like to see what the outcomes of this analysis could be if I included 100, 200, 300, etc, more universities in my dataset.

#inst414spr26final

Github: https://github.com/melatabera/Final-Module-Extension

Does University Prestige Predict Higher Post-Graduate Earnings? (Continued) was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Endometriosis Symptoms: Clustering Patient Profiles to Unveil Symptom-based Subgroups

Melat Abera — Fri, 08 May 2026 01:28:19 GMT

Introduction

Endometriosis is a chronic disease in which tissue similar to the lining of the uterus (endometrium) grows outside the uterus, causing inflammation, scar tissue, and severe pain. It affects approximately 10% or 190 million women worldwide, and manifests across a wide spectrum of symptoms: severe menstrual bleeding, debilitating cramps, chronic pelvic pain and infertility. Yet, despite its scale, endometriosis remains one of the most underdiagnosed conditions in women’s health.

There is a whole host of reasons why endometriosis is often missed, even as medical advancements continue to take place. One of the most prominent is how differently the condition can present in individuals. In some cases, both medical professions and patients may dismiss their symptoms as typical menstrual pain. Others have masked their symptoms with birth control, and don’t experience discomfort until they go off the medication. There are even some don’t experience any pain at all and may only find out through their struggles with infertility. These challenges highlight a critical need for better approaches to identifying at-risk individuals.

My Question: Can clustering symptom profile reveal hidden endometriosis phenotypes, like hormone abnormalities or fertility issues, that pain-centric diagnosis overlook?

Stakeholders

Researchers are the main stakeholders in this analysis. The subgroups identified here can directly inform how to improve diagnostic tools are approached for future patients. Since not all women present with the condition in the same way, tools can be improved to find these outliers and provide better treatment. Furthermore, this analysis can also inform how providers are trained to detect endometriosis. The current research on endometriosis limits what type of women are included. Through this analysis, providers can look beyond the current textbook definition and study patterns that arise from individuals who may otherwise spend their life under the radar.

The Data

To explore potential endometriosis subgroups, we need medical records of patients who reported experiencing symptoms associated with the condition. Medical records are often protected under HIPPA and, for analysis like this, can raise an ethical questions if not responsibly used. So, I looked into Kaggle’s Endometriosis Dataset, which features 10,000 instances of synthetic data aimed for research and development in predicting endometriosis. It contains a total of 7 fields:

Age: Age of the individual, ranging from 18–50 years old
Menstrual_Irregularity: Indicates if individual experiences irregular menstruation (0 = No, 1 = Yes)
Chronic_Pain_Level: Pain severity reported by individual on a scale of 1 to 10
Hormone_Level_Abnormality: Indicates if individual has experiences abnormalities in their hormone levels (0 = Normal, 1 = Abnormal)
Infertility: Indicates if individual experience infertility (0 = No, 1 = Yes)
BMI: Individual’s Body Mass Index, ranging from 15 to 40
Diagnosis: Indicates if there is endometriosis present (0 = No, 1 = Yes)

These variables combined provide a comprehensive picture of symptoms reported in individuals who might be experiencing endometriosis. Using these fields, we can analyze the relationship between patients and identify those who have abnormal presentations and who may be missed.

Collection

Before applying any clustering and analysis on the data, I applied basic preprocessing to ensure cluster integrity. First, using the Pandas library, I dropped the “Diagnosis” column from the initial dataset. Because our goal is to find symptom-based patterns with unsupervised learning, including the diagnosis column could introduce bias. This could prevent the model from finding hidden subgroups. The variable will be reintroduced later to see where the clusters and actual diagnoses differ.

To prepare to data for KMeans and Euclidean Distance calculations, I used a Standard Scaler. Standard Scaler handles features on different scales, ensuring that all variables contribute equally to the analysis. Without this, we can risk certain variables dominating the calculations and skewing the cluster results. For this data, the Age, Pain Scale, and BMI all need to considered as they all have different ranges of data. For example, the Age has a range of 18–50, which could carry significantly more weight than then the Infertility variable (0 or 1).

Measuring Similarity and Choosing K

To measure similarity between the data points, I used Euclidean Distance. Euclidean distance is used with the KMeans clustering approach, making it an effective method for measure similarity. The specific features to include in the calculations included Age, Menstrual Irregularity, Chronic Pain Level, Hormone Level Abnormality, Infertility, and BMI.

To determine the choose a value for k, I applied the Elbow Method. In K-means clustering, the data is partitioned into clusters by minimizing the distance between the points and their cluster centroids. The Elbow Method helps determine the number of clusters by plotting the total distance between data points and their clusters against different values of k. The points in which improvement slows down is called the “elbow”.

After running the Elbow method, we can see that the point in which improvement slows down is k=3. This indicates that we should explore three distinct subgroups whose symptom profiles overlap meaningful. Later on, I will reintroduce the diagnosis variable to analyze which symptom community is diagnosed and which is often being overlooked in the medical community.

Clusters

By examining our three clusters, we can see clear patterns of patients that emerge in each.

Cluster 1 — High Pain & Irregularity Profile

This cluster represent a more classic endometriosis profile. The individuals here mostly suffer from severe pain and abnormalities in their hormone levels/menstrual cycles. But, these individuals have not reached a point where infertility is an issue. This group likely fits the category of women who have thought their symptoms were typical and wasn’t a sign of something more serious

Element A: Patient 12 (28 years old) experiences menstrual irregularity and hormone level abnormality. Their pain level are about a 9 but doesn’t report having infertility issues.

Element B: Patient 65 (35 years old) experiences no menstrual irregularity but does have hormone level abnormality. They report a pain level of 9 but experience no infertility.

Reintroduced Diagnosis Variable: 12 / 20 patients in the cluster did had confirmed endometriosis.

Cluster 2 — Some Pain & High Irregularity Profile

This cluster still fits some of the classic endometriosis symptoms like the first, but is missing a few key factors. There is also no hormone level abnormality among this group and the pain levels are a lot more moderate. In addition, unlike the first, there are more experiences with infertilities. This mostly fits the profile of those who regulate their periods with birth control, which explain the lack of hormone level abnormality and lower chronic pain levels

Element A: Patient 5 (25 years old) experiences no menstrual irregularity and a chronic pain level of 4. Hormone level abnormality and infertility are not present.

Element B: Patient 19 (20 years old) does experience menstrual irregularity and has a chronic pain level of 5. Hormone level abnormality and infertility are not present.

Reintroduced Diagnosis Variable: 5 / 20 patients in the cluster had confirmed endometriosis.

Cluster 3— Low Pain & High Infertility Profile

This cluster represent a lesser recognized type of endometriosis. Individuals in this cluster suffer significantly less pain levels, but still have consistent hormone abnormality and menstrual irregularity. Interestingly, this subgroup also have high higher rates of infertility. This closely matches last group of women who are unlikely to notice an issue until they want children.

Element A: Patient 23 (41 years old) experiences menstrual irregularity and has a chronic pain level at 5. They also experience hormone level abnormality and infertility.

Element B: Patient 66 (11 years old) has menstrual irregularity and a chronic pain level at 3. They also experience hormone level abnormality and infertility.

Reintroduced Diagnosis Variable: 6 / 20 patients in the cluster did have confirmed endometriosis.

Analysis

In each of the clusters, a different patient type was explored. Cluster 1 explored a high pain and irregular profile, which is the most consistent with how endometriosis is usually diagnosed. However, the true value in unsupervised clustering comes from Cluster 2 and 3.

In both of these clusters, we explore atypical profiles of endometriosis. In cluster 2, we explored the group with moderate pain levels that experience no hormone-level abnormality but do have some present menstrual irregularity. Cluster 3, on the other hand, has very little pain levels and both hormone and menstrual irregularity. These are vastly different profiles than the first, and this was especially evident in the diagnoses.

The first cluster had a confirmed endometriosis diagnosis rate on 60%, while Clusters 2 and 3 had a 25–35% rate of diagnoses. This reveals a huge inconsistency in diagnostic practices. The only variable that had significant change between these three groups were the pain levels and the individuals who were identified with having the condition were disproportionately those with high pain levels.

This identifies a clear issue with traditional pain-centric models. By clustering these profiles, it is clear that that the pain-centric model doesn’t take into account all the facets of endometriosis as a whole. Clusters 2 and 3 highlight group of individuals that don’t have the typical high-pain profile, but still exhibit multiple meaningful indicators of the conditions. This reinforces a core problem with the medical system and why this condition tend to have a very low identification rate. With KMeans clustering, we can move past these limitations and implement a more holistic way to approach endometriosis diagnosis.

Verification & Limitations

Verification:

To ensure all my information regarding the endometriosis disorder was correct, I used academic and popular sources on the condition. This helped extensively with the analysis because I knew what to expect with each subgroup. I was able to label these hidden profiles and back them up with actual medical literature. For the coding portion, I used ran multiple samples of the code to ensure that the results were roughly similar despite the size of the analysis. I also used Claude to help write some of the clustering code, which I verified through reading documentation (and our discussiong assignments) on the methods it suggested

Limitations:

Probably the biggest limitation of my analysis was the synthetic data. Since the topic I picked dealt with sensitive (and protected data), I opted to use synthetic data for the final analysis. Kaggle did provide me with many rows and variety, but in real life applications, using actual medical data could get me more accurate results. Something that was also missing was more symptoms for the analysis. The data included general symptoms, but it would be interesting if I could include some lesser known symptoms of endometriosis. This would further my research question as well, as I would get to compare the pain centric profiles with other factors that are less considered. Finally, with all datasets, there is some bias. The most obvious one would be that the data is synthetic and may include more “perfect” patterns than real-world data. The dataset also opted to include patients who already were diagnosed with the condition or suspected to have it. This excludes the general population who reports these symptoms without direct links to being diagnosed with endometriosis.

Github Link: https://github.com/melatabera/module-4-assignment/blob/main/module-4.py

Endometriosis Symptoms: Clustering Patient Profiles to Unveil Symptom-based Subgroups was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Next Big Hit: Predicting Box Office Revenue Earnings

Melat Abera — Fri, 08 May 2026 01:25:47 GMT

The Next Big Hit: Predicting Box Office Revenue Earnings

Introduction

Every year, Hollywood plays a high stakes game as they try to produce the next box-office hit. Sometimes, that risk comes with a huge reward like the 2009 release of Avatar or 2019 release of Avenger’s: Endgame, both of which went on to make over 2 billion dollars. But, there are times where Hollywood gets it wrong and makes it onto the list of biggest box office fails. Just last year (2025), it was Disney’s live action Snow White, which lost $170 Million dollars.

For studios, greenlighting the production of a movie means accepting uncertainty. There’s no promise that your movie will resonate with your intended audience or that you will make back the money you invested into your project. Factors like actor/actresses, current market interest, timing, and competition can play a huge role in determining a film’s success or failure.

This raises an important question: Can the box office success (revenue) of a movie be predicted before it’s official release?

For film studios, this question provides critical insight into how to maximize their company’s financial performance. Finding success in the box office isn’t just about how you produce the movie, but how effectively the movie is marketed towards a desired audience. Accurate predictions of box office success rate can help to determine which projects to invest in, what types of marketing tactics they should develop, and how to allocate their budget in a way that minimizes financial risk.

Data

To answer this question, I utilized the Full TMDB Movies Dataset, which is a comprehensive and maintained archive of over 1 Million movies. While it contains a vast number of fields, I filtered the dataset to focus on 8 keys fields that hold the most predictive power: Revenue, Budget, Popularity, Vote, Average, Release Date, Genre, Run Time. The features I decided to include exist before a film is released, so a trained model could be used to forecast pre-release earnings. This will directly help with the prediction before the initial release, which is what my research questions aims to answer.

The ground truth labels for this dataset will be the Revenue. These numbers represent the actual box office earnings for each film in the dataset, and are generated from pre-existing publications from the official TMDB database.

Before conducting the full analysis, I utilized preprocessing techniques to ensure that the dataset is cleaned and ready to use. With the help of Claude, I removed movies with no budget or revenue, so that the model wouldn’t contain any invalid values. In addition, I assigned the release date month as a numeric value, extracted only the first genre out of the list (primary), and turned those genres into a numeric label. These steps were to ensure that there is no issues that arise when running the regression model.

Regression models are used when predicting continuous numerical quantities. For this analysis, I want to predict revenue as a measure of success, so therefore the regression model would be the best fit.

Testing Results

In my supervised model, I used the following features: budget, popularity, runtime, release month, vote average, and genre.

After conducting the test, the R² score was 0.476, which means that the model explain about 48% of the variation in the movie revenue. This indicates a moderate fit but that the outcome variable (revenue) isn’t explained well by the selected features. In other words, the model isn’t as predictive as we want it to be with the features we chose. The RSME score was $110,816,449, which indicates the model is a $110,000,000 dollars off when predicting revenue for movies. This establishes and gives evidence to the difficulty of predicting box office success.

To further evaluate performance of the model, I identify 5 samples where the model was inaccurate.

There are a few reason why this model was inaccurate. The first is has to do with dataset accuracy. There could be some entries where the actual revenue was not entered correctly, causing the model to be lower in over accuracy. Data quality issues are common to run into, especially with a dataset this big, and could be a big contributing factor to “bad models”. Another reason why this model could have been inaccurate was important factors that were not included. For this analysis, I only included some of the features I thought would be most relevant. But, there are many other factors that play a big role in a film’s box office success. These could include marketing, competition, word of mouth, etc, all of which were not included in this dataset. Finally, the data could happen to be inaccurate because of some of the data cleaning techniques used to prep the data for analysis. For example, I only included one genre for simplicity, but that could have lead to the overrepresentation or underrepresentation of others.

Findings

To visualize how the model preformed, there are two main graphs that give further explanation as to why the model behaved as it did.

This bar graph summarizes feature importance, which is how much influence each feature had to the model’s prediction. Budget seems to be by far the most influential feature, which makes sense intuitively. When a film production studio is willing to invest more, that mean wider marketing and larger campaigns to attract audiences to the film. Popularity was a second most influential feature, which suggests that audience interaction with the campaigns and film is also important, but not as significantly as the budget. The genre variable, on the other hand, was the least important feature, which indicates that category of the film doesn’t have a significant impact on whether or not audiences support a film.

The linear regression scatter plot also reveals the actual revenue vs the predicted revenue that the model output. Most of the points trend upwards, which means that the model is predicting higher revenue for higher budget/popular movies. In addition, most of the points are clustered in the bottom left of the plot. This confirms the prediction model is failing to keep up, consistently underestimating the value of big blockbusters.

Conclusion

For constructing the analysis and model, I used a few key resources. Claude helped me with cleaning the code the features to fit the overall analysis. In addition, Claude also helped me with constructing the model and verifying that it was working correctly. To confirm that everything was accurate without an AI agent, I referenced examples from in-class exercises and documentation about what output I should expect.

I think the dataset was a big limitation in this analysis. There were so many movies included, which gave me a lot to work with. Without more sophisticated data cleaning processing, it’s hard not to miss some type of failing data that could skew the outputted data. If I worked with a smaller dataset, it would be easier to catch these flaws, but it wouldn’t have been so inclusive. Another thing to consider is bias. The dataset included many blockbuster hits, and there is a chance that it overrepresented them in the dataset. This makes it more likely that the model is learning patterns associated with high grossing films while underrepresented the smaller ones.

Github: https://github.com/melatabera/module-6-assignment

#inst414spr26a06

The Next Big Hit: Predicting Box Office Revenue Earnings was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spotify Recommended Artists: Are Those Musicians are Really Similar?

Melat Abera — Sat, 28 Mar 2026 03:51:33 GMT

Introduction

Music has always been a profound social connector for so many societies. All around the world, cultures have been strengthened through the passing on and creation of different types of music. In the modern age, it is easier than ever to find your favorite musical artists, especially through applications like Spotify, Apple Music, and Youtube Music. Just log in, search your favorite songs or artists, and immerse yourself in a cultural phenomenon.

In these apps, users don’t just listen to their favorite artists but discover news one through the apps’ algorithms. By clicking on their listened artists, users can see a list of “similar artists” which the app believes has a similar style or genre of music. But just how close in similarity are these artists to one another? Through this analysis, I aim to discover which artists are related to one another based on a variety of different variables and determine which artists may share the same group of listeners.

For this analysis, music listeners will be the biggest stakeholders. Their ability to find new artists may be reliant on how accurate the similarity of the new artists to their current artists. Without accuracy, it can be harder for users to find new music they enjoy. The artists themselves are also very relevant in this discussion, as their discoverability depends on the accuracy of music recommendation systems. If their music is being recommended to users who don’t usually listen to their type of music, they might experience less interaction with their music. Finally, music platforms are also a huge stakeholder. Users will be more likely to use a music platform for listening if the types of recommendations they get are accurate to what they like. Especially with the on-going competition between music platforms, having a good referral system is crucial in keeping a consistent and loyal userbase.

Data

To look into the similarity between artists on music platforms, I looked into the “Featured Spotify artists/tracks with metadata” dataset. It includes data on the artists, their monthly listeners, popularity & followers, and their releases. Using this dataset, I will extract how similar artists are based on their shared genres, popularity rankings, and monthly listeners.

Genre: Genre describes what the musical style of each artist is. Using this metric, I will determine if artists within the same/similar genre tend to overlap between audiences. For example, would artists with the genres “pop” and “k-pop” have the same audience group, or are their artists largely distinct?
Popular rankings: I will analyze the popular scores of artists to discover if the popularity causes artists to be advertised in the same groups. For example, Taylor Swift and The Weekend are both prominent music artists but have completely different sounds. Would they be recommended to the same audiences?
Monthly listeners: Similar to the popular rankings, I will analyze would artists with the same amount of monthly listeners be recommended to the same audiences based on their numbers?

To find which artists are recommended in the “users also like”, I used Spotify as a search algorithm. While these dataset contains a wide variety of artists, I decided to focus on 5 artists with 5 of their similarity recommendations.

Measuring similarity

For my similarity measurement, I will use Jaccard Similarity, which measures the closeness between two sets of data. Since artists usually have different numbers of genres associated with, Jaccard Similarity can find the genres that overlap between two artists. For popularity rankings and monthly listeners, Jaccard Similarity wouldn’t be the most effective similarity measure, as they contain numeric, not categorical data. So, I will instead use Cosine Similarity, which calculates the cosine of the angle between two vectors to see how aligned they are. Instead of focusing on the numeric value, cosine on direction and length, which means artists with different levels of popularity or monthly listeners can still be related to each other.

Github: https://github.com/melatabera/inst-414-module3

Spotify Recommended Artists: Are Those Musicians are Really Similar? was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

Goodreads Recommendation System: Echochamber or Gateway to Discovery?

Melat Abera — Sat, 28 Mar 2026 01:35:33 GMT

Introduction

When a reader wants to share their reading goals, post reviews, or find new books, there’s a good chance they are going to Goodreads. With its massive community of users, huge catalog of books, and robust recommendation system, Goodreads has curated a “perfect” social media for readers. One of their particularly popular functionalities is their book discovery feature. When a user clicks onto a book, they are not only met with reviews and information about the book, but a “Readers also enjoy” section. In this section, the platform analyzes user data to find books that are commonly shelved and use it to suggest similar titles.

But just how good is this recommendation system for discovering new books? Does it keep users in an echo chamber of the same, most popular books or does it expose them to new books, authors, and genres? Through this analysis, I aim to discover the reliability of Goodreads as a recommendation service. As readers want to expand their taste, it can be important to have a diverse recommendation system to guide them. By analyzing the “Readers also read section” as a web-based network, readers can find if they are stuck in the same group of books or if they are guinely finding new material.

Authors are also affected by these types of recommendation algorithms, especially newly published ones. If Goodreads’ network is highly structured towards already-popular tiles, new books struggle to make it into the space, making discoverability challenging for less-known writers. The company itself also has a stake in this system. As a business, Goodreads in interested in maintaining its userbase and popularity over other competitors. Understanding whether or not its current algorithms invite diversity of authors and books on their platform is crucial in improving the platforms’ overall engagement.

Data

To analyze the structure of Goodreads’ recommendation platform, I used the goodreads_young_adult dataset from USCD’s Goodreads Book Graph Datasets. The full book collection dataset contained 2.3 million books, which would’ve overcomplicated the project. So, I narrowed down the dataset to a specific genre (young adult), and further picked only 100 titles for the final analysis. From the fields, I used the following

isbn
title
similar_books
publication_year
ratings_count
publisher

The isbn and title fields are used as identifying information for each of the books in the final 100 books. The similar_books field, which is the most crucial variable, displays a list of book_ids that are recommended when a user views a given title. In other words, similar_books gives information on which books are getting recommended alongside each other, which will map the “edges” of the network. From this, we can determine if the recommendation algorithms are repeatedly the same groups of books or if there is a reasonable variety in their books. The other fields, publication_year, ratings_count and publisher, all serve as attributes that can help contextualize patterns that arise from the network. For example, the publication_year variable reveals if books from certain time periods are clustered together. The ratings_counts variable and publisher variable serve a similar purpose, revealing if the amount of ratings causes a book to be recommended more or if certain authors are disproportionately recommended more or less.

Collection

For the collection of the data, I used three libraries:

Pandas: Pandas was helpful in processing the original json file into a datframe that I could use for analysis. After turning the original dataset into the dataframe, I continued to narrow down to specific columns (fields) that was more relevant to the type of analysis I conducted

NetworkX: NetworkX was the main library I used throughout the project, mostly creating the graph and applying the centrality metrics. With it, I was able to identify the number of nodes and edges as well as apply degree centrality, betweeness centrality, and pagerank to detemine node importance.

Matplotlib: Matplotlib was my visualization library to create the charts and tables for my final results.

Centrality Metrics

In a graph, a node refers to the specific objects or entities and an edge refers to the connections between the nodes. In Goodreads dataset, the nodes are the books (book_id, book_title) and the edges would represent the relationship between books in the recommendation section. For example, if Book A has Book B in their “Readers also enjoy” section, the connection between Book A and Book B would be an edge.

In network analysis, “node importance” quantifies a node’s influence and impact within a graph. In the case of the Goodreads book system, it identifies which book titles are centralized in the platform’s algorithm recommendation system. To measure node importance, I used 3 centrality metrics in my analysis:

Degree Centrality: Degree centrality counts the number of direct connections (edges) a node has. From the sample of 1,000 books, “Quintana of Charyn” is the biggest bridge between young adult books. This indicates that

Betweenness Centrality: Betweenness Centrality measures nodes that appear the most often in shotrest paths.

PageRank:

Github Link: https://github.com/melatabera/inst-414-module2#

Dataset (Young Adult): https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html#datasets

Goodreads Recommendation System: Echochamber or Gateway to Discovery? was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Link between University Ranking and Post-Graduate Earnings

Melat Abera — Mon, 23 Mar 2026 07:43:36 GMT

Introduction

As high school seniors prepare to apply to university, one important question is on their mind: which schools should I pick? Some students may set their sights on prestigious universities, hoping that the name can give them a foot up when looking for jobs. Others may choose colleges in state to save money or to more easily transfer college credits. But beyond these, there is another huge factor that everyone, especially now, pays attention to: post-graduate earnings.

As the cost of living goes up each year, many want to set themselves up for a high-earning future. The choice to go to university alone can result in better earnings post-grad, but does the school you pick really matter? Many argue that big-name universities get you resources or connections, which translate into higher earnings. But others reiterate the importance of choosing the right college major for higher earnings, even at an in-state school, over breaking the bank at a top university. In the context of our current economy, which one is more important?

My Question: Does attending a top university result in higher post-grad salaries than attending an in-state university?

In this analysis, rising college freshmen will be the primary stakeholders. As the requirements needed to enter college become more competitive, students need to be more selective over which applications they prioritize. The comparison between post-grad earnings between top universities and state university graduates will directly inform students on which schools to add to their list. In addition, students will be able to better decide which universities align with their goals and give a better return on investment.

The Data

For this analysis, I utilized 2 primary datasets from Kaggle. The first dataset, Where it Pays to Attend College, included a total of 8 fields:

School Name: The name of the university (249 unique values)
School Type: Classifies each university into one of five categories (State, Liberal Arts, Party, Engineering, Ivy League)
Starting Median Salary: Represents the median salary for recent graduates
Mid-Career Median Salary: Represents the median salary for graduates with approximately 10 years of professional experience
Mid-Career 10th Percentile Salary: The salary at which 10% of mid-career graduates earn below (and 90% earn more), representing a lower bound salary benchmark
Mid-Career 25th Percentile Salary: The salary at which 25% of mid-career graduates earn below (and 75% earn more)
Mid-Career 75th Percentile Salary: The salary at which 75% of mid-career graduates earn below (and 25% earn more)
Mid-Career 90th Percentile Salary: The salary at which 90% of mid-career graduates earn below (and 10% earn more) representing the highest earning mid-career graduates

From this dataset, I collected the university names, university types, starting median salary, and the mid-career median salary. For the university types field, I only included state and Ivy league schools, since my research question compares top universities with state schools. Starting median salary and mid-career salary will serve as measurement for both immediate and long-term earning potentials between the universities.

I also used Kaggle’s US University & College Ranking dataset to filter for the top universities and state schools. This dataset contains a total of 41 fields, most of which the rankings of colleges of 1984 to 2023.

University Name: The name of the university (167 unique values)
IPED ID: Unique six-digit code for universities that participate in financial aid programs
State Code: Aliases representing the states each university it located
2023–1984: Columns corresponding to the rankings for each university based on year

For simplicity, I used only the 2023 rankings and the university names, which I cross-referenced with the primary dataset to create a comprehensive list of top and state universities.

Collection

For cleaning and visualization, I used three python libraries: Pandas, Numpy, and Matplotlib. The Pandas and Numpy libraries was used for creating data frames and filtering through certain categories (school type, starting median salary, and mid-career median salary). Using both, I extracted the columns I needed for the visualizations and cleaned any unnecessary symbols in the numbers ($, commas).

After cleaning the data itself, I manually curated a list of top universities by cross-referencing the two datasets. Because the naming conventions between the two were different, manually matching and compiling the list of college proved effective. The final list included all eight Ivy League schools along with a selection of highly ranked and lower ranked state schools. After creating the list, I filtered the salary dataset to only retain schools that appeared in the curated list. This will be the basis for my analysis.

Analysis

To answer my research question, I conducted an analysis focused on comparing starting median salary and mid-career median salary across both school types. Then, I used a side-by-side bar chart and box plot to visualize the final results.

Side-by-Side Bar Chart

For my first visualization, I created a side-by-side bar chart to compare top universities and state schools across two categories: starting salary and mid-career salary. This chart was ideal because it allows for both school types to be compared over both categories at once. Rather than looking at separate charts, the side-by-side format allows for immediate caparison between the groups.

Upon exploration of the bar chart, we can reasonably deduct two findings. First, the difference between start salary in top universities (Approx. $47,000) vs state school (Approx. $42,000) is very minimal. While top universities do have a slightly higher median starting salary, it is not significant enough to draw the conclusion that school type alone drives better early career earnings. However, the difference in mid-career salary is a little more interesting. The gap between top universities (Approx. $88,000) and state schools (Approx. $74,000) is slightly more. This indicates the post-grad earning are less apparent immediately after graduation but becomes more pronounced overtime.

Box Plot

My second visualization includes a box plot that, again, compares the two types of universities across two categories. The box plot adds more context t because it includes measures such as outliers, upper/lower bounds, ranges, and medians. This creates an interesting picture of how each group’s salary outcomes are distributed, rather than showing an average figure.

With the added distribution, we get a new perspective that the single means from the bar chart didn’t show us. When looking at all four categories, mid-career graduates from top universities see the greatest variability of salary outcomes. Previously, it was concluded that the mid-career salaries at top universities had a more pronounced advantage to state universities. However, this box plot reveal that this advantage could be due to outliers who drive that average forward. It also could mean that those who attend these top universities have a larger range for earning potential mid-career. In contrast, the in-state mid-career graduates seem to have a much tighter range, indicating much more predictable and consistent earnings. It also contains outliers in lower salaries ranges, indicating that graduates from in-state universities are likely to earn below the average.

The entry level salaries from top universities and in-state universities are both consistent in their ranges. Top universities, however, has more outliers which indicate that these universities, again, come with a better chance of earning above the average.

Validation + Limitations

Validation

I used Kaggle as the main source for this analysis. Though it is widely regarded as a reputable platform, it can sometimes include messy data and not always give the full picture. When searching for datasets, I made sure to use those which linked to original, credible sources, such as US News and the Wall Street Journal. In addition, I ensured that both datasets were cleaned and contained appropriate values, removing extra symbols and filtering through/dropping any null rows or unnecessary columns.

In addition, I used Claude to help with some of the data analysis methods I used. But, as Claude can be flawed at times, I verified the inputs and outputs of the data through different python methods. I used .unqiue() to print the lists of top universities and state universities, ensuring all the data was divided in the right categories. Using this method, I also checked how many observations are in each group to ensure that number of observations weren’t skewing the results. Finally, I used to .describe() method to check if the means I was receiving in the data visualizations were accurate.

Limitations

One of the biggest limitations of this analysis was the datasets I used. While they contained a fair number of universities, they didn’t have all of them, which means that the results of the data could look different on a larger scale. In the future, I would aim to find a more encompassing dataset.

As with any dataset, there is also biases. One obvious one was the manually curated list of universities from the US News Rankings dataset. The decision of which universities were considered “top” was solely reliant on the opinion of the source. But, if I were to find other datasets, the list could look slightly different (selection bias). In addition, the methology does not consider individual state universities compared to individual top universities. Instead, they are all grouped together as a monolith and averaged out. If the comparisons involved subsets of each group (i.e., 20 schools), it could perhaps revealed more in-depth insights of the post-graduate earnings.

Github Repository: https://github.com/melatabera/module-1-assignment

The Link between University Ranking and Post-Graduate Earnings was originally published in INST414: Data Science Techniques on Medium, where people are continuing the conversation by highlighting and responding to this story.