Module III: The Strategies in Planning for a High School Decathlon

Published in

INST414: Data Science Techniques

5 min readMay 2, 2024

When I was in high school, one of the most transformative experiences of my life has been the D.C Muslim Interscholastic Tournament, a 3-day weekend of academic decathlon (plus sports!), that brought together more than 60+ high schools between D.C, Maryland, and Virginia, with more than 750 competitors across these states. It was an experience that helped make me grow as an individual, and realize many strengths about myself that I’ve employed in my college career.

Coming into college, I’ve gotten involved with giving back to the competition by helping serve in the organizing team. Currently, I serve as the Registration and Technology chair for the competition, where I lead a team to help coordinate the registration of hundreds of students across the DC area.

One of the main problems that come with organizing a tournament of this scale is the huge amount of data needed for creating a schedule that fits all the needs of competitors. On the tournament weekend, competitors can compete in a max of 6 competitions, and with that means theres a large amount of combinations that exist for the logistics team to coordinate to make sure that all competitors would be able to compete in all of their competitions.

Taking this into account as the Registration Chair, I employed a similarity calculator utilizing the Jaccard Index to trace how similar different competitors were with each other in terms of their competition selection.

Data Collection and Composition

My data comes from my.getmistified.com, the website where students can go and get registered for the competition. As an organizer, I have access to a .csv that contains all competitor information, including name, emails, phone numbers, and the competitions they selected. For security reasons, I anonymized the data by removing any personal identifiable information (PII) from the CSV to use in the calculator.

The .csv contains the information about the individual competitor and the competitions that they’re competing in.

Data Analysis and how do we define similarity?

In creating the logistics of MIST, one of the key things to know is just how many competitors will be competing in one competition. With more than 40 competitions present across the board to fit into rooms at the venue in a slotted time, it means that we have a lot of planning we need to take into account to make sure we’re not signing competitors up for conflicting periods or times. The Jaccard Index, in this case, would measure the similarity of a random competitor compared to the more than 700 other competitors they’re competing with, to see which of them are most similar.

By defining that similarity, we as an organizing team are able to calculate just how similar some of the competitors are to each other referring to their competitions, and judging on that index, we would factor that into the schedule to make sure we’re not scheduling these similar competitions at or around the same time.

The Process

mistDict= {key: [value for value in values if pd.notna(value)] 
for key, values in mistDict.items()}

def jaccardSimilarity(mistID):
    refSet = set()
    testSet = set()
    for x in mistDict[sampleID]:
        refSet.add(x)
    for x in mistDict[mistID]:
        testSet.add(x)

    intersection = refSet.intersection(testSet)
    union = refSet.union(testSet)
    jaccardSimilarity = float(len(intersection)) / float(len(union)) if union != 0 else 0 

    return [mistID, jaccardSimilarity]

The top part of the code was the cleanup of the .csv file, to make sure I removed all NaN values from the .csv, as to not interfere with the Jaccard results.

The second piece of code is a Jaccard similarity calculator that takes in two sets, one of a competitor I selected from the dataset (I chose him specifically because he was registered for a competition in almost every category), and the test set of another ID inside of the set. After doing so, I added all the competitions to the set for the student, and because sets have a nice function known as intersection and union, I was able to create these respective values that matched the requirements for each value. After doing so, I ran the Jaccard Formula of J ( A , B ) = | A ∩ B | | A ∪ B |, where ∩
represents the intersection and ∪ represents the union. Floats were used to make sure values were returned.

jacSims = sorted(jacSims, key=lambda tup: tup[1], reverse=True)

Afterwards, I returned that value alongside its respective IDs to be able to give me a top 10 of the most similar values to the test ID. This code sorts the list based on the second value in the tuple, as well as reversing the List so the greatest comes on top.

Results

From conducting this analysis, I found out that the 10 most similar IDs to the reference ID were:

‘SM16O-E7PMV’, 0.6666666666666666
‘EXIMZ-0939L’, 0.4
‘2IUSV-DFOXR’, 0.375
‘EXIMZ-VXQF3’, 0.3333333333333333
‘EXIMZ-BJDQK’, 0.3333333333333333
‘VJ2GK-6A7VC’, 0.3333333333333333
‘51M64-SPWLQ’, 0.3333333333333333
‘A85Q9-EWQDX’, 0.3333333333333333
‘Y0J9Q-MNXOI’, 0.3333333333333333
‘JW7VO-2P34A’, 0.3333333333333333

The higher the number is to 1, the more similar in competitions that ID has with the reference ID. For me, this showed that this competitor doesn’t have much similar competitions with other competitors, and so we can ignore this part of scheduling. If more than 5+ competitors have a Jaccard Index of .5 or above, it would mean for us that we need to review these IDs and see that competition and make sure we’re able to plan out our schedule taking those students into account.

Limitations

With this analysis, while it does give me a nice statistical look onto valuable insights into the overlap of preferences among participants, several limitations and potential biases exist:

Limited to Specific Data:

The analysis relies solely on the IDs of participants and the competitions they select. It does not consider other factors such as individual preferences. Due to the dataset, this was the only “measurable” factor that could be calculated.

Oversimplification

Using Jaccard similarity coefficients between competition treats each choice as equally significant. While it may be the case for the organizer, some competitions may carry more weight to participants than others. Without weighing competitions based on significance, the analysis may oversimplify the data.

Conclusions

Stepping into a more leadership position for DC MIST has shown me the power of data to create a more meaningful registration process. Using a data-driven strategy from previous years and massive PR and outreach, I was able to increase numbers for registration by more than 400 from the previous year. The competition is always one that’s a work in progress, and using concepts I learned in class to help benefit it has shown me the power of data analysis and its effectiveness at creating feasible results that can be seen in the real world.

In the case of this similarity index, showing these results to the logistics coordinator of the competition has assisted her in making more informed decisions about which competitions should be placed in certain times, and what days each of those should occur in. Seeing the real-world effects of data similarity strategies like the Jaccard Index has helped me try to find ways I can implement it in other aspects of my other work.

Github Link to Code: https://github.com/CaptFalc/Assignment-3

Module III: The Strategies in Planning for a High School Decathlon

Written by Wadi Ahmed