Wedding Insights with Joy
Weddings are one of the most cherished days in a person’s life. Weddings can also be one of the most difficult and stressful events to plan. The difficulty of wedding planning has spawned movies, TV shows, and an entire industry of professional wedding planners. In the internet age, wedding planning has also now moved online. Joy is a fast growing website (and accompanying mobile app) that provides a platform for couples to plan and organize their big day, from handling guest lists and RSVPs, hosting pictures, and creating both digital and physical wedding invitations. In addition to making the entire wedding process easier and more enjoyable, the digitalization of wedding planning also means that we now have a treasure trove of data about wedding. This data can be used to derive insights and hopefully make life easier for future love-struck couples.
During my time as an Insight Data Science Fellow, I worked on a consulting project with Joy. My ultimate goal was to derive data-driven insights that Joy could publish to help couples better plan their weddings, and to aid the entire wedding industry in general. The full set of insights that I derived will be published on Joy’s website, but in this article I will dive into one of my favorite insights, and the technical steps required to get there.
“What fraction of my invited guests will actually come to my wedding?”
One of the most difficult, but necessary, steps in wedding planning is estimating what fraction of the invited guests will actually come to your wedding. This information is vital for making informed decisions about every stage of a wedding, such as how many people to invite in the first place, which venue to choose, making catering reservations, etc.
Simple google searches of “how many people should I invite to my wedding” return many generic, ad hoc answers such as “Expect 10–20% of your guests to decline” or “Invite 10% more people than you want to actually show up”. I wanted to address this question with data and see if we could derive some more meaningful and personalized answers.
Data Cleaning & Enhancement
The first step of my analysis was to make sure that I was working with a clean dataset, which meant removing data related to any test weddings or sample weddings that Joy had in its database. I implemented several filters to produce a sizable and clean data set on which I could do further analysis.
Once I had a clean dataset, I could start doing some initial analysis. However, I quickly realized that the majority of the data was in the form of unstructured, free form text boxes, meaning that I only had whatever the user decided to input. This meant that I would have to do some data processing and enhancement to get the information I needed. One area where this is particularly necessary is location data (of both the actual wedding venues and the hometowns of couples and guests). The information provided by the users showed huge variation, from full street addresses (“2259 Kalakaua Ave, Honolulu, HI 96815”) to single words (“Hawaii”) to multiple sentence descriptions (“The big pink hotel on Waikiki beach”).
I employed a variety of techniques and algorithms to try to obtain geographic location (specifically latitude and longitude coordinates) of both venues and hometowns from these types of text inputs. The hometowns of the couples were fairly easy to identify, as I was able to find their IP addresses. I then performed an IP lookup using the free GeoLite data from MaxMind, which provided approximate latitudes and longitudes. The locations of wedding venues and guests were harder to find, as I had to rely solely on the free form text inputs. The first piece of information I checked for was a 5-digit number, which I interpreted as a zip code. Zip codes can be used to provide approximate coordinates that were accurate enough for my purposes (I used the python package uszipcode). For all addresses that didn’t include a zip code (such as international addresses), I resorted to geolocating the text through the OpenStreetMaps API (using the geopy python client). I used the OpenStreetMaps API because they have no daily limit on the number of queries you can submit. However, there is the restriction that you cannot submit more than 1 query per second, which meant that geo-locating over 50,000 text strings required a couple days of querying.
Once I had the geographic location (latitude and longitude) of both the users and the wedding venues, I wanted to identify distinct regions in the world. I could do this manually, by selecting distinct regions that I might expect to behave differently (e.g. San Francisco, LA, Chicago, New England, etc.). The problem with this manual method is that it is based on preconceived notions, not data, and this manual identification of regions relies on prior knowledge, and my knowledge of regions outside of the US is much more limited. Instead, I decided to let the data tell me where the major regions are located by implementing a k-Means clustering algorithm.
The major hyperparameter that needs to be tuned in a k-Means clustering analysis is the number of clusters to include (this is the k in k-Means). There are no set rules for how to choose an optimal value of k, but one way to visualize your choice is to compare how the cost function (which is the total squared distance between every point and its nearest cluster center) decreases with number of clusters, and to choose a number of clusters that is around the “elbow” of this curve, where it starts to flatten out. Using the below elbow plot, I chose to use 30 clusters. As a sanity check, a visual inspection of the clusters produced by the algorithm seem reasonable and correctly separate out distinct, but nearby areas in the United States, such as San Francisco and Los Angeles.
What determines guest attendance?
With our data cleaned, transformed, and interpreted, we can finally dig into some analysis! As a first step, I wanted to see how the time of year of the wedding affected guest attendance. The below plot shows the median attendance fraction for the weddings that occurred in each month.
Gray error bars on the data represent the standard error (standard deviation / N²), and the lightly shaded bars on the bottom of the plot represent a scaled histogram of how many weddings happened in each month. The data shows that the majority of weddings happen in the months of May-July, with very few weddings in fall and winter months. Overall, weddings in the summer have slightly higher attendance fractions, with a small but steady decrease in attendance as the year goes on.
Guests do not like to travel more than 100 miles
While time of year appears to have a minimal impact on wedding attendance, the distance that guests have to travel to get to a wedding does make a substantial impact, as seen in the following graph. It appears that 100 miles is the maximum distance that guests will travel before thinking twice.
As in the previous graph, the black line is the median attendance fraction of weddings, this time binned by the average distance their guests had to travel. The gray error bars represent the standard error. The red dashed lines give examples of the types of distances covered in this graph, and can be interpreted as distance from Palo Alto to various large cities in California (and New York). One way to view this graph is: If your guests all live in Palo Alto, about 85% of them will come a wedding in San Francisco. However, if you move that same wedding to New York and ask your guests to travel almost 3000 miles, then the fraction of guests that attend your wedding will drop to ~80%.
Nearby weddings (< 100 miles away) have a fairly consistent attendance fraction of about 85%, but as soon as travel distance becomes greater than 100 miles, guest attendance begins a steady decrease with distance. Intuitively, this makes sense, as 100 miles is about the distance you can cover in a 2-hour car ride. Beyond 100 miles, it appears that guests become less and less likely to put up with the hassle of travel to attend weddings!
This analysis demonstrates that there are some clear, key factors that can affect wedding attendance. A broad-stroke, one-size-fits-all solution is no longer necessary, and if we leverage the power of data, we can find much more personalized and accurate estimates. For those of you with weddings in your future, consider the distances your guests have to travel and use the above plot to estimate how many guests will attend!
The exponential growth of data over the last decade is affecting almost aspects of life. The once-in-a-lifetime nature of weddings means that most people who have to plan a wedding are going through the experience for the first time ever, and do not have the benefit of personal experience. In the past, this meant that couples had to rely almost completely on the experiences of only a handful of close advisors (parents, friends, or a hired professional). By collecting and sharing data on numerous weddings, Joy is allowing new couples to learn from the experiences, failures, and successes of thousands of previous couples and their weddings.