Average Jeans Color by State, 2020
Want to help me improve the map? Fill out the 3-question survey here!
This month I bought the first pair of blue jeans I’ve owned in a decade. I don’t know what inspired me to stray from my usual black-on-black, but regardless, I’m never looking back; I’ve been dressing like an androgynous Jerry Seinfeld every day since.
With denim on my mind, I found myself revisiting a meme I had seen on twitter months ago: Average Jeans Color Per State. In this post, I’m going to show you how I recreated this map with Python.
The first step in this process involved researching the source and methodology used to create the original map. I scoured the internet, and eventually came to the conclusion that the origin of this map will remain a mystery to us all. Along a similar vein, I was also unable to find any datasets about denim color popularity, which led me to think the original map was altogether made up.
Thus, I decided that if the data didn’t exist, I would make my own through primary data collection. I created a survey to collect age, State of residency, and the participant’s favorite shade of denim to wear, chosen from a selection of eight possible shades.
I shared the survey among friend/family circles, on social media channels, and in online forums specific to survey sharing, such as r/SampleSize on Reddit. At the time of writing, I received 377 responses.
The next step in this process involved deriving RGB and hex values from each of the eight denim picture samples from the survey. This was accomplished using a Python library called Color Thief, which can be used to grab the RGB color palette from an image. Under the hood, Color Thief uses k-means clustering to return the k most dominant colors in an image. You can specify the number of colors to grab with the
rgb_palette = ColorThief(img_path).get_palette(color_count=6, quality=10)
After grabbing the RGB values for each image, I wrote a function to display each original image alongside its dominant colors in a pie chart, converted to hex values:
hex_palette =[webcolors.rgb_to_hex(rgb) for rgb in rgb_palette]
Here are some examples using sample photos:
When applied to a denim swatch, this is the result:
Average RGB Values
After grabbing all of the RGB values from the eight survey images, I needed to map these colors to each survey response. To accomplish this, constructed a Pandas DataFrame from the results .csv file, and then used
np.select(), to map each
shade to its corresponding set of RGB values.
The next task involved averaging all of the responses together to arrive at one color per State. I did this by grouping the data by State, and taking the mean of each RGB value to derive the average RGB per state. This average RGB tuple was then converted to its corresponding hex value and added to the DataFrame.
grouped = df.groupby([‘state’])[‘r’, ‘g’, ‘b’].mean()
The last step of creating the new map involved locating a shapefile representing United States borders. This file contains GIS data on a specific location’s spatial and geographic information. I merged the shapefile with my main DataFrame by using a package called GeoPandas to create a GeoDataFrame.
Finally, I used Matplotlib to plot the GeoDataFrame:
The purpose of this project was to learn about extracting dominant colors from images with Python more so than to derive any legitimate insights about jeans. I caution against drawing any conclusions from the mapping aspect of this project for the following reasons:
The results shown here represent 377 survey responses, averaging ~7 respondents per State, which is not nearly large enough of a sample to legitimize a relationship between color preference and location.
The survey respondents do not represent an independent, random sample. My primary method of collecting responses was through my own social media accounts, meaning most respondents live in either Michigan or New York. In addition, most respondents fit a similar demographic to me in age, socioeconomic status, and education level. As a result, this sample was not truly representative of the US population, and cannot be used to generalize denim preferences.
A data science instructor once brought to my attention that even the “raw” data we use is biased in some capacity.
I experienced this first-hand through this exercise: I created the survey, meaning I selected the eight denim shades respondents had to choose from, based on my own subjective idea of what a representative sample of denim looks like. Had another colleague created the survey, the outcome of the project may have looked entirely different.
The “raw” data we find ultimately was produced from an entity that is biased in some capacity, as all humans are. Thus, it is important to remember as data scientists, that even when our methodology is designed to eliminate as much bias as possible, each person’s perception of the world is subjective, and so is the data they create.
Want to explore the full project? Check out the Github repository here!