Average Jeans Color by State, 2020

Published in

The Startup

5 min readOct 27, 2020

Want to help me improve the map? Fill out the 3-question survey here!

Original map: Average Jeans Color per State, 2018

This month I bought the first pair of blue jeans I’ve owned in a decade. I don’t know what inspired me to stray from my usual black-on-black, but regardless, I’m never looking back; I’ve been dressing like an androgynous Jerry Seinfeld every day since.

With denim on my mind, I found myself revisiting a meme I had seen on twitter months ago: Average Jeans Color Per State. In this post, I’m going to show you how I recreated this map with Python.

Data Collection

The first step in this process involved researching the source and methodology used to create the original map. I scoured the internet, and eventually came to the conclusion that the origin of this map will remain a mystery to us all. Along a similar vein, I was also unable to find any datasets about denim color popularity, which led me to think the original map was altogether made up.

Survey

Thus, I decided that if the data didn’t exist, I would make my own through primary data collection. I created a survey to collect age, State of residency, and the participant’s favorite shade of denim to wear, chosen from a selection of eight possible shades.

I shared the survey among friend/family circles, on social media channels, and in online forums specific to survey sharing, such as r/SampleSize on Reddit. At the time of writing, I received 377 responses.

Color Thief

The next step in this process involved deriving RGB and hex values from each of the eight denim picture samples from the survey. This was accomplished using a Python library called Color Thief, which can be used to grab the RGB color palette from an image. Under the hood, Color Thief uses k-means clustering to return the k most dominant colors in an image. You can specify the number of colors to grab with the color_count parameter.

rgb_palette = ColorThief(img_path).get_palette(color_count=6,    quality=10)

After grabbing the RGB values for each image, I wrote a function to display each original image alongside its dominant colors in a pie chart, converted to hex values:

display(Image.open(img_path, 'r'))
hex_palette =[webcolors.rgb_to_hex(rgb) for rgb in rgb_palette]

Here are some examples using sample photos:

When applied to a denim swatch, this is the result:

Average RGB Values

After grabbing all of the RGB values from the eight survey images, I needed to map these colors to each survey response. To accomplish this, constructed a Pandas DataFrame from the results .csv file, and then used np.select(), to map each shade to its corresponding set of RGB values.

The next task involved averaging all of the responses together to arrive at one color per State. I did this by grouping the data by State, and taking the mean of each RGB value to derive the average RGB per state. This average RGB tuple was then converted to its corresponding hex value and added to the DataFrame.

grouped = df.groupby([‘state’])[‘r’, ‘g’, ‘b’].mean()

Mapping

The last step of creating the new map involved locating a shapefile representing United States borders. This file contains GIS data on a specific location’s spatial and geographic information. I merged the shapefile with my main DataFrame by using a package called GeoPandas to create a GeoDataFrame.

Finally, I used Matplotlib to plot the GeoDataFrame:

Updated map: Average Jeans Color by State, 2020

Limitations

The purpose of this project was to learn about extracting dominant colors from images with Python more so than to derive any legitimate insights about jeans. I caution against drawing any conclusions from the mapping aspect of this project for the following reasons:

Sample Size

The results shown here represent 377 survey responses, averaging ~7 respondents per State, which is not nearly large enough of a sample to legitimize a relationship between color preference and location.

Survey Question 3 Responses: What State do you currently live in?

Sample Bias

The survey respondents do not represent an independent, random sample. My primary method of collecting responses was through my own social media accounts, meaning most respondents live in either Michigan or New York. In addition, most respondents fit a similar demographic to me in age, socioeconomic status, and education level. As a result, this sample was not truly representative of the US population, and cannot be used to generalize denim preferences.

Author Bias

A data science instructor once brought to my attention that even the “raw” data we use is biased in some capacity.

I experienced this first-hand through this exercise: I created the survey, meaning I selected the eight denim shades respondents had to choose from, based on my own subjective idea of what a representative sample of denim looks like. Had another colleague created the survey, the outcome of the project may have looked entirely different.

The “raw” data we find ultimately was produced from an entity that is biased in some capacity, as all humans are. Thus, it is important to remember as data scientists, that even when our methodology is designed to eliminate as much bias as possible, each person’s perception of the world is subjective, and so is the data they create.

Want to explore the full project? Check out the Github repository here!