From Which Country Do Unsplash Images Come From?
#5 camera are iPhones 🤳🤯
The Unsplash Dataset is created by over 200,000 contributing photographers and billions of searches across thousands of applications, uses, and contexts.
But from which country do they come from and which brands are used the most? Let’s find out
Dataset
I went with the lite version. You can find both this one as well as a fully fleshed one here.
First Look
I like to keep the data I work with neatly organized, so I will put it on my Gdrive to unzip and read it from there.
from google.colab import drive; drive.mount('/content/gdrive')
!unzip “/content/gdrive/My Drive/Google_Colab_Datasets/unsplash-lite-dataset-25k-nature.zip”
import pandas as pd
df = pd.read_csv(open('photos.tsv000'), encoding='utf-8',sep='\t',engine='c')
df.columns
A treasure trove with 26 columns and 25k rows! Let’s take a quick look at the photo’s origin and what cameras are being used.
Top Upload Locations And Cameras
df[‘exif_camera_make’].value_counts().nlargest(10)
df['photo_location_country'].value_counts().nlargest(10)
The numbers don’t seem to add up and some of the country names are actually duplicated in a different language (e.g. Deutschland = Germany). That being said, the data set is surprisingly clean. 🧹
Scatter Plot
Let’s start by visualizing the upload locations and which cameras were being used the most (top7).
df.brand.value_counts().nlargest(7)
Mostly Canon and Nikon.
A lot of the brand names need to be cleaned:
df[‘brand’] = df[‘brand’].str.replace(‘,’, ‘’).str.replace(‘go’, ‘gopro’).str.replace(‘fujifilm’, ‘fuji’).str.replace(‘iphone’, ‘apple’).str.replace(‘fujifilmfilm’, ‘fuji’).str.replace(‘goprogoogle’, ‘gopro’).str.replace(‘gopropro’, ‘gopro’).str.replace(‘goproogle’, ‘gopro’).str.replace(‘gopropro’, ‘gopro’).str.replace(‘nikon corporation’, ‘nikon’) #and so forth#note: I renamed the columns prior with
#df = df.rename(columns={"exif_camera_make":"brand", #"other_column":"other_new_name"...})
Next, I will import and install a bunch of things, some for convenience:
!pip install geopandas
!pip install pyunpack
import pandas as pd, matplotlib as mlp, matplotlib.pyplot as plt, geopandas as gpd, numpy as np
import matplotlib.colors as mlpc, random, datetime; from datetime import datetime
import pyunpack; from pyunpack import Archive
Time to plot 🚀
import matplotlib, seaborn as sns, matplotlib.pyplot as plt
from matplotlib import rcParamsdf = df.loc[df['brand'].isin(['canon','nikon','sony','fuji','dji','apple','panasonic'])
df['photo_location_latitude'] = pd.to_numeric(df2['latitude'], errors='coerce')
df['photo_location_longitude'] = pd.to_numeric(df2['longitude'], errors='coerce')new_style = {'grid': False}
matplotlib.rc('axes', **new_style)
rcParams['figure.figsize'] = (17.5, 17)
rcParams['figure.dpi'] = 250sns.lmplot(x='longitude', y='latitude', data=df2, fit_reg=False, hue='exif_camera_make', height=13)
The plot resembles a world map, pretty cool. To kick it up a notch, we will have to do some more tidying up first.
Data Cleaning Reloaded
Here is a quick overview of the data cleaning/organizing one-liners I used.
df.column = df.column
.str.replace(‘old’,‘new’)
.groupby('date')['brand'].transform('count') #or
.groupby(["column1", "column2"]).size().reset_index(name="Count")
.sort_values('Count')
.drop_duplicates('column', keep='last')
.asastype('datetime64[ns]') #date format to order
.dt.strftime('%b%Y') #2013-06->Jun2013, codes: strftime.org
.column.replace(np.nan, 'None', regex=True) #NaN->String
.dropna(subset = ["col1"], inplace=True) #drop NaN
.drop_duplicates()
.T #pivot table
.column.str.split(",",expand=True) #ForExample-> For|Example
.isna().sum() & .info() #find missing values
.to_csv(r'df.csv')
.sort_values(by=['col1','col2']) #sort values
.shapefile.merge(df) #for later plots
I fix values, organize the set, and lastly merge it with a shapefile with a common column.
Tip: Stuck on the specific syntax for a line? Try live chats. seeking help in a data science chat can turn hours of tweaking into a one-minute solution. Sometimes you even discover entirely new approaches 🛠️
For some ties and use cases, we need our data in a specific format.
There are many neat low- and no-code sites to make e.g. visualization like the racing bar chart you saw at the top. Usually, it’s the easiest solution to pivot tables/columns aggregate stuff to give one example.
One of the reasons I was working on this data, aside from my curiosity, was the profile-matching data science competition I discovered, for which I needed unique insights. But let’s start at the beginning: renaming e.g. the Russian Federation into Russia.
df = df.rename(columns={‘photo_location_country’: ‘country’, ‘photo_location_latitude’: ‘latitude’, ‘photo_location_longitude’: ‘longitude’, ‘exif_camera_make’: ‘brand’, ‘exif_camera_model’: ‘model’})
shapefile = shapefile.rename(columns={‘COUNTRY’: ‘country’})
shapefile[“country”].replace({“Russian Federation”: “Russia”}, inplace=True) #there are more but those were the most crucial
merged = shapefile.merge(df)
merged.head(1)
The ‘brand’ column needs a lot more cleaning. We will also merge the data set with a shapefile so we can create some great maps.
There are many brand names that need to be fixed (many uploads have typos such as Nikkon-> Nikon). To be able to see all rows, we need to remove the limiter.
Side-Note: after merging the DF with a shapefile for later plots I put it into a new variable called ‘merged’.
merged['brand'] = merged['brand'].str.replace(',', '').str.replace('go', 'gopro').str.replace('fujifilm', 'fuji').str.replace('iphone', 'apple').str.replace('fujifilmfilm', 'fuji').str.replace('goprogoogle', 'gopro').str.replace('gopropro', 'gopro').str.replace('goproogle', 'gopro').str.replace('gopropro', 'gopro').str.replace('nikon corporation', 'nikon')merged = merged.groupby(["country", "brand"]).size().reset_index(name="Counter")merged = merged.sort_values('Counter').drop_duplicates('country', keep='last')merged = shapefile.merge(merged)
Neat! Now it’s finally time for our next plot.
Choropleth
This will not only visualize where most uploads come from but also how many! In this case, it’s easier to spread the code into several lines.
vmin, vmax = 0, 2100
fig, ax = plt.subplots(1, figsize=(25,14.4))
sm = plt.cm.ScalarMappable(cmap='Greens', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)
merged.plot(column='country', cmap='Greens', linewidth=0, ax=ax, edgecolor='0')
ax.axis('off')
Thanks to our preparation earlier, we can plot the same variable to show the different brands.
vmin, vmax = 0, 9999
fig, ax = plt.subplots(1, figsize=(22.5,11.6))
merged.plot(column='brand', legend=True, ax=ax); ax.axis('off')
plt.savefig('df.png', transparent = True) #saves transparent pic
Conclusion
The vast majority of photos come from the US, UK, and Zimbabwe (mostly gorgeous nature shots) and are shot with Canon and Nikon cameras with surprisingly few uploads from iPhones, GoPros, and alike. Many wildlife photos from Africa — seeing how it’s not on our map — may be uploaded from other countries, possibly via VPN. Also, South Korea has a thing for Fuji cams but Japan joined the Canon wave. 🤔
If you want to take a shot at the massive full data set you can request it here.