Adventures in analysing my Instagrams
Firstly a confession–I’m a fairly heavy Instagram user, and I take a lot of pictures of my dogs, food, flowers, sunsets and all the other trite things that ‘Instagrammers’ take pictures of. One of the things I hate about Instagram is that it’s really hard to search and browse your own images. You always end up needing a 3rd party app and the paging is annoying at best and at worst non existent. I wanted a way to explore at all my IGs, ideally offline on my machine so that it would act as a back up. But there’s no ‘back up’ or ‘download’ my data like say Twitter or even Flickr. Also most of the 3rd party tools to back up your IG account only grab the images and miss the titles and hashtags. Now you can go the developer API route and build a custom application to do it but as I already had all the images backed up via a great 3rd party app (Photodesk — also good for showing the details, comments etc), I decided to find a way I could use Qlik Sense to explore the images. For those who don’t know what Qlik Sense is, it’s a data analysis and visualisation tool, that can pretty much do anything with a little practice, patience and tinkering.
Between Kat and I we have approx 6500 (and counting) IGs. So first off I needed to turn 6500 images into tabular data. Which meant I simply needed to create a list of the files, and that’s as simple as Copy and Paste into a .csv file on the Mac. That gave me a way to list all the images and view them inside Qlik Sense. With a little more tinkering I could derive additional information from the IG filename. When you backup your images from Photodesk it stores them like so: 1317927813210600632_6614136.png which is the Instagram media_id. That gives you the image number and the user id. Not much but a start, as it allows me to filter by who created the image. But what I really wanted was the meta data for each IG— at least when it taken and the title. I managed to get a date for the image by extracting the creation date from the image with a simple Python script. Unfortunately this is a little misleading as no timezone info so no clear way to offset the server timestamp to the local time it was taken (still investigating that). But what this meant was that I had the beginnings of a real data table:
[Filename][image id][user id][date created]
After this first dabble with Python, I decided to use it to analyse the colour in the images. I used ColorThief for Python, I threw all 6500 images at it and created a couple more data tables mapping the dominant colour and a 4 colour palette for each image. Then I mapped those colours to the closest named colours on the CSS3 colour list, as the individual hex values are too unique for analysis. In addition I pulled out which of the major colour hues the dominant colour was in eg; red, yellow, cyan, green, blue, magenta and then black, white and grey.
[image id][Dominant colour][Dominant colour Hex][Dominant colour HLS][Closest named colour][Closest named colour Hex][major hue]
[image id][Pal 1 Hex][Pal 1 closest named colour][Pal 2 Hex][Pal 2 closest named colour][Pal 3 Hex][Pal 3 closest named colour][Pal 4 Hex][Pal 4 closest named colour]
This all associated perfectly in Qlik Sense via the [image id] and was becoming a fun and fascinating way to explore and analyse my IGs.
Next up was to get the title and that meant getting it from Instagram. To get any useful information or indeed anything out of the Instagram web service you need the shortcode for the image (eg: BJKOF0KBnS4).
Which I didn’t have. At this point I gave up several times and almost signed up for the Instagram Developer API as it seemed impossible. But I eventually found a great post on reverse engineering the media_id (http://carrot.is/coding/instagram-ids#). This explained that the shortcode was a base64 encoding of the first part of the Instagram media_id. Simple I thought, just run it through a standard encode routine in JS or Python and save the results. Unfortunately not, after trying dozens online only 1 gave me the correct result, but I didn’t know why! Eventually I realised I needed to use a custom base62 encoding routine as IG had used a specific mapping for the alphabet and most of the built in encoders needed a string not the decimal to do the encoding. I found this routine http://stackoverflow.com/questions/1119722/base-62-conversion tweaked the map and bingo. I had the shortcodes.
This allowed me to recreate a link back to the Instagram page for each image (as above), as well as opening access to the images themselves at different sizes (t, m, l) via:
Which means I could in theory grab the images myself or indeed reference the media from their servers, if I didn’t want to store them. But most of all it got me access to the JSON for embedding an image using the API, without needing to register. Like so:
Not the nicest of things, but with a little more Python (by this time getting well and truly down and dirty with it), I could write a batch process to loop through all the shortcodes, calling the JSON and extracting just what I needed into another table (took a couple of hours, but hey, at least it didn’t reject the amount of requests I made).
[image id][Shortcode][Title][Photographer][User id]
So finally I had all my images (held locally), and a bunch of meta data about them also local stored in CSV files for easy loading into Qlik Sense. Once in Qlik Sense I could pull out additional things like hashtags (from the Title), break down the date into periods etc, and make use of the colour codes.
Unfortunately you can’t get comments, likes or location data this way, for that you need to use the authenticated API. Which after this adventure, would have been the smarter approach, but not as much fun.
Anyway here’s some of the things I found in the data.
A few numbers
- 6548 images (3057 for Kat, 3491 for me) — 6536 unique ids (so looks like a few double downloads)
- Earliest post 17th July 2011
- Avg. posts per day 4.83
- Avg. hashtags per post 6.67
I post a lot more black and white, and essentially ‘grey’ images than Kat (631 compared to 41!) when looking at the dominant colour by major hue. However Kat takes more magenta orientated images. Red and yellow are the top hues overall, which over half the dominant image colours falling in those ranges.
As I said, I used Colorthief to extract the dominant colour and 4 secondary colours from the images. This sometimes creates odd values when looking at the image, due to the way it analyses the pixels. Out of the CSS3 named colours the murky ones and the greys take precedence, with Darkslategrey well and truly leading the pack. But that feels a little misleading, especially when you look at the images. You really need to view the colours across all 5 colours in the palette to get a sense of the colour for a specific image
I was first to Instagram and slowly got going. When Kat joined she got really into it, but over time we’ve ended up posting roughly the same amount of images per day on average.
Unsurprisingly Saturday and Sunday are our more active days, and winter is less active than the other seasons overall. However we moved from Denmark to San Francisco in 2015, and when you separate out the two things look a little different. In SF it’s Autumn that we were least active in, and Spring the most active (only used 2015, as only half way through 2016). Back in Denmark, winter was the least active (because it’s wet, cold and windy!), and Autumn the most.
Kat has some truly prolific days. The 6 days with the most posts are all Kat’s. The most she posted on one day was 37, for me it’s a paltry 17. And on days where we both took pictures she out posts me.
Basically, I’m inept at this. My tagging is pretty much concentrated around a few key things. Kat is way more in to it, using far more and far more often. Kat averages around 10 hashtags per post, with me coming in at around 3. I have 430 posts that aren’t hashtagged at all, with Kat it’s only 113.
The top hashtags we both use are about where we are and our dogs. But then there are few which are very specific to each of us. Out of 6779 tags there are only 913 that we both use, 201 of which we have used the same amount.
So that’s my first stab at exploring our Instagrams. One of the nicest things is to be able to explore them and then recall where it was, and what we were doing or why we were there.
Ping me or leave a comment if you want the Python scripts I used.