How I archived eight years of my life with JavaScript — Part 1

Michael Smart
7 min readOct 29, 2017

--

Photo by Glen Noble on Unsplash

The plan was simple. Create a tool that would allow me to remove Facebook. I have no real reason to remove or dislike Facebook — with that said I believe that ultimately the users, not the application, make humans unhappy. The developers at Facebook do excellent things for the open source community and have built some of my favorite projects and tools, so it’s essential for me to make the distinction that we make ourselves unhappy on Facebook. It isn’t the application’s fault.

With the full stack of JavaScript at my fingertips, I figured I had two approaches when it came to retrieving data from my account. It is important to note at this stage is that however witty, cutting-edge and brilliant my posts have been these eight or so years, I only wanted to retrieve my photos.

Plan A: The Facebook Graph API

This approach would be the most legitimate path, Facebook’s official, well documented, API. I would register an application with Facebook, authenticate myself, then make requests against my profile to retrieve photos. Simple.

I don’t mind admitting I learned a valuable lesson going full steam ahead with this solution. I had done some preliminary testing with the Graph API Explorer to determine the feasibility of retrieving my photos using this method.

An example response when retrieving photos from a user’s account.

Success! It would be easier than I had ever imagined, using the Graph fields to extract additional information, the request would give me exactly what I needed. I could get a URL for each image associated with my profile, tagged or otherwise.

This blind enthusiasm was the first mistake. I had assumed that Facebook wouldn’t return all of the photos I needed in one request, so when I saw that the API had returned 369 with the first request — with the limit parameter appended limit=999. I thought nothing of it; I assumed I’d hit a paginated response and would read the docs, when the time came, on how to retrieve the remaining photos. In retrospect, this was a huge and costly assumption. I was way too excited to notice, I had a response I could work with, so I jumped feet first into building the application.

The importance of prototyping

Looking back, this seems as though too obvious a statement to reason about. As I learnt, however, it can be difficult to know when to draw the line with a prototype, we don’t want to invest too much time in unusable, non-production, prototype code — whilst on the other hand we need to know that our application is technically possible before committing to write anything that we would deem production ready.

In this case, I had stopped prototyping too early. I had assumed that from the preliminary testing the application was technically possible. It was not to be. The response I was receiving was, in fact, all of the photos I could retrieve using the API. As I later discovered, due to Facebook’s privacy policy unless a user had explicitly set the photo’s permission to everyone, which is not the default, I could not retrieve it. Alas, I hadn’t discovered this until I had committed hours building out an application that was not technically possible. Onwards and upwards.

Plan B: Automation

With retrieving photos through the Graph API no longer an option, I moved onto the next solution, automation. The reason I set out on this journey in the first place was to solve the problem of the time and effort expended in doing this myself. It would be possible for me to download every photo I needed manually. I could have just as well spent a weekend manually going through each photograph, click the view full size links and save the result.

Let’s break that down for a minute. For argument’s sake, we’ll assume that on average it would take 30 seconds to manually download one photo — bearing in mind this isn’t accounting for the time spent making coffee or stopping to play Overwatch during what would be a mind-numbingly dull task. It would take me 19.5 straight hours to download the full 2341 photos.

With those numbers in mind and the repetitive nature of the task, it was clear that it should be automated. This approach would enable me to waste 19.5 hours of my weekend on my terms. My weapon of choice was Puppeteer.

Puppeteer is a Node library which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

Puppeteer would allow me to control a headless or head(full?) — for that matter — instance of the Chrome browser. Better yet, I could do this from the comfort of Node.js.

Separation of concerns

There were two concerns that I considered when approaching this task. The first would be getting, or sometimes referred to as scraping in this context, the view full size versions of my photos. The result would be in the form of a URL to a file sitting in a dusty area of a Facebook CDN.

The second concern would be downloading these files and saving to disk. These two things could well have existed as the same concern. Our script could start to download the file as soon as it had grabbed the URL. However, we run the risk of trying to do too much at once here. This approach is inevitably going to lead us into a lengthy session of debugging, potentially not knowing initially which one of our concerns is at fault.

With concerns separated one thing that needed consideration was the persistence of data, as stage 1 would involve accumulating 2341 URLs that we would later download. The method I opted for was a MongoDB, a document-driven database with a straightforward, well-documented API. I would grab each photo URL and store them within the database.

Let’s see some code already!

Just before we dive into the code, one point worth mentioning is I have omitted all error handling for brevity from the examples. Check out the full repository for the unedited versions of the scripts I used.

Setup

In this first block, we are requiring our dependencies and kicking off the script. I’m wrapping the main execution in a Immediately Invoked Function Expression; this was necessary so that I could leverage the proposed async and await syntax, which should make reasoning about this predominantly asynchronous script easier.

Logging in

We’re leveraging Puppeteer’s incredibly easy-to-follow API here to do a couple of things. Firstly, we are are using our newly created session to open the first URL. Our target destination to start grabbing photo URLs is the first photo in a collection.

To get to the first photo in this collection in the least amount of time, I navigated to it in a real session. I manually logged in, clicked on photos and opened the first in the list, then grabbed the URL from the browser. Now all I needed to do was pass this URL to Puppeteer’s Page.goto(...) method.

The neat thing here would be that when in a unauthenticated Puppeteer session, Facebook would take us to a login page, then redirect to the URL of the first photo once successfully authenticated.

Photo URLs

I’ve omitted a lot of code from this block to make it as concise as possible. The full version contains error handling, logging and a little trick used to get the date of each photo; I have omitted this in an attempt to keep you awake during this article.

Thanks to Puppeteer’s API this function should be relatively easy to follow. We are automating the manual process of:

  1. Clicking view full size link on a photo.
  2. Saving the photo URL to the database.
  3. Hitting the back button in the browser.
  4. Clicking the next arrow in the UI.
  5. Repeat

Caveats

With the above snippets, you’d think we’d be safe to kick the process off and go put our feet up for the next couple of hours while it recursively grabbed each photo right? Wrong. There were some fun caveats when running through eight years of application UI.

href

The first issue I encountered was finding the view full size link within the Puppeteer session. There was some inconsistency with the href value of this element. The majority of these linked to a PHP script with a URL containing something along the lines of view_full_size/some/other/path.php that would, in turn, redirect to the CDN URL (the location of our photo).

With this in mind, the first approach was to find these elements with selector a[href*="view_full_size"], which was effective 90% of the time. However, I had to adjust this, as written in the snippet above, to find each anchor tag on the page and check it’s innerHTML.

Next photo

Another interesting issue I couldn’t explain was the absence of a next photo link. Consistently ~10 photos in my full collection didn’t contain a link to proceed to the next item. This issue warranted the most substantial manual intervention within the process.

To handle these errors, I would store the page URL in the database as well as the photo URL. Storing this information meant that I could determine which photo was missing the next photo link. I would then manually, sigh, find that photo in my collection, pass in the page URL of the next item into the getPhotoURL which would resume the script from that photo. I got this exception a handful of times which took about 15 seconds to rectify per occurrence.

Stay tuned

In the next post in this series, I’ll be breaking down the second concern, saving each photo file to disk. For the full un-edited files check out the full repository.

--

--