Scaling Ancestry.com: Billions and Billions of Images at Ancestry

Jeremy Noel Johnson
Ancestry Product & Technology
7 min readMar 24, 2022

Introduction and Challenges:

Did you know that Ancestry has well over 1 billion user uploaded images for family trees? Did you know that we have well over 5 billion images for a variety of historical records including the 1940 census, World War 2 Draft Cards, and High School Yearbooks?

Ancestry is adding hundreds of thousands of family trees and user generated images per day and millions of historical document images each month. All these images have three primary challenges:

  • To be stored safely and securely
  • To be able to be retrieved quickly and easily
  • To be able to be manipulated and changed when needed

I joined Ancestry over 13 years ago. Ancestry has come a long way since then and is doing some incredible things including handwriting recognition, image colorization from black and white images, and releasing the 1950 census! I work on the Media Services team at Ancestry as a software engineer and we’re responsible for making sure these 3 challenges are solved and working at all times. These are large scale challenges needing large scale solutions.

Solutions:

Images being stored safely and securely

Imagine you have a laptop computer with a hard drive and you have a large number of family photos on that hard drive. All of a sudden your hard drive fails and doesn’t work anymore and you are unable to access your family photos. This is a devastating scenario and this type of thing cannot happen with your images at Ancestry.

Storage Solution: We store your images in Amazon S3, a proven document storage solution. S3, along with processes our team runs, and many databases with information, are in place to make sure your images for your family tree are backed up in multiple locations to ensure you never lose them.

  • We make use of AWS (Amazon Web Services) security protocols to make sure your images are safely secured and only you have access to your images (unless you make them public)
  • We make use of AWS Aurora (MySql) in order to store specific information about each image
  • AWS S3 has multiple locations and your images are backed up to ensure you never lose them. We even store images in S3 “Glacier” as a backup-backup just in case
  • Our databases are also backed up regularly so that data is never lost
  • We have a process in place if a user wants an image removed if it contains information about them.

Images being retrieved quickly and easily

Imagine you’ve scanned at a very high resolution a 1940s census document. The image takes up a lot of space on your computer in order to be viewed clearly. Loading it takes quite a bit of time and you want to see it fast and not have to wait 5 seconds or more for the image to come back.

In order to view your images quickly and easily on Ancestry’s website, our team does more than just get the image from AWS S3 and send it back to the page you are on. Yes, that’s part of what we do, but another important part is when you view an historical document.

Slicing Solution: Take this record of Franklin D Roosevelt, the 32nd president of the United States. When you click on the image to view it, you go to Ancestry’s image viewer : the place to view all historical documents. The image of the record of Franklin D Roosevelt in the 1940 census comes back and you see it. See if you can find his row in the census record.

What you may not realize is that in order to get the image on your screen as fast as possible, our team does some processing the moment the link to view the image is clicked: even before the Ancestry image viewer page shows you the census record.

Our team, using code, slices up the large census image into small pieces and when you view the census record, you see several smaller images pieced together like a puzzle, which come back faster than having to retrieve the entire large image at once.

It’s this and many, many small and large things our team does that helps ensure you get your images and media safely, quickly, and efficiently at Ancestry. Now, go see if you can find the 1940 census record and image for the 33rd president of the United States, Harry S. Truman.

Caching Solution: Our team also stores these image slices and information about images in an AWS Elasticache cluster. This means the images will come from memory (RAM) instead of disc, which will make them come back faster. Memory retrieval is faster than disc retrieval.

Thumbnail Solution: Another thing our team does is storing a few varieties of thumbnail images (smaller sized images) and stores a “pre-canned” image for a few different dimensions that are common to Ancestry. That way when you need a thumbnail, you can get a pre-scaled image rather than having to scale the dimensions when the image is requested and then return it.

Image Manipulation

Ancestry is also in the process of allowing black and white images to be colorized, or damaged images being enhanced so they are easier to view. My team is involved in this process and I’m excited to see our progress over time.

My team also is involved in the many different image operations (when viewing an historical document in the image viewer) that you can do on an image like so:

Solution: In order to perform these image operations, my team has a code library called PicTools, that makes it easy for us to perform these and other image operations. The most common operations are image rotation, image flipping, inverting colors, and specifying the quality of the image.

Image rotation is helpful if there is text or handwriting that is sideways and you want to be able to read it more easily.

You might invert colors if you find something hard to see or read. Changing the image to a darker tone with white text might make it easier to read.

Ancestry Is a Great Place to Work to Solve Hard Problems

One of the great things about working for Ancestry is the freedom and autonomy to dive into difficult challenges and solve them. My team was one of the first to get our systems migrated to AWS a few years ago. We had the flexibility to do things and be creative in order to solve this problem.

Everyone has a voice: whether you are the manager, or just a junior software engineer that is getting started. If you have an idea that is good, you will be heard and there’s a chance your idea will get implemented.

Ancestry also holds annual coding hackathons and presentations for new product ideas in order to see what ideas employees have that might make sense to put on Ancestry’s website. For instance, the “shaky leaf” and hinting system was one such idea someone had many years ago that made it into the website as an actual feature.

Where Can You Find Media?

We’ll end with showing where media at Ancestry can be viewed — and it is from a variety of locations. Here’s a few of them:

Search results: anytime you search for records, if there is an image available, you can view it from the search results page. When you view this image, it is my team’s code that helps you get your image.

You can also view images for any people in your family tree. When you click on a family tree member, you’ll have these options:

The Gallery is a view of all the media and images you’ve uploaded for a person in your family tree. When you view this media or click on an individual image or media item (like a text document), it is my team’s code that helps you get those images and media.

Lastly, your profile image is stored and retrieved by our image service as well:

What images are you able to find on Ancestry? Share some in the comments below!

Please see our next article in this series: Scaling Ancestry.com: Putting our culture of innovation to the test

*** Join us on our scaling journey, check out the open engineering roles! ***

--

--