Scaling Ancestry.com: Providing Ethnicity Inheritance insights to 22M+ customers

Published in

Ancestry Product & Technology

5 min readAug 29, 2022

Ancestry recently launched Ethnicity Inheritance that predicts which regions each customer inherited from each of their parents. Our team, Genomic Algorithms, was responsible for running this algorithm at scale to make these predictions for all 22+ million customers in our database, the largest consumer genomic database in the world!

As explained in this blog, this algorithm is powered by technology developed by Ancestry called SideView™. This identifies which parts of a customer’s DNA comes from each parent, without requiring their parents be tested by Ancestry. As that blog explains, this is achieved by aligning the DNA of all of their genetic matches against their own DNA.

The Challenge: Horizontal scaling of a file caching system

Prior to processing any customer data through this algorithm, our system was required to :

download the DNA data of the customer’s genetic matches from AWS S3 and
selectively retrieve only the data corresponding to the “match segments” i.e. portions of the customer’s DNA where they matched their matches.

Each customer has tens of thousands of matches that were used for this algorithm. This implied that a person’s DNA data needed to be downloaded from AWS S3 thousands of times, each time for each of their matches and this posed a serious challenge for us!

Solutions

First, to minimize the redundant AWS S3 downloads and AWS KMS decrypts, we decided to architect and build a supporting service that would efficiently perform the 2 steps above and also cache those data files on disk.

Second, to optimize the DNA data reads from the cache, we architected highly -distributed, optimized computing for this service. The processing of DNA data of 22+ million customers, in a short period of time, required significant horizontal scaling of compute instances used for this supporting service. This resulted in 10K t3.small AWS EC2 instances being organized in 100 AWS Auto Scaling Groups (ASGs) with 100 instances each. For example, if a customer’s DNA matched with 10,000 other individuals, the caching system allowed us to download this data only 100 times from AWS S3 (once per ASG). This was better than having to download each file 10,000 times but not as good as doing it just once per customer.

*The backend system that predicts Ethnicity Inheritance for all customers*

Further Learning

This scaling exercise led to the following takeaways:

First, due to the nature of the architecture of the supporting service, we could not afford these instances being taken down for any reason during the processing and consequently we chose on-demand instead of spot instances. A loss in any single instance in an ASG would result in all other instances in that ASG (and their expensive caches) being rendered unusable.

For higher reliability and up-time, choose on-demand instances

Second, we provided a list of subnets (corresponding to availability zones) in which these ASGs could spin instances up. We quickly maxed out the number of instances on each subnet and had to extend that list of subnets. Over time, we discovered that the ASGs automatically try to rebalance the number of instances across different subnets, should any 1 subnet get too crowded. By adding new subnets to ASGs that already had instances running, we inadvertently caused the ASGs to spin down some instances and spin up new ones in the newly added subnets.

The takeaway here is that if your application can’t afford some instances being taken down for any reason, do the math to check if you have a sufficient number of subnets provisioned before spinning up these instances. Designing an application that is more resilient to instances being automatically spun down is another rabbit hole for another blog :)

Check if your subnets can support the number of instances you need if you can’t afford instances being automatically spun down.

Third, each of the 100 ASGs described above were required to download the DNA data of all customers in our database and cache them on the disks of their AWS EC2 instances just for high availability. Downloading the same DNA data from S3 100 times onto 100 ASGs, just to be able to handle our desired throughput, resulted in tens of billions of downloads of the same files! Our KMS decrypt and S3 requests costs skyrocketed.

This is where AWS FSx for Lustre finds use. It’s an open source file system used in High Performance Computing that can serve as the cache in our supporting service.

Consider Amazon FSx for Lustre for large-scale data transfers with sub-millisecond latencies.

Lastly, AWS S3 has a backend system that tries to identify sets of keys that can be partitioned. Each prefix in such a partition has a limit of 3500 COPY/PUTs and 5500 GETs/HEADs per second. These partitions are created :

to sustain performance when AWS S3 identifies parts of the keyspace that have a high request rate for extended periods of time, and
to reduce the time for lookups within a partition if there are a large number of keys within it.

We designed the key structure of our AWS S3 buckets to follow the best practices mentioned here. While AWS S3’s partitioning usually happens many times a day and is hidden to the user, we observed a large number of 503 Slow Down Errors because of our very high PUT rate.

S3’s behind-the-scenes partitioning system can cause 5xx errors for extended periods of time.

During this migration, we experienced this throttling, even with the prescribed exponential backoffs and without exceeding the per-prefix GET/PUT request limit, due to the sheer number of files that were being dropped to S3. It is possible to request AWS to pre-heat S3 buckets by setting up partitions before starting such a process to minimize these errors.

Future blogs will track the progress of how this system is made more fault tolerant, cost and time efficient. If you’ve faced similar problems and had other solutions in mind, feel free to share them in the comments section below !

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info.