Record-time feature building: AI Time Machine™ (Part 1)

Ran Levy
MyHeritage Engineering

--

Record-time feature building: AI Time Machine™ (Part 1)

I am often asked by my friends and colleagues how MyHeritage managed to build the AI Time Machine™ in a week, with such high quality, performance and scale.

So after long time no post, I decided to write it down and I encourage you to travel in time with me by reading this post.

There are several factors that enabled us to release this feature so quickly: technology and development practices, mindset, and one of the best teams one could dream of.

In this post and its sequel, I will expand on each one, saving the most important factor, the team, for last. :)

I will describe briefly some of the major contributors in terms of technology and development practices:

  1. Cloud: our strong collaboration with AWS allowed us to scale in GPU-enabled machines across AZs (availability zones) and regions.
    Advanced cloud services such as SNS, SQS, Lambda, ECS, and others allowed us to focus on the business logic and leverage the cloud ecosystem for service and to recover failures easily (by the integrated Dead Letter Queue capabilities).
    The ability to balance on-demand and spot instances, and the ability to define the Auto Scaling Policy for the cluster, were key to controlling cloud cost (this is a topic worthy of its own post, but I would mention here that the slowness of AWS cost reporting updates were not in our favor).
  2. Continuous Deployment: one of the most critical key factors for high velocity, especially when several teams work in parallel, is the ability to deliver code to production quickly and frequently. Our entire R&D team is used to delivering code to production dozens of times per day. A merge to the master triggers a deployment pipeline that deploys the newly merged code to production in 20 minutes, assuming all levels of testing (unit, integration and end-to-end) passed successfully.
  3. Microservices and Serverless architecture: the ability to split the work between many teams can also be attributed to the distribution by nature that microservices and serverless architecture allow by nature. Such architecture is built for scale and fault isolation, and enables high velocity and frequent production changes.
  4. Development environments: we have several development environments that support high velocity. It starts with the ability to code and test on the local machine with micro-MyHeritage-in-a-box that every developer has. It continues with sandbox environments that allow sharing code and testing in a closed environment, and from there to a development environment that is very similar to production that allows testing complex environments like the one that was built for AI Time Machine™. Moreover, we have developed an easy (yet more risky) option to test on production, using a feature flags system that was built in-house (I’ll elaborate further below).
  5. Well-defined APIs: all services are consumed using a well-defined Graph / Graph Query Language (read more here) that allows well-defined and high-performing access to the services, with built-in documentation, rate limiting, and access control.
    The existence of such a layer allows rapid development of web UI and mobile apps in a standard way and utilizing existing APIs that were essential building block of the feature.
  6. Feature Flags: many years ago we built a state-of-the-art feature flag system. In short, this system allows controlling the exposure for a portion of our users, based on predefined rules (e.g. 10% of users from Canada who use the website in French). Moreover we are using feature flags to dynamically control the behavior of a feature, for instance how many AI Time Machine™ models and themes we are allowing for free.
    All feature flag configurations have an easy-to-use UI; all changes are reported to Slack and trigger production monitoring systems scans to ensure that the system behaves correctly post-feature-flag-change.

I hope you find this first installment interesting! I will post Part 2 in the next few days.

--

--