Moving Meetup to the cloud
The Meetup Engineering team has been re-platforming our infrastructure, core, mobile, and web platforms over the past 12 months so we can improve our product and launch new features faster for our members. We have been moving parts of our infrastructure to the cloud throughout the past year and the final big move will happen this weekend. As a result:
Meetup will be unavailable on March 5 starting at 12:00am EST (GMT -05:00).
We thought a lot about how to avoid downtime for our members, but all possible technical solutions came with too great a risk of data loss. We’re going to be unavailable so we can stop updates to the database — from the website, apps, and various producer daemons and crons — and let the consumers run until all our queues are drained. We’ll then shut down all consumers, services in AWS, stop replication, promote our AWS data store to master and start everything up again. We anticipate this taking about 3 hours if all goes according to plan, but the time it will take to drain the queues is the most unknown that may take longer. You can read the details in our playbook.
Once we’re live in AWS, Meetup will be 100% in the cloud. Over the past year we completed three projects: moving our current infrastructure to AWS, launching new infrastructure on GCP, and moving the rest of our data processing pipeline to GCP. Below, I’ll go into highlights of each.
We’re excited to announce that we’re completing our project to migrate our current monolithic web app, API, and related services and infrastructure from our bare metal data centers to AWS. Our AWS infrastructure is entirely built as a set of CloudFormation templates, which allows us to automatically provision the entire platform programmatically. The architecture is using Docker on ECS and other managed services wherever possible, so our engineers can focus more on making Meetup and less on operations. Finally, Meetup is deployed for high availability across multiple Availability Zones.
Some highlights of the AWS infrastructure and managed services include:
Jobs processing: We created a separate jobs ECS cluster that runs email processing, data updates, reminders, and more.
Web and API serving: Our serving layer consists of an application ECS Cluster that runs our Web servers, API servers, and Admin tools.
Application services: We created a services ECS cluster that provides member social graph and interest services, as well as contextual search based on Sphinx.
Cloud services: We’ve implemented a set of cloud services including using AWS Lambda and API Gateway for photo management and resizing, DynamoDB for asset tracking, SNS for alerts and notifications, S3 for photo and assets storage, and Route53 for DNS.
DMZ & CDN: We run separate Application Load Balancers as an option in the Elastic Load Balancing service to automatically distribute application traffic across EC2 instances. We also use a modern CDN which is our first line of defense against DDoS.
Throughout 2016 we launched a new engineering infrastructure, core services, and web application on Google Cloud Platform. This infrastructure uses service-oriented architecture and Kubernetes container clusters managed by Google Container Engine. This form of clustering enforces our applications being ephemeral and stateless, leading to easy horizontal scaling with little infrastructure orchestration.
We chose this approach because we believe a service-oriented architecture will help us scale our platform and engineering team as we grow. Also, offloading management and uptime of the Kubernetes clusters to Google frees our engineers to focus on making better Meetup product faster and iterate faster towards our vision of MEME (a Meetup Everywhere about Most Everything).
The Engineering Effectiveness cluster contains various services that empower engineers to operate and development including:
* Selenium Grid for E2E testing
* Jenkins for triggering and running CI gatekeeper builds
* Docker Registry
* Nexus repository for component management and health
* Airflow for scheduling jobs and data pipeline workflows
* JIRA integration services
* Launch monitor tooling
The production cluster houses our new core platform, services, and web application. For user access, users interface with an autoscaling Google Load Balancer. Outside of our clusters, we’ve chosen to utilize managed solutions offered by Google and also a partially managed solution for SQL data stores.
Data processing pipeline
In December 2016 we completed migrating our data processing pipeline to GCP and AWS. The data processing pipeline collects application logs and database extracts, cleans it, runs machine learning and processing jobs all in GCP, and then moves the results to our data warehouse and application database so that we can serve recommendations and view analyzed data through Looker. A future enhancement and cost reduction we have planned will be moving our entire data pipeline infrastructure to GCP.
The above is achieved using a combination of services from AWS and GCP, including the following:
* Amazon S3 (~22TB) for object storage
* Amazon Redshift (~9TB) as our data warehouse
* Google Cloud Storage (~25TB) for object storage
* Google Cloud Dataproc (1 cluster with ~30 jobs) for running machine learning and processing Spark jobs in a managed Spark/Hadoop service