How YipitData slashed $2.5 million, or 50%, off our AWS bill

Published in

YipitData Engineering

5 min readApr 13, 2020

As organizations look to cut costs, we thought it might be helpful to share how we cut 50% or $2.5mm off of our AWS bill with no impact on our business.

For context, I’m the CTO of YipitData. We analyze alternative data for 200 of the world’s largest hedge funds and corporations. We do web scraping and data analysis at a massive scale (each month we make billions of requests collecting data from hundreds of websites).

Last year, we conducted a review of our AWS costs. To our surprise, a few simple changes produced over $2.5 million in annual savings — over a 50% reduction!

Below are some of the most significant changes we made, we hope some of these will work for your organization.

By the way, if you make millions of web requests a month, you may be interested in using our infrastructure, instead of incurring all of the costs of managing it yourself. Just email corporate@yipitdata.com or me at steve[at]yipitdata[dot]com.

Use spot EC2 instances, rather than on-demand instances

Savings: $0.0255/hour to $0.0091/hour, saved 64% ($372k per year).
Why? Lots of compute is scheduled jobs or otherwise rarely needs 100% uptime. While spot instances may be interrupted at any time, we plan for this behavior and harvest significant savings. We’d started doing this years ago, but in 2019 we took a hard look at what really needed continuous uptime and moved everything else to spot.

Move it all to Ohio (US-East-2)

Savings: $0.0091/hour to $0.005/hour for spot EC2, saved 45% ($144k per year).
Why? While many default to the more established US-East-1 region (Virginia), US-East-2 (Ohio) has much lower spot EC2 prices. We moved most of our infrastructure there so we could take advantage of low spot prices without incurring large data transfer costs.

Home to the finest affordable Midwestern servers.

Use public EC2 instances instead of private instances

Savings: $0.045/GB to free, saved 100% (saved $122k per year).
Why? Scraping the web requires making a lot of requests from our servers to the public Internet. Since we don’t transfer sensitive information in these requests and are okay with the security tradeoffs, using public machines eliminated data transfer costs entirely.

Minimize database expenses by decoupling querying from storage

Savings: We went from spending over $1.2mm per year on databases to less than $500k, saving ~60%.
Why? In 2019 we finalized our migration away from AWS Redshift to storing our data in (much cheaper) AWS S3 and using services like Athena or Databricks for queries and transformations. We now pay for compute only when our team is querying data, and not when data is idling in an always-on database.

(It’s an Apache data format joke. Sorry.)

Minimize Elasticache Redis in favor of DRedis and SQS

Savings: Our blended Redis/Dredis/SQS costs dropped from $0.068/hour to $0.035/hour, saving 50% (saved $180k per year).
Why? An in-memory data store like Redis is great for very fast data and queuing lookups, but we found we didn’t need this level of performance for most of our software. Migrating to SQS and DRedis (our open source disk-based Redis), reduced our costs without sacrificing much else.

**Not AWS, but just as important: Slashing proxy server costs**

Anyone familiar with web scraping or crawling has run into hurdles scraping sites efficiently. At our scale we’ve had to overcome a wide range of challenges, from rate limiting, to Javascript rendering, to IP address reliability and beyond.

We’ve invested heavily in proxy pools and tooling to enable us to collect data from just about any publicly available website. Historically we’d partnered with 10+ proxy providers around the world to leverage thousands of IP addresses of varying quality. Over the last year we’ve simplified our infrastructure to easily deploy the most cost-effective proxies to collect the data we need. This has brought down our blended proxy costs dramatically:

By partnering with several proxy providers, we’re able to smartly align the quality of the proxies with what we need to scrape a particular site:

For most websites we can scalably deploy HTTP requests through datacenter proxies
Other sites require headless browsing via datacenter proxies, which is still more cost effective than HTTP requests through residential proxies
For other sites, we leverage tiers of proxies of increasing quality as needed
All our data access tools integrate seamlessly so we can easily switch between different proxies, user agents, and tools, without refactoring our scraping code

Savings: Without sacrificing our ability to collect data, we’ve reduced our annual proxy costs from $860k to $510k, saving 40%

Any questions?

We hope this post helps you think through ways to optimize your infrastructure costs! Having recently opened up our web scraping infrastructure to other companies, we understand how challenging it can be to solve “data access” problems cost-effectively.

If you have any questions about any of the above, email me at steve[at]yipitdata[dot]com. If you’re interested in using our infrastructure directly, email us at corporate@yipitdata.com and we’ll be in touch.