Stabilizing & Improving Crunchyroll Service

Michael Dale
Ellation Tech
Published in
3 min readApr 1, 2017

As a Crunchyroll user you may have noticed some degradation of service on weekends as our popular simulcasts are released. We have been tracking this issue closely and employing strategies on the short, medium and long term to ensure Crunchyroll long term success and continued growth.

What’s the problem ?

Crunchyroll recently celebrated its 10th anniversary and the codebase has some almost as old design decisions. At the same time the Crunchyroll audience has continued to grow surpassing 1 million subscribers. The Crunchyroll platform is a “monolithic application”, running within a traditional data center environment. This has consequences where the entire application has to be scaled together. Various services, web views, analytics, authentication and API endpoints all share the same databases. A constraint in one service can slow down other services. In contrast most modern large scale web services have employed a microservices architecture that can be scalable and analyzed on a per service level.

On the API itself, the Crunchyroll platform has intermixed public resources like episode metadata with private resources like playhead position for the user requesting that catalog data. This makes traditional resource caching on a CDN ( outside of video and static assets ) difficult with existing Crunchyroll apps and clients.

Sample graph of DB connection load across multiple DBs

One problem we have seen during these degradation is high number of DB threads / connections during peak load. Because of the monolith structure of the system where a single request to the application can access many databases, it’s not trivial to trace what “service” or endpoint is causing the runway db connections. During these spikes the servers are still delivering API responses for the majority of users, but naturally we aim for no degradation of the service.

How are we fixing it?
We are pursuing multiple strategies to resolve including:

  • Moving the crunchyroll service from the datacenter into AWS where we can better manage scaling with virtualized resources.
  • Breaking the monolith into microservices making use of services we have built in our software defined AWS infrastructure that is already powering our newer products such as VRV.
  • Implementing mitigations to ensure users can still access video when the services are degraded.
  • Auditing and updating our crunchyroll clients to reduce superfluous API calls, better cache their API responses, and looking at reducing database load on the server.

One example of moving to AWS microservices is the shift towards cloud video delivery. Previously our video infrastructure ran out of the datacenter and required opening DB connections in order to server the list of video segments that were hosted on the CDN. Today we server the video stream list from our cloud services that are not affected by a Crunchyroll degradation.

Another project is transitioning to Redis for our session store. This means instead of storing sessions in the MySQL database we would store them in a shared memory cache. This should move a very common db call to a simpler key-value store and be faster then a MySQL transactions.

There are a few other focused projects in this space which we will detail going forward.

There are far too many people to thank in this company wide effort to stabilize and improve the Crunchyroll service that is underway. Teams involved go beyond the engineering org, with the content operations team who has worked around the clock to push cloud video delivery before all the automated publishing was in place. The dedicated efforts of the operations team also must be noted against minimizing negative impact during these peak usage times.

--

--