In 2019, Panorama scaled our Student Success platform from 250,000 students across 300 schools to 2 million students across 5,700 schools. One of the biggest challenges during that growth was optimizing how we store and interact with all that data!
Panorama Student Success is an event-sourced system. This means that all changes in our domain are modeled as a strictly ordered series of events. The current state of any piece of the domain is determined by projecting these events into a read model. As opposed to something like a traditional ActiveRecord application, no changes can be made directly to the read model — changes are only made through events. This approach to modeling our domain gives us a tremendous amount of flexibility to change how we interpret events as our system grows and new features are added. In order to ensure that this flexibility is maintained we need to frequently make sure that all events can be re-projected with no issues. We have too many events to re-project all of them to generate a read model each time a web request is made. Instead we regenerate our read models every night and store them in PostgreSQL.
At Panorama we value being able to ship as quickly as possible so that we can learn as we go. Our engineering team has a “just in time” mentality to all the work we produce and scaling has been no different. At the end of 2018, we were storing each school’s read models in separate PostgreSQL tables. During a school’s regeneration each night, we created a new set of projection tables for that school, called behind tables. We populated these tables during the nightly regeneration process. When all events were projected, the behind tables became the new active read-side tables and we deleted the old active tables. When this was first implemented, we decided to use this multi-table model because we wanted the ability for one school to fail in regeneration without affecting all of the others, and this was the most straightforward approach. At any time we could have at most twice the number of tables per projection per school. At the end of 2018 we had about 30 read model projections, giving us between 9,000 and 10,000 total tables at any given time in our database. As we grew, this multi-table model meant that we’d have a lot of tables! In early 2019 as we were forecasting our student and school growth, we had a big question. What would happen to our database with 100,000+ tables?
When you make a query in PostgreSQL, the query planner determines the most efficient way for the query to be executed. Creating indexes on your tables in PostgreSQL is one strategy that makes querying certain dimensions of data more efficient. The query planner uses indexes like these and also considers details about how the query is built, like what tables are in a join and any aggregate functions or filters in the query to determine the query plan. The time it takes to return data from any query comprises both this planning time and the time to execute the query. After getting a better understanding of how the query planner works, we made a hypothesis that query planning time would grow as we added more tables, because there would be more things for the query planner to consider. Since overall query time would increase, we expected this to slow down all of our web requests.
A core value for our engineering team is to take time to verify an approach or test a hypothesis. When we are able to, we like to have data to back up our reasons for implementing things a certain way. Since there is no explicit limit to the number of tables PostgreSQL can support, we ran an experiment to see if we could mimic a future state to get a better understanding of what might happen with so many tables. In our experiment we created a number of fake schools with similar data to our current schools in our production-like test environment. Our hypothesis was that request times would grow so we ran load tests to simulate high web load, expecting to see an increase in total request times.
Hypothesis: Slow Web Requests
During these tests we saw a slight increase in response times but nothing outside of acceptable ranges. Surprisingly, with the addition of only 500 schools, about 25,000 total tables, we saw PostgreSQL throw OutOfMemory exceptions and the database refused to process any more queries or accept any more connections and actually failed over to our hot standby! OutOfMemory exceptions usually happen when PostgreSQL can’t allocate any more memory to its connections or to its cache. In order to figure out what happened, our team started to do more research.
We figured out that PostgreSQL performs best when the data can fit in RAM and that in our case, the database could have run out of space in the cache. The cache contains frequently accessed data as well as indexes in order to plan and serve queries faster. The total cache size can be up to 80% — 90% of a database’s RAM size and can vary based on workload. We also found that each connection to the database holds a copy of the database catalog in its memory, which includes data about each table in the database. That addition of 500 schools brought our total database size to become larger than our database size RAM. Could the OutOfMemory errors be due to the amount of data in our database?
Hypothesis: Too Much Data
We allocated more RAM to our database in an attempt to replicate the failover we experienced previously. In this next round of experiments, we saw excellent performance on the larger database with the addition of 500 schools. We then added 1,500 more schools (for a total of 475,000 tables), again bringing our total data size larger than our new RAM size. During a test with moderate load, response times were a bit slower but still within acceptable ranges and our database had no performance issues. This proved to us that total data size was not the source of our problem and suggested that the culprit was instead high load.
Hypothesis: High Load
We ran another test with high load on the larger database to verify this new hypothesis. After this test, our larger database failed over to the hot standby again, with OutOfMemory exceptions. We’ve narrowed down that high load, rather than data size, seemed to cause a problem!
Overwhelming the database with too many connections is known to cause OutOfMemory exceptions. In order to support that high load, we added more web servers, increasing the number of connections to our database. However, we had reached only about 150 connections during these high load tests, well under our database’s limit of 500. This ruled out the number of connections as the source of the issue. Another common cause of OutOfMemory exceptions are complex queries. Digging through our monitoring tools, we didn’t see unusual request queuing or long transaction traces that would indicate that poorly written or extremely complex queries could be bringing down the database under high load. In our earlier research we had found that each connection to the database holds a copy of the database catalog in its memory. This catalog contains information about every single table in the database. Could the number of tables in PostgreSQL have been the real problem?
Hypothesis: Too Many Tables
Our team had a new hypothesis to test. We proposed that we had actually exhausted our database connection limit. With this many tables, our database actually had a much lower limit than expected because of the amount of memory each connection required to hold the entire database catalog. We ran a final experiment to prove this new hypothesis. We maintained the data size we had in our previous experiments with 1,500 additional schools but this time we put all of the data into a single table per read model rather than a table per read model per school. Our total data size was still larger than total database RAM and we had the same number of database connections, but under our final high load test, we experienced no substantial performance issues and no OutOfMemory exceptions. We proved our application cannot handle that many tables because each connection to PostgreSQL holds a copy of the database catalog in its memory!
Read about how we solved this problem in Part 2 (coming soon)!