How you cannot replace SQL filtering for Python list filtering.

I was working on a dashboard when I walked into this problem where I had to summarize data in a database for many different fields. I kept running a similar query — where I was selecting and counting fields — over and over again when suddenly I had this not-so-amazing idea: Why not get all the data first and then filter it, that would definitely clean up my code. At first, it might seem like a good idea, because you don’t have to send multiple requests to a database for one API endpoint. However, normally the database is stored on the same server as the backend — which is definitely the case in my situation — making considering saving requests because of latency unnecessary. My realization of my stupidity regarding this idea didn’t stop me from testing it out anyway.

Specifically, I wanted to test the difference in speed between filtering objects with a MySql query and using the Python list filter function. To test this, I added 2000 lines to a table in the database. I Also added an endpoint to my Django backend and added two serializers, one which uses different MySql queries to select data and another one which filters the objects in Python. Each serializer had 5 method fields, basically summarising data multiple times, each method field adding more data to process to the test. The response of the endpoint was the time it took to perform the filtering.

Then I added a test to Postman request which adds the time it took — the performance indicator — to a list. I added a postman collection runner, performing the test 1000 times for 1 to 5 method fields for both SQL and list filtering. I exported the data as JSON, wrote a script that converted the right data to a CSV file and collected it all in an Excel sheet. There I did some smart maths to standardize the data, trim the highest and lowest 20% of the values and summarized the processed data. And there it was, the shocking truth:

The difference in speed between the two. Each horizontal dot adds another method field.

Let’s start off by explaining the SQL Filtering method. A database is made to filter data and is very fast in doing so. I sent a query similar to the following to the database: SELECT COUNT(*) FROM TABLE WHERE TYPE='TYPE';. This query returns the count of all rows where TYPE='TYPE', which would always return 2000 because it was true for all rows. Doing this multiple times would linearly add about 2.3ms for each serializer field. Which makes sense because you’re basically repeating the same step over and over again.

Now, for the Python filtering method. It’s really not useful because it requires some really inefficient steps. First I executed a query on initialization: SELECT * FROM TABLE;. Then I would store that data in in a variable. This process took about 80ms, mostly caused by writing the data to memory. Then I would — repeatedly for each serializer field — filter through the data. The data however, contains so much overhead that it’s not at all efficient to loop through. The length of the filtered data could be determined with the following pseudo-code: len(self.objects.filter(type='type')), which is similar to a more Pythonic method: len([x for item in self.objects if item.type == 'type']).

Okay… Thats interesting, but how would that perform with a much bigger database?

Different datasets for SQL and Python filtering. Green 2000 records, orange 10000 records. (Not real sizes for SQL).

For the Python filter method, we can say that the performance indicator changes linearly with the amount of data that is getting processed. 4 times slower for a table size with 10.000 records. But only about 1.5 times slower for the SQL query. Imagine an even bigger database, with maybe a million records?

In conclusion, please don’t filter huge amounts of data in Python when trying to make a dashboard. But instead use a database, because in the end that is what it’s made for. And I haven’t even taken into account memory usage, which would probably blow up. Maybe this use-case doesn't make sense, but it’s good to know the difference in performance between the different methods. Anyway, thanks for reading.