A general enterprise data-driven application will have to do a lot of querying and read operation on the database. And if you are making queries through your python app then SQLAlchemy is a great tool for this purpose. A good read on how to improve the performance of SQL code is given here. In this post, we will look at those considerations and assumptions and try to find the best possible balance between the two. Also, we will learn the ways and methods to weigh the various options at our disposal.
Creating a database for the benchmarks
We will be running the measurements on a MySQL/MariaDB database. So, first of al, let’s create a database `testing` for our testing purposes.
Next, we will create a transactions table where we will have some dummy columns. We will populate those columns with some dummy data.
This will create a database with the required columns. We will now fill the rows and columns with some dummy data. To generate the dummy data we are going to use faker. Using faker we will generate fake data for the database which will be representative of real datasets.
The code that is used for this purpose is shown below. First, we will call all the dependencies in the script.
Then we can define the base class that will be used to interface with the database.
And then we will leverage faker and create 50000 rows of some fake data.
If you want the whole script, you can click on the link here. Running this code will create a database of 50000 rows in your database.
Below are small samples of the database which has been reduced for reading purposes.
https://gist.githubusercontent.com/infinite-Joy/04dde9149aa82f4188a71c821dcaec0a
As we will be focusing on the date, we will have the database show us some sample datetimes as well.
Querying the database using SQLAlchemy
Now we will bring SQLAlchemy into the picture, query the DB and filter it according to the date. The code to do this is below. The profiled function is taken from SQLAlchemy documentation with some modifications for python3.
This is basically passing the query to the database and letting the database do the filtering for us. Running the above code gives us the following result.
➜ python test_on_different_methods.py
8092 function calls (7813 primitive calls) in 0.078 seconds
So the query completes in 0.078 seconds. What if we try to do the same thing by querying everything and then filtering the result in the code. Common sense tells us that this should take time as databases are generally optimized for querying and filtering. Also passing a lot of data over the network will take a lot of time as opposed to having the database do the operations and passing only the relevant data over the network. Let’s take a look at the code that will be needed to make this test.
Executing the above code will give us the following result.
➜ python test_on_different_methods.py
1624078 function calls (1623417 primitive calls) in 1.880 seconds
As you can see this took a lot more time as expected(1.880 secs as compared to 0.078 secs).
So this was expected. Now, do you sometimes doubt that this was just one call so we should obviously be deferring the filtering to the database? What if we have to make multiple filtering on the same data. Surely it would be better to make just one call, get all of the data and then do all the filtering in the code. Let’s check that hypothesis. To do this, we will take the help of the code that is there below.
So running the above code on different amount of calls gives us the following numbers. The numbers are in absolution time in seconds.
The whole code can be checked out here. We can see that the results in both the cases are quite close (except the first result) and they are quite linear. The last result for the `transactions_filtered_by_querying_multiple_calls` failed with connection error showing that this is the upper bound with the number of connections that we can have with the SQL database. So we can see that for a small number of queries SQLAlchemy wins hands down. While for a large number of queries they are almost the same. Only when we reach the need for having around 400 connections, then we need to give this a second thought. Probably then the database should be changed and an in-memory database like Redis should be used.
So the conclusion that we should try to make the number of calls as less as possible and then defer the general processing like filtering, sorting to the database. This should help us in getting a high amount of performance from the code.
To get regular updates on more content, learning material and general rants you can follow me on Twitter.