Maximizing SQLAlchemy: Database Query Performance

A general enterprise data-driven application will have to do a lot of querying and read operation on the database. And if you are making queries through your python app then SQLAlchemy is a great tool for this purpose. A good read on how to improve the performance of SQL code is given here. In this post, we will look at those considerations and assumptions and try to find the best possible balance between the two. Also, we will learn the ways and methods to weigh the various options at our disposal.Creating a database for the benchmarksWe will be running the measurements on a MySQL/MariaDB database. So, first of al, let’s create a database `testing` for our testing purposes. ➜ mysql -u root -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 5 Server version: 10.1.24-MariaDB MariaDB Server   Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.   Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.   MariaDB [(none)]> MariaDB [(none)]> create database testing; Query OK, 1 row affected (0.00 sec)   MariaDB [(none)]> use testing; Database changed MariaDB [testing]> MariaDB [testing]>view rawcreate_database_mysql.bash hosted with ❤ by GitHubNext, we will create a transactions table where we will have some dummy columns. We will populate those columns with some dummy data. CREATE TABLE transactions ( id int NOT NULL PRIMARY KEY, name varchar(255), description varchar(255), country_name varchar(255), city_name varchar(255), cost varchar(255), currency varchar(255), created_at DATE, billing_type varchar(255), language varchar(255), operating_system varchar(255) );view rawcreate_table.sql hosted with ❤ by GitHubThis will create a database with the required columns. We will now fill the rows and columns with some dummy data. To generate the dummy data we are going to use faker. Using faker we will generate fake data for the database which will be representative of real datasets.The code that is used for this purpose is shown below. First, we will call all the dependencies in the script. from datetime import datetime import random   import sqlalchemy from sqlalchemy import create_engine from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String, DateTime from sqlalchemy import Sequence from sqlalchemy.orm import sessionmaker from elizabeth import Personal from faker import Fakerview rawcreating_tables1.py hosted with ❤ by GitHubThen we can define the base class that will be used to interface with the database. user = 'root' pwd = '' host = 'localhost' db = 'testing' mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))   Base = declarative_base()   class Transactions(Base): __tablename__ = 'transactions'   id = Column(Integer, primary_key=True) name = Column(String) description = Column(String) country_name = Column(String) city_name = Column(String) cost = Column(String) currency = Column(String) created_at = Column(DateTime) billing_type = Column(String) language = Column(String) operating_system = Column(String)   def __repr__(self): """How the class Transactions is shown.    :returns: return the string with the id and the name.    """ passview rawcreating_tables2.py hosted with ❤ by GitHubAnd then we will leverage faker and create 50000 rows of some fake data. myFactory = Faker()     Session = sessionmaker(bind=mysql_engine) session = Session()   billing_type_list = ['cheque', 'cash', 'credit', 'debit', 'e-wallet'] language = ['English', 'Bengali', 'Kannada'] operating_system = 'linux'   for i in range(60000, 1000000):   transaction = Transactions( id = int(i), name = myFactory.name(), description = myFactory.text(), country_name = myFactory.country(), city_name = myFactory.city(), cost = str(myFactory.random_digit_not_null()), currency = myFactory.currency_code(), created_at = myFactory.date_time_between(start_date="-30y", end_date="now", tzinfo=None), billing_type = random.choice(billing_type_list), language = random.choice(language), operating_system = operating_system )   session.add(transaction)   session.commit()view rawcreating_tables3.py hosted with ❤ by GitHubIf you want the whole script, you can click on the link here. Running this code will create a database of 50000 rows in your database.Below are small samples of the database which has been reduced for reading purposes.https://gist.githubusercontent.com/infinite-Joy/04dde9149aa82f4188a71c821dcaec0aAs we will be focusing on the date, we will have the database show us some sample datetimes as well. MariaDB [testing]> select name, billing_type, created_at from transactions limit 3; +------------------+--------------+------------+ | name | billing_type | created_at | +------------------+--------------+------------+ | Henry Smith | cash | 2015-05-11 | | Brandon Williams | debit | 2000-04-14 | | Natalie Smith | debit | 2007-09-02 | +------------------+--------------+------------+ 3 rows in set (0.00 sec)view rawdb2.bash hosted with ❤ by GitHubQuerying the database using SQLAlchemyNow we will bring SQLAlchemy into the picture, query the DB and filter it according to the date. The code to do this is below. The profiled function is taken from SQLAlchemy documentation with some modifications for python3. import cProfile from io import StringIO import pstats import contextlib import time   import sqlalchemy from sqlalchemy import create_engine from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String, DateTime from sqlalchemy import Date, cast from sqlalchemy.orm import sessionmaker     @contextlib.contextmanager def profiled(): pr = cProfile.Profile() pr.enable() yield pr.disable() s = StringIO() ps = pstats.Stats(pr, stream=s).sort_stats('cumulative') ps.print_stats() # uncomment this to see who's calling what # ps.print_callers() print(s.getvalue())     user = 'root' pwd = '' host = 'localhost' db = 'testing'   Base = declarative_base()   class Transactions(Base): __tablename__ = 'transactions'   id = Column(Integer, primary_key=True) name = Column(String) description = Column(String) country_name = Column(String) city_name = Column(String) cost = Column(String) currency = Column(String) created_at = Column(DateTime) billing_type = Column(String) language = Column(String) operating_system = Column(String)   def __repr__(self): """How the class Transactions is shown.    :returns: return the string with the id and the name.    """ return "Transaction(id=%s, name=%s)" % (self.id, self.name)     def transactions_filtered_by_querying(): with profiled(): mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db)) Session = sessionmaker(bind=mysql_engine) session = Session()   transactions = ( session.query(Transactions) .filter( cast(Transactions.created_at, Date) == '2007-09-02' ) .all() ) return [(t.id, t.name, t.billing_type) for t in transactions]     transactions_filtered_by_querying()view rawquerying1.py hosted with ❤ by GitHubThis is basically passing the query to the database and letting the database do the filtering for us. Running the above code gives us the following result.➜ python test_on_different_methods.py8092 function calls (7813 primitive calls) in 0.078 secondsSo the query completes in 0.078 seconds. What if we try to do the same thing by querying everything and then filtering the result in the code. Common sense tells us that this should take time as databases are generally optimized for querying and filtering. Also passing a lot of data over the network will take a lot of time as opposed to having the database do the operations and passing only the relevant data over the network. Let’s take a look at the code that will be needed to make this test. def transactions_filtered_in_python_code(): with profiled(): mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db)) Session = sessionmaker(bind=mysql_engine) session = Session()   transactions = session.query(Transactions).all() return [(t.id, t.name, t.billing_type) for t in transactions if t.created_at == '2015-05-11' ]     transactions_filtered_in_python_code()view rawquerying2.py hosted with ❤ by GitHubExecuting the above code will give us the following result.➜ python test_on_different_methods.py1624078 function calls (1623417 primitive calls) in 1.880 secondsAs you can see this took a lot more time as expected(1.880 secs as compared to 0.078 secs).So this was expected. Now, do you sometimes doubt that this was just one call so we should obviously be deferring the filtering to the database? What if we have to make multiple filtering on the same data. Surely it would be better to make just one call, get all of the data and then do all the filtering in the code. Let’s check that hypothesis. To do this, we will take the help of the code that is there below. def timing_function(some_function):   """  Outputs the time a function takes  to execute.  """   def wrapper(*args, **kwargs): t1 = time.time() some_function(*args, **kwargs) t2 = time.time() print("Time it took to run the function: " + str((t2 - t1)) + "\n") return wrapper     @timing_function def transactions_filtered_by_querying_multiple_calls(calls): for _ in range(calls): mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db)) Session = sessionmaker(bind=mysql_engine) session = Session()   transactions = ( session.query(Transactions) .filter( cast(Transactions.created_at, Date) == '2007-09-02' ) .all() ) [(t.id, t.name, t.billing_type) for t in transactions]     @timing_function def transactions_filtered_in_python_code_multiple_calls(calls): mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db)) Session = sessionmaker(bind=mysql_engine) session = Session()   transactions = session.query(Transactions).all()   for _ in range(calls): [(t.id, t.name, t.billing_type) for t in transactions if t.created_at == '2015-05-11' ]view rawquerying3.py hosted with ❤ by GitHubSo running the above code on different amount of calls gives us the following numbers. The numbers are in absolution time in seconds.The whole code can be checked out here. We can see that the results in both the cases are quite close (except the first result) and they are quite linear. The last result for the `transactions_filtered_by_querying_multiple_calls` failed with connection error showing that this is the upper bound with the number of connections that we can have with the SQL database. So we can see that for a small number of queries SQLAlchemy wins hands down. While for a large number of queries they are almost the same. Only when we reach the need for having around 400 connections, then we need to give this a second thought. Probably then the database should be changed and an in-memory database like Redis should be used.So the conclusion that we should try to make the number of calls as less as possible and then defer the general processing like filtering, sorting to the database. This should help us in getting a high amount of performance from the code.

Learn More >