Need of Optimization
In most IT systems, the database is an integral part. Data needs to be processed from databases to actually use it — and SQL is the standard interface language used to query the Database for data. Most of the time, SQLs written to communicate with the Database are simple, easy and fast. Creating, reading, updating and delete are usual operations done on the Database and they are quite fast most of the time. But they, as it might be evident by now, aren’t the only operations done on databases.
Sometimes, operational databases need to run queries that take too much time. These include operations like aggregation, queries involving multiple tables, and so on. Unlike queries that run on data warehouses, queries run on operational databases are time-critical, which means they need to return data as soon as possible. So, optimization of SQL queries becomes a necessity.
Some of the systems that need optimized SQL include a geoserver. A geoserver includes millions of geographical data, spanned across multiple tables; and the database operation performed on these data (which includes intersection of lands, calculation of area, calculation of distance between two geographical locations, etc) are usually performance heavy. To perform meaningful operation from these huge amounts of data requires complex geo operation and requires data from multiple tables. An unoptimized SQL query for these operations would take days or months — giving rise to the need of optimization.
Tips for Optimization
I have compiled the following list of tips that anyone can use to optimize their SQL query for performance. These tips are the result of my own experiences when working with the Sunly project at Producement. Sunly is a startup helping to develop renewable energy projects and the Sunly project involved finding suitable land for those energy projects. Finding most suitable lands required complex SQL queries which when unoptimized ran for days or weeks. Following tips come from the knowledge I gained while trying to optimize those queries while using PostgreSQL database and PostGIS.
Tip 1: SELECT as little fields as possible
Most developers, for the sake of simplicity, use SELECT *
. This is rather inefficient as it requires the database to return every column. One can speed up the efficiency and, as a result, the speed of the SQL query by only selecting required fields.
For example, in the table persons
containing fields like id
, name
, age
, citizenship_no
, place_of_birth
and so on; to get name and age of the person, instead of doing something like:
SELECT * FROM persons;
we could do:
SELECT id, name, age FROM persons;
Tip 2: Use JOINs instead of Cartesian product
This is another simple tip that most developers who properly learnt SQL would already know. When you are trying to obtain data from two tables based on the relationship between them, it’s better to prefer using JOIN
s instead of doing Cartesian product using WHERE
clause.
Let’s explain this with an example. Suppose we have two tables: persons
and lands
, and the relationship between persons and lands is given by the field ownerId
under lands
table which describes which person owns the land. To list persons along with the lands they own, one can do the following with Cartesian product:
SELECT * FROM persons, lands WHERE persons.id = lands.ownerId;
But this query is too slow, and one can use a more efficient query which uses JOIN
:
SELECT * FROM persons INNER JOIN lands ON person.id = lands.ownerId;
The reason why query using JOIN
is faster is due to the fact that databases would optimize joining of tables when JOIN
is used.
Tip 3: JOIN is not always the answer
Joining tables is not always the answer when operations involving multiple tables are concerned. There are other ways too, like subqueries.
Let us take the example of two tables: person
and phone_number
. Multiple phone numbers can be associated with a single person, and the relation is defined by the owner_id
field present in the phone_number
table. We want to find the list of persons who have more than 5 phone numbers associated with them. A quick thinking would produce a query using JOIN
like as follows:
SELECT person.*
FROM person INNER JOIN phone_number
ON person.id = phone_number.owner_id
GROUP BY person.id
HAVING COUNT(phone_number.id) > 5;
As SQL is a declarative language, you might not need to optimize this query as most databases would optimize for this query by not performing full join before grouping. But, some don’t; wasting a big chunk of time performing joins that aren’t really needed. So, a sure fire way to speed this up would be a query like:
SELECT *
FROM person
WHERE (
SELECT COUNT(phone_number.id)
FROM phone_number
WHERE phone_number.owner_id = person.id
) > 5;
This way, the database won’t have to perform full join just to select people having over 5 phone numbers.
Tip 4: Do some test run using LIMIT
Use the LIMIT
clause of SQL to your advantage by doing some test runs. When you have millions of data to perform an operation on, you don’t want to wait for your query to run fully before realizing that your query was wrong or your query was inefficient. Instead, limit your query to a small amount of data and check its efficiency and validity. When you are satisfied with your query, you can then run it fully.
To limit your query, you can use the LIMIT
clause, like as follows:
SELECT lands.*
FROM lands LEFT JOIN telephone_lines
ON ST_DWithin(lands.geom, telephone_lines.geom, 1000)
WHERE telephone_lines.id IS NOT NULL
LIMIT 1000;
Here, the query is selecting at most 1000 the lands that are within 1000 metres distance from the nearest telephone lines. We are limiting the SQL query to select at most 1000 lands by using the LIMIT
clause. By understanding the time it takes to run the above query, we can understand how much time the whole query takes. Suppose the query takes 3 seconds to run fully. If we estimate the above query to return at least 1 million suitable lands, the whole query would take 3000 seconds to complete, which is 50 whole minutes. We would now understand the query needs to be optimized. Using TIP 3, we could do something like:
SELECT lands.*
FROM lands
WHERE EXISTS(
SELECT id
FROM telephone_lines
WHERE ST_DWithin(lands.geom, telephone_lines.geom, 1000)
)
LIMIT 1000;
Let us suppose that this query now takes only 0.5 seconds to run. According to our earlier estimate of 1 million suitable lands, this query would now take 500 seconds, which is about 8.5 minutes — a lot better than the above query.
Tip 5: Don’t forget to create indices
Indices help to make queries much faster by allowing the database to access data much faster and perform several operations much better. Let us take the SQL query from Tip 4 as example:
SELECT lands.*
FROM lands
WHERE EXISTS(
SELECT id
FROM telephone_lines
WHERE ST_DWithin(lands.geom, telephone_lines.geom, 1000)
)
LIMIT 1000;
Here, if there is no index created on the geom
field of both lands
table and telephone_lines
table, the POSTGIS operation ST_DWithin
is going to be much slower; thereby making the whole SQL much slower. By creating the indices on both the geom
fields, the query will run much faster.
Tip 6: Don’t hesitate to write long complex queries if it saves time significantly
Sometimes, you can write simple queries that don’t require much time to understand but would take too long to actually run. In time-critical operational databases, it would be much better to write complex queries that save time significantly; as long as it’s understandable with some effort.
Tip 7: Create a materialized view or new table
Sometimes, the new data you obtain from some table is reused over and over again in some other queries. Instead of performing those duplicate queries to obtain the same data, you could and should create a materialized view or new table to speed up your queries.
Let us consider a table containing millions of lands information including its geometry information. Finding all the lands having area above a certain number multiple times requires calculating the area of the geometry of land over and over again, once in each query. Instead of doing such a complex and time-consuming calculation in each query, we could precompute the area of each land beforehand and create a new table, and use the new table in our future queries. We could also forgo the hassle of creating a new table and maintaining it by modifying the existing table to include the new information.
Tip 8: Prefer fast data structures
If you have some basic understanding of programming, you would know that string comparison is much slower than integer comparison. Thus, if you do any operation involving strings on databases, find if there is a way to do it with integers instead.
Let us suppose that there is a table of citizens
in a country and another table of lands
in the country. The relationship between them is given by the owner_citizenship_number
filed under the lands
table. For your information, the citizenship number is stored in string as it includes some hyphenation like '123-456-7890'
. If you want to do any queries that involves both tables, you might be inclined to do JOIN
like:
SELECT *
FROM citizens INNER JOIN lands
ON citizens.citizenship_number = lands.owner_citizenship_number;
But this is slow as it uses string comparison. You can speed up this query by using citizenship number in integer form than in string form. As hyphen in '123-456-7890'
was redundant, you could use the integer version 1234567890
. So, using Tip 7, create a new table/materialized view of those two tables having citizenship number field converted to integer and perform the same query much faster as:
SELECT *
FROM citizens INNER JOIN lands
ON citizens.citizenship_number_int = lands.owner_citizenship_number_int;
This was an example of when string (slow data structure) was converted into integer (fast data structure). There will be other situations with other data structures where converting from slow data structure to fast data structure would speed up the query significantly.
Take for example one situation that I faced while working in Sunly. I had to find the area of land not covered by forests. I could have found all the forests that lay inside the land and performed geometry difference of the land and forest and then calculated the area of the remaining part of land. But instead I calculated the total area of forests that lay inside the land and deleted it from the total area of land to find the area of land not covered by forests. As the geometry difference of land and forest is slower than simple integer difference between land area and total forest area, the second option was much faster.
Tip 9: Prefer not to use IN operator
It is sometimes tempting to use the IN
operator to check the existence of something in another table. For example, to find all individuals that are not in sanctions list, one might do something like:
SELECT *
FROM persons
WHERE persons.id NOT IN (
SELECT person_id FROM sanctions
);
But this query is too slow, as it calculates the sub-query first and then checks if the person’s id exists in the list returned from the sub-query. A faster way to do this would be something like:
SELECT *
FROM persons
WHERE NOT EXISTS (
SELECT * FROM sanctions WHERE sanctions.person_id = persons.id
);
Before you go
Not all these tips apply to all databases out there. SQL being a declarative language, some databases might optimize for situations described here and these tips might no longer hold true. So, before you apply them to your SQL queries, first test them and see if it works for you. Tip 4 would come in handy for you in those situations.