Query Optimizations and Database performance tuning

Published in

stackspacearena

5 min readMar 10, 2024

Sharing a quick example from one of my past experience at an ecommerce, where I was leading the transport management service team. We were calling up a shortest path query in a graph database for our routing service to optimize the delivery time once shipping was initiated. On production deployment, the shortest path api started giving timeouts while it had worked seamlessly in dev and stage environments. After significant amount of debugging, the issue we found was trivial and I had missed in code reviews. One of the queries was using filter on an attribute that was not indexed!

Query optimization is a critical aspect (often underrated) of database management systems, aimed at improving the performance of database queries. Here are some important query optimizations to consider:

Common Optimization techniques

Indexing:

Creating indexes on columns frequently used in WHERE clauses, JOIN conditions, and ORDER BY clauses can significantly improve query performance by allowing the database engine to locate rows more efficiently.

Example:

-- Without an index
SELECT * FROM users WHERE email = 'example@example.com';

Performs much slower for large datasets than :


-- With an index
CREATE INDEX idx_email ON users (email);
SELECT * FROM users WHERE email = 'example@example.com';

Query Rewriting:

Sometimes, rewriting a query in a different form can lead to better performance. This might involve restructuring joins, subqueries, or using different SQL constructs.

Example:

-- Using a subquery
SELECT *
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA');

can be improvised to

-- Using a JOIN
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'USA';s

because the database optimizer can evaluate the join condition more efficiently than a subquery.

Normalization and Denormalization:

Proper database design using normalization techniques can reduce redundancy and improve data integrity. However, in some cases, denormalization may be necessary to optimize query performance, especially for read-heavy workloads.

As a hypothetical example,

-- Normalized tables
-- (shipment table with only invoice_id)
SELECT * FROM shipment , WHERE invoice_id = 123;
--
-- get invoice details separately or join

Above query might turn out to be slower and we could get better performance in heavy read systems, with denormalized tables as below:

-- Denormalized table
-- (orders table with invoice details)
SELECT * FROM shipment WHERE invoice_ID = 123;

Limiting Results:

Limiting the number of rows returned by a query using the LIMIT keyword can improve performance, especially for queries that might otherwise return a large result set. Limiting the number of rows returned can reduce network overhead and improve response times, especially when the full result set is not needed.

Partitioning:

Partitioning large tables into smaller, more manageable chunks based on certain criteria (e.g., range partitioning, hash partitioning) can improve query performance by reducing the amount of data that needs to be scanned. Consider this query, it will only scan specific partition(s).

-- With partitioning
-- Assuming 'orders' table is partitioned by range on 'order_date'
SELECT * FROM orders WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31';

Optimized Joins:

Choosing the appropriate join algorithms (e.g., nested loop join, hash join, merge join) based on the size of the tables and available indexes can improve query performance.

Example

-- Using a nested loop join
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'USA';

Can be improved as:

-- Using a hash join
SELECT *
FROM orders o
JOIN customers c USING (customer_id)
WHERE c.country = 'USA';

Stored Procedures:

Encapsulating frequently executed queries into stored procedures can reduce network overhead and improve performance by minimizing the amount of data transferred between the application and the database server.

Caching:

Implementing caching mechanisms at various levels (database level, application level) can help reduce the overhead of frequently executed queries by storing and retrieving results from memory rather than re-executing the query every time.

Optimizing Database Configuration:

Configuring database parameters such as memory allocation, parallelism settings, and disk I/O settings according to the workload characteristics can significantly impact query performance. These are often Database dependent but always play cruicial role in achieving performance.

Database specific optimizations

MySQL:

Use appropriate storage engines like InnoDB for transactions and MyISAM for read-heavy operations.
Tune configuration parameters like innodb_buffer_pool_size, query_cache_size, and innodb_flush_log_at_trx_commit.
Optimize queries using indexes, covering indexes, and proper join strategies.
Use query caching to cache frequently executed queries.
Partition large tables to distribute the load and improve performance.
Use stored procedures to reduce network traffic and improve performance for frequently executed routines.

PostgreSQL:

Utilize indexing strategies including B-tree, GIN, GiST, and BRIN indexes depending on the data type and query patterns.
Analyze and vacuum tables regularly to update statistics and reclaim disk space.
Optimize queries using CTEs (Common Table Expressions), window functions, and proper join strategies.
Utilize table partitioning to manage large datasets efficiently.
Adjust configuration parameters like shared_buffers, work_mem, and effective_cache_size based on workload characteristics.
Take advantage of advanced features such as JSONB for flexible schema support and full-text search capabilities.

MongoDB:

Design schema considering document structure, embedding, and referencing based on query patterns.
Create indexes to support query performance, including compound indexes and indexes on embedded fields.
Shard collections to distribute data across multiple nodes and improve scalability.
Utilize aggregation pipeline for complex data processing and analytics.
Use appropriate read preferences and write concerns to balance consistency and performance.
Enable WiredTiger storage engine for better compression, concurrency control, and performance.

Cassandra:

Design data models based on query patterns and distribution requirements.
Utilize compound primary keys and clustering columns to optimize queries and data distribution.
Consider denormalization and materialized views to support different query patterns efficiently.
Tune consistency levels based on application requirements to balance consistency and performance.
Optimize compaction strategies to manage disk space and improve read/write performance.
Use partitioning strategies like ByteOrderedPartitioner or RandomPartitioner based on the use case and data distribution.

Graph Databases

Design efficient graph models to represent relationships and optimize traversals.
Use relationship indexing and node property indexing to speed up graph queries.
Optimize Cypher queries using query profiling and index hints.
Utilize relationship types and labels effectively to organize and query graph data.
Partition graphs into smaller subgraphs to distribute data across multiple servers.
Use caching mechanisms to cache frequently accessed graph data and query results.

Time Series Databases (e.g., InfluxDB, Prometheus)

Design efficient schema to store time-series data, considering tag keys, field keys, and timestamps.
Utilize retention policies and shard duration to manage data retention and storage efficiency.
Optimize queries using appropriate time range selection, downsampling, and aggregation functions.
Use continuous queries or tasks for data preprocessing and downsampling.
Tune storage configuration parameters like shard duration, retention policies, and data compaction settings.

DynamoDB:

Design efficient partition keys to distribute data evenly across partitions and avoid hot partitions.
Utilize sort keys and secondary indexes to optimize query performance.
Consider DynamoDB Accelerator (DAX) for caching frequently accessed data.
Use provisioned capacity or on-demand capacity modes based on workload characteristics.
Implement retry and exponential backoff strategies for handling throttled requests.
Leverage DynamoDB Streams for real-time data processing and change capture.

Cosmos DB:

Design efficient partition keys and logical partitioning schemes to distribute data across physical partitions.
Utilize indexing policies and indexing paths to optimize query performance.
Adjust consistency levels based on application requirements to balance consistency and performance.
Use partitioned collections and multi-region replication for scalability and high availability.
Monitor and optimize request units (RUs) consumption to ensure optimal throughput and performance.
Leverage features like stored procedures, triggers, and user-defined functions for server-side processing and optimization.