Database performance for teams

Leo Zeba
NAZAR
Published in
9 min readOct 3, 2017

Each year since 2011, Stack Overflow has asked developers about their favorite technologies, coding habits and work preferences. This year 64,000 developers took the annual survey. It was the largest group of respondents in their history. For the fifth year in a row, SQL was the second most commonly used programming language.

Stack Overflow Developer Survey Results — https://stackoverf low.com/insights/survey/2017

SQL stands by Structured Query Language is a domain-specific language used in programming and designed for managing data held in a relational database management system. SQL was one of the first commercial languages for Edgar F. Codd’s relational model, as described in his influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks.” https://en.wikipedia.org/wiki/SQL

SQL and RDBMS are well proved technologies. They have been adopted in a huge amount of applications around the world by companies of all sizes since the 80’s. From small and medium businesses using one single database instance to corporations powered by tens of instances or even global companies like Facebook and Linkedin that rely on clusters of hundreds or thousands of relational databases to scale their operations at maximum performance, SQL is dominant.

As an old and robust technology, most of the SQL techniques used to achieve performance and scalability are well known. Index creation, bulk inserts, stored procedures, table partitions, multiple schemas, sharding, clustering and others are richly documented with lots of examples available (ex: Sharding & IDs at Instagram; Sharding Pinterest: How we scaled our MySQL fleet; Query Analyzer: A Tool for Analyzing MySQL Queries Without Overhead). However, what once was common knowledge for many of my colleagues programmers in the 90’s, does not seem to be the case nowadays. This lack of due attention to the SQL code has cost lots of money.

Throwing machines at the problem

This lack of due attention with the SQL code, will probably turn in a performance issue when the application goes to production and user base grows or when the data volume increases.

When performance issues arises, the preferred solution for many companies around the world is to increase the computing resources available. That is specially true in cloud environments where this action is one click away and wasted dollars to pay.

On the other hand, for those that decide to investigate the root cause of the problem, there is a set of approaches and tools one could use. None of them is better than another. There are risks and trade-offs to be aware in any choice. It will depend of the database engine, environment (Cloud, on-premises or hybrid), load of the database and the team expertise. If the company can afford a dedicated database team like Linkedin, it can even develop own tools and processes to help in the quest. However, for most small and medium businesses, small teams and developers with little previous SQL experience this can be a hard problem to solve that will waste plenty of time and money.

Offensive queries

Our preferred approach when facing performance issues, is to find the top offensive queries. It consists in using the database slow query logs/performance dynamic views with a threshold (ex. 100 ms) and capturing all the queries crossing the threshold, which are then analyzed.

Although this approach cannot capture all the queries, it gives great visibility about which commands the application is sending to the database. We’ve been using this approach with great success in our customers, always obtaining very significant performance improvements, stability, scalability and, in most of the cases, reducing the infrastructure costs by 40% in average, and 90% for some special cases.

*This is not a one size fits all solution*. Logging leads to high IO and, depending of your database load, this can drastically reduces the throughput and degrades performance. But, if used correctly, one can rapidly identify and solve performance issues in those stacks without touching the source code.

This approach has been proved very effective in the following scenarios:

Small and medium businesses

  • When there is none or only a couple of developers with little previous SQL experience;
  • When buying proprietary third-party software with poor support and you lose $ due performance issues.

Managed services

  • When the customer application is killing the database resources in the cloud provider and you need evidence to support your customer service operations;
  • When managing multiple deployments, from different customers, with distinct database engines in a multi cloud, on-premises or hybrid environment.

Corporations

  • When there are multiple teams responsible for the development, security and operations where often takes too much time and effort to coordinate the teams;
  • When there are databases shared by multiple applications inside an organization.

However, setup the environment, tools and context to find the offensive queries for every new customer or application is a very repetitive and time consuming task that usually involves high skilled professionals and makes this task expensive. Another point is that, although most of techniques to fix performance issues in applications backed by RDBMS are very well known for decades, many companies, specially when there are multiple teams involved, struggle to follow the book due to communication problems between them and defensive positions adopted by the teams. That is nonsense because there is real time and money being wasted burning CPUs around the globe with BAD SQL code.

“It is not lack of hardware. It is not network traffic. It is not slow front ends. The main performance problem in the huge majority of database applications is BAD SQL code.” — Joe Celko.

By seeing this hundreds of times happening in front of our eyes, we’ve decided to build a tool to help with this problem. So we built NAZAR. Database performance for teams.

NAZAR is an offensive query monitoring SaaS. It helps teams to solve performance issues, democratizing the access to the BAD SQL commands and enabling collaboration in order to fix the issues faster. Its a productivity tool to save time and money.

Using NAZAR we’ve been successful in identifying and solve the performance issues for our customers in less than 14 days in average. These fixes also lead to infrastructure reduction costs in 40% average, and in 90% for some special cases among our customers. It has saved real money yet.

NAZAR currently supports Aurora, MariaDB, MySQL, PostgreSQL, Oracle and SQL Server relational databases and is platform agnostic. You can also use NAZAR in any cloud, on-premise or hybrid environment.

Relational databases supported by NAZAR

NAZAR continually monitors the database because applications are like living organisms. They are continually evolving and there will be always fresh and new problems to avoid.

*This is not a one size fits all solution*. If you are skeptical about activating logs to investigate and solve performance issues, please request a demo and we will happily show you how NAZAR works and how we can safely and successfully use this approach.

NAZAR is your amulet against performance issues

Agentless No software to install. Less friction when dealing with multiple teams (IT support, developers, managers, etc).

Database It’s all about data. In the database layer we can see and monitor all applications interacting with the data.

Performance We look for slow queries only. These slow queries are, most of the times, the root cause of high CPU and IO degrading performance.

Monitoring We monitor database top offensive queries by 3 different views:

  • most executed
  • most time consuming
  • slowest executions

These are some of the features NAZAR offers to help your team:

Overview

General metrics about the slow queries. Total time spent, total number of executions, number of hosts, number of users and databases that executed slow queries.

Most time consuming

After analyzing more than 3.5bi of slow queries in hundreds of applications we’ve identified that the most time consuming query is responsible for more than 50% in average of the slow query time resources. This metric is the sum of the execution time of each individual query of the same fingerprint (command without parameter values assigned). It helps one to choose the priority of their SQL tuning.

Most executed

Queries being executed too many times can point to N + 1 problems and cache opportunities. N + 1 problems generally occurs when the application fires SQL commands inside a loop structure. In most of the cases, it would be better to fire only one SQL and retrieve all the information at once, avoiding network round trips. Other times code can execute a SQL command every time a new request arrives, but that data rarely gets stale (ex: Cities of California). This is a great cache opportunity and can improve performance in orders of magnitude releasing precious database resources for the app.

Slowest executions

Slowest execution

Sometimes one single slow query can initiate a process where the overall performance starts degrading. In this section, one can easily see and check this potential candidates.

Collaborators

It’s our mission to help teams to write better SQL code. Democratizing the access to this information is fundamental. Sharing this information with multiple roles within a project is essential to get to the fix faster. You can share the reports with multiple collaborators by Slack, email and sms.

Normalized view

Hybrid and multi cloud environment are the common place nowadays. Managing multiple environments with their proprietary tools is a complex and time consuming task. NAZAR gives a normalized view of your databases in any cloud, on-premises or hybrid environment. For all databases supported. It allows database administrators and developers to manage hundreds of databases in a single tool.

Normalized view
Executions per database

Execution per databases

Shows the executions count by database. Helps a lot in environments with multiple instances running in the databases server.

Commands by hour

Shows the slow query commands distribution by the hour for the day. It allows to see the slow query count for every command type at any moment of the day. Helps a lot when there are concentrated spikes in performance degradation.

Filter by database

When there are slow queries in multiple databases is possible to filter the daily report by the database. This provides a in-depth detail of the performance issues for each separate database in a server box.

Groups

A group is a set of servers grouped by a feature the owner chooses. You can group the servers by the database engine (ex. All MySQL databases). It helps a lot when monitoring more than dozen servers.

Attention required

Attention required

Attention required is the group of servers that requires a look due to the spiked performance degradation by the day or week metrics. There are 3 different levels. Yellow, Orange and Red. The metrics evaluated are the number of executions and total time spent in those executions. The current day values are compared against the the average of the last 7 days.

Notifications

We’ve built a set of rules that triggers notifications pointing to specific problems as soon as they are identified by the system. The preferred method of notification is by Slack, but email and sms are also present and can be customized for every collaborator.

Notifications

If you have experienced any performance issue in your applications before, please try NAZAR. If you have any questions or want to know more about NAZAR, just request a demo or email us.

Finally, if you have any feedback or thoughts — I’d love to chat.

--

--