Lessons learned from optimizing performance in multi-layered .NET projects

Published in

ELCA IT

17 min readAug 9, 2022

“Do you want one week more holidays?” I still remember the conference talk, which started with this question. The explanation was that the UI of a product was optimized in the way, that a user could perform their task faster than before and could then save up to one week per year. So, performance of an application really does matter. In general, performant applications help the users to execute their work efficiently. Moreover, when the user can choose, which application or web site to choose, the performance is definitely one of the criteria.

We at ELCA have also had our challenges when wanting to reach the needed performance. For example, one challenge was to rewrite a 12 hour process to run within 30 minutes. Another was to analyze, why an application performs poorly also with a small amounts of data. In this post I go through our experiences with performance issues and their solutions in a .NET projects. Please note that we are not affiliated with any of the vendors of the tools mentioned in this post.

Analyze

When narrowing down the performance issue, following questions help to get started:

As mentioned in fellow article “Tools to detect performance issues in JAVA web applications”, performance can be measured in many ways: For example with availability, response time and utilization. Which kind of performance is required from your application? Should it be highly responsive, meaning focus on interactive actions? Or should it process large amount of data performantly, even with the cost of responsiveness?
Are all the users and the use cases impacted or just some of them? If just some of them, what is the difference between the impacted and non-impacted? For example, is there a difference in amount data or visibility of functions between the users?
Does the performance issue occur constantly or only in certain circumstances? For example, does the issue occur only, when the first person logs in, only at the end of day or only, when there are many users or many parallel operations?
Does the issue occur when the application is used by a person or by another system, like a batch job?

Below an overview of the layers, where the performance issue can be located:

Layers: Database usage, data transfer, data processing, serialization, third party services, network — Overview of the layers, image by author

It is important to recognize, in which layer(s) or component(s) the performance issue occurs. Possibilities are for example:

Network
Serialization
Code
Database
Third party services

Note that performance can mean a different thing for different stakeholder. For example, database expert and backend developer can be interested about how fast the query runs and how fast the request is executed. A frontend developer can be interested about how fast the page is filled. But an end user is interested about when the whole page is ready to use, when all the parallel processes are ready and when the user can finish the whole process described in a story. It is important to monitor and measure the performance as an end-to-end process and also for each layer and for each service (third party service or micro service) to recognize, which to analyze further.

Overview of tools for handling the performance:

Logos of the below mentioned tools — Overview of performance tools, image by author

Tools for analyzing the performance issues

ANTS Performance profiler — Analyzing .NET performance
dotTrace — Analyzing .NET performance
Dynatrace .NET profiler — Analyzing .NET performance
Profiling in Visual Studio Ultimate
WireShark — Analyzing network, shows for example redirects and protocols, which might be not visible when using SDKs

Tools for performance testing

Apache JMeter — Free Java-tool can also be used to test .NET applications. Can be integrated with Jenkins with performance plugin. Before usage, wait fix for the currently open XXE vulnerability.
Gatling — good for high load testing, enterprise version is pricey, but open source version is enough for many basic test cases. Enterprise version can be integrated with Jenkins with a plugin
k6 — Open source, can be integrated with Jenkins with shared libraries after custom parsing of the result-file.
Locust — Open source, needs python skills

Tools for performance monitoring

New Relic — Monitors the code through a .NET language agent and reports of delays on a dashboard
Azure App Insights — Uses SDK or agent for monitoring the application performance

Possible causes and solutions

Let’s talk about possible causes of performance problems, ways to recognize them and possible solutions.

Database layer

Analysis of the queries

Turn on tracing in the database to see, which queries are called, their duration and amount of calls. Analyze more in detail the queries, which take longest times and the queries, which are called the most often. For example, if same query appears often during a short period of time, it could be a symptom of N+1 problem. N+1 issue means that a list of objects is searched, and then for each object the related objects are search, each one in a separate query. This could be resolved with a joined query or a view. If a joined query needs more than 5–7 tables, then simplification of the data model or at least denormalization could help to simplify the query.

Analysis of the slowest queries

Check whether the queried tables are correctly indexed. Are there indexes in the fields, which are frequently used in where-clause? Does the query perform a full-table scan? This can be checked with analyzing the execution plan. See more details here: https://docs.microsoft.com/en-us/sql/relational-databases/performance/execution-plans?view=sql-server-ver16. The queries should be written in the way, that no unnecessary full table scans are performed on large tables.

Analysis of queued queries

Are there more parallel queries than there are open connections? Check whether every query is really needed with the hints of the following application layer chapter. Note that also stored procedures use cursors and need open connections, not only the queries from the code. Check whether the connections get released after the usage. If an object opened the connection in the code and the object is not released, it is possible that the connection stays open for longer time as needed. Below an example of the query, what we used for seeing which sessions are open in our Oracle database:

If the analysis shows that all the queries are indeed needed, then increase the amount of open connections in the database connection configuration.

Resolving deadlocks

If deadlocks appear in the logs, then verify that the transactional level (snapshot isolation) is configured as needed. Configuration depends on the tradeoff between consistency and concurrency. Another solution is to shorten the transaction scope in the code. Verify in the code whether for example the reading operation could be outside of transaction, when the parallel edit of the same data is very unlikely.

Summary of the possible issues in the database layer:

Possible causes: Invalid indexing, dead locks, long queries, queues, unneeded data queried, many queries — Overview of the possible issues in database layer, image by author

Application layer

Analysis of fetching time and amount of data

Verify whether too much data or too few data is fetched. For example, fetching a blob-field from each row before knowing whether is needed, might be unnecessary. In some cases it is better to lazy load some of the fields or related objects. In another cases, it is better to have one big query to eager load the data instead of lazy loading, to avoid the N+1 problem. The right way depends on the use case.

Using a framework like Dapper, EntityFramework or NHibernate helps to write code in a higher abstraction level. But it can also cause performance issue, as it is not more visible, when the loading of the data exactly happens. Make an explicit decision in the code about which data is lazy loaded and which is eager loaded.

When amount of handled data is not known, apply paging. For example, usage of the function and network latency can help to define the page size. With high latency, smaller batch size is preferred. Then the process does not block other processes for too long time and does not run into a timeout. Optionally, batch size and max amount of fetchable data can be configured, to avoid fetching the full table and blocking the other processes.

Analysis of time taking pieces of code

Some aspects of the code can take surprisingly much time, in comparison to what is the benefit of them. For example, in a project we noticed with the help of the ANTs tool, that logging took 10 % of the time in each function. Also usage of reflection took larger amount of time as expected. In these cases, we replaced the generic solution with more specific one, which was more optimized for performance. What I learned, was that when a code piece for resolving a class name is used often, then instead of the elegant one-liner of resolving a class name dynamically with reflection, I rather hard code or pass the class name as parameter, to avoid the time used for the reflection.

Another source of delays can be unnecessary instantiation of classes. There are classes like HttpClient, which are meant to be instantiated only once and not for each usage. If these classes are instantiated multiple times, it can take unnecessary amount of time.

Analysis of transferred data

When much data is transferred between layers, verify the format in which the data is transferred. In some cases, it is faster to zip/unzip the data before sending/receiving. Also the format can make a difference: For example, sending as binary takes less space than sending in xml- or in json-format. In a project we decided to send a large data objects as zipped binary. However, nowadays binary serialization should only be used when the data source is known and trusted. See more details in https://docs.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide. Zipping can be done with the GZipStream-class:

Example based on full code in here https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.gzipstream?view=net-6.0

In our project we used the binary-serialization and zipping only for certain largest requests. Handling also smaller files would have created an overhead and not reasonable performance optimization. Next time when I need a more generic approach based on end points, I might use a different approach. A good option is to enable HTTP gzip-encoding from server endpoint instead of using a custom encoding. With the encoding it is possible for example to define, which mime-types will be compressed. Note that when the compression is enabled for https, then the possible security risks should be handled for example with antiforgery-tokens. Below an example using GZip-compression for PDF-files:

Example is based on the sample here: https://docs.microsoft.com/en-us/aspnet/core/performance/response-compression?view=aspnetcore-6.0

Analysis of synchronous calls

If there are many synchronous calls to time-consuming I/O operations, then the performance can be improved by using asynchronous methods. In synchronous code, when an I/O operation starts, the executing thread enters to a wait-state and does not perform any operations until the I/O operation is ready. In asynchronous code, instead of going to a wait-state, the thread is freed to perform other operations. The difference looks like this:

In synchronous code, the process is waiting the IO. In asynchronous code, the process can perform other tasks, while IO-task is running — Comparison synchronous and asynchronous code, image by author

In .NET, a call to a synchronous function can be executed by a Task to make the execution asynchronous. For example in following code snipped the Import-function runs asynchronously even when the ImportFile-function itself is synchronous.

More best practices for .NET optimizing can be found here:

https://docs.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-6.0

I advice to read about these best practices and to make them well known for the team. When such best practices are followed from the beginning, the project is less likely to run into performance issues.

Summary of possible issues in the application layer:

Unnecessary big response, many calls, fetching static resources, unnessary prosessing, slow IO — Overview of the possible issues in application layer, image by author

Network-layer and resources

Analysis of network latency

If the application is running in the cloud, verify that the locations are geographically close to the possible usage of the application. Specially, when running the application in the hybrid model, where part of the components are in the cloud and the part on premises, the latency can have a significant impact to the performance. Is there data, which could be placed to the content delivery network (CDN), which would be located close to the user and cache the static data like images and videos? Then fetching these resources would not cause a call to the application server.

Analysis of the resources

Do the different components, like application and database, have enough resources (CPU, RAM, GB)? Are the resources used only by this application or are they shared with another applications? For example, is there house-keeping queries running in the database in the night, which could impact to a nightly import process? It is also good to verify, that the configuration of the single component is optimized for the usage. For example, in our project, we notice that an application pool in IIS was configured to use only a quite low amount of available resources. The impact was that the requests from the browser could not be handled very fast by IIS because of limited CPU and at the same time, the resources of the host were far from fully being used. The solution was to analyze, what is the optimal percentage of the CPU for each running application pool. Another option could surely be another host, if a single IIS gets crowded. Another example could be, that on the host there are other processes like Antivirus running, which could check the network calls and the transferred files. Even if the application is running in the cloud and supports auto scaling, it is good to verify, that application does not use unnecessary costly resources.

The limitation of CPU usage is configured in IIS for each application pool under “Advanced settings”, CPU:

More good practices on tuning IIS 10.0:

https://docs.microsoft.com/en-us/windows-server/administration/performance-tuning/role/web-server/tuning-iis-10

Summary of possible issues in the network layer:

Sharing the bandwidth, Not using the possible bandwidth, Distance, Issues with bandwidth — Overview of the possible issues in network layer, image by author

Architecture and design

Analysis of needed data consistency

If a function is indeed time-consuming, consider alternative ways of implementing the story. For example, could an operation run as a background job or as scheduled job? Or can data be cached? Both solutions contain a tradeoff between consistency of the data and performance.

Let’s talk first about the batch-jobs. If the performance issue is related to fetching or combining data from external services, an option is to fetch or prepare the data to a faster local storage regularly. For example, a full dataset could be fetched ever night from the third party service instead of asking single data records during the day. Often fetching full dataset is less error-prone than fetching a delta.

Another option is using a cache. Cache can be in many levels, for example:

Database: materialized views
Code: IMemoryCache (https://docs.microsoft.com/en-us/aspnet/core/performance/caching/memory?view=aspnetcore-6.0)
Component: Request cache, for example Redis-cache.

Selection of the component depends on from which usage the clients profit the most. For example, can one view be used by many clients or is formatting done just before sending a response to a client? In the first case, database view would be optimal, in the second case the request cache. There are also specific caches like map proxy for GIS-data. When planning a cache, consider how long the data is cache, minutes or days and when to refresh cache, scheduled or when the data is changed. Design application in the way, that it can handle eventual consistency. This can be done for example by optimistic locking, which prevents overwriting changes done by another user.

For web applications, pay attention to the amount of requests from client to server. Browsers have a limit of how many parallel requests are send. For example, Edge can send 6 parallel requests. After that, requests are queued (stalled) at the client, before they are sent to the server. Source of many requests can be for example, when each image, JS-file, CSS-file and other resource-files are loaded a single files or if many short rest-calls are made to fetch the data. The usual solution is to bundle, possibly also minify and obscure the text-files to have less files to load. Also images can be bundled. We used for example image sprite to have load only a single image containing all the icons instead of over 90 images. Then when using the image, only a part of the main image is shown. This way, we were able to improve the performance of the application start with a few seconds.

Here an example how the sprite configuration looks in the sprites-less-file and in the less-file of the component. The icon’s name and position are defined in the sprite’s less-file. The usage of the icon is defined with .sprite-image-definition.

Definition in sprites.less-file:

Usage in an component.less:

An example of a part in the base-sprite.png:

6 Icons — Part of sprite image in our project, image by author

To recognize the whether the requests are queued, have a look on the DEV tools of the browser. Below for example the white block shows that the request was queued for 2 seconds before it was executed.

Request was in the queue over 2 seconds — Screen capture of a slow request, image by author

Analysis: Is the logic placed on the right component

Verify whether the business logic is placed in the right component. For example, it is possible to format and combine data in many layers: in database as part of the query or stored procedure, in backend and in frontend. But, if processing the data in the database layer causes database not to serve the incoming calls on time or if combining the data in the frontend causes to load too much data to frontend, then the location of the business logic is in the wrong place. When reviewing the architecture, check whether logic can be moved elsewhere to make the division of the processing less blocking. This is challenging specially in the microservices where data is often joined only in frontend after receiving it from the different services. This can cause unneeded data to be loaded, as it gets filtered out in the frontend. When designing microservices, it is important to split the services in the right way to keep up with the needed performance

Analysis of memory consumption

Usually the garbage collector runs in the background when needed and has no visible impact to the performance.
But when it becomes under pressure, then it needs resources for cleaning up the objects and there is less resources available for running the application. This is visible for end user as slow performance.

To avoid putting the garbage collector under stress

Use structs instead of classes for small, short living objects to use less memory.
Use StringBuilder when concatenating many strings to reduce allocation
Avoid memory leaks by knowing what is normal memory consumption and analyzing the leaks.
Avoid using the GC.Collect method as it would be a blocking and time consuming call

We faced recently this issue in two projects. In the first project we used service instances of a product. The instances started with initial memory consumption containing cached data, increased the memory consumption during data processing and released it after being in an idle state for configurable amount of time. When the application was under pressure, the idle time was never reached and the instances kept their memory consumption high. When the orchestrating service noticed that less than 20% memory was available, it started restarting the instances to release memory. This was visible for the client as slow performance as the instances needed time to start-up, when a request came.

We were able to analyze this by monitoring the resources during heavy load and analyzing the lifecycle of the instances. The issue was fixed by adjusting the configuration of the idle time and increasing memory. In the second project the application was getting slower during the normal usage of a normal working day. With the help of ANTs tool, we found out that not all the resources of images were released and when there was less memory available, the garbage collector started working more often slowing down the other processes. The solution was to change usage of the images and other releasable resources into using-statements instead of having them in the central service, so that they get released earlier.

More best practices for performant .NET architecture can be found here:

https://docs.microsoft.com/en-us/azure/architecture/performance/

Summary of possible issues in the architecture:

Consistency, cache, bundling, memory usage, materialized views — Overview of the possible issues in architecture, image by author

Design considerations

Lastly, a few words about how to prevent performance issues in the early phase of a project. In another words, what to consider, when starting the implementation of the story.

In general, prefer architecture without too many layers. This is because possibly the data is serialized and send over network between the layers, which can increase the processing time.

Find out, how often the function is planned to be used and what is the expected amount of processed data. For example, is the function called by an admin user once a year or by another service every minute. Or is the function called for a few data rows or as part of batch-process for larger amount of data. Optimizing too early or too much adds unneeded complexity to the solution.

As mentioned in the beginning, it is important to recognize, which non-function requirement is more driving for the application: responsiveness meaning interactive actions and optimization for response time or processing meaning optimization for throughput. If the preference is clear, then the same optimization ways can be used for the whole application, either focusing on the responsiveness or the performant processing. If on the other hand, application needs to support both, then a good option is to create different services for each purpose: handling customer inputs and processing. This can be done for example with the CQSR, command query responsibility segregation pattern. In this pattern, there are services for editing (command) and nodes for reading (query). The services might use even a different database, which is optimized for the needed operations. The trade of is eventual consistency of the data. When designing the application, try to identify large patch jobs, which don’t need the real-time data. For example, a when there is a job, which runs every night, month or year, an option could be to copy the needed data to a separate data store and process it in a separate node. This way the performance of the main application could be less impacted. When needed, after the batch processing, the data can be written back the main database or it can be used from the separate data store as an own micro service.

Before releasing an application, it is a good practice to have reference values of the performance of the key components. The measurements should be taken from an environment, which is similar or comparable with a production environment. The initial values can be received for example by running performance tests. After the release, the tests can be run regularly in a non-production environment to see, whether the values change after a new release. The values should be defined for example with following format “95 % of calls to LoadData-service with maximal 1000 rows of data during max 100 users should take max 1,2 seconds.”

Conclusions

To sum up, for our performance issues we have found root cause and possible solutions in the peculiarities of each layer and component, in addition to the design. I advice first to take a step back and make an overall end-to-end analysis to identify the possible layer. It might not be the typical one or the one where you have the most expertise. After identifying the possible component, then it is time to dive deeper to the analysis of the component with the suitable tools.