The Importance of Data Quality & Quantity for Performance and Scale Testing

Shainesh Baheti
Salesforce Architects

--

If your team is tasked with building scalable applications on the Salesforce platform to meet complex business requirements, you are going to need to ensure your application performs well, particularly as it scales. That means you’ll need performance testing to identify, for example, how responsive your application is for a reasonable number of users in terms of efficient SOQL, efficient web pages, and optimized Apex code. And you’ll need scale testing to identify concurrency related errors, governance limits, and similar issues that may only become apparent as the application scales to handle more customers, transactions, users, and so on.

When creating and executing performance and scale tests, the test data you use will determine how effective your tests are. Ideally, you want your test data to match your production data as closely as possible. That is, you want to have roughly the same quantity of test data as production data, and you want your test data to have the same qualities as your production data — with the same variety of data values, access limits, and so on. This post explores the importance of these two dimensions of test data — quantity and quality — in running realistic performance and scale tests that yield valuable insights.

Query plans and query optimization

As you may know, when an SOQL query is issued to retrieve data, it is processed based on a query plan designed to minimize the time needed to complete the query. Like most cost-based query optimizers, the one that Salesforce uses to create query plans relies on statistics gathered about the data. As explained in this post:

  • The database routinely collects statistics about the data in the database. For example, the number of records in each table/object, the cardinality (number of corresponding records) of particular values in an indexed field, and much more.
  • When you execute a query, the optimizer considers such statistics to calculate costs for various execution plans.
  • The optimizer then chooses the plan with the lowest cost to execute the query

This is why it’s so important to use test data that matches your actual production data as closely as possible. If it doesn’t match well, it’s likely that the statistics collected for it will also not match, the calculated costs of queries will not match, and the query plan chosen will not match. As a result, the performance of your application will likely look very different when it is run against your test data compared to when it is run against your production data, resulting in lots of false positives and false negatives:

  • False positives: Performance and scale issues that are identified in the test environment but not present in production
  • False negatives: Performance and scale issues that are not identified in test environment but actually are present in production

False positives can result in wasted time and resources as teams address non-issues; and false negatives, more importantly, can affect the business when potentially serious production issues are undetected by testing.

Why create synthetic test data

Ideally, you would create test data by simply copying your production data to a full copy sandbox and conduct your performance testing there. In many cases, however, this is not feasible and you’ll need to generate synthetic data. The need for synthetic data can be driven by various factors:

  • Legal. In some cases you may simply not be allowed to use production data (although you still may be able to copy metadata).
  • New use cases or features. You may be working with new custom entities or fields for which there is no data already available in production.
  • Greenfield implementation. If you are a new customer on Salesforce, you won’t have production data to work with.
  • Rapid growth. When you anticipate rapid growth, you may not have enough production data available to conduct realistic tests that reflect that performance at the expected scale.

Challenges in synthetic data creation

The creation of synthetic test data requires a thorough understanding of the test data needed and then a series of steps to design, generate, and load the test data.

Understanding test data requirements

To gain a better understanding of your test data needs, ask the following questions:

What use cases require data to be populated? Performance tests should be performed only for all commonly used use cases.

Which test data entities are transactional, and which are static or master data that are used for lookup only?

  • Static or master data entities are used by an application to look up values for completing business transactions. This test data is generally small in terms of volume but could be complex depending on your application. For a loan processing application, for example, this data includes mortgage terms, current mortgage rates, and relevant credit profiles.
  • Transactional data is created or updated while performing a business transaction. The volume will vary based on the application. For the loan processing application, transactional data might include customer details, loan details, and other values that are inserted or updated throughout loan processing.

How much volume needs to be created?

  • For static or master data, the volume required depends on the target throughput you want to achieve. For example, if one user can perform five business transactions in a minute and your target throughput is 100K/minute then at least 20K users need to be created.
  • For transactional data, the volume depends on how much production data you have. You’ll need to create roughly the same amount synthetically to conduct realistic testing.

Data shape design
As you design your test data, you’ll want to define:

  • A list of objects needed in your test set as well as fields of each those selected objects
  • The relationship between selected objects
  • Parent-child data skew between selected objects

Test data generation
Once you know your synthetic test data will look like, you need to write some code or a script to generate it. Keep in mind that complex data models and requirements will take more time to code.

Test data loading

Finally, you’ll need to load your test data once it’s generated. The Bulk API is a good option for this, but make sure you follow best practices to achieve optimal throughput and avoid row locks.

When you’ll need more test data

There are several scenarios in which you’ll need more test data than you currently have.

Scenario 1: Little test data exists

The most obvious case for needing more data is when you have created no — or very little — test data to start with. If you have only a few dozen rows in your test data set, but you have millions of rows in production, then in testing, all of your queries will be efficient and your response times will be short. This, of course, is a false negative because those same queries may indeed yield unacceptable response times when executed in production.

Scenario 2: Visibility mismatches

If you have not defined sharing rules for your test data entities and fields that match the sharing rules in production, then the visibility of and access to those entities will not be the same across the two environments. If in production a user has access to all entities, but in test the user does not, your tests will produce false negatives. On the other side of the coin, if in production a user has limited access, but in test the user has full access, your tests will produce false positives.

Scenario 3: Skewed data within fields

Consider a situation in which you generated test data by populating some fields for a few rows, and then duplicated those rows repeatedly to generate a large test data set. (Examples include using only “true” for Boolean values, selecting the same value for a picklist, or using the same name, city, and phone number over and over again.) This results in poor quality test data that does not well represent the variety of data and distribution of values present in production. In this case, you’ll want to create more test data that is a closer match to your production data.

Scenario 4: Duplicate rows with the same data for indexed fields

Similar to the previous scenario, if you simply duplicate rows with same data for indexed fields to create a large volume of test data, the likely result is an inefficient query plan — since indices will not be used because index selectivity thresholds will be breached — resulting in lots of false positives.

Scenario 5: Parent and child entity skew

If the production data set has a large number of child records associated with the same parent record, you’ll want roughly the same skew in your test data set. For example, if you have an average of 100 Contacts for each Account, then you should have same ratio in your test data. If not, the validity of your performance and scalability tests will suffer.

If you have a mix of parent-child skews in production, then make an effort to create test data skews to at least cover the corner cases and majority of skew cases. For example, if in production your Account:User skews range from 1:1 to 1:1000 but the majority are between 1:10 to 1:100 then create test data with Account:User skews of 1:1, 1:10, 1:100, and 1:1000 to maximize test coverage. The ideas here is to cover the majority of cases without needing to create test data for each and every skew in production.

Scenario 6: No data for displayed fields

Your test data set should have values for any fields displayed on a page. This means that you need to pay close attention to dates in your test data. For example, if one of your tests involves a list view filter that shows only items from the past month and all the dates in your test data are more than a month old, then your tests will result in false negatives.

Counterexample: More data than required

Generally speaking, there is no harm in creating more test data than you need. That said, you still need to be careful. If you have too much data then you may start seeing false positives in the form of Apex CPU time limit errors.

Conclusion

Although generating test data with the same qualities — and in the same volume — as your production data is challenging, doing so is imperative for realistic performance and scale testing. If you do need to create synthetic test data, look for an App exchange data seeding tool or here is an open source data seeding tool that can simplify the process.

For more on Salesforce performance testing, see Introduction to Performance Testing on the Developers’ Blog.

--

--