What’s LDV and Why Does it Matter for My Nonprofit or School? Part II

Published in

Salesforce Architects

11 min readAug 4, 2020

This is Part 2 of a three-part series on large data volumes (LDV), specifically in the nonprofit and education space .

In Part 1, we covered what we mean by LDV, and why the concept is important to understand. This part describes how to design an organization with LDV considerations in mind. Part 3 covers loading data into a LDV organization and best practices.

Designing for Large Data Volumes

If you have large data volumes, or expect that you will soon, the good news is that there are ways to work with them instead of having them work against you.

Minimizing skew

First, let’s talk about the types of skews you may encounter. Generally, data skew happens when more than 10,000 child records are associated with the same parent record. The design choices you make can minimize the effect that data skew has on performance.

Ownership skew

For many nonprofit and educational organizations that are just moving to Salesforce, the concept of record ownership is new. For these organizations, the database applications used to store and access contacts and other data likely had no requirement of record ownership. Because the concept is new to them, the organizations tend to have a single user owning all records, because they don’t want actual users to own records.

When a single user owns 10,000 records or more this is known as ownership skew. With ownership skew, problems can arise when sharing rules associated with that single user. For example, when a user changes roles, sharing calculations will be initiated to update record sharing based on the user’s new role. If the user owns a large number of records this can cause performance issues.

Having a single user own a large number of records is acceptable, as long as you adhere to following best practices:

Ideally, do not use a user account associated with a real person, for example create a user called ‘Integration User’.
Place the user in a separate role at the top of the hierarchy and do not move the user out of that top-level role.
Keep the user out of public groups that could be used in sharing rules.

Account data skew

If you are using NPSP or EDA and using the household or administrative account models, you are normally unlikely to come across account data skew for contacts, but you may encounter this kind of skew for other associated objects, e.g. Opportunity objects.

Symptoms of account data skew include performance issues and record locking. For example, in a bucket account scenario, when an anonymous contact or opportunity is updated, the account is locked for the duration of the update, meaning no other anonymous contact or opportunity can be updated at the same time.

Under normal circumstances you’re unlikely to encounter a record lock when manually updating records through the web UI, but when automated processes are being triggered at the same time, record locking becomes more likely.

If you must use a generic bucket account, consider creating multiple such accounts and distributing the child records evenly across them to keep the number of child records below 10,000. For example, create a new account every month or six months, depending on your expected growth and assign any new child records created within that period to the new account.

Lookup skew

Lookup skew occurs when a large number of records are associated with a single record in a lookup object. Since lookups can be created on any custom object, lookup skew issues can surface at any time.

When a lookup is added, a foreign key relationship is created between objects. And when a record is updated or added, the target records are locked to maintain referential integrity. The more records in the lookup relationship, the greater the chance of record lock errors. This is especially true when using custom code for process automation.

A common example of this, which we see in both nonprofits and schools, is a campaign with more than 10,000 campaign members. This scenario can lead to timeouts in reports and dashboards.

To mitigate lookup skew, consider the following:

Review any automated processes — including process builders and Apex triggers — that access the objects in the lookup relationship, and look for opportunities to reduce the record save time. Refer to Salesforce documentation to ensure best practices are being followed. In particular, try to consolidate multiple triggers or processes into a single Apex trigger or process per object.
If possible, use a picklist instead of a lookup; this will eliminate any kind of locking issues.
Distribute the skew, if possible. If you add additional lookup values, you can significantly reduce or even eliminate your lock exceptions.
Reduce the load created by automated processes. Consider running automated processes serially during off-peak hours. If automated processing must occur while your end users are active and your end users are encountering locks, consider reducing the batch size to shorten the lock duration. This effectively prioritizes end user access over automated process access.

For more details, see Managing Lookup Skew in Salesforce to Avoid Record Lock Exceptions and the Trailhead module on Asynchronous Apex.

Optimizing code for performance

Trigger and bulk request best practices

A common development pitfall stems from assuming that the system will only be used by users updating one record at a time. Of course, this isn’t the case; often there are processes, data loads, or third-party apps updating multiple records at once. Apex triggers are optimized to operate in bulk, which, by definition, requires developers to write logic that supports bulk operations.

To ensure your logic does not monopolize shared resources in a multitenant environment, governor limits are enforced. For more information on governor limits, see Execution Governors and Limits.

The following example demonstrates the correct pattern to support the bulk nature of triggers while respecting governor limits.

First, the trigger syntax:

//Trigger on Mileage__c

trigger TDTM_MileageTrigger on Mileage__c (after delete, after insert, after undelete, after update, before delete, before insert, before update) {

   npsp.TDTM_Config_API.run(Trigger.isBefore, Trigger.isAfter, Trigger.isInsert, Trigger.isUpdate, Trigger.isDelete, Trigger.isUndelete, Trigger.new, Trigger.old, Schema.SObjectType.MyCustomObject__c);}

And the handler class:

//Handler Class for Mileage__c Trigger

global class GetMileage_TDTM extends npsp.TDTM_Runnable {

global override npsp.TDTM_Runnable.DmlWrapper run(List<SObject> newlist, List<SObject> oldlist, npsp.TDTM_Runnable.Action triggerAction, Schema.DescribeSObjectResult objResult) {
    
        npsp.TDTM_Runnable.dmlWrapper dmlWrapper = new npsp.TDTM_Runnable.DmlWrapper();
  
        if (triggerAction == npsp.TDTM_Runnable.Action.beforeInsert {
            Set<ID> ids = Trigger.newMap.keySet();
            List<User> c = [SELECT Id FROM user WHERE mileageid__c in :ids];
            dmlWrapper.objectsToUpdate.addAll( (list<SObject>)listCon );
        }
        return dmlWrapper;
    }
}

Here the Trigger.new() collection is passed to a set, which is used in a single SOQL query. This pattern captures all incoming records within the request while minimizing the number of SOQL queries. Note that this code has also been written to use Table Driven Trigger Management (TDTM).

For additional examples and best practices for using TDTM see the EDA Table-Driven Trigger Management overview or the NBSP Table-Driven Trigger Management overview.

Best practices for designing bulkified code

When processing records in triggers use the following best practices for bulk design patterns:

Minimize the number of data manipulation language (DML) operations by adding records to collections and performing DML operations against these collections.
Minimize the number of SOQL statements by preprocessing records and generating sets, which can be placed in a single SOQL statement used with the IN clause.
When possible, use asynchronous calls to external systems together with a call-back mechanism.

Governor limits and concurrency

Governor limits are not only critical in maintaining an efficient multitenant cloud-based architecture, they are also important for managing concurrency.

When a process runs longer than five seconds, it begins to count against an organization’s limit of 10 long-running process threads. As your organization grows, and in turn the data volume begins to increase, you may reach this limit and see an error message like “Unable to process request. Concurrent request limit hit.” Concurrency issues can cause unfinished processes, timeouts during Visualforce page loads, lengthy record locks, error messages presented to end users, and ultimately, a poor user experience.

Keep an eye out for the following common causes of concurrency issues:

Synchronous SOQL/SOSL requests running longer than five seconds
Synchronous DML actions running longer than five seconds
External and Apex web service calls lasting longer than five seconds.

For more information on common limits and allocations, check out the Salesforce Developer Limits and Allocations Quick Reference. When designing for LDV, be sure to review limits on the number of custom objects, number of child records for master-detail relationships, total number of records retrieved by SOQL, total number of records processed as a result of DML statements, and batch size.

Deciding where to store data

Storing data outside of Salesforce

One way to minimize the negative effects of LDV is to store data outside of Salesforce and reference it when needed, We refer to this arrangement as a “mashup” and it has two principal potential advantages. First, the data is never stale as it is being referenced from the system of record. Second, you can make use of the capabilities of the external system without having to build or replicate them in Salesforce.

Of course, there are potential drawbacks as well. Integration with an external data source can add to the overall call latency, and in some cases functionality may be limited. You can implement a mashup in a variety of ways:

With Canvas
With an iFrame
Using external callouts
Salesforce Connect

Each of these methods has strengths and weaknesses; consider them as you choose the most appropriate method for your use case. For more details, see the Integration Patterns and Practices documentation.

Deleting and archiving

You can minimize the effects of LDV by keeping your Salesforce instance as lean as possible with respect to data. This is a best practice not only from a Salesforce perspective, but also from the perspective of compliance with data regulations (e.g. General Data Protection Regulation or GDPR), purpose limitation principles, and data minimization principles.

It’s important to remember, however, that maintaining a lean instance is not just a matter of deleting records. You’ll want to consider a wide range of questions on deleting and archiving data, including:

Have you defined a hot/warm/cold classification for your data?

How are records going to be deleted/archived?

Apex code
A data management tool (ETL)
API
Manually

How are records going to be marked for deletion or archiving?

Flag field on each record
Third-party tool (e.g. from AppExchange)
A data management tool that maintains retention information and archives or removes records on a regular basis

How can deleted or archived records be re-imported?

Mashup to an external database
Recall record (i.e. define a process to reinstate the record with all its data and related records)
Create a new record with a reference (ID) to the old record in an external database

How do you handle different retention periods for different types of data?

Field on each object showing retention period

How can archived data be anonymized?

With a third-party tool (from AppExchange)
Apex code
Data management tool (ETL)

Where will archived records be kept?

Big objects
Heroku (Postgres)
A third-party data repository

How is referential integrity going to be maintained?

Is high-level rollup data required from the records being deleted or archived?

Create a new custom object for storing consolidated data for reporting purposes, use Flow, or write Apex code to regularly update objects or as part of the delete/archive process

As you can see, maintaining a lean instance is not necessarily easy. Some large organizations, as part of their design, may incorporate an additional data solution — such as a traditional database, data warehouse, or data lake — for analytical, reporting, or archiving purposes. Even with such a solution the challenge of deleting or archiving the correct data remains.

Database tools for query optimization

Database statistics

Salesforce, being a multitenant platform, keeps its own set of statistical information on the amount and type of data stored.

The Lightning Platform query optimizer uses this information to perform queries (generated for reports, list views, and so on) in the most efficient way possible. By default, the statistics are gathered on a nightly basis. As a result, large changes in data on the platform in a short period of time (for example, a mass delete or large upload) can have an adverse effect on query performance when the gathered database statistics are no longer representative of the current data, making it difficult for the optimizer to accurately calculate costs for various execution plans.

When you run scheduled jobs with NPSP or EDA, keep in mind that if the jobs involve large volumes of data, optimizer performance may be affected until statistics are gathered again.

Even a hard delete of a large number of records (e.g. millions) may not happen instantaneously; so plan data management activities accordingly; for example, schedule time between large-scale data changes and initiating batch processes that query objects affected by those changes.

Indexes

The Query Optimizer uses database indexes when performing queries. Indexes are created by default for most objects (standard and custom) on the following fields:

Salesforce ID
CreateDate
Systemmodstamp (LastModifiedDate)
Name
RecordTypeId
Foreign key relationships (such as lookups, master-detail)
Email (on the Contact and Lead objects)
Division (on the Account object)

Any fields marked as External IDs also cause indexes to be created.

In situations where queries are failing or not performing as expected, especially when processing a large number of records, you may want to create custom indexes on specific fields that are frequently used.

By creating a case with Salesforce Customer Support, you can request additional standard or custom fields to be indexed to improve query performance. For example, if a custom object has fields for unique identifiers used in reports and list views, you may want to index those fields. Indexes are not one-size-fits-all. You can work with Salesforce Customer Support to identify which fields should be indexed to address your specific performance issues.

Skinny tables

Skinny tables, also enabled by Salesforce Customer Support, are used to contain frequently used fields to avoid joins and improve query performance. For each object table that you can see, Salesforce maintains other, separate tables at the database level for standard and custom fields. This separation ordinarily necessitates a join when a query contains both kinds of fields. A skinny table contains both kinds of fields and does not include soft-deleted records.

This image shows an Account view, a corresponding database table, and a skinny table that would speed up Account queries.

If you make changes (e.g. add fields) to the query, report, or list view, you will have to open a new case with Salesforce Customer Support to update the skinny table. As with additional field indexes, you will need to work with Salesforce Customer Support to identify which fields to include in a skinny table to address your specific performance issues.

Summary

You’ve seen how to design organizations with LDV in mind by minimizing skew, optimizing code for performance, deciding where to store data, and optimizing queries. In Part 3 of this series you’ll see how to load data into an organization with large data volumes.

About the Authors

Chris Rolfe is a Customer Success Architect at Salesforce.org in EMEA. As a member of the advisory team he ensures our EMEA customers in higher education and nonprofit areas are successful in their implementation of Salesforce technologies, enabling them to achieve their mission. Connect with Chris on Linkedin.

Richard Booth is a Customer Success Architect at Salesforce.org in EMEA. He helps nonprofit and higher education organizations make the best possible use of Salesforce technologies to deliver value and support their mission. Connect with Richard on LinkedIn.

Marie van Roekel is a Customer Success Architect at Salesforce.org, based in the Netherlands. As a member of the advisory team, she works with nonprofit and educational institutions to ensure they are successful in their implementations of Salesforce technologies and best practices, enabling them to better achieve their mission.