Improve RoR at Insert Large Dataset to Database

Monitoring, testing, and trying to improve it.

William
WilliamDesk
6 min read5 days ago

--

I discovered that one of our APIs had response times that occasionally took up to 10 seconds. This latency could be faster, so I optimized the API's performance.

This API is straightforward. Its purpose is to "Add the member's contacts to the database."

Here is some information you need to know first

  1. The contacts table at the PostgreSQL schema
id, integer, PK, null false
member_id, integer, FK for members table, null false
email, string, null false
name, string
phone, string
created_at, datetime, null false
updated_at, datetime, null false

indexes:
index_contacts_on_id, unique
index_contacts_on_member_id_and_email, unique
index_contacts_on_member_id

2. Contact API introduction

Send the JSON format post API to request the addition of one or more contacts for member relations.

{
"member_id": 12345,
"contacts": [
{
"email": "contact_1@gmail.com",
"name": "John",
"phone": "+886 987654321"
},
{
"email": "contact_2@gmail.com",
"name": "Carter",
"phone": "+886 123456789"
},
...
]
}

When sending this request to the server, the controller will do the following things:

class Api::Internal::MemberInfoController < ActionController::API
def setup_contacts
invalid_contacts = []
# run the single reiteration to traverse all incoming contact info.
# time complexity is O(n).
contact_infos.each do |contact_info|
# check each email is valid and legitimate.
# time complexity is O(n).
checker = MailCheck.call(contact_info[:email], check_exist: false)
invalid_contacts << {
email: contact_info[:email],
error_key: checker.error.key,
error_message: checker.error.message
} and next if checker.failed?

# Traverse through the database to check if each contact email exists.
# If it does, write it into the contact object.
#
# If it doesn't exist,
# create a new object in memory to await writing
# time complexity is O(n).
#
# ORM to SQL:
# SELECT "contacts".*
# FROM "contacts"
# WHERE "contacts"."member_id" = $1
# AND "contacts"."email" = $2
# LIMIT $3
contact = current_member.contacts.find_or_initialize_by(email: contact_info[:email])

# update the new parameters into the contact object.
contact.assign_attributes(contact_info)

# execute SQL to write into the DB.
# time complexity is O(n).
# ORM translates to SQL roughly like this:
# (or translates to UPDATE if updating):
#
# INSERT INTO "contacts" ("member_id", "email", "name", "phone", "created_at", "updated_at")
# VALUES ($1, $2, $3, $4, $5, $6)
# RETURNING "id"
contact.save!
end
# return member's new contact_list infomation.
success_response(current_member.contact_list)
end
end

The core of the MailCheck function is used Resolv::DNS to check whether the email domain (MX Record) exists.

Resolv::DNS.open do |dns|
mx = dns.getresources(@domain, Resolv::DNS::Resource::IN::MX)
raise ServiceError.new(:invalid_domain) if mx.blank?
end

3. Observation service

We used Datadog for our monitoring service. And here is the APM of this API's latency time.

APM of the Datadog for this API.

Observations and Summary

  1. Time complexity is O(n).
    * However, if we break down the function in detail, it would be O(n) + O(n) + O(n) = O(3n).
  2. The incoming contacts array has no upper limit.
    * n could be infinitely large.
  3. The processing for the DB will be O(2n) times.
    * Query the database n times to verify if each contact exists.
    * Write to the database n times to add or update each contact.

Performance will be abysmal if n is large enough and the table has too many records.

Each request will be executed in a way that heavily hits the database.

The improvement plan

1. Reducing database processing from O(2n) to O(1).

My solution is using the upsert_all ORM method.

upsert_all method introduction:

upsert_all transfer to raw SQL would be like:

INSERT INTO "contacts" (
"member_id", "email", "name", "phone",
"created_at", "updated_at"
)
VALUES
($1, $2, $3, $4, $5, $6), # first set
($7, $8, $9, $10, $11, $12), # second set
...
ON CONFLICT ( member_id, email ) DO
UPDATE SET
"name" = excluded."name",
"phone" = excluded."phone",
"created_at" = excluded."created_at",
"updated_at" = excluded."updated_at" RETURNING "id"

Explanation of executing raw SQL:

  • Step 1: Search the corresponding database record using the index called index_contacts_on_member_id_and_email.
    * Use the Unique Compound Index composed of the member_id and email fields to find all records needing modification.
  • Step 2: If the table does not have a record with the combination of member_id and email, execute an INSERT for the new data.
  • Step 3: If the table already has a record with the combination of member_id and email, it will trigger a CONFLICT condition and execute an UPDATE to modify the name, phone, created_at, and updated_at fields.

Considerations for executing the upsert_all method

  • It DOES NOT trigger Rails ActiveRecord's data validations and callbacks, as it directly executes SQL on the database.
    * For example: after_commit, before_save, and other model-level checks. ( https://rails.ruby.tw/active_record_callbacks.html )
  • It DOES NOT trigger Rails ActiveRecord's automatic updates to the created_at and updated_at fields, so you must handle these manually.
    * Therefore, created_at will always change, but this issue is optional for the model being processed this time, so I ignored this flaw.
  • To significantly improve performance, it is recommended that a unique index key be set as the target for lookups and updates.
  • All data validations and confirmations before writing must be handled manually.

Refactoring API processing

class Api::Internal::MemberInfoController < ActionController::API
def setup_contacts()
valid_contacts = []
invalid_contacts = []

# run the single reiteration to traverse all incoming contact info.
contact_infos.each do |contact_info|
checker = MailCheck.call(contact_info[:email], check_exist: false)
if checker.failed?
invalid_contacts << {
email: contact_info[:email],
error_key: checker.error.key,
error_message: checker.error.message
}
else
# for those contact that meet the criteria,
# insert them into the valid_contacts array.
current_time = Time.current
valid_contacts << {
member_id: current_member.id,
email: contact_info[:email],
name: contact_info[:name],
phone: contact_info[:phone],
created_at: current_time,
updated_at: current_time
}
end
end

# Execute all writes at once using the upsert_all ORM syntax.
# Time complexity is O(1).
current_member.contacts.upsert_all(valid_contacts, unique_by: :index_contacts_on_member_id_and_email) if valid_contacts.present?
end
# return member's new contact_list infomation.
success_response(current_member.contact_list)
end

2. Using Cache at MX Record Searching

For monitoring and breakdown to check the latency reason, I found that searching MX records, whether they exist or not, sometimes takes a long time for unknown reasons.

Setting the short cache mechanism may be a good idea since the famous mail domain (e.g., gmail.com / hotmail.com, etc. )does not need to be searched every time.

# MailCheck method

def check_for_domain_existence!
@domain = @email.partition('@').last
dns_cache_result = Rails.cache.fetch(domain_cache_key, expires_in: 12.hours)
return if dns_cache_result.present?

Resolv::DNS.open do |dns|
mx = dns.getresources(@domain, Resolv::DNS::Resource::IN::MX)
raise ServiceError.new(:invalid_domain) if mx.blank?
Rails.cache.write(domain_cache_key, mx, expires_in: 12.hours)
end
rescue Resolv::ResolvError => e
Sentry.capture_message("DNS lookup failed", extra: {domain: @domain, error: e})
return
end

Before and After

Deployed to production after a week, we must remember to review the differences.

Differences in API latency before and after deployment. (PR 95 Latency time)
Differences in API latency before and after deployment. (using the trend line for PR 95 Latency time)

Looking back at a month's observations, the trend line for pr95 latency is downward, decreasing from 556.33ms to 347.94ms.

The reduction is approximately 208.39ms, representing a 37.5% improvement in efficiency.

At the very least, we can be more confident that pr95 latency has significantly improved.

But honestly, this chart is just for RD or tech enthusiasts to enjoy. If you show it to any frontline business person or explain it to a customer, you may get a response like, "Hmm, okay. But my clients and I still feel it's slow."

Efficiency has improved, and the PR 95 latency trend shows a reduction of 0.2 seconds. Does it make a noticeable difference? I think not.

We need to return to the adjustment's original purpose. Do you remember what the first screenshot was sorted by? It was Duration.

Therefore, my ultimate goal has always been to minimize the occurrence of extreme peak response times for this API.

The peak duration time is better than before.

It is the only way the "end users" will notice the difference when this API experiences peak latency, and it will truly demonstrate the value of your time in making these improvements.

Conclusion

To share with engineers striving to improve performance:

  • Suppose you want to become a more professional backend engineer. In that case, knowing how to analyze and what methods to use to improve performance is the stepping stone to becoming a senior engineer.
  • Finding the best balance between maintainability, readability, and performance improvement will make you a top senior engineer.
  • Finally, suppose you can interpret and remember to compare the differences before and after, proving that the time and resources you spent are worthwhile. In that case, you will be recognized by your supervisors or organization more quickly.

Hello World. =D

--

--