Using batch processing to improve performance when working with large datasets in Ruby on Rails.

Tauqeer Ahmad
3 min readMar 13, 2023

--

Using batch processing to improve performance when working with large datasets in Ruby on Rails

As a Ruby on Rails developer, it is not uncommon to encounter situations where you need to export data from a database table to a CSV file, and then upload that file to Amazon S3. However, when dealing with large datasets, this process can become slow and resource-intensive, which can cause problems such as timeouts and memory issues.

In this article, we will discuss a solution to this problem that involves streaming data from the database and uploading the fetched data to Amazon S3 using upload_stream.

Whta is the Problem?

Let’s say you have a database table with a large number of rows, and you need to export that data to a CSV file and upload it to Amazon S3. Here is a sample code that exports data to a CSV file

require 'csv'

csv_file = CSV.generate do |csv|
csv << ['id', 'name', 'email']
User.all.each do |user|
csv << [user.id, user.name, user.email]
end
end

File.write('users.csv', csv_file)

This code fetches all the rows from the users table, generates a CSV file in memory, and then writes that file to disk. However, when dealing with a large number of rows, this approach can quickly become slow and resource-intensive.

Solution

To solve this problem, we can stream data from the database and upload the fetched data to Amazon S3 using upload_stream. Here is a sample code that implements this solution

require 'csv'
require 'aws-sdk-s3'

s3 = Aws::S3::Resource.new(region: 'us-east-1')
bucket = s3.bucket('bucket-name')

header = ['id', 'name', 'email']

CSV.open('users.csv', 'w', write_headers: true, headers: header) do |csv|
User.find_each(batch_size: 1000) do |user|
csv << [user.id, user.name, user.email]
end
end

File.open('users.csv', 'rb') do |file|
object = bucket.object('users.csv')
object.upload_stream do |write_stream|
write_stream.write(file.read)
end
end

This code streams data from the users table using find_each with a batch_size of 1000. This approach fetches data in batches, which helps prevent memory issues and reduces the load on the database. It then generates a CSV file in memory using CSV.open with the write_headers and headers options to add a header row to the file.

After the CSV file is generated, it is written to disk and then uploaded to Amazon S3 using upload_stream. This method streams the file contents directly to S3, which helps prevent memory issues and reduces the load on the server.

Conclusion

In this article, we discussed a solution to a common problem faced by Ruby on Rails developers when exporting large datasets to a CSV file and uploading that file to Amazon S3. By streaming data from the database and uploading the fetched data to S3 using upload_stream, we can improve the performance and scalability of our applications.

If you are considering developing a Ruby on Rails project, you may want to consider working with Xprolabs(https://xprolabs.com). Xprolabs is a leading software development company with extensive experience in Ruby on Rails project development. Our team of experienced developers, designers, and project managers are dedicated to delivering high-quality solutions that meet the unique needs of our clients. you can reach us at tauqeer.ahmad@xprolabs.com.

--

--