Database Seeding with large csv files in Laravel.

Aashish Gaba
5 min readJun 11, 2020

--

What is Database Seeding?

Database Seeding is a process in which an initial set of data is provided to a database when it is being installed. It is especially useful when we want to populate the database with data we want to develop in the future. This is often an automated process that is executed upon the initial setup of an application. The data can be dummy data or necessary data such as an initial administrator account.

Sometimes you may have large data, say a million rows. Seeding such big data may lead your seeding script to run for a long time or even exceed the memory limit because of the way you store the data. How to deal with such large data in an efficient manner in terms of both time and memory?

The aim is to read data from CSV and insert it into our database.

Now, a simple solution would be to read the entire data from the CSV file and then insert the entries into the DB, one row at a time.

Now, what could possibly go wrong if we go with the simple solution?

  • Your data could be big, that storing it inside a variable, and processing it at the same time in your script, may exceed your RAM. And if there’s a way you could totally reduce the amount of RAM being used, then why not use it?
  • You could possibly end up waiting for a few hours, just for your Database seeding to complete.

What are other possible ways of doing this?

Reading from CSV

  • What if we could just have something where we read 1000 lines and do whatever processing we want and then continue back from the 1001th line, wouldn’t that be so cool?
  • So even if there are 10⁶ entries, we won’t be saving all of the data into the memory at once, we’d just be using the memory that is required for 10³ entries. Read 1000 entries, store them into a variable, send them for processing, and then come back, store the next 1000 into the same variable.

Isn’t this independent of the number of rows you have in your CSV?

Yes!!

Inserting into the DB

  • Instead of inserting one row at a time, we insert chunks(consider 1000) of rows at a time. The Eloquent ORM of Laravel supports bulk insertions, which are a lot more effective when it comes to time taken to insert as compared to inserting the same number of rows one by one.

You got the idea of reading in chunks, sending the chunk for processing, and then coming back again. But the question is, how do we implement it?

Well, generators in PHP comes to rescue.

What is a generator?

A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate. Instead, you can write a generator function, which is the same as a normal function, except that instead of returning once, a generator can yield as many times as it needs to in order to provide the values to be iterated over.

Let’s take an example.

There are a total of four columns. The first row represents the column names. The rest of the rows represent an entry that we want to store in our database.

arr[i][j] => represents the data of the ith row and jth column.

E.g. arr[0][‘COL2’] => data02

The actual number of rows could be huge.

Let’s see the code that’ll read the entire file in the chunks of 1000.

Don’t worry, We’ll go through it block-wise :)

The constructor

We pass the path of the filename, which has the entire data. The iterator is a variable that keeps track of the next line that it’s supposed to read. You could also have delimiters as tabs, but for the file sample that I showed you above, we’re going to use , as our delimiter.

The while loop condition

fgetcsv is a function in PHP that lets you read CSV files, so this loop runs till the end of the file.

We need the data[i] to be an array of the form

data[i] => [
'COL0' => 'some value',
'COL1' => 'some value', 'COL2' => 'some value', 'COL3' => 'some value'];

In order to achieve that we use array_combine.

The first row that we read is composed of the headers (column names for the table in which we want to insert this data).

For the rest of the rows, we want to append an array to our $data

The if condition block is the main part where the trick lies.

When it completes reading 1000 lines, the expression in this if block evaluates to true.

The 1000 entries that were stored so far have to be yielded, and we need to clear the data array for the next 1000 rows.

I know what you have in mind :P

What if the number of entries was not a multiple of 1000, say 100357. For the first 100000, it would work fine, but what about the 357, where are they being handled?

Here is the answer.

Here, when it reaches the end of the CSV file, in case the number of rows was not multiple of 1000, then the remaining entries (n%1000, where n is the number of the entries) are sent for processing from this block of code.

This covers the part of reading in an efficient manner and sending it for processing, what about the place where you receive the data sent by this CSVReader?

$csv_reader is an instance of the readLargeCSV class that I mentioned earlier.

$csv_reader->csvToArray()

This function gets the 1000 lines(n%1000 in the last chunk), and stores them in the $data variable.

Laravel, doesn’t add the timestamps(created_at, updated_at) in case of bulk insertions. So if you have timestamps in your database for the table you’re seeding, make sure you add them for every entry.(as is shown in the code above)

You’re always free to do any preprocessing before you actually insert the data into the database.

The final step is the bulk insertion of the data.

XYZModel::insert($data);

XYZModel is the model of the table in which you want to seed the data.

There might be other ways to do this, maybe even a better way. I tried a few other things but found this one to be the best in terms of time and memory.

I hope the blog pretty much explains the approach. In case of any queries, please leave the comment below. I’d be happy to help you out :)

Thanks,

Aashish

--

--

Aashish Gaba

Senior Software Engineer @DynamoAI | Former SDE 3 @ Codenation | Former SDE intern @ Stockarea | Former SDE intern @ Zetwerk | GSOC '18