A Fast Way to Load Large CSV Files into Microsoft SQL Server
So you have a bunch of huge CSV files (perhaps you have received them via an FTP). We are talking hundreds of megabytes (MB) to several gigabytes (GB). The problem? It takes ages.
The goal of this article is to find the fastest *proven* way to bulk-load CSV files into Microsoft SQL server. Let’s get into it!
Pre-Requisites
To work on this, we need to understand the file size conversions. Here they are:
- Kilobyte (KB): One kilobyte is made up of 1,024 bytes. It’s about the size of a very short piece of text (like a paragraph).
- Megabyte (MB): One megabyte is made up of 1,024 kilobytes. To give you a practical example, a medium-quality photo from a smartphone might be around 2 to 5 megabytes in size.
- Gigabyte (GB): One gigabyte is made up of 1,024 megabytes. A gigabyte can store about a few hundred photos (depending on their quality), or it’s roughly enough for a standard-definition movie.
Why Are CSVs Important?
Transferring CSV files is really important because it helps different computer systems share information, even if they use different types of software or databases.
For example, imagine one system uses a NoSQL like PayPal’s, while another uses SQL Server. It’s like trying to talk to someone who speaks a completely different language! To solve this problem, we can turn the information from the database into a CSV file, which is a simple way to organize data that most systems can understand. Then, we can easily import that CSV file into the other system. This saves a lot of time and effort compared to creating new software or dealing with complicated ways of connecting the systems together.
Now, to the Experiment!
Here are the tools we will combine for the faster data loader possible:
- BULK INSERT command
- BULK INSERT command tweaked from SQL
- BULK INSERT command tweaked from C#
Why Is BULK INSERT The Fastest?
Microsoft themselves say that it is the best insert options for CSVs with 1000+ rows. Here:
When you use BULK insert, it processes data in batches, which means it can insert many rows at once instead of one at a time. This reduces the overhead and processing time associated with each individual insert operation, making the overall process much faster. Additionally, BULK insert minimizes logging and locking, further improving performance.
Experiment — BULK LOAD:
This CSV took: CPU time = 46125 ms, elapsed time = 46773 ms. With the BULK load in place.
BULK Insert and TABLOCK
We will now add TABLOCK to the BULK INSERT. Here is what TABLOCK does:
- Acquires Bulk Update (BU) Lock: When you use the TABLOCK hint in a query, SQL Server acquires a bulk update lock on the entire table affected by the operation.
- Locks Entire Table: Unlike other locking hints that might lock only specific rows or pages, TABLOCK locks the entire table. This means it prevents any other transactions from reading or modifying any part of the table until the operation is complete.
- Exclusive Access: TABLOCK ensures exclusive access to the table for the duration of the operation. No other transactions can perform operations on the locked table until the lock is released.
- Prevents Conflicts: By locking the entire table, TABLOCK helps prevent conflicts and maintain data integrity during bulk data operations, especially in scenarios where concurrent access might lead to inconsistencies or errors.
- Used for Bulk Operations: TABLOCK is often used in conjunction with bulk data operations, such as bulk inserts, updates, or deletes, where ensuring exclusive access to the entire table is necessary for the operation to succeed efficiently and without conflicts.
Experiment — BULK LOAD + TABLOCK:
This CSV took: CPU time = 19312 ms, elapsed time = 19823ms. With the BULK load and TABLOCK in place. More than twice as fast!
Disabling Triggers
Triggers are special types of stored procedures in a database that automatically execute when certain events occur, such as inserting, updating, or deleting data in a table. While triggers can be useful for enforcing business rules or maintaining data integrity, they can also slow down bulk insert operations because they are executed for each row inserted.
This CSV took: CPU time = 19281 ms, elapsed time = 19809 ms.