SQL Server for developers: Improve Data loading performance using Bulk Insert & SqlBulkCopy

Introduction

Every Developer ever worked with any kind of DB knows how to insert new rows into table. INSERT is one of the basic DML statements in SQL. For example, if we have a table with 3 columns (ID which is IDENTITY column, an int column and a text column) our INSERT statement will be something like this:

INSERT INTO BulkInsertDemo(NumCol, LongTextCol) VALUES (1, 'BlahBlahBlah')

Nothing interesting so far.

In many projects nowdays we use somekind of ORM, such as Entity Framework or NHibernate. In those types of projects we won’t write the INSERT statement manually, but it’ll be auto-generated by the ORM. As far as SQL Server concerns – it doesn’t matter. All it see is just simple INSERT statement.

The Problem

When doing INSERT to add values for single row, 10 rows or even 1,000 rows – the time it takes is most likely reasonably. But what happens when you try to run INSERT 100,000 times to add many rows? That will take far longer. In the following graph you can see how long it takes to add n number of rows (y axis) in milliseconds (x axis). We’re inserting row – each row contains two int columns and one nvarchar columns with 380 characters.

image

Of course, it just makes sense that the more rows we’re inserting – the more time that’s going to run. But, let’s try and analyze what do we have here – what happens when we’re doing multiple INSERT’s at once:

  • Network latency: Any request, small as it might be, that has to go over the network – has some overhead. In my example we almost don’t suffer from this overhead, since I made this test with the application and the DB running on the same server. But in a real-world scenario, we’ll usually have a DB server that for any access attempt we’re ‘paying’ the cost of network access. It may be just 1ms if we’re in the same LAN – but that sums up to big numbers when doing multiple INSERTs.
  • Locks: When we write to table we have to take some logical locks on the table (and some latches – lightweight ‘locks’ on some objects). In some scenarios, if our DB is busy, we may have to wait until getting this locks.
  • CPU cost: Obviously, INSERT involves some CPU cost. It may be small, but it still exists.
  • Transaction Log writes: Every modification of the data files in DB, doesn’t affect the DB files directly, but first we write a “transaction log record” that log the operation we’re going to do on a separate file (called the transaction log). This enables rollback if required, and keeps the data consistent. You can (and should!) read more about the transaction log here.
    What it means that any write we’re doing to the actual table, also involves writing some record to the transaction log.

What’s our goal? to improve this times when we want to insert large bulks of data to our DB. We actually want to reduce the overhead involves in each insert.

The Solution: BULK INSERT

SQL Server includes built-in support for this “data loading” scenario we’re talking about, which is actually quite common. This support is done by T-SQL statement, called BULK INSERT that support getting a path fo file full of data, and path to other file that describes the format of that input file.

That’s nice – but of course that as software developers we want to do this using code, preferably C# code (or whatever language and platform we’re using, i’ll give my examples in .NET). For that reason, we have built-in class in the .NET Framework called System.Data.SqlClient.SqlBulkCopy.

Let’s see a short usage example:

//Prep data
DataTable dt = new DataTable();
dt.Columns.Add("NumCol", typeof(int));
dt.Columns.Add("LongTextCol", typeof(string));
for (int i = 0; i < rows; i++)
{
    var row = dt.NewRow();
    row["NumCol"] = i;
    row["LongTextCol"] = _longString;
    dt.Rows.Add(row);
}
 
       
//Load data
using (var bulkCopy = new SqlBulkCopy(_connectionString) { DestinationTableName = "BulkInsertDemo" })
{
    bulkCopy.ColumnMappings.Add("NumCol", "NumCol");
    bulkCopy.ColumnMappings.Add("LongTextCol", "LongTextCol");
    bulkCopy.WriteToServer(dt);
}

Before we’ll figure out what exactly we’re doing in that code snippet, let’s take a look in a graph that compares the time it takes to add multiple rows using SqlBulkCopy, compared to multiple INSERT statements (just to be clear, this benchmark is just to get a sense of the time difference – if you’re looking for more detailed comparison, there are multiple benchmarks over the internet):

image

As you can see – that is some dramatic difference, especially when we look at the growth rate of the graph. If we want to insert 200,000 new rows to the table using multiple (200,000) calls to INSERT it would take about 39.5sec, while using SqlBulkCopy will be around 3.1sec (and that contains the time of creating the DataTable and adding data to it. The actual WriteToServer operation took only 1.8sec).
Even for smaller amount of data, BULK INSERT has significantly better performance – for 1,000 rows it’s 17ms for BULK INSERT and 205ms for multiple calls for INSERT.

So, after we saw that it works, and give us real benefit, let’s take another look at the code snippet from before.

The first thing we do is creating a DataTable object and fill it with some data. The reason is that WriteToServer, the SqlBulkCopy’s method that do the actual data insertion to the table, has an overload that gets a DataTable (we’ll see more options later on).
After filling the DataTable, we’re creating an instance of SqlBulkCopy with the connection string we want (we don’t give it any other parameters, so it just use the default ones). In addition, we define the DestinationTableName to be the name of the table we’re inserting the rows to.

Next, we’re creating a mapping between the names of the columns in the table, and the names of the columns in the DataTable. They’re the same in our example, so we just doing it in a loop, assuming the names are identical.

Then, we call to WriteToServer that does the actual data write to the table using BULK INSERT. There’s also an async version called WriteToServerAsync that can be used when working with async-await.

API

Working with DataTables just for the sake of the BULK INSERT doesn’t look so good, and there isn’t a real reason doing so if we’re already have some abstraction for the tables in use (if we’re working with ORM, for example).

If you’re working with Entity Framework, you can use EntityFramework.BulkInsert. That’s a NuGet package that adds the functionality of BULK  INSERT to your DbContext object and let you call to context.BulkInsert ans give it an IEnumerable of entities that you want to add, and then enjoy the BULK INSERT performance.

If you’re using different ORM, or have your own abstractions over the DB tables, then you should read this StackOverflow answer that contains a basic implementation to ObjectDataReader (WriteToServer has a different overload that can be used). This allowed you to use SqlBulkCopy with your own object (using 1:1 mapping based on the names of the properties to the names of the columns, or using attributes that doing that mapping). This saves you the need of using DataTable directly, and copy all the data you already have stored in your objects copied to that DataTable.

Why is BULK INSERT Faster?

Reducing network overhead

While working with SQL Server installed on remote machine, we have to “pay” with latency for each packet we send. If, for example, we have 1ms latency for each packet, then doing 200K INSERT one-by-one will cost us at least 200,000ms – just on network latency. That’s over 3 minutes, just for network I/O.
When using BULK INSERT we “combine” data of multiple rows that should be inserted into a single packet, and so reduce the amount of packets, and the time wasted on network latency.
Of course, it’s not going all in just 1 packet, because the packet size is limited (by SQL Server, and even by the physical layer – we’re limited by the MTU), but even so we can significantly reduce the number of packets we send.

Reducing the SQL Server overhead

For every INSERT we make, SQL Server has some work to do. It needs to find the  page  the data should be in or allocate a new one, it needs to create (or re-use) execution plan, if we use page compression (for example) that it has to decompress and then compress the data – all of this for each INSERT statement.
If we sends multiple rows in a bulk, then instead of “wasting” this overhead on every single row, we group rows together, and then SQL Server has to do less redundant work.

Minimal-logging

If our DB is in Simple or Bulk-Logged recovery model (if you don’t know what ‘recovery model’ is, think about it as the policy of how data is written to the transaction log, and how it gets removed from the transaction log – is it automatically when it’s not needed anymore? or just after it was backed up to somewhere else?), then SQL Server enable feature called minimal-logging (it has some additional prerequisites too). When minimal-logging takes place, then we write less data to the transaction log, and improve our IO performance, which makes loading data faster.

Things you should know when working with BULK INSERT

After we say the basic usage of BULK INSERT using SqlBulkCopy, there are some things you should know about how BULK INSERT works. On all of this we can control when working with SqlBulkCopy.

BULK INSERT Flags

In SqlBulkCopy constructor, we can use SqlBulkCopyOptions enum to set some ‘settings’ of the BULK INSERT operation. For example:

using (var bulkCopy = new SqlBulkCopy(_connectionString, SqlBulkCopyOptions.TableLock | SqlBulkCopyOptions.FireTriggers) { DestinationTableName = "BulkInsertDemo" })

Let’s go over some of this options:

  • CheckConstraints: Control whether SQL Server will enforce constraints on the data inserted using SqlBulkCopy. By default, the constraints are not enforced when using SqlBulkCopy.
  • FireTriggers: When set, AFTER INSERT triggers will be called for rows inserted using SqlBulkCopy.
  • KeepIdentity: By default, values that was defined in the inserted rows for IDENTITY columns will be overridden by the new value SQL Server got in the sequence for the specific column. If set, the IDENTITY columns will use the values given by the user (similar to SET IDENTITY_INSERT ON).
  • TableLock: When set, the table will be locked while the new rows are inserted. It enables us using the benefits of minimal logging.

Another important setting we can control, is the BatchSize property. By default, when we call WriteToServer all the rows will be inserted as a single batch. We can make the BULK INSERT treat each batch as a different transaction (by setting SqlBulkCopyOptions.UseInternalTransaction flag), so if one batch fails, the rollback will be limited to that batch only (and not all the rows inserted using the BULK INSERT).

Another important setting that isn’t even a part of the SqlBulkCopy parameters, is the packet size. We can add this setting to the connection string we’re using, to increase the insertion throughput. The default is 4096, but if we’re working in a LAN supporting jumbo frames (MTU up to 9000) then we probably want at least to double this number. This changes should be done with caution, since they can affect the memory usage on the server.

Multiple BULK INSERT ‘s simultaneously

As any operation in SQL Server ,BULK INSERT also require and use some locks. But, as long as you didn’t use the SqlBulkCopyOptions.TableLock flag, then you can run multiple BULK INSERT’s on the same table, at the same time.
You should be aware of important difference between parallelism that can be achieved using multiple clients working with SQL Server simultaneously versus parallel operation done by the SQL Server itself.

If you don’t use TableLock, then multiple clients (different threads, processes, machines – whatever) can execute BULK INSERT operation to the same table, at the same time. BUT, each BULK INSERT will have normal, serial, execution by SQL Server. If you use TableLock, then only one BULK INSERT operation can run on a table simultaneously, but then SQL Server will do this one BULK INSERT using parallel loading if there are no indexes on the table (usually relevant for loading into staging tables).

BULK INSERT & Table Relationships

One of the most convenient features of Entity Framework, is the way it handles relationships between tables, and make them relationships between objects. We can work with objects, and then when we call SaveChanges, Entity Framework generates the INSERT statements to add the new rows, in such way that the relationship between objects are now relationships in the tables, using foreign keys.
The way it done, is that after every INSERT, Entity Framework use the value returned from SCOPE_IDENTITY. Meaning, it retrieves the ID of the newly created rows, and use it when inserting associated objects to their tables.

When working with BULK INSERT we’re inserting all of our data as a single bulk, so we can’t use SCOPE_IDENTITY (or other ways) to get back the values given by SQL Server to the IDENTITY columns (or any other auto generated columns). Which means, that we don’t get the same behaviour as using ‘SaveChanges’ method – we don’t add object graph but we just add a set of rows of specific type.

This might cause us some problems, but there are some tips that allows us to overcome this issues:

  • Denormalization of this tables: If (for example) we want to load massive amounts of data on a regular basis to the table LogEvents. And, let’s say that in our current design, there’s 1:1 relationship between LogEvents and LogEventsExtendedDetails. Some of the most common metadata is stored in LogEvents, and some other data is stored in LogEventsExtendedDetails.
    If normally we’d be able to use code like ‘logEvent.LogEventsExtendedDetails = new LogEventsExtendedDetails….’ and know it’ll work when we call ‘SaveChanges’ – it won’t if we add a set of LogEvents rows using SqlBulkCopy.
    In that case, denormalization by combining this two tables into one table that contains all the data may be the easiest solution for our problem, that will allow us using BULK INSERT without worrying about keeping the relationship between the tables.
  • Using values that decided by the client, and not the SQL Server: Let’s think about the following example – we have LogEvents table, and LogEventsParameters that contains for each LogEvents row few parameters that we want to save. In that case, we don’t want just to combine them all to the same table and duplicate the data for each parameter, and we want to keep it in seperate tables with relationships.
    Instead of using the IDENTITY columns for the relationship, or any other columns whose value is auto-generated by SQL Server, we can generate the data in a way that the connection between the data in this two tables is based on GUID (uniqueidentifier columns) that we set on the client. Then, we’ll do two BULK INSERT operations: one to add all the new rows to LogEvents, and one to add the new rows to LogEventsParameters. All the rows are already ‘connected’ by the GUID we generated in the client.
    We also need to consider if we must have the data 100% consistent. For example, we can share a transaction between this two BULK INSERT operations, but we might choose not to – if we want to increase performance, reduce transaction log space etc. In that case, we have to understand the implications on our data and our applications.

What happens when BULK INSERT Fails?

By default, when BULK INSERT fails it doesn’t  rollback the rows that were inserted before the error occured. If we set the UseInternalTransaction flag, then there’ll be rollback of the current batch (that contains BatchSize rows – by default, all the rows are in the same batch).
If the SqlBulkCopy was part of a transaction, then the transaction will be rolled back.

Monitoring BULK INSERT in production environment

We can track the performance of BULK INSERT operations using the performance monitor (Start->Run->perfmon.msc). The relevant counter is in ‘SQL Server :: Databases –> Bulk Copy rows / sec, Bulk Copy throughput / sec’:

Improving INSERT Performance

Although it’s not the main subject of the post, I want to mention some options that you have to improve your INSERT performance without using SqlBulkCopy. That’s obviously not all the things you should consider, but some highlights for relevant things.

Inserting multiple rows using INSERT

A single INSERT statement in SQL can have multiple values:

INSERT INTO BulkInsertDemo(NumCol, LongTextCol) 
VALUES 
(1, 'BlahBlahBlah'),
(1, 'BlahBlahBlah'),
(2,'a')

The number of rows is limited to 1,000. For 1,000 rows, using multiple INSERT statements one-by-one took about 205ms. Using SqlBulkCopy it took about 17ms, and using INSERT with multiple values it was about 143ms.
It’s better than 1000 INSERT’s, but not as good as BULK INSERT. This syntax feature is not actually about performance, but it has some benefits.

Using Delayed Durability

A relatively new feature in SQL Server is delayed durability. It allows us to improve the INSERT performance by gathering multiple writes to the transaction log into a buffer, and then write it to the transaction log when it’s big enough or when some time passed. The most important thing is that our application don’t have to wait for the actual write to the transaction log, and by not waiting for the I/O – we can improve performance. Of course, this also means that we can loose our data, in case of failure between when our application wrote the data, and when the data was actually written. You can read more about delayed-durability here.

Using In-Memory OLTP

Another relatively new feature in SQL Server is In-Memory OLTP. This feature allows us to work with tables that have all their data stored in the server’s memory and also work in a different way which is optmized to work for data that guaranteed to be stored in the RAM (using diffent types of indexes, lock free mechanism etc.). This can greatly improve the performance, including INSERT performance. You can read about it in the official documentation.

Summary

In this post we saw BULK INSERT and how it can help us improve the performance of inserting new rows.  We also saw SqlBulkCopy, which is the .NET class that enables us use BULK INSERT from our .NET applications.
We saw some of the challanges we can face when working with BULK INSERT, and some workarounds that can help us.

Hope it helps.

Shahar.

No Comments

Add a Comment

As it will appear on the website

Not displayed

Your website