Importing a large file into Sql Server database employing some statistics
I imported a 2.5 GB file into our SQL server today. What
made this import challenging is SQL Server Import and Export Wizard failed to
read the data. This file is a tab delimited file with CR/LF as record
separator. The lines are truncated if the remaining line is empty, resulting
variable number of columns in each record. Also, there is a memo field that contains
CR/LF so that it is not reliable to use CR/LF to break the records.
I had to write my own import utility. The utility does
the import in 2 phases: analysis and import phases. In the first phase, the
utility reads the entire file to determine the number of columns, the data type
of each column, its width and whether it allows null. In order to address the
CR/LF problem, the utility uses the following algorithm:
Read a line to determine the number of columns in the
line.
Peek the next line and determine the number of columns in
the second line. If the sum of the two numbers is less than the expected column
count, the second line is a candidate for merging into the first line. Repeat
this step.
When determining the column type, I first initialize each
column with a narrow type and then widen the column when necessary. I widen the
columns in the following order: bool -> integral -> floating
->varchar. Because I am not completely sure that I merged the lines
correctly, I relied on the probability rather than the first occurrence to
widen the field. This allows me to run the analysis phase in only one pass. The
drawback is that I do not have 100% confidence on the schema extracted; data
that does not fix the schema would have to be rejected in the import phase. The
analysis phase took about 10 minutes on a fairly slow (in today’s standard)
dual-core Pentium machine.
In the import phase, I did line merging similar to the
analysis phase. The only difference is that, with statistics from the first
phase, I was able to determine whether a short line (i.e. a line that has a
small number of columns) should merge with the previous line or the next line.
The criterion is that the fields split from the merge line have to satisfy the
schema. I imported over 8 million records and had to reject only 1 record. I
visually inspected the reject record and the data is indeed bad. I used SqlBulkCopy to load the data in 1000 record batch. It took about
1 hour and 30 minutes to import the data over the wan into our SQL server in
the cloud.
In conclusion, rather than determining the table schema fully deterministically, I employed a little bit of statistics in the process. I was able to determine the schema faster (with one pass) and higher fidelity (rejecting bogus data rather than accepting bogus data by widening schema). After all, statistics is an important part of machine learning today and it allows us to inject a little bit of intelligence into the ETL process.