Importing a large file into Sql Server database employing some statistics
I imported a 2.5 GB file into our SQL server today. What
made this import challenging is SQL Server Import and
Export Wizard failed to read the data. This file is a tab
delimited file with CR/LF as record separator. The lines
are truncated if the remaining line is empty, resulting
variable number of columns in each record. Also, there is
a memo field that contains CR/LF so that it is not
reliable to use CR/LF to break the records.
I had to write my own import utility. The utility does
the import in 2 phases: analysis and import phases. In the
first phase, the utility reads the entire file to
determine the number of columns, the data type of each
column, its width and whether it allows null. In order to
address the CR/LF problem, the utility uses the following
algorithm:
Read a line to determine the number of columns in the
line.
Peek the next line and determine the number of columns in
the second line. If the sum of the two numbers is less
than the expected column count, the second line is a
candidate for merging into the first line. Repeat this
step.
When determining the column type, I first initialize each
column with a narrow type and then widen the column when
necessary. I widen the columns in the following order:
bool -> integral -> floating ->varchar. Because I
am not completely sure that I merged the lines correctly,
I relied on the probability rather than the first
occurrence to widen the field. This allows me to run the
analysis phase in only one pass. The drawback is that I do
not have 100% confidence on the schema extracted; data
that does not fix the schema would have to be rejected in
the import phase. The analysis phase took about 10 minutes
on a fairly slow (in today’s standard) dual-core Pentium
machine.
In the import phase, I did line merging similar to the
analysis phase. The only difference is that, with
statistics from the first phase, I was able to determine
whether a short line (i.e. a line that has a small number
of columns) should merge with the previous line or the
next line. The criterion is that the fields split from the
merge line have to satisfy the schema. I imported over 8
million records and had to reject only 1 record. I
visually inspected the reject record and the data is
indeed bad. I used SqlBulkCopy to load the data in 1000
record batch. It took about 1 hour and 30 minutes to
import the data over the wan into our SQL server in the
cloud.
In conclusion, rather than determining the table schema fully deterministically, I employed a little bit of statistics in the process. I was able to determine the schema faster (with one pass) and higher fidelity (rejecting bogus data rather than accepting bogus data by widening schema). After all, statistics is an important part of machine learning today and it allows us to inject a little bit of intelligence into the ETL process.