Nearly all of the time I create a table, I include a primary key, and often that PK is implemented as a clustered index. Those two don't always have to go together, but in my world they almost always do.
On a recent project, I was working on a data warehouse and a set of SSIS packages to import data from an OLTP database into my data warehouse. The data I was importing from the business database into the warehouse was mostly new rows, sometimes updates to existing rows, and sometimes deletes. I decided to use the MERGE statement to implement the insert, update or delete in the data warehouse, I found it quite performant to have a stored procedure that extracted all the new, updated, and deleted rows from the source database and dump it into a working table in my data warehouse, then run a stored proc in the warehouse that was the MERGE statement that took the rows from the working table and updated the real fact table.
CREATETABLE Integration.MergePolicy (PolicyId int, PolicyTypeKey int, Premium money, Deductible money, EffectiveDate date, Operation varchar(5))
CREATETABLE fact.Policy (PolicyKey int identity primary key, PolicyId int, PolicyTypeKey int, Premium money, Deductible money, EffectiveDate date)
CREATEPROC Integration.MergePolicy as
Merge fact.Policy as tgt
Using Integration.MergePolicy as Src
On (tgt.PolicyId = Src.PolicyId) When not matched by Target then Insert (PolicyId, PolicyTypeKey, Premium, Deductible, EffectiveDate)
values (src.PolicyId, src.PolicyTypeKey, src.Premium, src.Deductible, src.EffectiveDate) When matched and src.Operation = 'U' then
PolicyTypeKey = src.PolicyTypeKey,
Premium = src.Premium,
Deductible = src.Deductible,
EffectiveDate = src.EffectiveDate
When matched and src.Operation = 'D' thenDelete ;delete from Integration.WorkPolicy commit
Notice that my worktable (Integration.MergePolicy) doesn't have any primary key or clustered index. I didn't think this would be a problem, since it was relatively small table and was empty after each time I ran the stored proc.
For one of the work tables, during the initial loads of the warehouse, it was getting about 1.5 million rows inserted, processed, then deleted. Also, because of a bug in the extraction process, the same 1.5 million rows (plus a few hundred more each time) was getting inserted, processed, and deleted. This was being sone on a fairly hefty server that was otherwise unused, and no one was paying any attention to the time it was taking.
This week I received a backup of this database and loaded it on my laptop to troubleshoot the problem, and of course it took a good ten minutes or more to run the process. However, what seemed strange to me was that after I fixed the problem and happened to run the merge sproc when the work table was completely empty, it still took almost ten minutes to complete.
I immediately looked back at the MERGE statement to see if I had some sort of outer join that meant it would be scanning the target table (which had about 2 million rows in it), then turned on the execution plan output to see what was happening under the hood. Running the stored procedure again took a long time, and the plan output didn't show me much - 55% on the MERGE statement, and 45% on the DELETE statement, and table scans on the work table in both places. I was surprised at the relative cost of the DELETE statement, because there were really 0 rows to delete, but I was expecting to see the table scans.
(I was beginning now to suspect that my problem was because the work table was being stored as a heap.)
Then I turned on STATS_IO and ran the sproc again. The output was quite interesting.Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Policy'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'MergePolicy'. Scan count 1, logical reads 433276, physical reads 60, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
I've reproduced the above from memory, the details aren't exact, but the essential bit was the very high number of logical reads on the table stored as a heap. Even just doing a SELECT Count(*) from Integration.MergePolicy incurred that sort of output, even though the result was always 0.
I suppose I should research more on the allocation and deallocation of pages to tables stored as a heap, but I haven't, and my original assumption that a table stored as a heap with no rows would only need to read one page to answer any query was definitely proven wrong. It's likely that some sort of physical defragmentation of the table may have cleaned that up, but it seemed that the easiest answer was to put a clustered index on the table.
After doing so, the execution plan showed a cluster index scan, and the IO stats showed only a single page read. (I aborted my first attempt at adding a clustered index on the table because it was taking too long - instead I ran TRUNCATE TABLE Integration.MergePolicy first and added the clustered index, both of which took very little time).
I suspect I may not have noticed this if I had used TRUNCATE TABLE Integration.MergePolicy instead of DELETE FROM Integration.MergePolicy, since I'm guessing that the truncate operation does some rather quick releasing of pages allocated to the heap table.
In the future, I will likely be much more careful to have a clustered index on every table I use, even the working tables.