Contents tagged with SQL

  • The real reason SELECT * queries are bad: index coverage

    Are SELECT * queries bad? Sure, everyone know that. But, why?

    It's returning too much data, right?

    That's the common answer, but I don't think it's the right one. If you're working with a reasonably normalized database, the actual network traffic difference is pretty small.

    Let's take a look at a sample. The following two queries select 326 rows from the TransactionHistoryArchive table in the AdventureWorks database (which has a total of 89K rows). The first uses a SELECT * query, the second selects a specific column:

     

    SELECT * FROM Production.TransactionHistoryArchive WHERE ReferenceOrderID < 100 SELECT ReferenceOrderLineID FROM Production.TransactionHistoryArchive WHERE ReferenceOrderID < 100

     

    In this case, the difference in network traffic is only 15K, roughly a 10% difference (180K vs. 165K). It's worth fixing, but not a huge difference.

    SELECT * makes the Table / Index Scan Monster come

    Often, the bigger problem with SELECT * is the effect it will have on the execution plan. While SQL Server primarily uses indexes to look up your data, if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage. In the above example, the first query results in a Clustered Index Scan, whereas the second query uses a much more efficient Index Seek. In this case, the Index seek is one hundred times more efficient than the Clustered Index Scan.

    SelectStarQueryPlan

    Unless you've indexed every single column in a table (which is almost never a good idea), a SELECT * query can't take advantage of index coverage, and you're likely to get (extremely inefficient) scan operations.

    If you just query the rows you'll actually be using, it's more likely they'll be covered by indexes. And I think that's the biggest performance advantage of ignoring SELECT * queries.

    The Stability Aspect

    SELECT * queries are also bad from an application maintenance point of view as well, since it introduces another outside variable to your code. If a column is added to a table, the results returned to your application will change in structure. Well programmed applications should be referring to columns by name and shouldn't be affected, but well programmed applications should also minimize the ways in which they are vulnerable to external changes.

    Shameless Plug: I go into this (and a lot other important performance tips) in more detail in a soon-to-be-released book for SitePoint.

    Read more...

  • [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

    It's relatively easy to store data with a time dimension, but querying it is another matter. If you select from a temporal tables (one which includes historical information indicated by a timestamp or datetime column) based on your ID, you'll get a lot of duplicate records; only one or a few of those records will be applicable to a given time or timespan.

    For example, the following query (from the AdventureWorks sample database) returns 746 records:

     

    SELECT ActualCost FROM Production.TransactionHistory WHERE ProductID = 784

     

    A table with History in the name is expected to work that way, but I've run into plenty of tables which allow for duplicates of every column but the primary key, differentiated by a datetime column. I'm starting to use these more now, partly due to project requirements, and partly because SubSonic has built-in support for CreatedOn and ModifiedOn columns.

    In the past, I've used stacked views or nested subqueries to handle this madness. It's a pain in the neck, and often leads to frustrations with the GROUP BY clause since queries with an aggregate term can't include a column in a resultset that's not in the GROUP BY clause (but adding those columns to the GROUP BY clause isn't the right solution, as it alters the rows you're returning).

    The general idea is to write a query which gets the latest update date for a unique ID combination, then join against it. It works, but it's a pain in the neck, and it's error prone. Fortunately, SQL Server 2005's RANK / PARTITION features make this a lot easier.

    Let's talk specifics - a query against the AdventureWorks Production.TransactionHistory table which returns the latest record Product Name and Number by Transaction Date. There are 113K rows in that table, but if we only want the latest entry for each Product we're down to 490 rows.

    Unlike the traditional GROUP BY based clause, the RANK / POSITION based queries are a bit more intuitive once you've got the hang of them. The important thing is to use PARTITION in the same way you'd think of a GROUP BY - the column(s) you want unique should show up in the PARTITION clause.

    One limitation on RANK queries is that you can't use the RANK value directly in a where clause. That's easy to work around by placing the RANK value in a subquery and the WHERE clause filtering on the RANK in the outer query.

    UPDATE: Based on comments, my original example was a little too simple. I've added a few joins and return columns; the point is that we can continue to add other columns to the result set without having to worry about how the grouping is handled. Yes, you can do this with subqueries, but in many cases those will continue to grow more complicated as you add tables and columns; this syntax doesn't.

    SELECT Product.Name, Product.ProductNumber, TransactionDate, TransactionID, ProductCategory.Name, ProductSubcategory.Name, ProductSubcategory.ModifiedDate FROM ( SELECT DISTINCT Production.TransactionHistory.ProductID, Production.TransactionHistory.TransactionDate, Production.TransactionHistory.TransactionID, RANK() OVER ( PARTITION BY Production.TransactionHistory.ProductID ORDER BY Production.TransactionHistory.TransactionDate DESC) AS DateOrder FROM Production.TransactionHistory ) CurrentTransactionHistory INNER JOIN Production.Product ON Production.Product.ProductID = CurrentTransactionHistory.ProductID INNER JOIN Production.ProductSubcategory ON Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID INNER JOIN Production.ProductCategory ON Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID WHERE DateOrder = 1 ORDER BY ProductSubcategory.ModifiedDate

    Read more...