[T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields
It's relatively easy to store data with a time dimension, but querying it is another matter. If you select from a temporal tables (one which includes historical information indicated by a timestamp or datetime column) based on your ID, you'll get a lot of duplicate records; only one or a few of those records will be applicable to a given time or timespan.
For example, the following query (from the AdventureWorks sample database) returns 746 records:
SELECT ActualCost FROM Production.TransactionHistory WHERE ProductID = 784
A table with History in the name is expected to work that way, but I've run into plenty of tables which allow for duplicates of every column but the primary key, differentiated by a datetime column. I'm starting to use these more now, partly due to project requirements, and partly because SubSonic has built-in support for CreatedOn and ModifiedOn columns.
In the past, I've used stacked views or nested subqueries to handle this madness. It's a pain in the neck, and often leads to frustrations with the GROUP BY clause since queries with an aggregate term can't include a column in a resultset that's not in the GROUP BY clause (but adding those columns to the GROUP BY clause isn't the right solution, as it alters the rows you're returning).
The general idea is to write a query which gets the latest update date for a unique ID combination, then join against it. It works, but it's a pain in the neck, and it's error prone. Fortunately, SQL Server 2005's RANK / PARTITION features make this a lot easier.
Let's talk specifics - a query against the AdventureWorks Production.TransactionHistory table which returns the latest record Product Name and Number by Transaction Date. There are 113K rows in that table, but if we only want the latest entry for each Product we're down to 490 rows.
Unlike the traditional GROUP BY based clause, the RANK / POSITION based queries are a bit more intuitive once you've got the hang of them. The important thing is to use PARTITION in the same way you'd think of a GROUP BY - the column(s) you want unique should show up in the PARTITION clause.
One limitation on RANK queries is that you can't use the RANK value directly in a where clause. That's easy to work around by placing the RANK value in a subquery and the WHERE clause filtering on the RANK in the outer query.
UPDATE: Based on comments, my original example was a little too simple. I've added a few joins and return columns; the point is that we can continue to add other columns to the result set without having to worry about how the grouping is handled. Yes, you can do this with subqueries, but in many cases those will continue to grow more complicated as you add tables and columns; this syntax doesn't.
SELECT Product.Name, Product.ProductNumber, TransactionDate, TransactionID, ProductCategory.Name, ProductSubcategory.Name, ProductSubcategory.ModifiedDate FROM ( SELECT DISTINCT Production.TransactionHistory.ProductID, Production.TransactionHistory.TransactionDate, Production.TransactionHistory.TransactionID, RANK() OVER ( PARTITION BY Production.TransactionHistory.ProductID ORDER BY Production.TransactionHistory.TransactionDate DESC) AS DateOrder FROM Production.TransactionHistory ) CurrentTransactionHistory INNER JOIN Production.Product ON Production.Product.ProductID = CurrentTransactionHistory.ProductID INNER JOIN Production.ProductSubcategory ON Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID INNER JOIN Production.ProductCategory ON Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID WHERE DateOrder = 1 ORDER BY ProductSubcategory.ModifiedDate