[T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

Friday, July 13, 2007

It's relatively easy to store data with a time dimension, but querying it is another matter. If you select from a temporal tables (one which includes historical information indicated by a timestamp or datetime column) based on your ID, you'll get a lot of duplicate records; only one or a few of those records will be applicable to a given time or timespan.

For example, the following query (from the AdventureWorks sample database) returns 746 records:

SELECT ActualCost 
FROM Production.TransactionHistory 
WHERE ProductID = 784

A table with History in the name is expected to work that way, but I've run into plenty of tables which allow for duplicates of every column but the primary key, differentiated by a datetime column. I'm starting to use these more now, partly due to project requirements, and partly because SubSonic has built-in support for CreatedOn and ModifiedOn columns.

In the past, I've used stacked views or nested subqueries to handle this madness. It's a pain in the neck, and often leads to frustrations with the GROUP BY clause since queries with an aggregate term can't include a column in a resultset that's not in the GROUP BY clause (but adding those columns to the GROUP BY clause isn't the right solution, as it alters the rows you're returning).

The general idea is to write a query which gets the latest update date for a unique ID combination, then join against it. It works, but it's a pain in the neck, and it's error prone. Fortunately, SQL Server 2005's RANK / PARTITION features make this a lot easier.

Let's talk specifics - a query against the AdventureWorks Production.TransactionHistory table which returns the latest record Product Name and Number by Transaction Date. There are 113K rows in that table, but if we only want the latest entry for each Product we're down to 490 rows.

Unlike the traditional GROUP BY based clause, the RANK / POSITION based queries are a bit more intuitive once you've got the hang of them. The important thing is to use PARTITION in the same way you'd think of a GROUP BY - the column(s) you want unique should show up in the PARTITION clause.

One limitation on RANK queries is that you can't use the RANK value directly in a where clause. That's easy to work around by placing the RANK value in a subquery and the WHERE clause filtering on the RANK in the outer query.

UPDATE: Based on comments, my original example was a little too simple. I've added a few joins and return columns; the point is that we can continue to add other columns to the result set without having to worry about how the grouping is handled. Yes, you can do this with subqueries, but in many cases those will continue to grow more complicated as you add tables and columns; this syntax doesn't.

SELECT
    Product.Name,
    Product.ProductNumber,
    TransactionDate,
    TransactionID,
    ProductCategory.Name,
    ProductSubcategory.Name,
    ProductSubcategory.ModifiedDate
FROM
(
    SELECT DISTINCT
    Production.TransactionHistory.ProductID,
    Production.TransactionHistory.TransactionDate,
    Production.TransactionHistory.TransactionID,
    RANK() OVER (
        PARTITION BY Production.TransactionHistory.ProductID 
        ORDER BY Production.TransactionHistory.TransactionDate DESC) 
    AS DateOrder
    FROM Production.TransactionHistory
) CurrentTransactionHistory
INNER JOIN Production.Product 
    ON Production.Product.ProductID = CurrentTransactionHistory.ProductID
INNER JOIN Production.ProductSubcategory ON
    Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID
INNER JOIN Production.ProductCategory ON
    Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID
WHERE DateOrder = 1 
ORDER BY ProductSubcategory.ModifiedDate

8 Comments

maybe i'm misunderstanding your requirement but wouldn't it be easier to just join to the subquery in which you get only the max date grouped by id?

Mladen - Friday, July 13, 2007 1:19:07 AM

@Mladen - I simplified my example quite a bit, so the benefit's less clear.

My production code was doing some complex joins with several tables, in which case joining to a subquery was a lot more complex. In my case, I needed a combination of ID's in each row.

The RANK / PARTITION features don't really allow you to do things you couldn't before, but they do make it easier. If you start by joining to a subquery and need to add fields later, it can get out of control quickly. If you start with RANK / PARTITION, adding new columns makes no difference to your grouping logic.

Jon Galloway - Friday, July 13, 2007 1:28:33 AM

It seems that you wanted to use row_num instead of rank because if you have duplicate records by TransactionDate you would get duplicates if you dont have duplicates, rank is not useless...

BAlexandrov - Friday, July 13, 2007 2:30:23 AM

@Jon: If you start by joining to a subquery and need to add fields later, it can get out of control quickly.

Can you provide an example of this? Based on your statement of "The general idea is to write a query which gets the latest update date" then the RANK/PARTITION feature is likely overkill and possibly more confusing.

SELECT
p.[Name],
p.ProductNumber,
th.TransactionDate,
th.TransactionID
FROM
Production.TransactionHistory th
INNER JOIN
Production.Product p
ON p.ProductID = th.ProductID
where
exists (
select 1
from Production.TransactionHistory thx
where
-- this is our local 'primary key'
th.ProductID = thx.ProductID
group by
-- group by our local 'primary key'
thx.ProductID
having
-- what predicate are we after?
max( thx.TransactionDate )
= th.TransactionDate
)

To me, based on your requirements, and any requirement that is based on a MAX/MIN/SUM (basically simple aggregates) then this particular format of query it far superior and likely a better performer. Where the RANK/PARTITION would be superior would be cases where you need the TOP X of something over a range of data, or some other subset of that data, or possibly if your data in your actual search predicate could be duplicated as well (as having duplicate datetime values by ProductID in the example) - but the latter could be handled easily in the outer query.

David L. Penton - Friday, July 13, 2007 7:43:47 AM

thanks i like your code fun time at me pants

guyishappy - Monday, July 23, 2007 10:44:48 PM

Hi Jon,

The information above is a bit helpful in terms of starting to understand the logic of structuring a joint query.

However, the example above is not working for me. I need to structure a nested query for one table which contains historical data to pull various records with the most recent update date without pulling duplicate records and I cannot seem to structure the query correctly. If possible, would you please provide an example of this type of query?

- Lavita

Lavita - Wednesday, August 6, 2008 8:49:19 PM

Hi i tried your code. It worked but i want to use rownum with this query as well. Can you give me the code with rownum.

following is the code...

SELECT * FROM tl_flexcab th where exists ( select 1 from tl_flexcab thx where th.FL_FNN_JOIN = thx.FL_FNN_JOIN and FL_OWN_CD = 'ZAG' group by thx.FL_FNN_JOIN having max( thx.FL_DT_TM ) = th.FL_DT_TM )

Sandeep - Thursday, August 28, 2008 3:40:25 AM

Thanks for your example... I would have been totally lost trying to figure out how to do this sort of thing with traditional t-sql statements.

paulains - Wednesday, December 31, 2008 1:47:36 PM

Comments have been disabled for this content.