[T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

It's relatively easy to store data with a time dimension, but querying it is another matter. If you select from a temporal tables (one which includes historical information indicated by a timestamp or datetime column) based on your ID, you'll get a lot of duplicate records; only one or a few of those records will be applicable to a given time or timespan.

For example, the following query (from the AdventureWorks sample database) returns 746 records:

 

SELECT ActualCost FROM Production.TransactionHistory WHERE ProductID = 784

 

A table with History in the name is expected to work that way, but I've run into plenty of tables which allow for duplicates of every column but the primary key, differentiated by a datetime column. I'm starting to use these more now, partly due to project requirements, and partly because SubSonic has built-in support for CreatedOn and ModifiedOn columns.

In the past, I've used stacked views or nested subqueries to handle this madness. It's a pain in the neck, and often leads to frustrations with the GROUP BY clause since queries with an aggregate term can't include a column in a resultset that's not in the GROUP BY clause (but adding those columns to the GROUP BY clause isn't the right solution, as it alters the rows you're returning).

The general idea is to write a query which gets the latest update date for a unique ID combination, then join against it. It works, but it's a pain in the neck, and it's error prone. Fortunately, SQL Server 2005's RANK / PARTITION features make this a lot easier.

Let's talk specifics - a query against the AdventureWorks Production.TransactionHistory table which returns the latest record Product Name and Number by Transaction Date. There are 113K rows in that table, but if we only want the latest entry for each Product we're down to 490 rows.

Unlike the traditional GROUP BY based clause, the RANK / POSITION based queries are a bit more intuitive once you've got the hang of them. The important thing is to use PARTITION in the same way you'd think of a GROUP BY - the column(s) you want unique should show up in the PARTITION clause.

One limitation on RANK queries is that you can't use the RANK value directly in a where clause. That's easy to work around by placing the RANK value in a subquery and the WHERE clause filtering on the RANK in the outer query.

UPDATE: Based on comments, my original example was a little too simple. I've added a few joins and return columns; the point is that we can continue to add other columns to the result set without having to worry about how the grouping is handled. Yes, you can do this with subqueries, but in many cases those will continue to grow more complicated as you add tables and columns; this syntax doesn't.

SELECT Product.Name, Product.ProductNumber, TransactionDate, TransactionID, ProductCategory.Name, ProductSubcategory.Name, ProductSubcategory.ModifiedDate FROM ( SELECT DISTINCT Production.TransactionHistory.ProductID, Production.TransactionHistory.TransactionDate, Production.TransactionHistory.TransactionID, RANK() OVER ( PARTITION BY Production.TransactionHistory.ProductID ORDER BY Production.TransactionHistory.TransactionDate DESC) AS DateOrder FROM Production.TransactionHistory ) CurrentTransactionHistory INNER JOIN Production.Product ON Production.Product.ProductID = CurrentTransactionHistory.ProductID INNER JOIN Production.ProductSubcategory ON Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID INNER JOIN Production.ProductCategory ON Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID WHERE DateOrder = 1 ORDER BY ProductSubcategory.ModifiedDate
Published Thursday, July 12, 2007 10:50 PM by Jon Galloway
Filed under: ,

Comments

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

maybe i'm misunderstanding your requirement but wouldn't it be easier to just join to the subquery in which you get only the max date grouped by id?

Friday, July 13, 2007 4:19 AM by Mladen

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

@Mladen - I simplified my example quite a bit, so the benefit's less clear.

My production code was doing some complex joins with several tables, in which case joining to a subquery was a lot more complex. In my case, I needed a combination of ID's in each row.

The RANK / PARTITION features don't really allow you to do things you couldn't before, but they do make it easier. If you start by joining to a subquery and need to add fields later, it can get out of control quickly. If you start with RANK / PARTITION, adding new columns makes no difference to your grouping logic.

Friday, July 13, 2007 4:28 AM by Jon Galloway

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

It seems that you wanted to use row_num instead of rank because if you have duplicate records by TransactionDate you would get duplicates if you dont have duplicates, rank is not useless...

Friday, July 13, 2007 5:30 AM by BAlexandrov

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

@Jon: If you start by joining to a subquery and need to add fields later, it can get out of control quickly.

Can you provide an example of this?  Based on your statement of "The general idea is to write a query which gets the latest update date" then the RANK/PARTITION feature is likely overkill and possibly more confusing.

SELECT

 p.[Name],

 p.ProductNumber,

 th.TransactionDate,

 th.TransactionID

FROM

 Production.TransactionHistory th

INNER JOIN

 Production.Product p

 ON p.ProductID = th.ProductID

where

 exists (

   select 1

   from Production.TransactionHistory thx

   where

     -- this is our local 'primary key'

     th.ProductID = thx.ProductID

   group by

     -- group by our local 'primary key'

     thx.ProductID

   having

     -- what predicate are we after?

     max( thx.TransactionDate )

       = th.TransactionDate

 )

To me, based on your requirements, and any requirement that is based on a MAX/MIN/SUM (basically simple aggregates) then this particular format of query it far superior and likely a better performer.  Where the RANK/PARTITION would be superior would be cases where you need the TOP X of something over a range of data, or some other subset of that data, or possibly if your data in your actual search predicate could be duplicated as well (as having duplicate datetime values by ProductID in the example) - but the latter could be handled easily in the outer query.

Friday, July 13, 2007 10:43 AM by David L. Penton

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

thanks i like your code fun time at me pants

Tuesday, July 24, 2007 1:44 AM by guyishappy

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

I am getting the error that RANK() is not a valid function.  I am using MSSQL2000. Anyways, here is my situation.

I have a table Stories (ID, Heading, SubHeading,..., EffectiveDate)

When a story is updated, the Heading and SubHeading never change and a new record is inserted and the old one is history.

ie:

ID | Heading   | SubHeading | EffectiveDate

55 | testing   | testing123 | 22 Feb 2008

56 | testing   | testing123 | 26 Feb 2008

58 | different | testing456 | 26 Feb 2008

I need to list all the lastest stories  only.  So I should get the last 2 records shown above.  There could be multiple entries for each "heading & subheading" and when I list, I need all the items where the effectivedate is the latest...

What is the solution?

Wednesday, February 27, 2008 6:10 AM by Glen

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

Hi Jon,

The information above is a bit helpful in terms of starting to understand the logic of structuring a joint query.  

However, the example above is not working for me.  I need to structure a nested query for one table which contains historical data to pull various records with the most recent update date without pulling duplicate records and I cannot seem to structure the query correctly.  If possible, would you please provide an example of this type of query?

- Lavita

Wednesday, August 06, 2008 11:49 PM by Lavita

# re: [T-SQL] Getting distinct, current entries from tables with timestamp / datetime fields

Hi i tried your code. It worked but i want to use rownum with this query as well. Can you give me the code with rownum.

following is the code...

SELECT * FROM tl_flexcab th where exists ( select 1  from tl_flexcab thx  where th.FL_FNN_JOIN = thx.FL_FNN_JOIN  and FL_OWN_CD = 'ZAG' group by  thx.FL_FNN_JOIN having max( thx.FL_DT_TM ) = th.FL_DT_TM )

Thursday, August 28, 2008 6:40 AM by Sandeep

Leave a Comment

(required) 
(required) 
(optional)
(required)