Erik Porter's Blog

Life and Development at Microsoft and Other Technology Discussions

News

    ROW_NUMBER() OVER Not Fast Enough With Large Result Set

    So I'm working on improving the performance of the next version of Channel 9.  For those of you not familiar, our team created a new platform for our department to build community sites (blogs, forums, videos, tagging, etc).  You can see an example of the platform running on 10.  Currently, there's not a whole lot of data in the 10 database so we've been able to get away with murder from a performance standpoint.  Now that I've imported all the data from Community Server (Channel 9) to 10 (EvNet Platform), we've been finding that 250,000 rows doesn't perform very well with our current code.

    The architecture of our platform is pretty simple.  There's an Entry table that houses Entries, Threads & Comments and the relationships between them.  When you're viewing a blog, you're looking at this table.  When you're looking at a forum, you're looking at this table.  We differentiate everything in our system by tags.  There's one for each forum, each blog, and content tags to describe what's in each entry.

    Currently, to do paging we've been using the ROW_NUMBER OVER function.  It's a new feature of SQL Server 2005 and is very simple and easy to use.  Unfortunately, it doesn't work very well with a lot of rows (250,000 for example).  I did some searching and came across this gem of an article.  It uses an interesting trick to use SET ROWCOUNT to get the first record to start with in a paged result set, then you just run the query again and set the row count again to the number you want where the values are greater than the first row from the starting point of the paged result set and man is it snappy.  Do check it out if you're having troubles with performance of ROW_NUMBER() OVER.

    Oh yah, I forgot to mention how much of an improvement this change made.  Before the change queries were taking about 8 seconds on average.  After the change, the queries now take less than 1 second.  Depending on if SQL has decided to cache the results or not, it's pretty much instantaneous.

    Comments

    FransBouma said:

    The problem with SET ROWCOUNT is that if you use a multiple PK it can give wrong results. It's been a while since I did tests with this, so the exact details faded away, but what I do remember is that SET ROWCOUNT is unreliable. What I find weird is that ROW_NUMBER() OVER is apparently so slow, as it should be faster than a temptable, which should you already give a query result way beyond 1 second for 250.000 rows. Tests on millions of rows with a paging query using a temptable showed result times of a second or thereabout. Ofcourse depending on your hardware, but I don't assume Microsoft runs its major sites on a nintendo gameboy.
    # October 18, 2006 4:19 AM

    HumanCompiler said:

    But running it on a Gameboy WOULD be fun!  ;)

    I honestly thought I was doing something wrong with my query because it was so slow, but I seriously wasn't doing anything complicated with it.  After seeing this article list the one I mentioned in the entry about it being more performant, I just went with it.

    http://www.asp.net/learn/dataaccess/tutorial25vb.aspx?tabid=63

    "This tutorial implements custom paging using the ROW_NUMBER() keyword. For more information on using the table variable and SET ROWCOUNT technique, see A More Efficient Method for Paging Through Large Result Sets."

    Oh and the table I'm querying has a single PK, so no worries there.  Seems to be reliable so far, but I guess we'll see.  Thanks for the heads up.

    # October 18, 2006 4:29 AM

    Wim said:

    Well, well.

    I'm using ROW_NUMBER() OVER as well for efficient paging for a forums app I'm working on.

    After reading your article, I dumped over 300,000 records into the table holding the posts. And to my surprise - it hit the SQL Connection timeout!!! EEKS.

    I hardly believed your blog post when I read it first, as ROW_NUMBER() OVER is the recommended way of performing 'efficient' data paging in SQL Server 2005, but I'm beginning to doubt that.

    Would be interesting to see what Scott Guthrie's response to this is.

    Cheers,

    Wim

    # October 18, 2006 10:14 AM

    Wim said:

    Actually - it was the Command timeout I hit first...

    # October 18, 2006 10:22 AM

    Karl said:

    One of the problems I faced with the Row_Number() OVER was the inability to have it live longer than a single SELECT. The nice thing about the temp table approach is you can do multiple things with that temp table...Sure you can do the same thing with Row_Number AND a temp table..but it's just easier to declare an identity field and use the ROWCOUNT trick for performance...
    # October 18, 2006 8:33 PM

    Brian said:

    We tried to implement this patern of using RowCount a couple months ago. The problem we ran into, is how changing the rowcount would impact the logic of function sand subqueries. Functions would usually not have a problem, unless its a function that returns a table with greater then your page size. Same issue exists for sub queries.
    # November 3, 2006 11:39 AM

    HumanCompiler said:

    Thanks for the link, but I don't really get it.  I see how that could help for a while, but when I get down to needing pages towards the end of my large result set, I'd still have to do a row number over pretty much all the rows.  Also, you can't use a variable with TOP, and I can't hard code for every page set I want.  So far, the method I talked about on this blog entry seems to be the best way.

    # November 6, 2006 2:09 AM

    Arnoud van Bers said:

    In fact: in SQL Server 2005 you CAN use a variable with the TOP clause: declare @n int set @n=3 select top (@n) * from MyTable2 furthermore: Did you create an index on the OVER() columns? This would greatly help because SQL Server must sort the results..
    # November 6, 2006 4:02 AM

    Arnoud van Bers said:

    in fact, in SQL Server 2005 you can define a variable for the top clause also.. dit you create an index for the order by fields?
    # November 6, 2006 8:42 PM

    Lewis Moten said:

    I've also had problems at first when using ROW_NUMBER() on some large tables during testing (1 million rows). It was about twice as fast then when I had used cursors and a table variable to fetch id's from an absolute row and the next N+ rows for the page. I tried making another procedure to use the set rowcount method, but it took a little bit longer than the ROW_NUMBER method.

    I created a table Item(ItemId uniqueidentifier, ItemName varchar(36)) and assigned a PK as well as an Asc index on ItemName.

    My basic tests were the following: Search against 2,097,152 records for the letter "a" on page 60,000 with 10 records per page sorting on "ItemName".

    INSERT INTO item VALUES (NEWID(), CAST(NEWID() AS VARCHAR(36)))

    -- keep running this next line until you have over a million records affected

    INSERT INTO item SELECT NEWID(), CAST(NEWID() AS VARCHAR(36)) FROM item

    ROW_NUMBER took avg 22 seconds

    SET ROWCOUNT took avg 28 seconds

    DYNAMIC READ_ONLY CURSOR took avg of 44 seconds.

    setting rowcount had a good effect on my original cursor query, but it still couldn't compete with ROW_NUMBER.

    select

    ItemId,

    ItemName

    FROM

    (

    SELECT

    ROW_NUMBER() OVER

    (

    ORDER BY

    case

    when @Sort = N'ItemName' then ItemName

    end asc,

    case

    when @Sort = N'ItemName DESC' then ItemName

    end DESC) RowNumber,

    ItemId,

    ItemName

    from

    Item WITH (FASTFIRSTROW)

    where

    ItemName LIKE N'%' + @Search + N'%'

    ) AS Page

    where

    Page.RowNumber BETWEEN @StartRowIndex AND @EndRowIndex

    ORDER BY

    Page.RowNumber asc

    # December 5, 2007 6:31 PM

    Ryk said:

    I have a weird problem with the last optimised solution in his stored proc. It returns a completely different resultset.

    the problem seems to be that it ignores my where clause I specified on the SELECT @first_id = ....

    WEIRD.

    # February 13, 2008 9:52 PM

    Ryk said:

    Aaah, sorted it, it is because of my complex order by's.

    One question about this whole scenario though. It is all well and good to have these ligtning fast solutions that always shows how to do the quick returns on queries, but in reality this is far from a real-world solution. For example, I have 180,000 records that can easily be returned using this super fast ROWCOUNT SET query, but there will always be order by columns etc. And in reality not just one or two, but multiple, then you are pretty much back to square one where you have to serialize all data and then select a subset, OR am I wrong, is there a better way?

    # February 14, 2008 12:32 AM

    HumanCompiler said:

    Ryk, we use the set rowcount method with lots of order bys and it's still really fast.  one thing you can do to increase the performance (but greatly adds to the complexity of the sql statement) is that if the page you're requesting is past halfway through your "pages" of data, then reverse the sort order.  Actually, it might not complicate it too much, just use a case statement in your order by.

    We're using this method on 4 high traffic sites at the moment (5th coming soon) and have no problems with performance.

    # February 14, 2008 12:39 AM

    Dizi said:

    I was going to use SQL server but when i saw your blog, i am planinng to use oracle for db

    # February 22, 2008 4:46 AM

    dmose said:

    HumanCompiler:  How do you use this method with sorting (other then the ID field) ... ?  Can you provide an example, I can't seem to figure it out

    # March 10, 2008 8:42 PM

    Colnaghi said:

    Linq vs. Row_Number

    I've been testing ROW_NUMBER on real big databases (4+ million rows) and it just fails to provide decent results.

    Using the rowcount gave me a solution. Even using the TOP/NOT IN approach works for page not so far from the begging.

    Now, the point is that Linq over Sql 2005 generates the row_number approach. It just blows the machine memory before returning any results.

    The question is, to make this work on this context, do I have to write my own SPs? Why don't Linq implements a better algorithm?

    This line

    (from e in db.Events orderby e.EventDesc select e)

    .Skip(pageIndex * pageSize).Take(pageSize);

    Generates this

    exec sp_executesql N'SELECT [t3].[EventDesc], [t3].[EventDate], [t3].[EventStatusName], [t3].[EventTypeName]

    FROM (

       SELECT ROW_NUMBER() OVER (ORDER BY [t0].[EventDesc]) AS [ROW_NUMBER], [t0].[EventDesc], [t0].[EventDate], [t1].[EventStatusName], [t2].[EventTypeName]

       FROM [dbo].[Event] AS [t0]

       INNER JOIN [dbo].[EventStatus] AS [t1] ON [t1].[EventStatusID] = [t0].[EventStatusID]

       INNER JOIN [dbo].[EventType] AS [t2] ON [t2].[EventTypeID] = [t0].[EventTypeID]

       ) AS [t3]

    WHERE [t3].[ROW_NUMBER] BETWEEN @p0 + 1 AND @p0 + @p1

    ORDER BY [t3].[ROW_NUMBER]',N'@p0 int,@p1 int',@p0=10,@p1=990

    # March 19, 2008 11:50 AM

    Emad said:

    Hi , I see your point is true in case of Low memory server ,

    but when i tried in my real server 8GB ram and on a table the heavilly searched it works very fast ( less than 1 second)

    WITH CC AS

    (

       SELECT *,

       ROW_NUMBER() OVER (ORDER BY forename) AS 'RowNumber'

       FROM contact  

    )

    SELECT *

    FROM CC

    WHERE RowNumber BETWEEN 200000 AND 200100;

    # April 29, 2008 7:03 AM

    cdowd said:

    Hi, the only way I see this method working is if you order by the id.  If you order by anything else, what would you use in the >= part of the where clause?

    What about sorting by multiple columns?

    # July 24, 2008 10:33 PM

    Spencer said:

    Try indexing the columns you plan to sort by..  it helps. Trust me.

    # December 4, 2008 1:41 PM