Attention: We are retiring the ASP.NET Community Blogs. Learn more >

Reporting on massive amounts of data in real time: Need input

Since I've been in the content business online for eight years or so, advertising is the life line that pays the bills. Not my first choice since it has been relatively unstable since the start, but it is what it is.

Eventually the time came where I needed to develop my own ad serving application. The truth is, it works extremely well in terms of serving ads. Where it doesn't work that well is reporting from the millions upon millions of rows generated.

Here's the situation. Every time an ad is served, a row is recorded that contains it's ID, it's location (an ID for a specific ad position on a page), the time it was served, and two GUID's, one for the user's session and one for the user's lifetime identification. The last two are used for session and overall impression capping. They're indexed, and surprisingly enough inserts aren't hindered. I seem to recall an index on the ad's ID as well.

Reporting queries ask for total counts in a given time span, either by the ID or the location, and need to be displayed with hourly or daily totals. Originally I had a background thread that counts up totals every hour and puts those in a new table, but even querying that data is too slow.

Can anyone with more SQL experience offer any suggestions? 

7 Comments

  • Since you need to group these reports by timespan, potentially using a BETWEEN clause, I'd suggest removing the clustered index for the ID column (assuming that's what you have) and creating a clustered index on the actual datetime column.

  • My suggestion will only offer some improvement if your PK column is NOT an identity column but a GUID with a clustered index on (which would be pretty much useless).

  • You have not mentioned what's ur database version. If it's SQL2005, then you can replace GUID with SequentialGUID with a clustered index. It's a better choice for an indexed identity column

    If its SQL2000, then indexing a GUID column would be of no use. Btw, is ur tables partitioned across multiple servers in anetwork? If not why use GUID at all, why not use Identity column itself?

  • a) Index the time column. b)Index the indexes. c)Make the time the primary key. d)faster server.

  • Dude, it's really hard to read your post when you don't spell out words.

    The database can be whatever, but I'd like to avoid tying myself down to any one in particular. Why is indexing a GUID column of no use?

  • Putting a clustered GUID index isn't of any use because there is no benefit in having GUID's that start with the same characters to be on the same physical database page.

    It would if you run queries like:

    SELECT CustGuid,Name FROM Customer WHERE CustGuid LIKE '4451%'

    The point is, you need to make sure that your ad impression records with similar timestamps are physically on the sequential database pages and are not scattered, which you can achieve by putting a clustered index on the datetime column.

    That will save you harddisk IO, since the DB needs to fetch less database pages from disk to return results for your query, hence increasing performance.

    By default, SQL Server will make the primary key a clustered index, which may not always be what you want. And since you can only have one clustered index on a table for obvious reasons (the records can only be stored in one way on disk), you might want to move the clustered index from the PK column to the datetime column.

  • Though intially records from one table may be written on the same DB page until the page is fully allocated, there is no guarantee that will happen. Different table records that are being inserted at the same time could end up on the same DB page.
    As a result of these dynamics, database pages do get heavily fragmented overtime; the clustered index will ensure that your records with similar timestamps remain together on their individual DB pages, and that's important because that's the main way you will be querying these impression records.

    For an aggregate type query for say impressions per hour for a specific ad campaign, try something along the lines of:

    DECLARE @youradid int
    DECLARE @yourdatestring varchar(20)

    SET @youradid=1
    SET @yourdatestring='25 Aug 2006'

    SELECT ad.title,DATEPART(hh,impression.dateserved) AS thehour,COUNT(impression.id) as impressions
    FROM ad
    JOIN impression ON ad.id = impression.adid
    WHERE ad.id=@youradid and CAST(FLOOR(CAST(impression.dateserved AS FLOAT)) AS DATETIME)=@yourdatestring
    GROUP BY ad.title,datepart(hh,impression.dateserved)

Comments have been disabled for this content.