Jeff and .NET

The .NET musings of Jeff Putz

Sponsors

News

My Sites

Archives

Reporting on massive amounts of data in real time: Need input

Since I've been in the content business online for eight years or so, advertising is the life line that pays the bills. Not my first choice since it has been relatively unstable since the start, but it is what it is.

Eventually the time came where I needed to develop my own ad serving application. The truth is, it works extremely well in terms of serving ads. Where it doesn't work that well is reporting from the millions upon millions of rows generated.

Here's the situation. Every time an ad is served, a row is recorded that contains it's ID, it's location (an ID for a specific ad position on a page), the time it was served, and two GUID's, one for the user's session and one for the user's lifetime identification. The last two are used for session and overall impression capping. They're indexed, and surprisingly enough inserts aren't hindered. I seem to recall an index on the ad's ID as well.

Reporting queries ask for total counts in a given time span, either by the ID or the location, and need to be displayed with hourly or daily totals. Originally I had a background thread that counts up totals every hour and puts those in a new table, but even querying that data is too slow.

Can anyone with more SQL experience offer any suggestions? 

Posted: Aug 24 2006, 10:52 AM by Jeff | with 10 comment(s)
Filed under: ,

Comments

Wim said:

Since you need to group these reports by timespan, potentially using a BETWEEN clause, I'd suggest removing the clustered index for the ID column (assuming that's what you have) and creating a clustered index on the actual datetime column.

# August 24, 2006 11:28 AM

Wim said:

My suggestion will only offer some improvement if your PK column is NOT an identity column but a GUID with a clustered index on (which would be pretty much useless).

# August 24, 2006 11:39 AM

AnjanaRam said:

You have not mentioned what's ur database version. If it's SQL2005, then you can replace GUID with SequentialGUID with a clustered index. It's a better choice for an indexed identity column

If its SQL2000, then indexing a GUID column would be of no use. Btw, is ur tables partitioned across multiple servers in  anetwork? If not why use GUID at all, why not use Identity column itself?

# August 24, 2006 1:50 PM

Edward Ezzell said:

a) Index the time column. b)Index the indexes.  c)Make the time the primary key.  d)faster server.

# August 24, 2006 2:05 PM

Jeff said:

Dude, it's really hard to read your post when you don't spell out words.

The database can be whatever, but I'd like to avoid tying myself down to any one in particular. Why is indexing a GUID column of no use?

# August 24, 2006 2:05 PM

RoyOsherove said:

I'd suggest reading parts of this book: http://weblogs.asp.net/rosherove/archive/2006/08/21/Free-E_2D00_Book_3A00_-Developing-Time_2D00_Oriented-Database-Applications-in-SQL.aspx
# August 24, 2006 5:11 PM

Jeff said:

Wow, that is awesome stuff. That's very much worth the read!

# August 24, 2006 5:37 PM

Wim said:

Putting a clustered GUID index isn't of any use because there is no benefit in having GUID's that start with the same characters to be on the same physical database page.

It would if you run queries like:

SELECT CustGuid,Name FROM Customer WHERE CustGuid LIKE '4451%'

The point is, you need to make sure that your ad impression records with similar timestamps are physically on the sequential database pages and are not scattered, which you can achieve by putting a clustered index on the datetime column.

That will save you harddisk IO, since the DB needs to fetch less database pages from disk to return results for your query, hence increasing performance.

By default, SQL Server will make the primary key a clustered index, which may not always be what you want. And since you can only have one clustered index on a table for obvious reasons (the records can only be stored in one way on disk), you might want to move the clustered index from the PK column to the datetime column.

# August 24, 2006 6:08 PM

Jeff said:

That makes a load of sense, thank you. Would it be logical, though, that as records are created, they'd be physically written next to each other anyway?

Any suggestions about getting aggregate data sets, like totals per hour?

# August 25, 2006 8:33 AM

Wim said:

Though intially records from one table may be written on the same DB page until the page is fully allocated, there is no guarantee that will happen. Different table records that are being inserted at the same time could end up on the same DB page.

As a result of these dynamics, database pages do get heavily fragmented overtime; the clustered index will ensure that your records with similar timestamps remain together on their individual DB pages, and that's important because that's the main way you will be querying these impression records.

For an aggregate type query for say impressions per hour for a specific ad campaign, try something along the lines of:

DECLARE @youradid int

DECLARE @yourdatestring varchar(20)

SET @youradid=1

SET @yourdatestring='25 Aug 2006'

SELECT ad.title,DATEPART(hh,impression.dateserved) AS thehour,COUNT(impression.id) as impressions

FROM ad

JOIN impression ON ad.id = impression.adid

WHERE ad.id=@youradid and CAST(FLOOR(CAST(impression.dateserved AS FLOAT)) AS DATETIME)=@yourdatestring

GROUP BY ad.title,datepart(hh,impression.dateserved)

# August 25, 2006 10:34 AM
Leave a Comment

(required) 

(required) 

(optional)

(required)