Mike Diehl's WebLog

Much aBlog about nothing...

I've been working on SSIS packages that extract data from production databases and put them into data warehouses, and recently I hit an issue using the Bulk Insert task that bit me real good.

 When you create a Bulk Insert task in the control flow of your package, the properties you generally edit are:

1. The target connection (which references a connection manager)

2. The target table

3. The source file (which references a file-type connection manager).

 I did that, ran the package in Visual Studio with my local file against a dev SQL database on a test server and it all worked just fine.

I ran it again, and it failed, due to a primary key violation - so I needed to make the execution of the task conditional, so long as the table was empty, I would run the task, otherwise if it contained anything, I would skip the task.

This was harder to do than I thought it would be. I started by creating a variable to hold the row count of the table, then an Execute Sql Task to run a statement on the target table (select count(*) as RowCount from targetTable) and set the variable value to the column in the resultset of the statement.

Now I go to look for an IF construct and there isn't any such thing. The closest was a For Next loop that I went down a rabbit-trail trying to use, and having it execute only zero or once, and I couldn't get that to work. Is there magic between the @variable syntax in the initialize, condition, and iteration expressions and the package user:variable declarations that make those work together? I still don't know the answer to that.

Then I thought of using the Expression on the dependency arrow from the task that got the row count from the target table. So I joined the Row Count task to the Bulk Insert task using the green arrow, then edited the dependency to be dependent both on Success of the row count task and the value of the user:rowCount variable I had created. That worked.

 Believe it or not, that isn't really what bit me in the butt.

Now I had a package that I could execute multiple times and it would work properly. My buddy Jeremy would say that it is "idempotent".

 What bit me was when I went to execute the package in another environment.

 I moved the .dtsx file to a test server and used the Execute Package Utility. I set the values for the connection managers in the package to the new server connections (and the new location of the bulk copy file), and ran the package, and it worked.

Just to make sure it was "idempotent", I ran it again.

It failed this time.

Another PK violation. Why?

It took me a while to find the problem. Eventually it came down to the target table property of the Bulk Insert Task - the value of this property was not just a two part table name, but it also included the database name.

It just so happened that the database I was testing with from my Visual Studio is on the same server as when I was testing with the Execute Package utility.

So, the first time I ran it with the Execute Package utility with the modified connection manager settings, it was querying the *real* target database for the number of rows, and getting back 0. Then it executed the bulk insert task into the *original* database I was testing with on the server (that I happened to clear out the rows from the table), and the bulk insert worked. The second time, the number of rows was still 0, and it tried to do the bulk insert into the same database, despite the fact that the connection manager was pointing to a different database.

I can understand why this was done this way, since when you use the bcp command line utility, you need to database-qualify the table you are moving data into, because bcp doesn't specify the database otherwise. But the Bulk Insert task was using the T-SQL statement BULK INSERT which is already database specific, so you don't generally qualify the table name with the database. With the whole database name in the target table property, the task isn't very responsive to changes to the connection manager at runtime.

Here is how I fixed it, and it's a HACK.

You can't just free-form enter the table name in the Bulk Insert task, it only allows you to pick from the list, which is based off the source connection you specify. So, in the Expressions tab of the Bulk Insert task, I used an expression to set the source table property to the non-database-qualified name of the table. It's kinda hidden, and it now overrides whatever table you select from the drop down later, and you wouldn't know it.

I hope that the SSIS Bulk Insert task gets fixed so it doesn't include the database name in the target table; it isn't needed, and it gets in the way of runtime changes to the target connection.

Mike

Posted by MikeD | with no comments

At Imaginet, we use Visual Studio Team Edition for Database Professionals (Data Dude) on our projects to manage database schemas, keep them in source control, unit testing, and lots of other nice features.

But it doesn't do database models well. Or at all, for that matter. I really would like the Database Diagramming tool in SQL Management Studio and Visual Studio to be able to go against a database project. But no, it can only go against an actual database.

Here is what we do to be able to model our tables and relationships with the diagramming tool and still use Data Dude.

For every project, we have a number of database "instances" - usually named after the project (I'll use the name Northwind from here on) with a suffix for the "environment", such as Northwind_Dev and Northwind_Test.

We also have another called Northwind_Schema, which is considered the "gold" standard for the schema of the project database. I'll start by creating that schema database and create tables in it using the database diagramming tool in SSMS. I can fairly quickly create a number of tables, and have a diagram for each subject area of the data. It also means my documentation is getting built at the same time as my database (in my world, the diagram forms the large part of the required database documentation). And these diagrams, like Xml comments in C# or VB, are also very close to "the code", and will keep current with the state of the schema database. Models created in other tools then exported to a database are very hard to keep accurate in the long run. When it comes time to snapshot the documentation for the database, we can fairly quickly embed pictures of the database models in Word or OneNote or some other documentation tool.

At the same time as I am modelling the database in Northwind_Schema, I create a database project in Visual Studio called Northwind. If I have the Northwind_Schema database in a state that I like (for first draft), I will use the Import Schema from Database wizard when creating the new database project. Otherwise, I'll just create an empty database project.

When I am happy with Northwind_Schema, I use a Schema Comparison to compare the Northwind_Schema database to the Northwind database project. I will update the database project with the changes that are in Northwind_Schema, then run any local tests against the database project before checking in.

Upon checkin, we have Team System automatically build the database and deploy it to Northwind_Dev, which is available for any developers on the project to use as they code other areas of the project. In the project I am working on now, we use LINQ and CSLA-based entities for our data access layer, so I will keep our LINQ model synchronized with the database project as well (usually by dragging tables onto the LINQ designer surface from the Northwind_Schema database).

If we ever lose Northwind_Schema, it is easy to rebuild it from the database project, because the database project in source control is "more true" than the Northwind_Schema instance. (However, we can lose the diagrams by rebuilding Northwind_Schema).

As I said above, I would actually prefer to do my diagramming in Visual Studio, against a database project rather than a database, and in that way I could also keep the diagrams in source control. But with the Northwind_Schema database, I can model new subject areas or do fairly major refactoring prior to checking out the database project files.

In my next post, I'll talk about how we build and manage stored procedures in project databases.

Posted by MikeD | with no comments

Here are my notes on the Monday morning keynote:

  • About 3000 attendees at the conference, over 60 countries represented.
  • There is BI in Halo3: whenever you look at competitor stats or weapon effectiveness, this is implemented using BI tech

Madison - MS has acquired DATAllegro, a company that was accomplishing low TCO MPP (massively parallel processing) scale out of BI. Using standard enterprise servers, you can process queries on very large data warehouse databases very quickly. They demonstrated a hardware setup of a MPP cluster: one control node, 24 compute nodes, and at least as many storage nodes (ie. shared disks). They loaded 1 trillion (yes trillion) rows in the fact table, and a bunch of dimension tables, such that the data warehouse contained over 150 TeraBytes of data. Then they sliced the fact table up onto the 24 SQL instances on the compute nodes (each compute node then had 1/24 of the trillion rows) and replicated the dimension tables to all compute nodes. Using SQL 2008 (and its new star join optimization) they then issued a query on the fact table and the related dimension tables to the cluster, where the control node passed the query along to the compute nodes, they each processed it, and returned the results back to the client.

On one screen they had Reporting Services (the client app) and on another, a graphic display of the CPU and disk stats for the control node and all 24 compute nodes, each node having 8 CPUs. When the Report was being displayed, the query got processed, and you could see the CPU usage go up on many of the nodes, then disk usage on each of the nodes, then the activity would subside and the reporting view would then display the results. It was all done in under 10 seconds. It was truly impressive. Now, that was with essentially read only data, so you could probably "roll your own" MPP system, given the time and hardware. It's not a huge technical problem to scale out read-only data. If they could show the same demonstration except with a SSIS package *loading* a trillion rows into the cluster, that would have been astounding - it's a much different and more difficult problem. Still, I was impressed.

Gemini - this is "BI Self Service" - the first evidence of this is an Excel addin that the always-entertaining Donald Farmer demonstrated. He used the addin to connect to a data warehouse and in a spreadsheet showed 20 million rows. We didn't *see* all 20 million rows, but he did sort it in under a second, and then filtered it (to UK sales only, about 1.5 million rows) in under a second. That performance and capacity was on what he said was a <$1000 computer with 8 GB RAM, similar to what he purchased for home a few weeks ago.

Aside from the jaw-dropping performance, he used the addin to dynamically link the data from analysis services with another spreadsheet of user-supplied data (I think it was "industry standard salary" or something). The add-in was able to build a star-schema in the background automatically and then make it available in the views they wanted in Excel  ( a graph or something? I can't remember). So it was showing the fact that sometimes the data warehouse doesn't have all the data needed for users to make decisions, so they got the data themselves, rather than wait for IT to get it in the DW. Ok, cool. So then he published that view into Sharepoint using Excel services, and the user-supplied data went along with it. So centrally publishing that view means it can be utilized by others in the enterprise, rather than sharing via email or a file share or something.

From the IT perspective, he showed a management view (dashboard) in SharePoint showing usage stats of "Sandboxes" (the thing they are currently calling these publications) and they could see how popular this particular sandbox was, and then take steps to formalize it into the enterprise. The tantalizing link on that web page was "Convert to Performance Point" - the idea was that you could take the sandbox view and convert it into a PPS web part. That looked cool too.

So Gemini looks very interesting.

Timeframes: the next major release of SQL will be 24-36 months from release of SQL 2008, but in the meantime, there are a number of releases coming: Madison and Gemini will be coming in the first half of 2010, and CTP's will be available sometime early next year. There are some incremental releases of Analysis Services, Integration Services and Reporting Services coming - the next gen of Reporting Services in particular will become available in a Feature Pack "real soon now".

 

Posted by MikeD | with no comments
Filed under: ,

It has been over a year since I last blogged, but I want to restart with some posts about the BI Conference I am attending this week.

Chris and I flew to Vancouver yesterday and drove down to Seattle in a Camry Hybrid. Sitting in the lineup at the US border for an hour drained the batteries on the Camry so it had to restart the engine to recharge a couple of times, for about 10 minutes each time. Seemed odd to discharge so much battery just sitting in a lineup and moving 10 feet every five minutes. Anyway...the trip display shows that our fuel efficiency was under 8 liters/100km on the trip down. That also seems a little poor compared to my Golf TDI that gets 4.5 liters/100 km regularly.

We registered last night and wandered the Company Store for a bit - saw uber-geek stuff there and we thought of getting something for Cam, our uber-geek on the team at Imaginet. The conference package was predictable: a nice back-pack, a water bottle, a pen, a 2 GB USB stick, a SQL Server magazine, not as many sales brochures as last year, and a conference guidebook.

Last year's guide book was a small coil bound notebook with a section of blank pages at the end for taking notes. This year's edition has the same content - a description of all the sessions and keynotes and speakers, as well as sponsor ads, but it is missing the note-taking section. I really liked that section last year, so today I found myself scribbling notes on loose paper, and running out. I specifically left my (paper) notebook at home because I liked the smaller conference book instead, but now I am going back to the Company Store to buy a small notebook for the rest of the sessions.

The conference is trying to be more environmentally friendly - in the backpack was a water bottle and they encouraged you to refill that at the water stations rather than having bottled water. That's cool. For me, I would have preferred a coffee mug, since I had three cups of coffee over the day (in paper cups, and no plastic lids). In a strange twist, the breakfast and lunch dishes were on paper plates and not the real dishes like last year - one step forward, two steps back I guess. I can't figure it out.

In the main hall before the keynote address, there was an live band playing 80's hits. They were pretty good, but it seemed odd to have a bouncy energetic group on stage at 8:30 on a Monday morning, everyone was filing in and sitting down, morning coffee still just starting to kick in. The bass player was one or two steps beyond bouncy-happy. It reminded me of someone on a Japanese game show.

 

Posted by MikeD | with no comments

I suppose this might be in the manual, but...

 If you want to rename a Build Type that you have created in a Team System Project, you need to open the Source Control Explorer window, dig down into the TeamBuildTypes folder under the project, and rename the folder that corresponds to the build type you want to change. After you check in that change, refresh the Team Builds folder in Team Explorer and you'll see your newly named Build Type.

 Remember to change any scheduled tasks you may have created to run your builds automatically.

One more thing about naming Build Types - because we like to have an email sent out to the team members after a build, we have found that a naming convention for the build types helps make it easier to easily recognize and organize the build notifications. We use a standard that includes the environment, the Team Project name, and the sub-solution as the name of the build. So we have build names like

 DEV Slam Customer Website - This builds the CustomerWebsite.Sln in the $\Slam\DEV branch.

QA Slam Customer Website - This builds the CustomerWebsite.Sln in the $\Slam\QA branch.

DEV Slam Monitor Service - This builds the MonitorService.Sln in the $\Slam\DEV branch.

QA Slam Monitor Service  - This builds the MonitorService.Sln in the $\Slam\QA branch.

 Having the project name in the build type helps because if you are a subscriber of lots of different builds for different projects, you cannot tell by looking at the email (other than this naming convention) which project the build is from.

 

Posted by MikeD | with no comments
Filed under: ,

The evening reception was at the Experience Music Project/SciFi Museum Hall of Fame. Gary and I walked through the SciFi Museum. It was really great, lots of memorabilia from all the TV series, movies, as well as books, comics, magazines, scripts, photos, videos. Really cool. The only underrepresentaed Sci Fi series was Dr. Who - I saw one thing from that, the "Fun Gun". No Daleks. Gary has read a lot of sci-fi I found out.

 I thot it was going to be a banquet and awards ceremony - they had awards, but it was more like a standup reception. No tables except in a tent outside. It was pretty stuffy inside, so I hung out with Gary in the fresh air (well, he was smoking, and so was a lot of others around me, but for the most part it was fresh air).

It would have been a great night to have my wife along - she would have loved the sci-fi museum (and the Experience Music Project, very little of which I saw), and she would have been a classy-looking woman to have with me too. We'll come see this place another time.

Tomorrow morning keynote is Steve Ballmer. Somehow the chant "Business Intelligence, Business Intelligence, Business Intelligence" or "Information worker, information worker, information worker" doesn't really roll off the tongue. What will be his hook tomorrow?

 

 

Posted by MikeD | with no comments
Filed under:

Apparently the Power Hour has been something that has been happening at TechEd in past years. I vaguely recall seeing something about it once.

This was a great session - two slides in total I think. All demos, and the demos were different - kinda crazy, but still educational. Lots of free stuff thrown into the crowd. I got something for my daughter.

1st demo - Magic 8-ball vs Data Mining Neural Net algorithm.

In an Integration package, the guy took a table of customer demographics, and ran it in parallel through two different algorithms to predict whether the customer was a homeowner or not. One algorithm was a DM NeuralNet algorithm, the second was a Script that launched the Magic 8-ball window. Looking at the results, the 8-ball didn't do too badly. The demo was interesting in that it showed you could solicit feedback from the user who was executing it (the 8-ball was in a Windows Form, created on the fly in the package).

2nd Demo - by Hitachi Consulting, he demoed an implementation of Analytics for mobile devices. The framework they built helped push out reports, alerts, forms, to a mobile device. They used MS Communication Server to send an SMS text message to the phone, and when the phone received the text message, it used web services to pull back the content (alert, report, form, etc). So he sent out a "Price Change" alert. An RMA authorization form. A Sales report. He said they also had a method to ping the phone and tell it to erase all its content, in case it got lost or stolen. Very cool.

3rd demo - the guy said he wanted to find the geekiest thing to do with Integration Services. He took two sets of a million random numbers between 0 and 1, and through selection of them and applying an algorithm, he basically calculated the value of PI. He didn't tell us what it was until at the end it became obvious. Terribly geeky.

 4th demo - The guy had built a custom reporting services item, which took in a dataset (a summary of sales amounts for three sales reps in three categories), then was an interactive KPI mechanism for displaying the data. It presented the categories and reps in a 3x3 matrix, with a green and red button in the corner of each cell. If you decided the amount was good, you clicked in the red button and the cell got an X. If you thot it was a good amount, you clicked the green button, and the cell got a green O. (Get it? X's and O's in a 3x3 matrix?)

Last demo - Using Performance Point Server, they showed a web page with ten suitcases on it, and they invited someone to come up and play Deal or No Deal. She won $10 (in play money I think). Her highest offer was nearly $485,000.

Between each demo they threw out schwag, like t-shirts and hats and stuff.

We should definitely try this the next demo we do at Imaginet.

Posted by MikeD | with no comments
Filed under:

A rolling stone grows no MOSS.

This guy was very MOSSy, because he didn't roll very much. I was nodding off in this session because the presenter wasn't very passionate, or funny, or showing me anything that hadn't already been shown in the keynotes, or at the MSDN tour in December.

(Microsoft Office SharePoint Server, btw).

Cool thing about SQL Server 2005 SP2 is that it adds much better integration of SS Reporting Services into MOSS, it contains the reports repository and the published reports become a document library, with much better web parts for integrating into Sharepoint.

I snoozed a little during this session. Good thing I did, because I was really glad I was awake for the next session.

Posted by MikeD | with no comments
Filed under:

Two thumbs down.

Well, it probably was a great session, but I got there five minutes early, and already there was 30 people waiting outside the "room" it was in. So I went for lunch instead.

Chalk talks are the 2nd or 3rd class citizens in the sessions here at the conference, but they have the potential to be the most valuable. At least from my perspective. These sessions are real world, not lovey-dovey like the main sessions.

Microsoft, the chalk talks deserve a 1st-class upgrade. Please, I'm begging you.

Posted by MikeD | with no comments
Filed under:

A woman who was formerly from ProClarity, now a product manager in the Performance Point Server group, presented this session on ProClarity.

"Interface to insight" - answering the WHY? in BI.

Tools for decision makers to explore large amounts of data and get rapid insight.

Simple data navigation, powerful calculations, and advanced visualizations of data.

Reports tell you what happened. Dashboards tell you what is happening now, and ProClarity Analytics helps understand Why it is happening.

ProClarity Analytics Server (PAS) is an IIS App. There is also a SQL database of business metadata. The clients are thin-client web-based, thick client web-based (ActiveX control), and Windows-based thick client.

This product, to me, addresses a lot of the "last mile" gap between SQL Server Analysis Services cubes and stuff, and the user.

the KPI builder helps make calculated measures with no MDX at all, easily, and publish to the PAS.

Advanced visualizations: heat map, like spaceMonger, shows boxes stacked together, with the size indicating one measure (sales amount), and the colour indicating another KPI (profit margin, good/warning/bad).

Decomposition tree - view a measure, break it down by category, then by another dimension, and so on. Hard to describe, cool to see.

She was great - perfect balance of architecture slides to show how the thing fits together, with lots of demo time with the product itself.

 

Posted by MikeD | with no comments
Filed under:
More Posts Next page »