VS Tools for DB Professional - Does it make sense to generate test data?

Wednesday, June 14, 2006

The new VS for DB Professionals supports generating test data for your data model.

For each column, you can specify the Generator you want to use, and you have a bunch of predefined ones (numbers in a range, strings that comply with a regular expression, values from a list, etc, etc, etc), with a lot of sophisticated features like using different distribution algorithms for the data, or read an existing table to find patterns there to then generate the test data (like 'the CustomerName has between 10 and 15 characters with an average of 12').

Now, as they are inserting data in the database tables directly, they are bypassing your business logic layer. Unless you have all the validations and triggers in the database, the data you have in your tables won't be consistent, and in that case is difficult to use it as a set of data that I can use for my tests.

Additionally, there are scenarios where creating test data is very difficult. If you have a table with the product Inventory, and each time you create an Invoice you check if there is enough inventory, and after that you decrease it, you need to make sure that in your test data, the Invoices you create don't make the Inventory go negative, because in that case your constraints will fail. In real world applications this kind of scenarios happens all the time. Unless we have a model where we can declare that, which is not trivial, I don't see a way to solve this issue.

So, I think the data generator feature it's a little naive, and I'm not sure if it can applied it in the real world. I hope I'm wrong because it looks cool ;).

1 Comment

Hi,

You raise a good point regarding making the test data generation appropriate to the scenario.

I think the test data generation is a good idea. There are systems which need data that we dont necessarily populate via our business logic. For example, what if you are writing an application that talks to preexisting databases, that have, say Customer information and the DBA's that own the data wont let you have any unless its de-sensitised to remove all identifiable information from the customer DB? In some scenarios de-sensitised data is almost useless because you want to see some real life names and addresses not just Mr.FirstName Surname who lives at AddressLine1 in Country 1. This way when you generate docs or PDF reports of some data it makes the entities more identifiable.

Secondly, from a load testing and volume testing standpoint it makes sense. One way to test under load is to just fill in a million rows of data with some rubbish info and then put some good data and check out how the system performs. But what makes it even better is to generate real life data with random names and addresses so that the data is skewed and the indexes are not packed nicely with organised data such as name1, name2 etc which contribute to good performance naturally.

Thirdly, for the Invoice example you mentioned, yes, you need to take care to generate data correctly but surely your system should be able to gracefully cope with negative values and throw the appropriate exceptions? what better way to test edge cases?

Finally, please dont neglect good database design practices. Just because you have constraints and error checks in the business logic doesnt mean you should ignore the db mechanisms for enforcing integrity. Are you sure your business logic layer will always be used for loading data? what if someone wants to use your system in a place where they already have a million invoices and want to use , say, DTS, to load all the preexisting data into your database? will they have to use the API or webservice for this bulk data load?

Just a few thoughts...
Cheers,
Benjy

Benjy - Monday, July 3, 2006 9:21:19 AM

Comments have been disabled for this content.