In the past weeks I've read several articles / blog-posts and other digitally expressed thoughts about relational databases, query systems and how they all suck compared to K/V stores, CQRS, OODBs or whatever Hype of the Day-term. While most of them were simply re-labeling 20+ year old common knowledge, others were pretty stupid and downright sending the (novice) reader the wrong message. With 'wrong' I mean: the conclusions are based on false 'facts', assumptions and hand-waving n==1 pseudo-science.
Instead of writing a long essay here, I'll quote from and link to several Wikipedia articles and other articles which can help you learn about what relational models, databases are all about, what the theory is they're based on, why they work and what tools (as in: methodologies) are at your disposal. It's not meant to sell you the picture of 'OODB==bad, RDBMS==good', as that would be silly and as short-sighted as the articles I mentioned above. Instead you should see this small subset of knowledge about relational models and databases as a starting point for yourself when you are researching what to use and how to face a problem domain. After all, you can only make an informed decision if you know what you're talking about.
Relational model and theory
Relational databases are based on the relational model as described by E.F. Codd. Its core is based on predicate logic, and operations on the model are based on relational algebra which is a form of predicate logic.
Links:
Why is this so important? It's important because it will teach you what the idea is behind grouping attributes together to entities and use them to define the meaning of data and above all: creating new entities from them by using projections.
Normalization and De-normalization
Relational models are in a given normal form. This normal form is a way to describe how much redundant information is stored in the data set defined by the relational model. To transform a model from a lower normal form to a higher normal form is called normalization. The other way around is called de-normalization. Normalization is considered a good thing because it solves various problems which arise with redundant information in a de-normalized model, if you perform data manipulation operations.
Links:
Important quote, which should describe for you why normalization is a Good Thing:
Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others
This is precisely the point of using a relational model: you store data in a model which gives it meaning in a general form, so you can create new information from it by using relational algebra, in such a way that it doesn't matter what query you might need in the future, it's already suitable for dealing with that.
Normalization has a downside: for reading data it could lead to performance intensive operations. For performance reasons, relational models sometimes get a de-normalized variant. This variant is then used for situations where a lot of queries have to be ran on many different tables with many joins to obtain the result requested. To use a copy of the data for reads, in an optimized de-normalized form, by projecting the data (using normal relational algebra) from the normalized model to the de-normalized model, read queries can be sped up drastically. As the de-normalized form is a copy of the original data, and isn't used for data manipulation, it doesn't run into the downsides of a de-normalized model.
In modern relational databases, these de-normalized variants are implemented by materialized views. Materialized views (in SQL Server they're called indexed views) are queries which results are stored on disk like a normal table. One could add indexes to these 'tables' to make them even more suitable for optimal performance in read logic. Materialized views can be partitioned across multiple systems to be able to handle massive sets of data. In today's hyped up term CQRS one can clearly recognize these two variants of the same model: for data manipulation work, the normal relational model is used, and for reads, the de-normalized variant with materialized views.
Links:
Information Analysis
No matter what people say, always remember that performing analysis of your problem domain, which functionality is required, what the abstract entity definitions are that you can recognize in your problem domain etc. is a good thing. If you don't know what you're dealing with, you can't create successful software to solve the problem at hand. Period.
The big problem with analysis is: if you use a flaky analysis method, you could end up with a skewed picture of your problem domain and therefore your software will likely suck, however how do you know your analysis method is flaky?
In short, analysis comes down to: gather as much information so you can make informed, well reasoned decisions during the software developing process. So you need to convince yourself, based on facts, that your analysis was complete and you have gathered all the information to make these informed decisions. This is a complex process and you need a methodology which allows you to do proper analysis of your problem domain without leaving important areas untouched. One of the pioneers on this subject is E. Yourdon. Today, some people consider his work very 'dated' but I firmly disagree: analysis of problem domains hasn't suddenly changed, the tools Yourdon provides are still very much alive and useful today as they were 20+ years ago. If you look closer and understand what Yourdon meant with all the elements he described, you'll see that most of them have got new names in recent years, and are actually not that dated at all.
As the field of analysis is very wide and deep, it's impossible to describe all possible forms and methodologies. I'll give you a list of links below which is very incomplete, but should get you started.
Links:
Friction and the Impedance Mismatch
One important aspect of a relational model is that it doesn't define behavior. Behavior is defined outside the relational model, through relational algebra using operations. This is a mismatch with modern day Object Oriented software design where data and behavior are combined in an object. If you look at software as code, you'll run into a problem if you want to store live objects you have in memory into a relational database: relational databases work with tables and relational algebra, not with objects and their embedded behavior.
This has resulted in a different form of database: the OO database or OODB. OODBs are nothing new, they've been around for many years. They're ideal if you want to store data in the object it is in as it is in memory, so your live in-memory objects can be stored without any conversion and you can get them back later on without a conversion (there are conversions, but they're not 'in your face', so hidden behind the scenes). You effectively look at the data as if it's inside your object model, so you navigate from one object to the next, not from one table to the next.
If you go back to the relational model and the quoted important aspect of a normalized model, you'll see that an OODB has an important side-effect: it's not really suitable for queries which don't match your object model (or don't use objects at all). This gets particularly problematic when the software using the OODB is replaced with a different system: the original database is outlived by the software requiring its existence however it's not usable in a different form as it's setup and designed to meet the requirements of the replaced software.
Being agile means to be able to cope with this, to be able to deal with change. If you use a relational model which is designed to represent reality, the facts you recognize in the problem domain, the software consuming it doesn't dictate how the model is defined, the reality does, as the relational model represents reality. That's an important aspect which makes it clear why relational models are so important, even in OO software: if the software is replaced with something else, the relational model can be re-used as well as all its data, simply because it's not tied to (or as I should say coupled with) the software consuming it.
To overcome this impedance mismatch between OO software and relational models, the concept of Object / Relational mapping has been introduced. This 'mapping' between two sides A and B is based on the idea that in A there is an element E which is directly related to an element F in B and vice versa. This theoretical connection between E and F, so the connection the mapping is based on, is the sole reason why it works in the first place. That's why you can save an object to a relational model and fetch it back (you're not saving the object, you're saving its data, the entity instance, but I've already written about that some time ago, see below). It works because both sides give the same meaning to the data. This is important because you can then work with a relational database in an OO fashion and you don't run the risk of being unable to deal with change.
As many databases outlive the software they're initially created for, it's important to realize this. All that matters is that if you look at your data, can you distill information from that data in any shape or form you might wish / need without the necessity of the special / original software the database was created for?
Links:
I hope this article has given you some insight in how to deal with databases, what they are, what the essential aspects are of databases, relational models, why they're important and how to successfully embed them into your own software. Now, go and build great software based on well informed decisions.
For the people who know me a little it's no surprise, but in case you didn't know: I love algorithms. I think they're the cornerstone of good software and they should be your first source of wisdom for every piece of software you're creating. This post will show an example of what I mean by that and how easy it is if you have a set of algorithms at your disposal which are solid, proven and correct.
The nice thing about algorithms is that you can reason over them without writing a single line of code. Another nice thing about algorithms is that many smart people out there have already documented and proven even more algorithms. This is important, because a proven algorithm has a key feature: it's correct within the boundaries stated by the algorithm itself (e.g. 'correct for all positive numbers'). A correct algorithm is great for writing software because you don't have to worry if the algorithm is buggy, it's not. All you have to do is implement it correctly. That can be done by taking baby-steps in projecting the algorithm to code, reviewing the code and if you don't trust your own skills, writing tests for the boundaries stated for the algorithm. Compare that to code where the algorithm also has to be tested and you'll quickly understand how important good solid algorithms are: they allow you to make you write software which is flawless without a lot of effort.
With LLBLGen Pro v3.0, which will hit its first beta in January 2010 (fingers crossed!
), we'll also ship the source-code of our algorithm library Algorithmia (and every v3 licensee is allowed to re-use that code in whatever form they see fit, within the boundaries of the flexible license of course). Algorithmia contains algorithms for graphs, queues, heaps, commands etc. etc. and offers a ready-to-rock set of classes and methods to build software with without worrying if the algorithm is even working in all cases. Algorithmia is written with .NET 3.5 and contains only classes which are not already found in .NET 3.5's BCL. This means that we only implemented functionality not found in .NET itself. For example, there's a KeyedCommandifiedList<T>. This List<T> lookalike class is command aware which means it's fully aware of undo/redo: every element addition/removal action can be undone and re-done. It's also keyed and can update it's own index based on changes of itself or the elements within itself without any help from the outside. Still it's a list and an enumerable. Its FindByKey() method is roughly an O(1) operation (amortized), no matter how big the set of elements in the list is.
Another example is the set of graph classes (directed and non-directed) and accompanying classes and algorithms. Not only are the graph classes usable in a commandified environment (and can be made undo/redo aware in full), they also can be used with SubGraphView classes, which are views on a graph and which contain a sub-set of the elements (vertices) and edges of the graph they're defined on. These views manage themselves, so if an element is removed from the graph, they detect that and also remove the element from themselves. You can subclass the views and override methods to intercept additions to the main graph to maintain views automatically as well. You can see this in action in a video I recently posted about LLBLGen Pro v3.0's QuickModel feature: the view you're looking at on the main graph updates itself when a new element is added to the main graph and it matches a set of rules.
But enough about that, let's look at another practical example of these classes in action: let's look at validation and examining data in data-structures with algorithms. The case in point is to see if fields in an entity B which is a split-off entity of entity A and which aren't mapped in A are optional (nullable). If they're not, we've to reflect that to the user with an error. LLBLGen Pro v3.0 has a deep validation system to make sure the model is valid before further action is taken (e.g. code is generated, it's updated from refreshes etc.). This validation system is of course extensible so you can add your own validation as well and has per-target framework validation so it validates in the scope of the target framework chosen (e.g. LLBLGen Pro runtime framework, Linq to sql etc.)
What exactly is a split-off entity? Say you have the Employees table in Northwind. This Employees table has two big fields: Photo, which is an Image (BLOB for oracle fans) typed field and Notes which is an NText (CLOB for oracle fans) typed field. LLBLGen Pro today, in v2.6, has the feature to exclude them from a fetch and load them at a later point. However, v3.0 will also support other frameworks and for example in the Entity Framework, it's solved in a different way: it supports multiple entities mapped onto the same table with a 1:1 PK-PK relationship between them. This means in practice that the entities which are 'split off' have to be merged into the same record in the table as the root of the entities when they're persisted to the database.
In the example of Employees, let's map two entities onto this table, EmployeeNoBlob and EmployeeBlob. EmployeeNoBlob has fields mapped to all Employees fields except Photo and Notes, and EmployeeBlob has besides the primary key EmployeeId only the fields Photo and Notes. EmployeeNoBlob is the 'root' of the two: if a new entity instance of that type is saved, it's inserted into the table Employees. However if an instance of EmployeeBlob is saved it's always resulting in an UPDATE statement. This because it's actually an entity which follows the root and this is the result of its primary key field which is depending on the primary key field of EmployeeNoBlob.
To validate the fields of EmployeeNoBlob and EmployeeBlob, we first have to find all entities which form a split group, thus which are all mapped onto the same target and one of them is the root and the rest is split off this root. We have to make sure we don't take into account entities which are also mapped onto this target but which aren't part of the group, otherwise we'll get false positives.
So how do you find these groups and how do you make sure you know what the root of such a group is?
All entities and their relationships are stored in a non-directed graph in the project: the EntityModel graph. So we first have to find all 1:1 relationships which are between two primary keys. Then we have to make sure both sides are mapped onto the same target.
Once we have that set of relatonships, we can create a new graph from those relationships (which are the edges in the graph) and tell a fancy algorithm in Algorithmia, the DisconnectedGraphFinder, to find all sub-graphs in this new graph which are disconnected from each-other. Each sub-graph is then returned as a SubGraphView and we can process each view further as each view is a split-group.
To find the root of a split-group, we can use another algorithm, the TopologicalSorter. We create from every relationship in a SubGraphView a directed edge in a directed graph, and sort that graph topological. Topological sorting is a well-known algorithm which tells you in what order a directed graph is ordered. In other words: it finds dependencies and orders the elements accordingly.
Once we have the root per group, we're practically done because we can then verify which fields are not mapped in the entities in a group which aren't the root and verify whether they're optional or not.
So how does this look in code? Let's look at the method which produces the list of split-off entities first. This method isn't inside the EntityModel as entity splitting depends on the mapped target and therefore it's located in the DatabaseMappingStore class.
/// <summary>
/// Gets all split entities, per target. A split entity is really a group of entities which are mapped onto
/// the same target and which all have 1:1 pk-pk relationships with one entity in their group. If A, B, C and D are
/// mapped onto the same target T and B and D have a 1:1 pk-pk relationship towards A (pkside), and C does not,
/// it means that B and D are 'split off' of A. C is not part of the split and is ignored. Returned is then A with
/// its split off companions B and D. A is then returned as key with values B and D.</summary>
/// <param name="containingProject">The containing project.</param>
/// <returns>Multivalue dictionary with as key the root of the group and as values all split off entities.</returns>
/// <remarks>Split-off entities are used for validations. Not all frameworks support split-off entities.</remarks>
public MultiValueDictionary<EntityDefinition, EntityDefinition> GetAllSplitEntities(Project containingProject)
{
var allOneToOnePkPkRelationships = containingProject.EntityModel.Edges
.Where(e => e.RelationshipType == EntityRelationshipType.OneToOne)
.Cast<NormalRelationshipEdge>().Where(e => e.FkFieldsFormPkOfFkSide);
var allOneToOnePkPkRelationshipsWithBothSideOnSameTarget = allOneToOnePkPkRelationships
.Where(e=>_entityMappings.FindFirstByKey(e.EntityPkSide).MappedTarget==
_entityMappings.FindFirstByKey(e.EntityFkSide).MappedTarget).ToHashSet();
// now add all found edges in a graph and determine all disconnected subgraphs. These are our groups. We do that
// by traversing the graph we create with a disconnected graphs finder, which is a DFS based algorithm for graphs.
var groupFinderSourceGraph = new NonDirectedGraph<EntityDefinition, NonDirectedEdge<EntityDefinition>>();
foreach(var foundRelationship in allOneToOnePkPkRelationshipsWithBothSideOnSameTarget)
{
groupFinderSourceGraph.Add(new NonDirectedEdge<EntityDefinition>(
foundRelationship.EntityFkSide, foundRelationship.EntityPkSide));
}
var groupFinder = new DisconnectedGraphsFinder<EntityDefinition, NonDirectedEdge<EntityDefinition>>(
() => new SubGraphView<EntityDefinition,
NonDirectedEdge<EntityDefinition>>(groupFinderSourceGraph),
groupFinderSourceGraph);
groupFinder.FindDisconnectedGraphs();
var toReturn = new MultiValueDictionary<EntityDefinition, EntityDefinition>();
foreach(var subgraphView in groupFinder.FoundDisconnectedGraphs)
{
// create a new directed graph which is topological sorted, based on the edges in the subgraphview. The
// ordering will give us the root and the rest. fk side is start vertex, pk side is end vertex in the non-directed
// edges in the view (which is a view on groupFinderSourceGraph).
var sortedSourceGraph = new DirectedGraph<EntityDefinition, DirectedEdge<EntityDefinition>>();
foreach(var edge in subgraphView.Edges)
{
sortedSourceGraph.Add(new DirectedEdge<EntityDefinition>(edge.StartVertex, edge.EndVertex));
}
var sorter = new TopologicalSorter<EntityDefinition, DirectedEdge<EntityDefinition>>(sortedSourceGraph);
sorter.Sort();
// views are always filled with at least 1 edge, which means at least 2 entity definitions, as edges are
// relationships between pk's, so sides are always different. The first entity is the root, the rest is the
// set of split off entities.
toReturn.Add(sorter.SortResults[0], sorter.SortResults.Skip(1).ToHashSet());
}
return toReturn;
}
You might see some extension methods or classes you don't know, like the MultiValueDictionary, they're all in Algorithmia. Looks pretty straightforward and actually pretty easy, simply because it's based on building blocks which are already working and solid. I don't have to worry if the Topological sorter really finds the right order, it does, because the algorithm is correct and the implementation works.
Once we have these entities, we can simply traverse them and match the fields:
var allSplitEntities = GetAllSplitEntities(containingProject);
// now validate the entities per group (root(key) + rest (values)). All entities in allSplitEntities have mappings
foreach(var kvp in allSplitEntities)
{
// key is group root, values is rest of group (split off entities)
var rootMapping = _entityMappings.FindFirstByKey(kvp.Key);
var targetFieldsMappedInRoot = rootMapping.GetAllMappedFieldTargets();
foreach(var splitOffEntity in kvp.Value)
{
var splitOffMapping = _entityMappings.FindFirstByKey(splitOffEntity);
var fieldsNotMappedInRoot = splitOffMapping.GetAllMappedFieldTargets().Except(targetFieldsMappedInRoot);
// each field in fieldsNotMappedInRoot now has to be nullable.
foreach(var splitOfTargetField in splitOffMapping.GetAllMappedFieldsForGivenTargets(fieldsNotMappedInRoot, false))
{
if(!splitOfTargetField.IsOptional)
{
// error.
toReturn = false;
EntityDefinition involvedEntity = splitOffEntity;
EntityDefinition involvedRootEntity = kvp.Key;
string involvedFieldPath = splitOfTargetField.PathAsString;
MessageManagerSingleton.GetInstance().DispatchMessage(
new ErrorMessage(involvedEntity.FullName, true,
"For the database with driver '{0}', the field '{1}' in entity '{2}' isn't marked as optional, while its containing entity is a split-off entity of entity '{3}' which doesn't have this field mapped."
.AsFormatted(dbDescription, involvedFieldPath, involvedEntity.FullName, involvedRootEntity.FullName))
.AddCorrection("Open entity '{0}' in its editor and manually mark the field as Optional or otherwise correct the error.".AsFormatted(involvedEntity.FullName),
1, () => MessageManagerSingleton.GetInstance().RaiseGoLocationRequested(involvedEntity.CreateSourceLocationDataObject())));
}
}
}
}
In the code above you'll also see a glimpse of the designer's internal message system. This message system is central hub which receives messages from all kinds of classes and which dispatches them in a central queue. This queue is then visualized in the UI by a form. The advantage of this system is that everywhere in the core system (so that's not the UI, it has no awareness of the UI), you can dispatch an error to the UI without knowing there is a UI. Furthermore, you can attach corrections. These are displayed below the error / warning and the user can click a link to activate the correction, for example it can open an editor for you to go to the location where the error was, but it can also offer you a correction immediately, for example by removing an element directly. It gets you a high de-coupling between elements which otherwise wouldn't have a relationship anyway and also gets you an easy way to 'break out' a deep object hierarchy so you don't have to pass back elements up the call chain.
It's things like this which will give you high productivity without a lot of effort. Oh, and Linq + lambda's of course. But you already noticed that from the code snippets I guess. 
LLBLGen Pro v3.0 is slated to hit its first beta in January 2010.
In .NET there's a class called StringComparer. It has some handy helpers, like the InvariantCultureIgnoreCase StringComparer. These classes also implement a method called GetHashCode(string), which produces the hashcode in the scope of the comparer, so if you're calling that method on the InvariantCultureIgnoreCase variant, you get the hashcode for that scope.
This is handy as hashcodes are important, for example to find duplicates. We recently ran into an issue with this, as passing a large string to this method caused it to throw an OutOfMemoryException, but ... there was plenty of memory left. What was even stranger was that the length of the string differs per appdomain and even machine!
So I wrote a little app, sourcecode is below. It fiddles with digits to find the maximum string length one can pass to GetHashCode before it throws this exception. Of course, this is of little use, but it illustrates the problem and is a good repro-case for Microsoft as well. The code below will crash with an OutOfMemoryException as it will test the found length by increasing it with 1. I'll post this to Connect (yes, I'm that naive, but perhaps this time they'll fix it). Tested on .NET 3.5 SP1 and XP sp3 as well as .NET 2.0 and XP sp3 (I'm pretty sure the error is in Win32, so it might be OS dependent even).
using System;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace OOMTester
{
public class Program
{
static void Main(string[] args)
{
int digitIndex = 0;
char[] digits = new char[8];
StringComparer comparer = StringComparer.InvariantCultureIgnoreCase;
while(digitIndex<digits.Length)
{
for(int i=9;i>=0;i--)
{
digits[digitIndex] = i.ToString()[0];
for(int j=digitIndex+1;j<digits.Length;j++)
{
digits[j] = '0';
}
int length = Convert.ToInt32(new string(digits));
bool succeeded = false;
try
{
int hashCode = comparer.GetHashCode(new string('X', length));
succeeded = true;
}
catch(OutOfMemoryException)
{
// failed.
}
catch(ArgumentException)
{
// out of range
}
if(succeeded)
{
digitIndex++;
Console.WriteLine("Digit index increased: {0}. Full digits: {1}",
digitIndex, new string(digits));
break;
}
}
}
Console.WriteLine("MaxLength: {0}", new string(digits));
int maxLength = Convert.ToInt32(new string(digits));
string xmlData = new string('X', maxLength);
int hashcode = comparer.GetHashCode(xmlData);
Console.WriteLine("Length: {0}. Hashcode: {1}", xmlData.Length, hashcode);
maxLength++;
xmlData = new string('X', maxLength);
hashcode = comparer.GetHashCode(xmlData);
Console.WriteLine("Length: {0}. Hashcode: {1}", xmlData.Length, hashcode);
}
}
}
Update: Connect issue.
Devnology, de Nederlandse developer community die niet gelieerd is aan 1 specifiek platform, heeft z'n 3e podcast nu online gezet, welke volledig bestaat uit een interview met ondergetekende! De podcast duurt ong. een uur.
Below I've linked a short video which demonstrates, among other things, the Quick Model feature of LLBLGen Pro v3.0. Quick Model is a feature which allows the user to specify model elements very quickly using a simple command input system combined with a visual model viewer. The scenario when this feature is ideal is when you're interviewing a Domain expert and you want to store the information you gather in a re-usable way. This feature allows you to do that in such a way that the model is immediately presented to you and the Domain expert (so s/he immediately sees if it's correct or not). Another advantage is that the model is already in your project, so if a developer has to continue with the project, you don't need a translation phase and you don't have to discuss which entities were determined during the interview, they're already in the model. All you need is a little fine tuning perhaps, using the other editors in the LLBLGen Pro designer. As the Quick Model feature is ... quick, you can type while discussing / interviewing, so the interview isn't stalled by you having to perform slow toolbox-jedi-tricks or other slow modeling wizardry.
The video creates a simple model from scratch and maps it to Oracle, creating meta-data from the model. If you pay attention at the end of the video, you will notice that in the end, not all meta-data is created (e.g. no FK constraints and no PK constraint). This is because I forgot to press the Validation button in the toolbar during the video recording (
) . The validation process has two steps: one validates the model and mappings and stops there, and a second step (if executed) also modifies the meta-data according to the model mapped onto it. This is a separate process as you can't keep a model in sync with meta-data in all possible model edit steps, for example foreign key constraints in the meta-data can only be created after all the required elements are available (relationship, PK fields, everything is valid).
As I'm not a video recording expert I made a mistake and accidentally deleted a frame midway, so you'll notice an error description in the log pane, while the statement causing it isn't in the video as it was on the frame I removed (as DemoBuilder recorded it a bit clumsy. Oh well..)
Please click on the screenshot below or click this link.
Please click the screenshot to view the small video. (opens in new window)
LLBLGen Pro v3.0 is scheduled to go beta in January 2010 and will support at RTM the O/R mapper frameworks: LLBLGen Pro runtime library, NHibernate 2.x, Entity Framework 1 & 4 and Linq to Sql, and more frameworks scheduled after that.
After I graduated from the HIO Enschede (B.Sc level) in '94 I have worked with a lot of different platforms and environments: from 4GL's like System Builder, uniVerse and Magic to C++ on AIX to Java to Perl on Linux to C# on .NET. All these platforms and environments had one thing in common: their creators were convinced their platform was the best and greatest and easiest to write software with. To some extend, each and every one of them were decent platforms and it was perfectly possible to write software with them though I'll leave the classification whether they were / are the greatest and easiest to the reader. I'll try to make clear below why this dull intro is important.
Yesterday I watched the live stream of the PDC '09 keynote and in general it made me feel uncomfortable but I couldn't really figure out why. This morning I realized what it was and I'll try to explain it in this blog.
Cloudy skies
If one word was used more often than anything else in the keynote it was likely the word 'cloud'. Cloud, cloud, cloud, azure, cloud, cloud, azure, cloud, azure... and so on. Perhaps it's the weather in Seattle which made Microsoft fall so in love with clouds, I don't know, but all this cloud-love made me a little uneasy. This morning I woke up and realized why: it's too foggy. You see, the whole time I was watching the keynote, I had the idea I was watching the keynote of some conference about some science I have no knowledge about whatsoever.
"Cool, another guy talking about azure clouds with yet another set of fancy UIs I've never seen, giving me the feeling that not using those is equal to 'doing it wrong', but what the heck azure clouds are and what problem they're solving is beyond me". That kind of thing.
A long line of people were summoned on stage to tell something about some great tool / framework / idea / wizardry related to clouds and with every person I more and more lost grip about what problem they all wanted to solve. All I saw was a long line of examples of Yet Another Platform with its own set of maintenance characteristics, maintenance UIs, maintenance overhead and thus maintenance nightmares.
More UIs, more aspects about things which were apparently new to software engineering nevertheless utterly essential to writing good software... more UIs I've never seen before, more cloudy weather, more azure flavors, more UIs I've never seen, more...
"Aaaaarrrgg!"
As I've tried to explain in the first paragraph, I've been around the block a couple of times. I have lived through internet bubbles, read McNealy's 'The Network is the computer' articles / propaganda, shaked my head when I heard about Ellison's Java client desktop idea, waded through the seas of SOA and SOA related hype material, so I have a bit of an idea what "Big computer with software somewhere + you" means. In this 'modern age' it's dubbed 'Cloud computing', though to me it looks like the same old idea that has been presented by various people in the past but with new labels. With all these platforms presented in the past, there was really one issue: what was the problem they all tried to solve? Why would one want to use it? With Cloud computing, that same old issue hasn't been solved.
"I built it, you run it"
One aspect all these 'big computer with software + you' systems tried to sell was that they could run the software you wrote for you and you didn't have to worry about a thing. Well, not to worry about a lot, but still you had to worry about things, as the system was still Yet Another Platform with its own set of characteristics, flaws and weaknesses and most importantly: differences with the development- and test environment the software was written with.
The problem with software once it is written, tested and ready for deployment is that last stage: will it run in the environment on-site the way it runs locally in the test environment? And is that on-site environment easy to maintain?
In other words: the problem is that the environment the software has to run in isn't necessarily the same as the environment the software was written with / tested in, which could cause a lot of problems during deployment and after deployment. Other aspects like updating the environment due to security flaws, bugs in software etc. are also factors which add to the overall unpleasant experience of deploying and keeping software running.
So the answer to that problem should be a system which provides the following things:
- The environment equal to the one the software was written and tested with
- The resources to keep the software running when the software requires them.
- The security that the software keeps running, no matter what.
In other words: the software engineers built the software, tested it and defined the environment (as they've done that for development and testing anyway) and shipped that in one package, and at the place where the software has to run, that exact same environment is provided, together with the resources required (like memory, cpu, a database connection). So "I built it, you run it". How the environment is re-created isn't important, the important thing is that the exact same environment is provided to the software, 24/7.
Are EC2, Azure and other cloudware solving the problem?
No. They provide Yet Another Platform but not the same environment. As they're yet another platform, you've to develop for that platform. The most typical example for that is that the newly announced application server from Microsoft 'AppFabric', has two flavors: one for Windows and one for Azure. Why would anyone care? Isn't it totally irrelevant for a system in the 'cloud' what software (or what hardware) it is running? All that matters is that it can provide the environment the developer asked for so the developer knows the software will run the way it was intended.
Let's look at a typical example: a website of some company with a small database to serve the pages, a small forum and some other data-driven elements, not really complex. Today, this company has to hire some webspace somewhere, database space, bandwidth and most importantly: uptime. To make the web application run online, it has to match the rules set by the hosting environment. If that's a dedicated system, someone has to make sure the system contains all software the web application depends on, that the system is secure and stays that way. If it's a shared hosting environment, the web application has to obey the ISP's rules of hosted web applications, e.g. can use 100MB memory max., can't recycle more than 2 times in an hour etc.
When Patching Tuesday arrives, and the web application runs on a dedicated server (be it a VM or dedicated hardware, doesn't matter), someone has to make sure that the necessary patches are installed, and that those patches don't break the application. Backups have to be made so if disaster happens, things can be restored. These all count as 'uptime' costs.
With a VM somewhere on a big machine this doesn't change, you still have to make sure the VM offers the environment the application asks for. You still have to patch the OS if a patch for it is released, you still have to babysit the environment the application runs in or hire someone to do that for you, but it always involves manual labor to make sure the environment online is equal to the environment during development and testing.
In the whole keynote I didn't hear a single argument how Microsoft Azure is doing this differently. Sure I can upload some application to some server and it is ran. However, not with the environment I ask for, but inside the environment Azure offers. That's a different thing, because it requires that the developer has to write software with Azure in mind. If I have a .NET web application running on a dedicated server which uses Oracle 10g R2 as its database and I want to 'cloudify' (
) that web application with Azure, I can't because I have to make all kinds of modifications, for example I have to drop the Oracle database for something else and also make other changes as the environment provided by Azure isn't the same as the one locally.
EC2 and other cloudware do the same thing, they all provide 'an' environment with a set of characteristics, but not your environment. So in other words, they're not solving the problem, they only add another platform to choose from when writing software. Like we didn't have enough of those already. Sure, they offer some room for scaling when it comes to resources, but what happens when the image has to reboot due to a security fix that had been installed? Is the application automatically moved to another OS instance? Without loss of any data in-memory, so it looks like the application just ran along fine without any hiccup?
So what's the solution? What should Cloud computing be all about instead?
It should be about environment virtualization. I give you a myapp.zip and an environment.config and you run it. And keep running it. All dependencies on software of my application, like 3rd party libraries, are enclosed in the application's image. That's not an image of an OS with the app installed, it's just the application. The environment.config file is a file which contains the description of the environment that the software wants, e.g. .NET 3.5 sp1, Oracle 10g R2 database, 2GB ram minimum, IIS7, domain name example.com registered to app, folder structure etc. etc. So I outsource any babysitting of the environment of my application.
That is incredibly complex. It might not even be doable. But it's the only way to make cloud computing something else than a new name for an old idea, despite the long list of well-known names who showed an even longer list of UIs and tools during a keynote.
Can Azure do what I described above? I honestly have not the faintest idea, even after watching the keynote yesterday and by reading up some marketing stuff. That doesn't give me confidence, as it's in general not a good sign if a vendor has a hard time explaining what problem a product solves.
I created a small video (flash movie) of a neat feature of the upcoming LLBLGen Pro v3.0 designer: creating a typed list definition from search results obtained in the designer by running a custom piece of code (C#, with Linq to objects. VB.NET is also supported)! So any query you want to run on the model meta-data is allowed.
Please click on the screenshot below to open the page with the video. You need flash to play the video. No sound included.
LLBLGen Pro v3.0 is scheduled to go beta at the end of 2009 and will support the LLBLGen Pro runtime framework, Entity Framework, Linq to Sql and NHibernate.
Please click the screenshot to view the small video. (opens in new window) Update: uploaded a better html file, so the video isn't resized improperly.
Today, it's been exactly 6 years ago we released the first version of LLBLGen Pro, v1.0.2003.1 after a development period of roughly 9 months (Sunday september 7th 2003, late in the evening). It was a big gamble, would it succeed or fail? We got our first customer within 9 minutes after release and we then knew it would be a success. And it still is, with thousands of companies using it world-wide, from small mom & pop shops to the biggest banks on the planet. Honestly, we hoped for success but that it took off this big was beyond our expectations. A big thank you! to all of our loyal customers who trusted our work in the past 6 years and who are keep trusting it.
Needless to say, we're still going strong and are looking forward to v3.0 which is scheduled to go beta at the end of the year. It will actually be our 10th major version (1.0.2003.1, 1.0.2003.2, 1.0.2003.3, 1.0.2004.1, 1.0.2004.2, 1.0.2005.1, 2.0, 2.5, 2.6) since the initial release, and will be the first release which will support other frameworks besides our own runtime framework and will also add another major new approach: model first.
Looking back at those 6 years, I think the biggest asset we deliver is quality you can count on. From the get-go we strived for that aspect, with top-notch support which is free and bug-fixes which are usually delivered within 24 hours. A data-access technology isn't something you just pick out of a pool of tools, it has to fit your way of how you want to write software and work with data, what you want to do in your application and above all, has to be rock-solid so you don't run into surprises, unexpected lack of support for common features or a wall of disbelieve when you ask for help or support or a bugfix. So in other words, a data-access technology is one of the pillars your software has to count on. From the start we realized this and with every feature we added we made sure that indeed, our customers could indeed count on our work and the quality we deliver.
During these 6 years, we worked full time on implementing more features, like a new paradigm (Adapter), support for more databases, multiple ways to do inheritance, more powerful code generator engines, template editor, linq provider etc. and it was and still is simply great working on this every day. On to the next 6 years! 
LLBLGen Pro works with SQL Azure, that is, the generated code and the runtime library. There are a couple of things you should be aware of, and I'll enlist them briefly below. The thing which doesn't work is creating a project from a SQL Azure database, as SQL Azure has no meta-data tables publicly available to the connected user (also a reason why for example SQL Server Management Studio doesn't work with SQL Azure at the moment)
The things to be aware of are the following when you want to work with SQL Azure and LLBLGen Pro are the following:
- SQL Azure doesn't support catalog names in the queries. As LLBLGen Pro supports multiple catalogs per project, and thus cross-catalog queries, you can only use one catalog in your project.
- To avoid catalog names in the queries, you should use the feature called 'Catalog Name Overwriting', which simply means that you configure the runtime to use a different string than the catalog name. You should configure the runtime to overwrite the catalog name of your project to "", so the catalog name is not emitted into the SQL query.
- Our tests and those performed by some of our customers showed that if you use a schema which isn't the default schema, it also seems to make SQL Azure throw errors. So to be safe, either use 'dbo' as the schema, or if you must: define the used schema as the default schema of your user using:
ALTER USER username WITH DEFAULT_SCHEMA = schemaname
That's it. If you make sure of that, which are a simple couple of steps to check, you can use LLBLGen Pro generated code on SQL Azure. Happy azuring! 
Direct profile url: http://twitter.com/FransBouma
I don't promise to follow everybody, but for the few people who want to follow what I have to say, I'll try to use it for more smaller blurps than this blog, as my blogposts here seem to be pretty big (and time consuming to write) overall.
More Posts
Next page »