Soft-deletes are bad, m'kay?

Have you ever ran into database tables with a field which is used to mark if a row has been 'deleted' ? Probably. These fields are used to implement 'soft-deletes'. For the soft-delete impaired, a quick introduction. Soft-deletes are row deletes which are not really happening: instead of removing the row from the database table, a field in the row is set to a value which signals that the row is officially 'deleted'. Every SELECT statement on that table is then filtering on that field's value so only rows which aren't marked as 'deleted' are returned (as the deleted data is not there anymore, semantically).

If this sounds rather awkward, it is. However, there are people who insist in having soft-deletes instead of real deletes, because it allows them to go back in time, to look back at the data that was deleted, as all data is, well... , still there. A small group of those people even believes that soft-deletes allows them to roll-back to deleted data, a kind of 'undo' facility.

The truth is, soft-deletes using status fields in rows is a bad solution to the real problem. Fortunately there are alternatives.

First let's have to look at why people would want soft-deletes. In general there are two reasons, which are already mentioned above: to be able to look at deleted data and to be able to roll-back to deleted data. Let's discuss the second reason first: rolling back to deleted data.

Roll-back to deleted data is hard.

Let's use Northwind as our example database. Let's say we don't delete rows from that database, but flag them as 'deleted' using a new field, IsDeleted (bit), added to every table. If you want to roll-back a deleted Order row, it looks as simple as setting the 'IsDeleted' field to 0, right? Though, what if that Order row refers to a deleted Customer row using its CustomerID foreign key? For the RDBMS, it's OK, as the 'deleted' Customer row is still in the Customers table, it just has its IsDeleted field set to 1. However, executing a SELECT statement which fetches the just recovered Order row with its Customer row will run into a problem: the Customer isn't technically there: the mandatory IsDeleted filter prohibits that the Customer row is showing up. The only solution to this is to also recover the deleted Customer row. Order might have had OrderDetail rows as well, which requires the OrderDetail rows to be recovered as well.

In short, recovering a row is not what's this is all about, it's about recovering a graph. Recovering graphs instead of table rows is much more complicated, due to the dependencies between the involved entity instances (rows). Writing roll-back code for this is therefore likely a complex affair, especially if you want to make sure the data-integrity of the working set is still 100% correct (so all rows involved in the recovered graph indeed have their IsDeleted flag set to 0 and are part of the working set). In the end you'll run into issues where rows have to be merged, similar to source-control systems (e.g. in the situation where a row becomes deleted several times in different graphs). So rolling back graphs is not likely going to be implemented in the average system, it's therefore not the main reason for soft-deletes.

On to the first reason, looking back at old data.

Old data is old for a reason.

When a system deletes rows from a table, it's hopefully done inside a transaction, so when something goes wrong, the delete is 'undone'. When the transaction completes, the delete is final, and the data is gone. If you ever would want to look at the old data (the data you just deleted) again, you can't, it's gone. Let's ignore that some people can't throw anything away ("I might want to look at it in 2 years, then what!? <wild panic>") and focus on what 'delete' actually means: it means that the data is considered 'no longer needed' and therefore removed from the working set. If the data was necessary after the delete, don't execute the delete in the first place, it's the same with your file system really.

However, when 'data is no longer needed' doesn't mean it is totally worthless in all situations: there's a situation where old data could be useful: for reports on history of an entity, when happened what, etc.. Though when will those reports be ran? Every 5 minutes? Will the average user of the system look at historic data all day long or work with the actual working set? The answer to these questions is likely the same: functionality consuming historic (deleted) data is rarely used compared to the functionality consuming working set (not deleted) data.

Let it be clear that it sometimes is required for legal reasons not to toss away data, however in other situations the same requirement actually lead to different solutions: old email is archived in archives and not kept in the in-box. Is that solution useful for this situation as well?

Implications of using soft-deletes

So the main reasons why some people want this is clear, however what are the implications when soft-deletes are used? Below I've mentioned a couple, though I'm sure there are more, though I think the list below is already convincing enough to look at another solution instead.

  • Queries become very complex. Make no mistake, once you introduce a field to signal if a row has been deleted, you have to make sure every table reference is accompanied by a filter on that IsDeleted status field. Every table reference. If you forget one, it's over: your data is then officially not correct. Maintainability will become more cumbersome, and as most time on a software project is spend on maintenance, it's something which will hurt the project, hard.
  • Queries become slower over time. Every 'deleted' row is still there, and for the database system the row is just an ordinary row like any other. This means that DML operations on the rows will become slower, but especially SELECT statements will become slower over time: the percentage of rows which are 'live' of the total number of rows in the table is getting smaller, as more and more rows will become 'deleted'. This baggage could hurt in the long run, especially in tables with a lot of inserts/deletes: the working set might stay the same (e.g. 10K rows) but the total set of rows grows every day so the total number of rows might be millions. When you have millions of rows in the table while the actual set of rows which are 'not deleted' is a small percentage of that, it will influence the performance of queries dramatically. Compare that to a table with the actual set of rows you've to work with.
  • Using constraints (UC, FK) is impossible. Using a unique constraint (UC) is not really possible, as the RDBMS will take into account the rows which are 'deleted' as well. So a value might be unique for the rows in the working set, but for the total set of rows in the table it doesn't have to be unique and the update or insert fails. In our example above with Customer and Order, we've seen that foreign key constraints are not really working anymore either, as they don't protect us from 'deleting' a PK side while keeping the FK side: for the RDBMS, both rows are still there and everything is fine (which is not the case)

A better solution to these requirements

There's a better solution, and I've already mentioned it briefly: archiving. RDBMS's (I assume you're using a proper, professional, solid ACID compliant database, not some toy RDBMS) usually sport a system called triggers. Triggers are neat things: they get called when something happens. You can compare them to event handlers really: an event occurs (e.g. a row gets deleted from a table) and the trigger responsible for handling that event is called. Additionally, the trigger is called in the same transaction as the code which triggered the trigger. So all actions taken inside a trigger are rolled back when the transaction containing the code which triggered the trigger is rolled back.

If you look closely at the two main reasons for soft-deletes, you'll recognize that both can be satisfied with simply keeping the data around which is deleted by DELETE statements: looking back at old, deleted data is possible wherever the old data is located and rolling back deleted data is not going to be less complex when the data is located elsewhere, as the complexity is in the graph rollback mechanism, not the location of the data to rollback to.

This leads to the solution of an archiving database. An archiving database is a catalog (or schema if you wish) which contains the same table definitions as the real database, with perhaps no UC or FK constraints, as data integrity is implied by the data integrity of the source data (after all, it's just for archiving data consistent data). Every table of which you want to keep deleted data around in the real database now gets a DELETE trigger which simply grabs the row(s) deleted and inserts them in the same table in the archiving database. If the transaction fails, the inserted rows in the archiving database roll back too, if the transaction succeeds, the data is successfully archived and still available. Additionally, you could add date/time fields to the rows in the archiving database to store exact dates and times when the row was deleted, the trigger can insert these values when the deleted row is inserted.

This makes sure the data is still available, so the reasons why people want this are still met, though it doesn't pollute the working set for the application anymore, and the implications of soft-deletes are gone. The only thing you've to be sure of is that the triggers and the archive database are maintained together with the real database (so schema changes in the real database are applied to the archive database as well, or you could go overboard and add a new archive database!). However, that's a small price to pay compared to the overly complicated queries one has to write (even with O/R mappers) and work with, queries which also have to be maintained and documented for the length the application is in use. Using the triggers and the archive database, the application can be written normally, can be tested normally, and no data is thrown away. Ever. One could extend this system with an UPDATE trigger as well, so updates are also tracked, so value deletes on the field level could be tracked as well.

So do yourself a favor, next time someone tells you to use soft-deletes, discuss the implications and offer this alternative solution. Everyone will be better off: you, the customer, and the group of people who will maintain the system for the next 20 years.

Published Thursday, February 19, 2009 11:09 AM by FransBouma

Comments

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 6:51 AM by Omer van Kloeten

"Queries become very complex" - Use views instead of tables when selecting.

"Queries become slower over time" - Use filtered indexes, instead of the regular indexes.

"Using constraints (UC, FK) is impossible" - Unless you don't overwrite keys and use constraints correctly.

I agree that soft deletes are not to be taken lightly, but discarding the idea entirely is not the right thing to do.

Also, every DBA I have come across in my entire professional career (be it Oracle, SQL Server, etc.) has discouraged using Triggers entirely. Some even go so far as to discourage DELETEs and UPDATEs on variable length fields.

Best approach, IMHO, is to use soft-deletes, but have a batch maintenance operation every once in a while that moves soft-deleted data to an archive database, deletes it from the origin and then optimizes the remaining data.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 7:11 AM by FransBouma

@Omer

"Use views instead of tables when selecting."

It's a way to deal with the extra complexity, though this too can get more complex over time (where views on the tables now should use the views as well)

"Use filtered indexes, instead of the regular indexes."

Not everything is done through indexes. A table scan is going to be slower. And updating indexes is also slower over time. It's the same difference as with a table with 1K rows vs. a table with 1million rows. Also not all databases support filtered indexes (sqlserver 2005 for example doesnt)

"Unless you don't overwrite keys and use constraints correctly."

No, that's not going to help. See my example, where the FK should be raising an error but doesn't. Also, re-using a UC value can't be done, as the row still occupying the unique value is still there (but not in the working set).

"I agree that soft deletes are not to be taken lightly, but discarding the idea entirely is not the right thing to do."

I didn't discard it, I proposed an alternative solution to the problem

"Also, every DBA I have come across in my entire professional career (be it Oracle, SQL Server, etc.) has discouraged using Triggers entirely. Some even go so far as to discourage DELETEs and UPDATEs on variable length fields."

erm... so creating views for every table, having to deal with non-working FK's etc. is something which is not discouraged, but triggers are? Isn't this just about not willing to use triggers because they seem complicated? Yes, you have to keep track of them, but you also have to do that with your views. Discouraging triggers is often done for the wrong reasons.

Your approach with batch moves is similar to my archive db proposal, however you still have to deal with all the complexity of having ALL data in the same table AND you still have to archive, while the trigger approach doesn't have to deal with any complexity AT ALL. Big difference.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 7:21 AM by Omer van Kloeten

@Frans,

I agree that this isn't the best solution, however you should note that not all soft-deletes are the same. Some require that the foreign keys and constraints remain alive, etc.

As far as I understood from said DBAs, triggers are discouraged because they cause a heavy workload when doing large DML queries. Not sure exactly, but after hearing so many different, independent professional sources say the exact same thing - I don't think I'll doubt their wisdom.

It all comes down to what your constraints are as a developer - do you have to make sure your solution scales well? are you under pressing time constraints to complete development? do you need to keep the solution as simple as possible? are your tools even able to include triggers, filtered indexes or even views?

Some of these points may conflict with each other.

My point is simply that there's no silver bullet to anything and that includes soft deletes.

(btw, my solution included hard-deleting the data after archiving)

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 8:34 AM by sake

I really enjoy this blog. All articles are so insightful.

Hope Frans could update it more often.

Thanks

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 8:46 AM by Joel

I agree that soft deletes are bad.  I once worked on a project where every table in the db had an is_deleted field.  We had a garbage collection process which every night was to go out and hard delete any soft deleted data that was over two weeks old.  It was a very complicated, resource intense process that took a lot of time to test and develop.  

I would suggest using delete replication as opposed to triggers to move off the deleted data to a history database.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 8:51 AM by Michael Dorfman

Actually, Frans, I think it is worse than you point out.  To use soft deletes to "go back in time" would require a datestamp in addition to the "delete flag", and that would imply a similar datestamp to capture alterations of data, and now we're suddenly talking about a different database schema altogether.  In other words, the "delete flag" is a partial and insufficient solution to the problem at hand...

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 10:13 AM by Jeff

I'm a big fan of your blog, Frans, but I really think that taking an absolute and dogmatic approach to this is not the best way to go. I use soft deletes in my forum app because in the context of watching moderation, it makes total sense. I allow edits to be physically deleted, and logged on a different table, but my approach in this case is far more simple in terms of querying than anything you've suggested.

I might conceded that it's a good guideline not to use soft deletes, but it's not a golden rule by any stretch.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 10:37 AM by FransBouma

@Joel: what exactly do you mean with Delete Replication?

@Michael: good point!

@Jeff: Isn't that a different thing? Making things not visible or accessable to some people isn't the same thing as removing it from the working set altogether, as that semantically means that it IS gone, the data isn't there. You could roll back to a previous state, but that implies more work than simply switching a flag.

I took the somewhat dogmatic stance because it's a recurring question for our o/r mapper: how can we implement soft-deletes. If you take into account that derived tables could pop up, inheritance can be involved spanning multiple tables, you're in for a complex problem which doesn't really have a good answer other than looking at the real problem and solve that problem differently.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 11:39 AM by zvolkov

Oh my god Frank thank you so much for this post. I've been thinking along the same lines all my life and now "an expert" has finally came up with solid analysis. Now I can quote you to my colleagues and bosses for years to come!

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 12:21 PM by someone

I'm sorry but I'm a bit confused. How will the 2-databases approach work in this scenario:

Let's say we have a Customer info screen, which among other things contains a link to the original Employee that made the first contact with this customer.

Somewhere else we have another screen which displays a list of Employees in our company,

but since employees do not stay around forever, we only want to see the current "active" employees.

But when I request the employee info that is no longer "active" from another screen (like the Customer info screen), I expect to see his data.

What this means is I can't just permanently remove an Employee row from the db once he has left our company. I have to mark him "inactive" so that other screens can continue to refer to his info even after he leaves. Otherwise I would have to join from both the "Active db data" AND the "Archive db" data for the employees.

In this case I don't see how I can get rid of the "is_active" column.

(and there are many other examples that come in mind too. Columns like "created_by" pointing to a user in the system that is no longer "around" but still we have to keep it there as inactive user for reporting purposes)

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 12:50 PM by Jeff

It is a semantically different thing, you bet, but your position seemed to be that there weren't any variations. My point is simply that there is no one-size-fits-all solution when thinking about historic entities and whatever scope that may include.

But then, that's the thing about O/R mapping, right? It is intended to fit as many feet as possible.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 12:56 PM by WoofWoof

Jeff is touching on the right point here; "soft deletes" aren't well defined.  Does it make a difference whether you call a column IsDeleted, IsActive or IsVisible?  "Delete" has a database meaning, but it also has a business meaning, and the two aren't always identical, that's often why people use "soft deletes".

But there's no solid wall between the two, it's a sliding scale.  Frans, even in your example, IsDeleted really  meant "only show it to me in Admin mode so I can undelete it", which is almost exactly what it meant in Jeff's case.  So what's the difference?

Like most things, it's a trade-off.  For example, you often can't delete employee records because the record references other tables even after the employee leaves, but you do want to mark them inactive.  And, yes, that means more work down the road since most of your employee queries are going to have to include "WHERE IsActive=True".  

Sure, in some cases people use soft deletes when they should really just delete the data.  But in most cases, people use soft delete because Delete doesn't really fit the business model, it's really Inactivate or Hide.  But to suggest, as I believe you are, that when it's a Bad Thing we'll call it "soft delete" but when it's OK we'll call it something else doesn't really address the issue.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 1:03 PM by Daniel

We have a client that has totally abused the approach you recommend.  They mandate every table in a database has a corresponding <tablename>_history and <tablename>_changes table.  These are populated by triggers on the main table, but the triggers require a username and other information to be passed as well, which means every insert,update,delete must also be pass additional parameters. This amounts to hidden, confusing business logic in the database. Frequently, their applications need to query across the tables, meaning UC and FK constraints are still not possible.

While IsDeleted is not the proper solution, it _does_ indicate is that the developers are not thinking through the domain requirements and are being too data-centric.  In your Customer/Order example, the requirement is typically not that a customer needs to be deleted, it's that Joe in accounting doesn't want to see inactive customers on some report.  Bob in Sales wants to see the inactive customers to try to make them active again. The "working set" to Bob is different from what it is to Joe.  Thus, the developer needs to define an "inactive" customer, and incorporate that into the database and into his business logic and reporting.  He _may_ do that by deleting from the "main" customer table and moving related data to a reporting database for sales, but I'm not sure hiding that business logic in triggers is a good idea.  I'd much rather see it inside a customer.Inactivate() method.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 2:10 PM by JW

K, While this all sounds great...  In practice, soft deletes are working fine for us.  

The majority of the tables in our CRM system flag records as soft deleted, and there are even audit log tables to track every touch too.  The main 3 tables total over 100 million rows, and we do not suffer from the issues you have mentioned.  We actually enjoy having records soft deleted.  Think of it like buying insurance (awesome when you need it..otherwise useless).

As you work in larger environments with more complicated requirements, EVERY solution will be larger and more complicated.  Complex queries....all day...real world is not all northwind and hello world.  Slow queries over time....only if you are doin' it wrong.  Contraints...c'mon (can't constraints be built to look at both a primary key and a record status [ours are]).

The other comments are right when they are telling you that individual environments and projects will determine if a soft delete is required or not.  Not just because you "think" it is a bad idea.

For me, we'll stick with the soft deletes.  They work great, and keep 1000s of users from multiple IUs happy at the same time.

Good luck

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 2:39 PM by Ryan Ternier

soft-deletes are key. I work with an online reporting application that deals with anything from traffic collisions to citations.

Reports can span 30+ tables, so with your method, deleting 1 report will cause us to delete records from 30+ tables.

Rather, we could always just change the DateDeleted field on our Row to the current date - and thus deleted.

Data Retreival, from any other source takes too long and isn't the best solution. If a customer deletes something that took them 3 days to do, and needs it back asap, you don't want someone to spend 30 minutes to get it back.

As far as complexity, get a wiki or hire some programmres that know how to execute proper SQL.

However, this is always a discussion to talk about when you're designing your system. I for one, vote for soft deletes.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 3:33 PM by Remco Ros

@Frans

How would you handle the case when only one entity in a graph needs to be deleted.

ie:

- a user posts an image.

- images gets displayed including the users display name

- user removes his/her account

how do you get the users' display name if the user is in an archived database?

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 5:03 PM by nes

But what if you have a reference table you still want to be able to query but not be allowed to use it for new entries. We had a doctor's scheduling application and when a doctor retired we would move him to a historical table with all his appointments. All good until somebody managed to enter a new doctor with the same PK of the retired one and all our historical appointment data went bad. We ended up keeping all doctors and flagging them as "can't assign appointment to him" so nobody would create twin doctors.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 6:38 PM by Richard Wright

I used to use soft deletes in most of my apps, mostly to assist in rolling back. I gave it up because, as you pointed out, you end up recovering a graph which quickly became quite complex. I also found I was often writing a wizard (I work almost exclusively in winforms) for when the user wanted to purge the deleted data.

It quite quickly got out of hand and I stopped using soft deletes.I occasionally use an inactive field, but this serves a different purpose.

I hadn't thought of using triggers, and I will give it a go.

Thanks for the post.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 8:03 PM by Brian

I worked on a project where soft deletes were required because of sox compliance, so there are instances where it would/could be required.

# re: Soft-deletes are bad, m'kay?

Thursday, February 19, 2009 10:50 PM by JohnMarsing

Interesting blog.  I thought I would share my database design regarding archiving.  Last year I created a registration database that I use to manage an annual week long camping event.  

Here is a list of the relevant tables.

Registration:   contains the data on the family

Person:         contains data on the each member of the family

Lodging:        contains data on where the family is staying, families can have more than one lodging place

PersonActivity: Splits the M:M relationship between the Person and the Activity Table

Activity:       More or less a simple look table indicating the activities that people want to participate in.

There are other tables but these are the interesting ones regarding this subject matter.

At the end of the event I wanted to clear out the data to make room for this years event.

As the database evolved from 2008 to 2009 I realized that I needed to remove the 2008 data so I can start the process over again for 2009.  The reality of my database was that I expected a high percentage of users that attended in 2008 to also attend in 2009, so I needed to archive the data.  I did this by creating similar tables (like your blog suggests) for the above tables (prefaced with zHist_) and created a sproc that moves the 2008 data to these archived tables.  With these archived tables I can then ask users who is re-registering if they want me to pre-populate the data for 2009 with 2008 data.

With this evolution process, I added one thing to my Activity table by adding an ActiveFlag bit field to it.  This was because there was a lot of activities that was decided that nobody really cared about and we were going to make them unavailable for 2009.  I didn't want to delete them because I wanted to retain a FK to zHist_PersonActivity.

This seems like a reasonable thing to do, what do you think?  I suppose my example is a little bit different than yours (i.e. apples and oranges).  Maybe the distinction is that I'm talking about historically useful and valuable data where yours is about deleted data that, for now, it is presumed to be of no value.  This deleted data is still preserved however but out of the way an put into archived tables.

Second question, is my scenario easy to implement with LLBLGEN?

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 4:24 AM by FransBouma

I see some people have a bit of a problem with grasping what 'delete' means. Simple example:

instead of

DELETE FROM Table WHERE predicateX

you do

UPDATE Table SET IsDeleted=1 WHERE predicateX

and for selects you don't do:

SELECT ... FROM Table WHERE predicateX

but you do

SELECT ... FROM Table WHERE predicateX AND IsDeleted=0

It's not about different working sets for different users, it's about THE working set for the application. So it has nothing to do with role-based security or other subsets of data based on filters which are part of the application, it's about using an UPDATE statement instead of a DELETE statement, to paraphrase commenter JohnMarsing: "To keep around data which is presumed to be of no value". I.o.w.: your application would work the same (except perhaps the accidentally rollback) if you would execute DELETE instead of the UPDATE, because data of no value is not part of the application's domain, is of no value to any part of the application (except for backup reasons like "I want to look at his history records of 10 years ago") at the time it is considered deletable. That's what this post is about.

About triggers: It seems that triggers have an even worse rep than dynamic generated parameterized sql ;). Triggers aren't more complex to maintain, they are tied to the table they govern. If you have defined soft-deletes, they are part of your design, and thus documented, and thus it's known that an entity definition E's instances in table TE are archived through a trigger. Does that impact performance? Only on deletes, as the triggers are delete triggers. They aren't executed on inserts nor on updates. So they don't impact performance on regular database use, only on deletes they insert another row in a separate catalog/schema so a delete might take a tiny bit longer.

Triggers seem to be harder to maintain in this case, but are they really? Consider that the alternative is to apply WHERE IsDeleted = @param filters _everywhere_ in your queries. For the people who think this can be automated, that's not easy. The problem is for example derived tables, scalar queries in aggregates, complex joins... views might help, though a simple delete trigger which inserts a row into a different table is very simple, so it doesn't really get much simpler than that.

Some individual replies:

@Remco Ros: your post is an example of not understanding what DELETE means. DELETE means the entity instances are destroyed, removed from the _database_. If doing so for entity E means the application doesn't work anymore, E can't be deleted, so soft-deletes ALSO don't apply: remember an entity instance E with IsDeleted set to true semantically means the SAME as having E deleted with a DELETE FROM statement: the data isn't there. Gone.

So in your example, the user's name either is filled in with a placeholder (e.g. '[deleted]') or a user can't delete an account, only deactivate it. 'Deactivating' doesn't mean IsDeleted=true, everyone who thinks that doesn't understand what DELETE means (so please start reading this reply again). Deactivating means deactivating. Deactivated accounts might not show up in some sets inside the application. That's not the same as some rows with IsDeleted=true not showing up in sets in the application: the IsDeleted=true rows are semantically gone, they're not there, for the application: if I ask: get me all User rows, I also get the Deactivated. If I use soft-deletes on User, and I ask "get me all User rows" I get only the rows with IsDeleted=false. They're not part of the database the application works on.

@nes: History data is something which is a tough subject. For example, what if employee X first works for department A and then for department B? The history of X working for A is gone if you don't log the FK change. I.o.w.: If you look at history data for X, you have to know for which department X worked at time T (where T is some time ago). This could mean that history data is part of the domain, as in: the application takes care of the transformation of entity instances to history entity instances so it has meaning in the context of the history database. Here too graphs play a role, as described with the employee-department example I gave. In your case, the IDs for doctors are unique, through time. This means that a new ID has to be unique, as the history data depends on it. You could have chosen for unique IDs in the form of sequences / identity IDs or GUIDs. This leads to unique IDs without the necessity of keeping the doctor data around forever, as the scheduling data is likely moved to a different database altogether.

@JohnMarshing: if data is no longer part of the application's LIVE data but only lives in the history data, why keep it around in the LIVE database? The FK lives in the history data, not in the live data. It seems you want to keep the users around for easy re-registering, but want to move historic data to a separate database for history purposes. It's then IMHO a good idea to move 2008's data to a history database, and remove 2008 specific data from the live database. This is an example of having an application database with two parts: a live part and a historic part. The main idea is to keep them separated: history data is history data, not live data. So effectively your problem description is: I want to archive 2008's specific data to clean up the live database for 2009. Which could be done with a simple proc indeed. Your last paragraph sums it up greatly: history data which is valuable is something else than data which is considered of no use and thus is DELETEd. No-one deletes data which is of value for the application, as the data is still part of the domain of the application. Instead history data of value is archived. The post indeed was about solving the solution of keeping data which is considered worthless around for legal reasons or for the rare ability to look back at that data 'just in case', as a kind of backup. My proposal was to solve it the same way as with history data which is of value: through archiving. As it's normal data-access whatever data-access code you pick will be able to work with your description.

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 6:00 AM by Dawid Joubert

I think everyone in the discussion needs to think very clearly about when a row should is part of the working set or not.

On my software there are dependencies such that every row will probably be referenced somewhere else and for those purposes must stay in the working set.

That is why we offer a built-in archive feature into our website. So a user can cleanup his desktop without having to delete data.

The historic reason is more relevant than not... Even if a user wants to delete his account we need the account in the table with his name/username because without it due object relation mapping the messaging system won't be able to display his name.

Not everyone has views

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 1:26 PM by Chris Carter

We had different requirements than many people, but at my last company we used soft-deletes very successfully.  Here's what we did:

* Every table had a GUID surrogate primary key.

* We used a timestamp for our deleted flag, which defaulted to the max date

* We partitioned our tables based on that timestamp

* When we deleted an item, we appended the GUID PK to one of the "unique" fields if there was a UC.  Sure, we had to strip it out when displaying, but honestly, even in our app which had over 200 entity types, we only had to do that on about 30 of them.  Really, it was about an hour of coding to add the "strip" code in.

We didn't run into most of the problems you list above, didn't have the additional overhead of a trigger based archive system or an additional database to maintain, and performance didn't degrade over time thanks to the partitioning.  Sure, we had to make sure that our collection queries were filtering on that deleted field, but since we were using LLBLGen we just overwrote the generator templates to take care of that for us.

Will that work for everybody?  No.  Not everyone has access to partitioning or wants to write additional code.  But for us, the couple days invested in the infrastructure more than made up for the massive amount of time saved from reversable deletes as well as the in-place audit trail.

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 1:49 PM by BruceK

I see a third reason for 'soft delete'.  In our system we use soft deletes for performance.  The system is an OLTP system, but a single transaction will impact as many as 39 tables.  We made significant performance improvements by implementing a 'soft delete' based on user action, with the 'hard delete' accomplished by an off-hours scheduled job.

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 4:48 PM by dfguy

so you didn't post my last comment since i basically proved you wrong and called you a moron in the process. not to mention shattered your ego and your precious MVP. you such a bitch.

# re: Soft-deletes are bad, m'kay?

Friday, February 20, 2009 4:57 PM by FransBouma

@dfguy: I didn't post it because it was a useless remark, like your last one. I posted this one so people can see what some people apparently seem to find 'appropriate' these days to respond to a post about soft-deletes: they start to call names and flame without any reasoning. I mean... which sane person would start flaming like you did and be so utterly rude about a post about a solution to some problem related to databases/software.

# re: Soft-deletes are bad, m'kay?

Saturday, February 21, 2009 12:27 AM by tmorton

Triggers? No way, not in my house.  The hidden logic behind triggers typically remains elusive.  As such, I think they should be avoided.

# re: Soft-deletes are bad, m'kay?

Saturday, February 21, 2009 1:03 AM by Charles

Frans, I completely agree here.

@tmorton, someone has been propagandizing you. In most cases triggers are not a good idea. However, this is a very, very specific case, namely archiving deleted data. Once you know that archived data is archived via triggers, you know it. It's not hidden magic and voodoo. That's just propaganda. It is much preferable to complexity associated with so-called "soft-deletes."

# re: Soft-deletes are bad, m'kay?

Tuesday, February 24, 2009 9:37 AM by Greg

I have audit data that refers to user information. There is a foreign key from the audit data table back to the user table. The audit data would live on even if a user is deleted.

So, which do you prefer:

1. Use soft deletes on the user table

2. Remove the foreign key

Thanks.

Greg

# re: Soft-deletes are bad, m'kay?

Tuesday, February 24, 2009 10:03 AM by FransBouma

@Greg: If data is required at some point in the future, it's therefore not useless data, and therefore can't be deleted. That's the misunderstanding of many people reading this blogpost. If you need the user data to live on to be useful to the audit data, you can't delete the user data, and therefore have to archive it. it's equal to the situations where you have scheduling info for year X and a new year arrives.

In your particular case, the users aren't deleteable. Though you also can't use soft-deletes, as in: I will use a flag instead of DELETE FROM Users WHERE ...

The point is: soft-deletes are hidden, they hide information for you as if it's not there. So if I want to do:

SELECT * FROM Users;

with softdeletes, I have to add the predicate to filter on deleted rows, as these aren't there anymore (as they're deleted). However, inactive users instead of soft-deleted users, ARE showing up in that simple query. So an inactive/active flag on the users table allows you to filter on which users are active and which aren't, but the flag doesn't hide information from you in simple queries. So your domain logic has to be aware that it has to work with active users. That's a domain requirement and the filter therefore is provided by the domain logic.

Soft-deletes are filtered out at the database level, as the data is semantically not there. That's a difference: soft-deleted rows are hidden for domain logic.

# re: Soft-deletes are bad, m'kay?

Tuesday, February 24, 2009 2:09 PM by Robz

I see this as a very relevant discussion and almost feel bad for those not "mature" enough to handle it that way.

In the world of absolutes, those are usually a bad thing to use. Frans I applaud you taking a stance, although I am not 100% in agreement.

I am in agreement that you should always discuss the implications of a decision like this with everyone and help them understand (possibly through a short demonstration?) of exactly what they are buying off on.

I've been on both sides of this argument and for that I always say, it depends.

"Should" and "it depends" are perhaps the two best words in a developer's vocabulary. :D

# re: Soft-deletes are bad, m'kay?

Wednesday, February 25, 2009 1:44 AM by Dmitry

IMO, soft deletes are often a very good idea. Imagine an online store/club than has a yearly fee for its customer. A customer has an order history, frequent buyer discounts, item reviews, etc.

Now the economy is bad and the buyer decided not to renew the membership for this year or possibly forever. The business decided to keep all the inactive/expired customer data for 18 months before it gets deleted from the system by a scheduled script. The customer can renew the subscription at any day until that time.

How do you deal with a scenario like this? You obviously cannot just delete the data as soon as the membership expires. Archiving the data might not be a bad idea. But if the customer renews his fee online you would need to have a real-time process to move all the data back (customer, orders, ....) as a graph.

A soft delete field makes everything much easier especially if are using a customer as an aggregate with a repository responsible for the rest of the graph. Even if you have a moderate amount of aggregate roots, it is still easier to change a field value in each one than moving a bunch of rows related by PK/FK from a different database/table structure.

# re: Soft-deletes are bad, m'kay?

Thursday, February 26, 2009 10:12 PM by Alex Di Marco

Both soft deletes and Archiving are not an easy solution to implement and IMO one should not say (this is the only solution, avoid the other). They both generate implementation problems.

Soft deletes have complexity as Frans already pointed out when you have joins (or outer joins to make things worse) on records that could have been "soft deleted".

Archiving suffers of the same problems. How do I display (from the archived same schema) the name of a supplier if my "archived" purchase order references it by PK and name column has not been de-normalized? (being the PK a surrogate key).

Will I pull this from a copy of the supplier tables that I have to keep synchronized between Archived Table and Work set table or do I pull it from the Work set only?

In the first case it means that I need to have a mechanism to update always two tables (suppliers in work-set and archiving because the supplier name might change over time) and in the second one I would have the problem that if I archive a supplier record my join will have to change (because the row moves from the work-set to the archived copy).

In both cases there are challenges and no solution gives you the magic pill.

FYI I have worked on both solutions on a medium sized project (application made of around 700 tables) and it's been a pain no matter which solution was implemented.

So I am for both options when it makes sense to do so. After all it can be really tough to hard delete a row that has lots of child records (an account code having millions of Journal Entries.. imagine the user waiting hours after pressing SAVE).

BTW, trigger rules. In most RDBMS they can be executed by statement and not on each row, you'll have the entire set of deletions available in a table variable (in SQL Server this is the only option, in Sybase you can choose for example).

We use them on each of our tables to implement our database independent replication (so that we can replicate data from Sybase to Sql Server to Oracle) and they work well without issues. There are many application for triggers where they prove their usefulness.

(obviously I won't fight with people who don't like the idea, just wanted to express my preference).

# re: Soft-deletes are bad, m'kay?

Friday, February 27, 2009 12:37 PM by Dmitry S.

I agree that joins make soft-deletes more difficult. But a good ORM helps a lot here. You can do cascading updates and deletes as well as providing your own implementation for the operations. Such an ORM would also support filters that can be used to make sure the IsDeleted condition is added to every necessary table in queries.

# re: Soft-deletes are bad, m'kay?

Tuesday, March 03, 2009 12:03 AM by TSlavens

@morton: AFAIK, a trigger (usually combined with a sequence) is the only way to achieve surrogate keys on Oracle, and probably a few other RDBMSs.

# re: Soft-deletes are bad, m'kay?

Tuesday, March 03, 2009 6:34 AM by moi@walteralmeida.com

Hello Frans,

Interesting post that lead to many reactions!

Fact is this is not an easy subject and that there is always the two sides problem of talking of the same thing (what do we mean by delete) and be generic enough to address many domains and specific cases. Because yes: every company/software project has its own constraints and rules and there is no easy, ready to use solution to any software design problem. If there would be, then developing good quality software applications would be straightforward and we can outsource all our software projects to big armies of developers somewhere in India or any emergent country with many available people at low prices. However it’s not the case and designing&developing good quality software is hard and many big offshore projects failed for not having understood that fact …

But anyway this is not the subject. I like your post Frans, and agree with your analysis of why soft deletes can be bad, why one should first analyse the need for soft deletes and the implication of doing so before going for it (and yes in some cases it might be a good solution) or for another alternative. It is therefore a matter of analysing the real business needs, find out the available technical solutions to be in a position to choose the more relevant one for the matter.

I have been thinking through the soft delete scenario. I am a great fan of Llblgen and therefore thought trough how I could use Llblgen to solve this problem. Triggers are a good solution at the database level but in my understanding there is no solution with Llblgen pro to model and automate their creation (well llblgen is data-centric and relies on an already existing database schema. I must say I would love to see Llblgen move to the path of modelling first and generating DB + business layer, even though generating DB might not always be the best solution).

I thought of using the auditing facility of Llblgen pro for solving the problem of archiving deleted rows: audit the delete actions on entities to store the deleted entities information (+ timestamp and any other information I might want to store, like for instance the identity of the caller). What do you think of this solution Frans? Maybe not as well-performing as triggers? Great advantage is that it can be part of the model and can be generated though code templates, and extensible to add extra business information as required.

Another thought regarding the idea that soft-deletes are useful for reporting purposes on old data: In big enterprise solutions, the relational database (or better said: OLTP) IS NOT supposed to be used for reporting purpose. Well it can be used for real time kind of reporting and real-time views on data. But for advanced reporting purposes we usually set up a data warehouse and a star schema, including scheduled import of data from the relational database. And this data warehouse will contain all historical data, and is really relevant for reporting on historical data (therefore even on data not available anymore on the main OLTP database). For this scenario at least, the data warehouse eliminates the need for soft-delete.

My last point is just to insist on the fact that there are many design issues to be covered, many solutions and ways to achieve the same result, and that in most cases no already baked answers that would eliminate the need for a proper analysis and design phase.

Well just food for thoughts:)

# re: Soft-deletes are bad, m'kay?

Monday, March 09, 2009 10:42 AM by Ted Wise

Soft deletes aren't bad, they're a tool.  In a system with undo requirements and extensive foreign key constraints, they are practically a necessity.  Simply put, when you need them there aren't any realistic options.  Shadow tables won't help you with foreign key constraints.  You can use a date field as the deletion flag.  If the date field is null then the record isn't deleted.  This lets you filter out deleted records (which is not an expensive operation), go back in time and have a purge process which removes deleted records over a certain age.

# re: Soft-deletes are bad, m'kay?

Tuesday, March 10, 2009 4:58 AM by Jerwin

@Dmitry

Why do you want to soft delete the inactive/expired customers? It should only tag as "In Active", that's it.

Expired membership should NOT be deleted (rather soft or hard).

# re: Soft-deletes are bad, m'kay?

Thursday, March 12, 2009 5:50 PM by Dmitry S.

@Jerwin,

As far as the e-commerce domain is concerned, inactive records do not exist (ala deleted). On the database level the records are not deleted so you can renew them using a service.

Marking a database record with "Active = 0" is exactly what I consider a soft delete.

I completely agree that marking records inactive without an intent to restore them makes no sense. There point here is there are reasons to do soft-deletes.