Soft-deletes are bad, m'kay?

Thursday, February 19, 2009

Have you ever ran into database tables with a field which is used to mark if a row has been 'deleted' ? Probably. These fields are used to implement 'soft-deletes'. For the soft-delete impaired, a quick introduction. Soft-deletes are row deletes which are not really happening: instead of removing the row from the database table, a field in the row is set to a value which signals that the row is officially 'deleted'. Every SELECT statement on that table is then filtering on that field's value so only rows which aren't marked as 'deleted' are returned (as the deleted data is not there anymore, semantically).

If this sounds rather awkward, it is. However, there are people who insist in having soft-deletes instead of real deletes, because it allows them to go back in time, to look back at the data that was deleted, as all data is, well... , still there. A small group of those people even believes that soft-deletes allows them to roll-back to deleted data, a kind of 'undo' facility.

The truth is, soft-deletes using status fields in rows is a bad solution to the real problem. Fortunately there are alternatives.

First let's have to look at why people would want soft-deletes. In general there are two reasons, which are already mentioned above: to be able to look at deleted data and to be able to roll-back to deleted data. Let's discuss the second reason first: rolling back to deleted data.

Roll-back to deleted data is hard.

Let's use Northwind as our example database. Let's say we don't delete rows from that database, but flag them as 'deleted' using a new field, IsDeleted (bit), added to every table. If you want to roll-back a deleted Order row, it looks as simple as setting the 'IsDeleted' field to 0, right? Though, what if that Order row refers to a deleted Customer row using its CustomerID foreign key? For the RDBMS, it's OK, as the 'deleted' Customer row is still in the Customers table, it just has its IsDeleted field set to 1. However, executing a SELECT statement which fetches the just recovered Order row with its Customer row will run into a problem: the Customer isn't technically there: the mandatory IsDeleted filter prohibits that the Customer row is showing up. The only solution to this is to also recover the deleted Customer row. Order might have had OrderDetail rows as well, which requires the OrderDetail rows to be recovered as well.

In short, recovering a row is not what's this is all about, it's about recovering a graph. Recovering graphs instead of table rows is much more complicated, due to the dependencies between the involved entity instances (rows). Writing roll-back code for this is therefore likely a complex affair, especially if you want to make sure the data-integrity of the working set is still 100% correct (so all rows involved in the recovered graph indeed have their IsDeleted flag set to 0 and are part of the working set). In the end you'll run into issues where rows have to be merged, similar to source-control systems (e.g. in the situation where a row becomes deleted several times in different graphs). So rolling back graphs is not likely going to be implemented in the average system, it's therefore not the main reason for soft-deletes.

On to the first reason, looking back at old data.

Old data is old for a reason.

When a system deletes rows from a table, it's hopefully done inside a transaction, so when something goes wrong, the delete is 'undone'. When the transaction completes, the delete is final, and the data is gone. If you ever would want to look at the old data (the data you just deleted) again, you can't, it's gone. Let's ignore that some people can't throw anything away ("I might want to look at it in 2 years, then what!? <wild panic>") and focus on what 'delete' actually means: it means that the data is considered 'no longer needed' and therefore removed from the working set. If the data was necessary after the delete, don't execute the delete in the first place, it's the same with your file system really.

However, when 'data is no longer needed' doesn't mean it is totally worthless in all situations: there's a situation where old data could be useful: for reports on history of an entity, when happened what, etc.. Though when will those reports be ran? Every 5 minutes? Will the average user of the system look at historic data all day long or work with the actual working set? The answer to these questions is likely the same: functionality consuming historic (deleted) data is rarely used compared to the functionality consuming working set (not deleted) data.

Let it be clear that it sometimes is required for legal reasons not to toss away data, however in other situations the same requirement actually lead to different solutions: old email is archived in archives and not kept in the in-box. Is that solution useful for this situation as well?

Implications of using soft-deletes

So the main reasons why some people want this is clear, however what are the implications when soft-deletes are used? Below I've mentioned a couple, though I'm sure there are more, though I think the list below is already convincing enough to look at another solution instead.

Queries become very complex. Make no mistake, once you introduce a field to signal if a row has been deleted, you have to make sure every table reference is accompanied by a filter on that IsDeleted status field. Every table reference. If you forget one, it's over: your data is then officially not correct. Maintainability will become more cumbersome, and as most time on a software project is spend on maintenance, it's something which will hurt the project, hard.
Queries become slower over time. Every 'deleted' row is still there, and for the database system the row is just an ordinary row like any other. This means that DML operations on the rows will become slower, but especially SELECT statements will become slower over time: the percentage of rows which are 'live' of the total number of rows in the table is getting smaller, as more and more rows will become 'deleted'. This baggage could hurt in the long run, especially in tables with a lot of inserts/deletes: the working set might stay the same (e.g. 10K rows) but the total set of rows grows every day so the total number of rows might be millions. When you have millions of rows in the table while the actual set of rows which are 'not deleted' is a small percentage of that, it will influence the performance of queries dramatically. Compare that to a table with the actual set of rows you've to work with.
Using constraints (UC, FK) is impossible. Using a unique constraint (UC) is not really possible, as the RDBMS will take into account the rows which are 'deleted' as well. So a value might be unique for the rows in the working set, but for the total set of rows in the table it doesn't have to be unique and the update or insert fails. In our example above with Customer and Order, we've seen that foreign key constraints are not really working anymore either, as they don't protect us from 'deleting' a PK side while keeping the FK side: for the RDBMS, both rows are still there and everything is fine (which is not the case)

A better solution to these requirements

There's a better solution, and I've already mentioned it briefly: archiving. RDBMS's (I assume you're using a proper, professional, solid ACID compliant database, not some toy RDBMS) usually sport a system called triggers. Triggers are neat things: they get called when something happens. You can compare them to event handlers really: an event occurs (e.g. a row gets deleted from a table) and the trigger responsible for handling that event is called. Additionally, the trigger is called in the same transaction as the code which triggered the trigger. So all actions taken inside a trigger are rolled back when the transaction containing the code which triggered the trigger is rolled back.

If you look closely at the two main reasons for soft-deletes, you'll recognize that both can be satisfied with simply keeping the data around which is deleted by DELETE statements: looking back at old, deleted data is possible wherever the old data is located and rolling back deleted data is not going to be less complex when the data is located elsewhere, as the complexity is in the graph rollback mechanism, not the location of the data to rollback to.

This leads to the solution of an archiving database. An archiving database is a catalog (or schema if you wish) which contains the same table definitions as the real database, with perhaps no UC or FK constraints, as data integrity is implied by the data integrity of the source data (after all, it's just for archiving data consistent data). Every table of which you want to keep deleted data around in the real database now gets a DELETE trigger which simply grabs the row(s) deleted and inserts them in the same table in the archiving database. If the transaction fails, the inserted rows in the archiving database roll back too, if the transaction succeeds, the data is successfully archived and still available. Additionally, you could add date/time fields to the rows in the archiving database to store exact dates and times when the row was deleted, the trigger can insert these values when the deleted row is inserted.

This makes sure the data is still available, so the reasons why people want this are still met, though it doesn't pollute the working set for the application anymore, and the implications of soft-deletes are gone. The only thing you've to be sure of is that the triggers and the archive database are maintained together with the real database (so schema changes in the real database are applied to the archive database as well, or you could go overboard and add a new archive database!). However, that's a small price to pay compared to the overly complicated queries one has to write (even with O/R mappers) and work with, queries which also have to be maintained and documented for the length the application is in use. Using the triggers and the archive database, the application can be written normally, can be tested normally, and no data is thrown away. Ever. One could extend this system with an UPDATE trigger as well, so updates are also tracked, so value deletes on the field level could be tracked as well.

So do yourself a favor, next time someone tells you to use soft-deletes, discuss the implications and offer this alternative solution. Everyone will be better off: you, the customer, and the group of people who will maintain the system for the next 20 years.

29 Comments

"Queries become very complex" - Use views instead of tables when selecting.
"Queries become slower over time" - Use filtered indexes, instead of the regular indexes.
"Using constraints (UC, FK) is impossible" - Unless you don't overwrite keys and use constraints correctly.

I agree that soft deletes are not to be taken lightly, but discarding the idea entirely is not the right thing to do.

Also, every DBA I have come across in my entire professional career (be it Oracle, SQL Server, etc.) has discouraged using Triggers entirely. Some even go so far as to discourage DELETEs and UPDATEs on variable length fields.

Best approach, IMHO, is to use soft-deletes, but have a batch maintenance operation every once in a while that moves soft-deleted data to an archive database, deletes it from the origin and then optimizes the remaining data.

Omer van Kloeten - Thursday, February 19, 2009 11:51:31 AM

@Omer
"Use views instead of tables when selecting."
It's a way to deal with the extra complexity, though this too can get more complex over time (where views on the tables now should use the views as well)

"Use filtered indexes, instead of the regular indexes."
Not everything is done through indexes. A table scan is going to be slower. And updating indexes is also slower over time. It's the same difference as with a table with 1K rows vs. a table with 1million rows. Also not all databases support filtered indexes (sqlserver 2005 for example doesnt)

"Unless you don't overwrite keys and use constraints correctly."
No, that's not going to help. See my example, where the FK should be raising an error but doesn't. Also, re-using a UC value can't be done, as the row still occupying the unique value is still there (but not in the working set).

"I agree that soft deletes are not to be taken lightly, but discarding the idea entirely is not the right thing to do."
I didn't discard it, I proposed an alternative solution to the problem

"Also, every DBA I have come across in my entire professional career (be it Oracle, SQL Server, etc.) has discouraged using Triggers entirely. Some even go so far as to discourage DELETEs and UPDATEs on variable length fields."
erm... so creating views for every table, having to deal with non-working FK's etc. is something which is not discouraged, but triggers are? Isn't this just about not willing to use triggers because they seem complicated? Yes, you have to keep track of them, but you also have to do that with your views. Discouraging triggers is often done for the wrong reasons.

Your approach with batch moves is similar to my archive db proposal, however you still have to deal with all the complexity of having ALL data in the same table AND you still have to archive, while the trigger approach doesn't have to deal with any complexity AT ALL. Big difference.

FransBouma - Thursday, February 19, 2009 12:11:53 PM

@Frans,

I agree that this isn't the best solution, however you should note that not all soft-deletes are the same. Some require that the foreign keys and constraints remain alive, etc.

As far as I understood from said DBAs, triggers are discouraged because they cause a heavy workload when doing large DML queries. Not sure exactly, but after hearing so many different, independent professional sources say the exact same thing - I don't think I'll doubt their wisdom.

It all comes down to what your constraints are as a developer - do you have to make sure your solution scales well? are you under pressing time constraints to complete development? do you need to keep the solution as simple as possible? are your tools even able to include triggers, filtered indexes or even views?
Some of these points may conflict with each other.

My point is simply that there's no silver bullet to anything and that includes soft deletes.

(btw, my solution included hard-deleting the data after archiving)

Omer van Kloeten - Thursday, February 19, 2009 12:21:57 PM

I really enjoy this blog. All articles are so insightful.
Hope Frans could update it more often.

Thanks

sake - Thursday, February 19, 2009 1:34:55 PM

Actually, Frans, I think it is worse than you point out. To use soft deletes to "go back in time" would require a datestamp in addition to the "delete flag", and that would imply a similar datestamp to capture alterations of data, and now we're suddenly talking about a different database schema altogether. In other words, the "delete flag" is a partial and insufficient solution to the problem at hand...

Michael Dorfman - Thursday, February 19, 2009 1:51:42 PM

I'm a big fan of your blog, Frans, but I really think that taking an absolute and dogmatic approach to this is not the best way to go. I use soft deletes in my forum app because in the context of watching moderation, it makes total sense. I allow edits to be physically deleted, and logged on a different table, but my approach in this case is far more simple in terms of querying than anything you've suggested.

I might conceded that it's a good guideline not to use soft deletes, but it's not a golden rule by any stretch.

Jeff Putz - Thursday, February 19, 2009 3:13:05 PM

@Joel: what exactly do you mean with Delete Replication?

@Michael: good point!

@Jeff: Isn't that a different thing? Making things not visible or accessable to some people isn't the same thing as removing it from the working set altogether, as that semantically means that it IS gone, the data isn't there. You could roll back to a previous state, but that implies more work than simply switching a flag.

I took the somewhat dogmatic stance because it's a recurring question for our o/r mapper: how can we implement soft-deletes. If you take into account that derived tables could pop up, inheritance can be involved spanning multiple tables, you're in for a complex problem which doesn't really have a good answer other than looking at the real problem and solve that problem differently.

FransBouma - Thursday, February 19, 2009 3:37:44 PM

I'm sorry but I'm a bit confused. How will the 2-databases approach work in this scenario:

Let's say we have a Customer info screen, which among other things contains a link to the original Employee that made the first contact with this customer.

Somewhere else we have another screen which displays a list of Employees in our company,
but since employees do not stay around forever, we only want to see the current "active" employees.

But when I request the employee info that is no longer "active" from another screen (like the Customer info screen), I expect to see his data.

What this means is I can't just permanently remove an Employee row from the db once he has left our company. I have to mark him "inactive" so that other screens can continue to refer to his info even after he leaves. Otherwise I would have to join from both the "Active db data" AND the "Archive db" data for the employees.

In this case I don't see how I can get rid of the "is_active" column.

(and there are many other examples that come in mind too. Columns like "created_by" pointing to a user in the system that is no longer "around" but still we have to keep it there as inactive user for reporting purposes)

someone - Thursday, February 19, 2009 5:21:29 PM

We have a client that has totally abused the approach you recommend. They mandate every table in a database has a corresponding _history and _changes table. These are populated by triggers on the main table, but the triggers require a username and other information to be passed as well, which means every insert,update,delete must also be pass additional parameters. This amounts to hidden, confusing business logic in the database. Frequently, their applications need to query across the tables, meaning UC and FK constraints are still not possible.

While IsDeleted is not the proper solution, it _does_ indicate is that the developers are not thinking through the domain requirements and are being too data-centric. In your Customer/Order example, the requirement is typically not that a customer needs to be deleted, it's that Joe in accounting doesn't want to see inactive customers on some report. Bob in Sales wants to see the inactive customers to try to make them active again. The "working set" to Bob is different from what it is to Joe. Thus, the developer needs to define an "inactive" customer, and incorporate that into the database and into his business logic and reporting. He _may_ do that by deleting from the "main" customer table and moving related data to a reporting database for sales, but I'm not sure hiding that business logic in triggers is a good idea. I'd much rather see it inside a customer.Inactivate() method.

Daniel - Thursday, February 19, 2009 6:03:26 PM

soft-deletes are key. I work with an online reporting application that deals with anything from traffic collisions to citations.

Reports can span 30+ tables, so with your method, deleting 1 report will cause us to delete records from 30+ tables.

Rather, we could always just change the DateDeleted field on our Row to the current date - and thus deleted.

Data Retreival, from any other source takes too long and isn't the best solution. If a customer deletes something that took them 3 days to do, and needs it back asap, you don't want someone to spend 30 minutes to get it back.

As far as complexity, get a wiki or hire some programmres that know how to execute proper SQL.

However, this is always a discussion to talk about when you're designing your system. I for one, vote for soft deletes.

Ryan Ternier - Thursday, February 19, 2009 7:39:09 PM

@Frans

How would you handle the case when only one entity in a graph needs to be deleted.

ie:
- a user posts an image.
- images gets displayed including the users display name
- user removes his/her account

how do you get the users' display name if the user is in an archived database?

Remco Ros - Thursday, February 19, 2009 8:33:17 PM

But what if you have a reference table you still want to be able to query but not be allowed to use it for new entries. We had a doctor's scheduling application and when a doctor retired we would move him to a historical table with all his appointments. All good until somebody managed to enter a new doctor with the same PK of the retired one and all our historical appointment data went bad. We ended up keeping all doctors and flagging them as "can't assign appointment to him" so nobody would create twin doctors.

nes - Thursday, February 19, 2009 10:03:14 PM

I used to use soft deletes in most of my apps, mostly to assist in rolling back. I gave it up because, as you pointed out, you end up recovering a graph which quickly became quite complex. I also found I was often writing a wizard (I work almost exclusively in winforms) for when the user wanted to purge the deleted data.
It quite quickly got out of hand and I stopped using soft deletes.I occasionally use an inactive field, but this serves a different purpose.
I hadn't thought of using triggers, and I will give it a go.
Thanks for the post.

Richard Wright - Thursday, February 19, 2009 11:38:34 PM

I worked on a project where soft deletes were required because of sox compliance, so there are instances where it would/could be required.

Brian - Friday, February 20, 2009 1:03:29 AM

I think everyone in the discussion needs to think very clearly about when a row should is part of the working set or not.

On my software there are dependencies such that every row will probably be referenced somewhere else and for those purposes must stay in the working set.

That is why we offer a built-in archive feature into our website. So a user can cleanup his desktop without having to delete data.

The historic reason is more relevant than not... Even if a user wants to delete his account we need the account in the table with his name/username because without it due object relation mapping the messaging system won't be able to display his name.

Not everyone has views

Dawid Joubert - Friday, February 20, 2009 11:00:19 AM

We had different requirements than many people, but at my last company we used soft-deletes very successfully. Here's what we did:

* Every table had a GUID surrogate primary key.
* We used a timestamp for our deleted flag, which defaulted to the max date
* We partitioned our tables based on that timestamp
* When we deleted an item, we appended the GUID PK to one of the "unique" fields if there was a UC. Sure, we had to strip it out when displaying, but honestly, even in our app which had over 200 entity types, we only had to do that on about 30 of them. Really, it was about an hour of coding to add the "strip" code in.

We didn't run into most of the problems you list above, didn't have the additional overhead of a trigger based archive system or an additional database to maintain, and performance didn't degrade over time thanks to the partitioning. Sure, we had to make sure that our collection queries were filtering on that deleted field, but since we were using LLBLGen we just overwrote the generator templates to take care of that for us.

Will that work for everybody? No. Not everyone has access to partitioning or wants to write additional code. But for us, the couple days invested in the infrastructure more than made up for the massive amount of time saved from reversable deletes as well as the in-place audit trail.

Chris Carter - Friday, February 20, 2009 6:26:23 PM

I see a third reason for 'soft delete'. In our system we use soft deletes for performance. The system is an OLTP system, but a single transaction will impact as many as 39 tables. We made significant performance improvements by implementing a 'soft delete' based on user action, with the 'hard delete' accomplished by an off-hours scheduled job.

BruceK - Friday, February 20, 2009 6:49:25 PM

@dfguy: I didn't post it because it was a useless remark, like your last one. I posted this one so people can see what some people apparently seem to find 'appropriate' these days to respond to a post about soft-deletes: they start to call names and flame without any reasoning. I mean... which sane person would start flaming like you did and be so utterly rude about a post about a solution to some problem related to databases/software.

FransBouma - Friday, February 20, 2009 9:57:55 PM

Triggers? No way, not in my house. The hidden logic behind triggers typically remains elusive. As such, I think they should be avoided.

tmorton - Saturday, February 21, 2009 5:27:10 AM

Frans, I completely agree here.

@tmorton, someone has been propagandizing you. In most cases triggers are not a good idea. However, this is a very, very specific case, namely archiving deleted data. Once you know that archived data is archived via triggers, you know it. It's not hidden magic and voodoo. That's just propaganda. It is much preferable to complexity associated with so-called "soft-deletes."

Charles - Saturday, February 21, 2009 6:03:07 AM

I have audit data that refers to user information. There is a foreign key from the audit data table back to the user table. The audit data would live on even if a user is deleted.

So, which do you prefer:

1. Use soft deletes on the user table
2. Remove the foreign key

Thanks.

Greg

Greg - Tuesday, February 24, 2009 2:37:56 PM

@Greg: If data is required at some point in the future, it's therefore not useless data, and therefore can't be deleted. That's the misunderstanding of many people reading this blogpost. If you need the user data to live on to be useful to the audit data, you can't delete the user data, and therefore have to archive it. it's equal to the situations where you have scheduling info for year X and a new year arrives.

In your particular case, the users aren't deleteable. Though you also can't use soft-deletes, as in: I will use a flag instead of DELETE FROM Users WHERE ...
The point is: soft-deletes are hidden, they hide information for you as if it's not there. So if I want to do:
SELECT * FROM Users;
with softdeletes, I have to add the predicate to filter on deleted rows, as these aren't there anymore (as they're deleted). However, inactive users instead of soft-deleted users, ARE showing up in that simple query. So an inactive/active flag on the users table allows you to filter on which users are active and which aren't, but the flag doesn't hide information from you in simple queries. So your domain logic has to be aware that it has to work with active users. That's a domain requirement and the filter therefore is provided by the domain logic.

Soft-deletes are filtered out at the database level, as the data is semantically not there. That's a difference: soft-deleted rows are hidden for domain logic.

FransBouma - Tuesday, February 24, 2009 3:03:58 PM

I see this as a very relevant discussion and almost feel bad for those not "mature" enough to handle it that way.

In the world of absolutes, those are usually a bad thing to use. Frans I applaud you taking a stance, although I am not 100% in agreement.

I am in agreement that you should always discuss the implications of a decision like this with everyone and help them understand (possibly through a short demonstration?) of exactly what they are buying off on.

I've been on both sides of this argument and for that I always say, it depends.

"Should" and "it depends" are perhaps the two best words in a developer's vocabulary. :D

Robz - Tuesday, February 24, 2009 7:09:42 PM

IMO, soft deletes are often a very good idea. Imagine an online store/club than has a yearly fee for its customer. A customer has an order history, frequent buyer discounts, item reviews, etc.

Now the economy is bad and the buyer decided not to renew the membership for this year or possibly forever. The business decided to keep all the inactive/expired customer data for 18 months before it gets deleted from the system by a scheduled script. The customer can renew the subscription at any day until that time.

How do you deal with a scenario like this? You obviously cannot just delete the data as soon as the membership expires. Archiving the data might not be a bad idea. But if the customer renews his fee online you would need to have a real-time process to move all the data back (customer, orders, ....) as a graph.

A soft delete field makes everything much easier especially if are using a customer as an aggregate with a repository responsible for the rest of the graph. Even if you have a moderate amount of aggregate roots, it is still easier to change a field value in each one than moving a bunch of rows related by PK/FK from a different database/table structure.

Dmitry - Wednesday, February 25, 2009 6:44:19 AM

I agree that joins make soft-deletes more difficult. But a good ORM helps a lot here. You can do cascading updates and deletes as well as providing your own implementation for the operations. Such an ORM would also support filters that can be used to make sure the IsDeleted condition is added to every necessary table in queries.

Dmitry - Friday, February 27, 2009 5:37:49 PM

@morton: AFAIK, a trigger (usually combined with a sequence) is the only way to achieve surrogate keys on Oracle, and probably a few other RDBMSs.

TSlavens - Tuesday, March 3, 2009 5:03:01 AM

Soft deletes aren't bad, they're a tool. In a system with undo requirements and extensive foreign key constraints, they are practically a necessity. Simply put, when you need them there aren't any realistic options. Shadow tables won't help you with foreign key constraints. You can use a date field as the deletion flag. If the date field is null then the record isn't deleted. This lets you filter out deleted records (which is not an expensive operation), go back in time and have a purge process which removes deleted records over a certain age.

Ted Wise - Monday, March 9, 2009 2:42:13 PM

@Dmitry

Why do you want to soft delete the inactive/expired customers? It should only tag as "In Active", that's it.

Expired membership should NOT be deleted (rather soft or hard).

Jerwin - Tuesday, March 10, 2009 8:58:25 AM

@Jerwin,

As far as the e-commerce domain is concerned, inactive records do not exist (ala deleted). On the database level the records are not deleted so you can renew them using a service.
Marking a database record with "Active = 0" is exactly what I consider a soft delete.

I completely agree that marking records inactive without an intent to restore them makes no sense. There point here is there are reasons to do soft-deletes.

Dmitry - Thursday, March 12, 2009 9:50:28 PM

Comments have been disabled for this content.