Firewall in between web and database servers can cause intermittent problems
This story starts, as all stories should, with me sitting on a beach outside of Sao Paolo, Brazil. My team was releasing some bug fixes to our production setup (we do this every week). When I got back, I saw that our sites were having very intermittent problems executing SQL against our database. A trace through the code found that they were able to establish the connection to the database with no problems, but when the code actually ran a query (or stored procedure) using that connection, we'd see various mystifying error messages ... out of memory, SQL Server doesn't exist or access is denied, and some DBNETLIB errors.
Let me step back a step further and outline the setup. We have two load balanced web servers running an application of mixed ASP and ASP.NET. It's a homegrown CMS built over the last 7 years by a series of talented (and many not-so-talented) developers. The data comes from a SQL Server 2005 cluster that resides on a different subnet from the web servers. There are redundant Nokia firewalls that control all traffic into, out of, and between the networks, and the firewalls were managed by a third party.
The most inexplicable thing about this whole scenario was that the sites would experience these problems only during periods of low usage. During peak activity times they clicked along nicely. It was confusing at first. Our research started to point us to the firewall and we found that the source of the problem was that the firewalls would sever connections that showed no activity for 3600 seconds. What was happening was that the web servers would open database connections and pool them. After an hour of not being used, the connections were severed at the firewall level but the web servers still believed them to be valid and usable. When users would use the site, one of these unusable connections would be drawn from the pool and an error would result.
Fortunately for us, our firewalls were configurable. We placed a policy change request into our firewall team and increased the timeout period. The problems immediately ceased. Our current working theory about why this problem began while I was away is that one of the bug fixes that we put live fixed a bug that was subtly causing more traffic between the web and database servers which resulted in the connections in the pool being kept alive by the firewall. We believe this to be true as we noticed a 2500% increase in packets dropped on port 1433 on our firewall after we released the bug fixes.