At ORCS Web, we've recently started to use DFS for some of our high availability offerings that use a central NAS (Network Attached Storage) content server. We're using DFS for handling the content server, both for replication and for automatic failover to a backup server in the event of maintenance or a server failure.
There were a number of things that I learned while researching, testing and rolling out DFS for webfarm content hosting that I'll share here. This isn't a step by step walkthrough, but rather some pointers that you will hopefully find useful.
DFS has many usages ranging from keeping content in sync between different physical sites, to giving a single easy-to-remember path that can serve up content from a variety of folders across a local or wide area network. (thus the 'distributed' in DFS).
DFS in its simplest form is a way to have a single friendly UNC path on your network which can have folders distributed across multiple servers. This friendly UNC path will be permanent while the real folders that it accesses behind the scenes can be most anywhere. Subfolders can point to completely different locations on disk or to different servers on your network. This flexibility is great for our webfarm situation and allows a primary and at least one backup server to handle the content with a clean failover solution in the event that the primary server fails.
The installation is fairly straight forward once you understand the concepts. Partial DFS functionality is already installed on Windows Server 2003. The replication side of things needs to be installed separately. As long as you’ve upgraded to Windows Server 2003 R2 you can install this from Add/Remove programs and the Distributed File System category. I recommend installing all 3 optional features as the extra management tools are better for managing your redundant DFS system. This needs to be installed on the servers hosting the namespaces and the folder targets if you will use replication.
The extra replication features of R2 do require Active Directory changes. If you have already upgraded your domain controllers to R2, then no additional action is required. If you haven't upgraded your domain controller to R2, no worries, you aren't required to do so, but you do need to extend the schema. Here is a link on how to do that:
Like anything of this nature, make sure to have a good disaster recovery plan in place and do this at a non-peak time. But the schema installation is straight forward and doesn't cause any interruption of service in Active Directory.
Once installed, there are three hotfixes that should be installed:
http://www.microsoft.com/windowsserver2003/technologies/storage/dfs/hotfixes.mspx. One is required for the client failback feature to fail back to the primary content server when it's back online after a failure, another allows you to have multiple domain-based DFS namespaces on Windows Server 2003 Standard Edition if you desire, and the 3rd supposedly fixes a potential RPC issue with replication, although I didn't run into this issue. KB Article 898900 needs to be installed on all of the servers accessing DFS (the web nodes). The other two need to be installed on the DFS content servers.
You have two graphical tools to use at this point, both support most features. My preference is the DFS Management tool which is available after the Add/Remove programs step above. You'll find this in Administrative Tools.
There are 3 terms/levels to take note of: Namespace, Folder and Target Folder. These terminologies changed with R2 so don’t get confused with terms you used in the past.
Top Level - NamespaceA namespace is a container to hold the folder and replication settings. The path to the namespace might be something like \\Domain\Webfarm. You can have multiple namespaces per server.
A namespace is a container to hold the folder and replication settings. The path to the namespace might be something like \\Domain\Webfarm. You can have multiple namespaces per server.
Second Level - FolderA folder is a virtual DFS folder which can have one or more target folders. The name of the folder is what is used in the UNC path. For example \\Domain\Webfarm\Site1, where Site1 is the Folder.
A folder is a virtual DFS folder which can have one or more target folders. The name of the folder is what is used in the UNC path. For example \\Domain\Webfarm\Site1, where Site1 is the Folder.
Third Level - Folder TargetA folder target is the real location of the content. This path is masked though and not seen in the DFS UNC path.
A folder target is the real location of the content. This path is masked though and not seen in the DFS UNC path.
You can have multiple target folders which point to different physical locations. There are various options to determine which target folder is used, but in our case we want to always point to a primary content server and only fail over to the backup content server when the primary server is unavailable.
Active Directory comes into play too with domain-based namespaces but management is still done from DFS Management.
Here's where it gets fun. To have everything fully redundant in the event that a server fails, every part of this needs to be mirrored. I'll discuss the various levels of redundancy here.
The namespace server holds the metadata for the namespace. Be sure that this doesn't depend on a single server. The data stored here is often pretty small unless you have hundreds or thousands of folders in the namespace, so a dedicated server isn't necessarily required for this role as long as the namespace server can always respond quickly to any queries. The namespace servers can be the same servers as your content if you want.
To create a mirrored copy of the namespace, in the DFS Management tool, right-click on the Namespace and click on "Add Namespace Computer". Here you can point to an existing share on a different server or create a new share.
DFS masks which server is used for the folder target. To fully use DFS in this situation, you will need to point to multiple folder targets. In my situation, I want to have one server always used as long as it's available. I don't want to hit a random server because there could be data integrity issues. DFS replication is good, but it doesn't handle data locking or data write-through. This means that there could be a delay from when something is written on disk until it has replicated to all other servers. For that reason, I only want to fail over when absolutely necessary.
To achieve this there are a few things that are necessary.
- The failback hotfix mentioned above needs to be installed.
- All webfarm nodes need to be running Windows Server 2003 SP1 or later
- The caching duration for the folders need to be changed. The default is 1800 seconds (30 minutes) which is too long for our situation. That means that less requests are made to the namespace folder, but it also means that the failback could take up to 30 minutes after the primary server is back online. You can update this by right-clicking on the folder in "DFS Management", going to properties and then the Referrals tab. Make sure to do this on each new folder. You can also change the cache duration on the namespace, but the default is already 300 seconds (5 minutes).
- In the Referrals tab of the namespace properties, check the "Clients fail back to preferred targets" checkbox.
- In the Referrals tab of the folder properties, check the "Clients fail back to preferred targets" checkbox.
- On the properties of the primary folder target, in the Advanced tab, enable "Override referral ordering" and select "First among all targets"
- On the properties of the backup folder targets, in the Advanced tab, enable "Override referral ordering" and select "Last among all targets"
Now you have a primary/backup server configuration that will always use the primary server as long as it is available.
The Active Directory part of things is done automatically and apart from the steps mentioned already, doesn't need any extra configuration. Just be sure to have redundant domain controllers in your Active Directory environment.
Links and Paths
There is a growing list of links and paths that can be used to testing purposes. Let me summarize them here assuming that the folder is called Site1 and the Folder Targets are also given the same name.
Using the DFS path directly: (DFS level)
Accessing directly using the first namespace server: (namespace level)
Accessing directly using the second namespace server: (namespace level)
Accessing content directly on primary server without using DFS: (folder target level)
Accessing content directly on second server without using DFS: (folder target level)
Notice that it’s the DFS path (\\domain\webfarm\Site1) which will be used on the web servers and for most usages. It will always be the same, regardless of the namespace or target folder changes over time. The other paths are for testing and troubleshooting and could change over time.
With R2, DFS replication uses what is called Remote Differential Compression (RDC) which will only update changes to files and won't send the entire file across the wire. This is especially handy when replicating across a wide area network, but it's also good for this situation.
If you set up two or more folder targets using DFS Management, the wizard should have asked you if you want to set up replication, but if you did things in a different order, you can set it up manually after the fact. This can be done using the DFS Management tool as well.
Changes to the servers aren't immediate so DFS doesn't work well for transactional type data where both servers need to be 100% in sync within a couple seconds of each other. But for a website related situation that is mostly read intensive, DFS works great.
You have a few options but in our situation we'll use the Full mesh which means that any server will write to any other server. This means that in a failure situation, the content changes made on the backup server will push back to the primary server when it is online again.
How Good Is It?
DFS failovers are pretty impressive. If the primary content server becomes unavailable, DFS will fail over to the backup content server in a small number of seconds. In this webfarm situation, almost every time that the primary server fails, the HTTP protocol will retry for a few seconds until IIS is able to serve up a successful page.
This means that there is zero downtime if the primary content server fails. The only issue I ran into in testing is if the page load was 1/2 done when the primary server failed using master pages or web controls. It could potentially process 1/2 of an ASP.NET page and fail processing the rest. But this is pretty rare and I would say that the failover is as close to perfect as can be.
A failure of the namespace server is even smoother, resulting in no noticeable downtime or slowness.
File Change Notification in ASP.NET
There is one thing to keep in mind during a failover and failback situation. ASP.NET and IIS uses what is called File Change Notification (FCN) to let IIS know of any changes to files. For example, if you add a new .dll to your /bin folder, ASP.NET will recycle the AppDomain and reload and recompile some of the site. During a failure, although the switchover is smooth, it does take a few seconds, which is abrupt enough for IIS and ASP.NET to reestablish the File Change Notification handle using the different content server.
The issue comes with the failback. The failback is so smooth that the File Change Notification isn't updated back to the restored server. This means that if you make any changes to ASP.NET files on the restored content server, the changes aren't noticed by IIS and ASP.NET. Even deleting the entire /bin folder won't be recognized by ASP.NET if the site was visited and cached while running on the backup server. Static pages don't have this issue, but the caching in ASP.NET makes this a problem. At the time of this writing, I'm working with Microsoft Product Support Services (PSS) to try to find a good solution for this. To resolve it, simply recycle the app pool of the site(s) and it will start to function normally again. So, this isn't necessarily a show-stopper but it is something to keep in mind with the failover/failback.
Caching and DFS
DFS client computers (webfarm nodes in this case) cache the DFS information for the length of time that you specify, as I mentioned already. This shouldn't be too low or you will have too much traffic to the Namespace server, but it shouldn't be too high or changes to the namespace and failbacks to a restored server will take a long time to be noticed. It is up to your environment what you want to set this at, but in every situation, it's important to know that there is some caching that takes place.
Make sure to keep in mind that adding a new folder to your DFS namespace won't be noticed immediately. You can force the DFS client cache to be cleared by running dfsutil /PktFlush from the client server. dfsutil.exe is a tool that is available in the Windows Server 2003 /support/tools folder of the installation CD. I simply copy that file to C:\Windows\System32 and I can run dfsutil from the command prompt.
When setting up new sites, make sure to wait until the new site has been recognized by all of the webfarm nodes, or force a cache flush from all of the nodes before attempting to set up or update the site.
Backups of the Namespace
Make sure to make regular backups of your Namespace. This can be done easily using DFSUtil. Simply export to an .xml file on a regular basis and have your backup process back up that file. An example of the syntax needed is:
dfsutil /root:\\OW\webfarm /export:c:\NameSpaceBackups\DateToday\webfarmroot.xml
I did run into something when importing the namespace. I received the following error:
System error 1168 has occurred.
Element not found.
After some research and stumbling through it, I found out that I was using the domain name 'orcsweb.com' instead of NetBIOS name 'OW' in the UNC path, which the import didn't like. OW is used by DFS in this case. The export worked with either name, but the import only worked with \\OW\ which is what was in the exported XML file.
Links and Resources
Here are a number of resources that I've found helpful:
Microsoft DFS Landing page
DFS hotfixes, post R2
Whitepaper on designing Distributed File Systems
There is a lot to consider with DFS and I've only scratched the surface, but I hope that this has been helpful to cover a few common configuration settings that are required for configuring DFS on Windows Server 2003 R2 in a webfarm situation.