Functional .NET - Fighting Friction in the BCL with Directory.GetFiles

Very recently on a project, I was having significant issues with System.IO.Directory.GetFiles, in which I was getting an access denied message which prevented further crawling of certain directories.  The performance issue was another issue that was detrimental.  I wasn't happy either with the design of this API.  Instead, I set out to fix some of these issues and come up with a design that I felt better addressed some of my concerns with some techniques from functional programming.

 

The Issues

There are several issues that lead me to come up with an alternative for this situation of getting all files in a directory. 

  1. Arrays shouldn't be returned from method calls
  2. Processor intensive for iterating over large directory trees instead of calculating only what I need, when I need it
  3. Filtering is weak, and only uses file format patterns
  4. Access denied internal messages occur for no apparent reason which halts the method

Let's talk about each just a little bit more.

The first issue with GetFiles is that it returns an array.  Eric Lippert, of the C# team recently posted on "Arrays considered somewhat harmful", in which he describes that for most cases, arrays should not be used in one form or another.  In this post, he states:

You probably should not return an array as the value of a public method or property, particularly when the information content of the array is logically immutable. Let me give you an example of where we got that horridly wrong in a very visible way in the framework.  If you take a look at the documentation for System.Type, you'll find that just looking at the method descriptions gives one a sense of existential dread. One sees a whole lot of sentences like "Returns an array of Type objects that represent the constraints on the current generic type parameter." Almost every method on System.Type returns an array it seems. 

Given that the GetFiles method returns an array, we could feel free to mutate the contents of this array and we'd be none the wiser for it.  There is no need, once calculated, to change this value.  This file path is what it is.  If you need to modify the contents, a Map (Select) would do wonders if you are using an IEnumerable<string> instead.  The problem with arrays is that they are a big bucket of mutability.  Coming from the functional programmer standpoint in me, I don't like that as an option, and would rather have an immutable List<T>. 

Secondly, there are performance implications of recursing through large directories all at once.  It's very much a blocking call in the application.  Another key point in the functional programming world is to only calculate what you need, aka just-in-time computing.  If I'm not going to immediately use all items in this array, I just calculated all files for no apparent reason.  So, there is a big plus when returning an IEnumerable<T> and using the yield return to calculate the value.

Thirdly, the search options for handling whether or not to include a given file or directory is a string used for file or directory masks.  There are more important ways of filtering than just using that one simple mechanism.  Instead, what if I want to return only hidden files, compressed and so on?  The level of effort got a little bit higher for doing that.  Instead, what you could do is provide a Predicate<string> function in order to be able to include certain files as you calculate your items.

Finally, there has been a big blocking issue which brought this up in the first place.  There are instances in which the implementation was throwing very odd internal windows errors while it was searching in some hidden directories.  So, the implementation of this just simply wouldn't work for my needs. 

Given these issues that I faced here, what could I do about it?

 

Handling the Native Code

In order for us to fix the overall issue with the Directory.GetFiles method, we need to interact with native code.  This involves the standard P/Invoke operations to call into kernel32.dll.  There are three methods that interest us for enumerating files and folders, FindFirstFile, FindNextFile and FindClose.  The FindFirstFile method is used, given the file name, returns both a SafeFindHandle and the WIN32_FIND_DATA structure, which contains file information such as file name, file size, date stamp information and so on.  The FindNextFile then takes the given SafeFindHandle and returns success as well as another WIN32_FIND_DATA containing the next file or directory should one be found.

We implemented a custom handle class called SafeFindHandle in order to support disposing the underlying IntPtr handle.  Creating handles is the best way to ensure proper disposal.  We override the ReleaseHandle method in order to call the FindClose method, which closes our handle. 

Let's take a look at the code to implement our native code calls:

internal sealed class SafeFindHandle : SafeHandleZeroOrMinusOneIsInvalid
{
    [SecurityPermission(SecurityAction.LinkDemand, UnmanagedCode = true)]
    internal SafeFindHandle() : base(true) { }

    protected override bool ReleaseHandle()
    {
        return NativeMethods.FindClose(handle);
    }
}

internal static class NativeMethods
{
    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    internal static extern SafeFindHandle FindFirstFile(
        string fileName, 
        [In, Out] WIN32_FIND_DATA data);

    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    internal static extern bool FindNextFile(
        SafeFindHandle hndFindFile, 
        [In, Out, MarshalAs(UnmanagedType.LPStruct)] WIN32_FIND_DATA lpFindFileData);

    [DllImport("kernel32.dll"), ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
    internal static extern bool FindClose(IntPtr handle);

    [Serializable, StructLayout(LayoutKind.Sequential, CharSet = CharSet.Auto), BestFitMapping(false)]
    internal class WIN32_FIND_DATA
    {
        internal FileAttributes dwFileAttributes;
        internal int ftCreationTime_dwLowDateTime;
        internal int ftCreationTime_dwHighDateTime;
        internal int ftLastAccessTime_dwLowDateTime;
        internal int ftLastAccessTime_dwHighDateTime;
        internal int ftLastWriteTime_dwLowDateTime;
        internal int ftLastWriteTime_dwHighDateTime;
        internal int nFileSizeHigh;
        internal int nFileSizeLow;
        internal int dwReserved0;
        internal int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        internal string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
        internal string cAlternateFileName;
    }
}
 

Now that we've defined the basis for our solution, let's move onto implementing the GetFiles method in a better fashion.

 

Implementing the Solution

We defined above a few issues that I'd like to solve with our implementation.  Functional programming aspects such as lazy evaluation, immutability, and high order functions will be implemented in our solution.  Let's answer each of our questions above in our implementation.

  1. Use IEnumerable<T> instead of a mutable array
  2. Use yield break/return to compute what is needed
  3. Use a Predicate<T> instead of a weakly typed query string for more powerful filtering
  4. Use Native P/Invoke in order to get around some of the blocking issues.

Given these answers, let's look at my implementation of GetFiles.  My normal standard is to abstract IO methods into a class called FileSystem, which implements the IFileSystem interface.  This way, I can abstract any side effecting code that needs to deal with the file system directly. 

public IEnumerable<string> GetFiles(
    string directory, 
    SearchOption option, 
    Predicate<string> pred)
{
    var findData = new NativeMethods.WIN32_FIND_DATA();
    using (var findHandle = NativeMethods.FindFirstFile(
        directory + @"\*", findData))
    {
        if (!findHandle.IsInvalid)
        {
            do
            {
                if ((findData.dwFileAttributes & 
                    FileAttributes.Directory) != 0)
                {
                    if (findData.cFileName != "." && 
                        findData.cFileName != "..")
                    {
                        if (option == SearchOption.AllDirectories)
                        {
                            var subdirectory = Path.Combine(
                                directory, 
                                findData.cFileName);
                            foreach (var file in GetFiles(subdirectory, option, pred))
                                yield return file;
                        } // if - option
                    }
                } // if - findData.dwFileAttributes
                else
                {
                    var filePath = Path.Combine(directory, findData.cFileName);
                    if (pred(filePath))
                        yield return filePath;
                }
            } while (NativeMethods.FindNextFile(findHandle, findData));
        }
    }
}
 

This implementation may need a little polishing in terms of error checking, some performance improvements such as Expressions over Predicate<T>, but as soon as I started using this approach, the performance of my application improved greatly, as I'm only computing the values as I need them and not beforehand.  This solution also allows me for better filtering as well.  Let's look at a few examples of using this.

// Get zip files only in C:\Work
var fileSystem = new FileSystem();
var files = fileSystem.GetFiles(@"C:\Work", SearchOption.TopDirectoryOnly,
    path => string.Compare(Path.GetExtension(path), ".zip",
                StringComparison.OrdinalIgnoreCase) == 0);

foreach (var file in files) Console.WriteLine(file);
 
// Get hidden files in C:\Program Files       
var hiddenFiles = fileSystem.GetFiles(@"C:\Program Files", SearchOption.AllDirectories,
    path => (File.GetAttributes(path) & FileAttributes.Hidden) != 0);

foreach (var file in hiddenFiles) Console.WriteLine(file);
 

As you can see from the above code, it's much more functional in that I am using high order functions such as Predicate<T>, using lazy evaluation to calculate, and not returning a mutable array.  But, how's my performance?  Since I'm only taking what I need, when I need it, nothing is done until I actually call GetNext on the IEnumerator<T>.  But, when forced to iterate this way versus using Directory.GetFiles, the two are close, but the edge is more towards my approach.

There are alternatives to this approach, such as instead of using a Predicate<T>, using IQueryable<T>.  Such an example might look like this.

var fileSystem = new FileSystem();
var results = from file in fileSystem.GetFiles(@"D:\Work", SearchOption.AllDirectories).AsQueryable()
              where (File.GetAttributes(file) & FileAttributes.Hidden) != 0
              select file;
foreach (var result in results) Console.WriteLine(result);

Both approaches work well and since we have lazy evaluation, it's pretty flexible, but what's your thought?

 

Wrapping It Up

This is just one of the many improvements that can be made to the BCL.  Had generics and lazy evaluation been around since the beginning of the .NET framework, I would guarantee many of the design decisions around the libraries would be much different.  Instead, there are several friction points we hit against that we must mitigate.  Solving these problems, using functional programming techniques greatly improves the experience and API design.  Understanding functional programming and some of the concepts behind this can greatly improve your APIs going forward and I hope this was proof that you can in fact do that.



kick it on DotNetKicks.com

268 Comments

Comments have been disabled for this content.