March 2006 - Posts

If you´ve followed my postings on associative/relationalistic approaches to information representation and were interested in implementations of the Pile concepts (see [1] and [2] for words from the horse´s mouth) you might also have stumbled across a C implementation of a so called Pile Engine at the sourceforge site of the project. (You can find my implementation there, too. But it´s in C#.)

Now, the C implementation has boggled quite some mind for quite some time. It was fast and used little memory - but was very, very hard to understand due to a couple of reasons. Unfortunately this has led to some misunderstandings of the basic Pile concept. To some it seemed, as if the concept/theory somehow was tightly coupled to this specific C implementation. It seemed as if "something magic" was happening in this Pile Engine which was essential and needed to be understood first or even be copied by future Pile Engine implementations.

Well, to make a long story short: this is not the case! And to prove it, I wrote up my findings in a paper you can read here.

I undertook the effort of explaining the Original C based Pile Engine (OPE) in order to finally and clearly separate theory from implementation. Any research on the potential benefits of the Pile concept needs to be independent of a particular Pile implementation - even if this implementation stems from the Pile originators.

After digging into the unfortunately not well documented intricacies of the OPE I can now say: It´s interesting and still holds some records in terms of performance and memory usage. Well done, Erez and Miriam! But there´s nothing in it, that couldn´t or shouldn´t be done differently (within a certain range) in order to get a working Pile Engine.

And I´m relieved to say: The Pile concept nevertheless is very interesting and needs much more research. View Pile relations as a different kind of bytes, i.e. smallest unit of information processing, and take a Pile Engine as an API on the same level of abstraction as System.IO of the .NET Framework.

Then: How would working with and "thinking in" relations change how we do information processing? Think of the implications... What would it mean to easily connect any (!) piece of information with any (!) other piece on any (!) level of abstraction? And such connections would be the only thing to worry about. No dichotomy of data and connections anymore. But I´m getting carried away... :-)

Links

[1] Pile Systems - the commercial side of Pile, www.pilesys.com

[2] Pileworks - the open source Pile forum, www.pileworks.org

 

Posted by ralfw | 4 comment(s)
Filed under: ,

In my previous posting I presented a way to avoid unnecessarily copying data during read/write of binary files by using low level C CTR file functions like fread() instead of System.IO. This works just fine - but requires you to link in an unmanaged code C DLL if not even doing programming in C yourself.

Now, thanks to Christof Sprenger at Microsoft Germany, this hurdle has been removed. He showed me how to use the existing low level file handle of a stream with the kernel32.dll ReadFile()/WriteFile() functions. No more C programming is necessary, although unsafe code is still needed. But that´s ok, I´d say.

Here´s an example how simple your code is if you use the new BinaryFileStream class:

struct ID3v1Tag
{
    ...
};

ID3v1Tag tag;

BinaryFileStream bfs = new BinaryFileStream("track01.mp3", System.IO.FileMode.Open);

bfs.Seek(-128, System.IO.SeekOrigin.End);
unsafe
{
    bfs.Read<ID3v1Tag>(&tag);
}

tag.Title = ...;
...
bfs.Write<ID3v1Tag>(&tag);

bfs.Close();

And here´s the BinaryFileStream itself:

using System.Runtime.InteropServices;

namespace System.IO
{
    public unsafe class BinaryFileStream : FileStream
    {
        #region CTORs

        public BinaryFileStream(string path, FileMode mode)
            : base(path, mode)
        {}

        public BinaryFileStream(string path, FileMode mode, FileAccess access)
            : base(path, mode, access)
        {}

        public BinaryFileStream(string path)
            : base(path, FileMode.Open)
        { }

        #endregion


        public int Read<StructType>(void* buffer) where StructType : struct
        {
            return Read(buffer, Marshal.SizeOf(typeof(StructType)));
        }


        public void Write<StructType>(void* buffer) where StructType : struct
        {
            Write(buffer, Marshal.SizeOf(typeof(StructType)));
        }


        #region Low-level file access

        protected int Read(void* buffer, int count)
        {
            int n = 0;

            if (!ReadFile(this.SafeFileHandle, buffer, count, out n, 0))
                throw new IOException(string.Format("Error {0} reading from file!", Marshal.GetLastWin32Error()));

            return n;
        }

        protected void Write(void* buffer, int count)
        {
            int n = 0;

            if (!WriteFile(this.SafeFileHandle, buffer, count, out n, 0))
                throw new IOException(string.Format("Error {0} writing to file!", Marshal.GetLastWin32Error()));
        }


        [DllImport("kernel32", SetLastError = true)]
        private static extern unsafe bool ReadFile(
            Microsoft.Win32.SafeHandles.SafeFileHandle hFile,
            void* pBuffer,           
            int numberOfBytesToRead, 
            out int numberOfBytesRead,
            int overlapped           
            );

        [DllImport("kernel32.dll", SetLastError = true)]
        private static extern unsafe bool WriteFile(
            Microsoft.Win32.SafeHandles.SafeFileHandle handle,
            void* bytes,
            int numBytesToWrite,
            out int numBytesWritten,
            int overlapped
            );

        #endregion
    }
}

Copy and enjoy!

Quite a bit has been written about reading structured binary data from or writing it to files (see [1,2,3]). [1], for example, compares three different approaches. Unfortunately none is as straightforward as C/C++ code would be. Here´s how you could read the ID3v1 tag from a MP3 file:

struct ID3v1Tag

   char tag[3]; // == "TAG" 
   char title[30]; 
   ...
};

ID3v1Tag t;

FILE *f = fopen("mysong.mp3", "r");

fseek(f, -128, SEEK_END);

fread(&t, 1, 128, f);

printf("%.30s\n", t.title);

fclose(f);

Now, if you wanted to accomplish the same with C#... it would not look that easy anymore. The reason: you cannot read data from a file (stream) directly into a struct. A stream always requires a byte array as the target for read operations. Or if you use a BinaryReader the ReadBytes() method returns a byte array. In any case the data read into a byte array needs to be copied into the target struct.

[1] uses Marshal.PtrToStructure() to do this, and [3] offers a much more elegant solution using an unsafe assignment like this:

[StructLayout(LayoutKind.Sequential, Pack=1)]
unsafe struct ID3v1Tag
{
    ...

    public ID3v1Tag(byte[] data)
    {
       fixed (byte* pData = data)
       {
           this = *(ID3v1Tag*)pData;
       }
    }
}


Alternatively you could read data from an input stream in little chunks using a BinaryReader, which would mean you deserialize the data into each field by hand. This avoids the extra copy of data, but requires much effort on your side. You´re trading performance for lines of code.

That´s what can be said about reading (and writing) binary data using C# (or managed code in general).

However, due to a customer engagement I recently started thinking about this. The customer needs to port C++ code which interacts massively with binary files to C#. The approaches found in the literature, though, are too slow for him. The need for an extra data copy really hurts the application´s performance. So he kept essential parts of the code in C++ to benefit from the languages ease of use when accessing binary data.

I felt challenged by this problem. And here´s my solution: Easy reading/writing of binary structured data using C# 2.0 - without the need for an extra data copy. Look at the following code for reading the ID3v1 tag of a MP3 file:

[StructLayout(LayoutKind.Sequential, Pack = 1)]
public unsafe struct ID3v1Tag
{
 private fixed sbyte tag[3];
 private fixed sbyte title[30];
 ...
}

using (System.IO.BinaryFile fmp3 = new System.IO.BinaryFile("myfile.mp3", System.IO.FileMode.Open))
{
 ID3v1Tag t;
 
 unsafe
 {
  fmp3.Seek(-128, System.IO.SeekOrigin.End);
  fmp3.ReadStruct<ID3v1Tag>(&t);
 }

 if (t.Tag == "TAG")
 {
  Console.WriteLine("title: " + t.Title); ...
 }
}


I´d say it´s as easy to read/write as the C++ equivalent above. And it´s just generic functions that get called. And no extra copies of data are needed. The ID3v1 tag data is read directly into the ID3v1Tag struct passed to the Read() method.

How is this done?

Well, I removed the premise that underlies the usual literature on this topic: I don´t use System.IO to access the file, but the old CRT fxxx() functions. The above BinaryFile class encapsulates the calls to the following C DLL functions:

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private static extern int FileOpen(string filename, string mode);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private static extern void FileClose(int hStream);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private unsafe static extern bool FileReadBuffer(int hStream, void* buffer, short bufferLen);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private unsafe static extern bool FileWriteBuffer(int hStream, void* buffer, short bufferLen);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private unsafe static extern bool FileSeek(int hStream, int offset, short origin);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private unsafe static extern bool FileGetPos(int hStream, out int pos);

[System.Runtime.InteropServices.DllImport("CRTFileIO.dll")]
private unsafe static extern bool FileFlush(int hStream);

I just wrote a small unmanaged DLL wrapper around the basic stdio C functions like fopen(), fread() etc. That´s all the magic there is. Look at my C function for reading data from a file:

extern "C" DLLEXPORT short __stdcall FileReadBuffer(FILE *stream, void *buffer, int bufferLen)
{
 int n = fread(buffer, 1, bufferLen, stream);
 return n == bufferLen;
}

This function is called by a wrapper class´ method to make it easier for application code to work with binary files. BinaryFile hides the CRT file handle and looks much like a FileStream (that´s also the reason why I put BinaryFile into the System.IO namespace):

public unsafe bool ReadStruct<StructType>(void *buffer) where StructType : struct
{
  return Read(buffer, (short)System.Runtime.InteropServices.Marshal.SizeOf(typeof(StructType)));
}

public unsafe bool Read(void* buffer, short bufferLen)
{
 ...
 return FileReadBuffer(hFile, buffer, bufferLen);
}

This Read() method you just need to pass the address of the target struct to receive the data from the file and the number of bytes to read. That´s it. fread() will put the data right into the C# struct. No extra byte[], no explicit deserialization of fields. You just need to be willing to use unsafe code:

unsafe
{
 fmp3.Read<MyStruct>(&myStructVar);
}

I´d say, it cannot become much easier or faster than this, when reading from binary files.

If you´d like to give this approach a try, you can download sources here.

In order to use the BinaryFile class just add a reference to CRTFileIO.Import.dll to your C# project and make sure the C wrapper CRTFileIO.dll gets copied to the same directory as CRTFileIO.Import.dll.

Enjoy!

Resources

[1] Anthony Baraff: Fast Binary File Reading with C#, http://www.codeproject.com/csharp/fastbinaryfileinput.asp

[2] Robert L. Bogue: Read binary files more efficiently using C#, http://www.builderau.com.au/architect/webservices/0,39024590,20277904,00.htm

[3] Eric Gunnerson: Unsafe and reading from files, http://blogs.msdn.com/ericgu/archive/2004/04/13/112297.aspx

More Posts