Reality Check: Chunked operations take a lot of code and are hard to get right (a thread safe chunked file writer)
I've been spending a good portion of the day writing the necessary facilities for doing a multi-threaded chunked download off the web. I often run into situations where I want to grab a file, but I'm on Wi-Fi or the remote site is throttling the band-width. The best way to overcome both of these problems is multiple connection and using content ranges in HTTP 1.1.
Deciding how to overcome the bandwidth problem is the easy part. Implementing the code required to make this a reality is a different story. Basically what you have is a nasty situation where 10-20 threads are all trying to write to a file at the same time and all at different locations. The IO classes handle this scenario pretty well, they just aren't optimized for it. They are optimized for file access from a single thread and so we'll have to handle all of the queuing ourselves. I'm going to go backwards through most of the code and start with the Chunk. This is the logical unit passed from the thread to the file writer. We store a buffer to write out, an offset and length into the buffer, and a target location into the resulting file.
struct Chunk {
public byte[] ChunkData;
public int Offset;
public int Length;
public long FileOffset;
public Chunk( byte[] chunkData, int offset, int length, long fileOffset ) {
this.ChunkData = chunkData;
this.Offset = offset;
this.Length = length;
this.FileOffset = fileOffset;
}
}
Once we actually have these prepared, we just shell them off to the file writer class and use some locking to access a queue. This is a Whidbey Queue<T>, so we are strongly typed and optimized. I'm making use of a locking object instead of locking the collection itself. Lately the MS guys have been kind of up in the air in regards to how the lock statement should be used and what the best usage is going to be. For that reason, I used the tried and true method of having a special object who's job in life is to be the target of a lock statement. I actually do this alot, since we have lots of methods to protect.
public Queue<Chunk> chunks = new Queue<Chunk>( 25 );
public object chunkLock = new object();
public void QueueFileChunk( Chunk chunk ) {
lock ( chunkLock ) {
chunks.Enqueue( chunk );
}
}
The big work goes on in the Start method. We need to tell the file writer how long the final file is going to be so we can set the length as appropriate. If the final file will be 8 megs, then we need to allocate the 8 megs up front. If we couldn't do this, the process wouldn't work at all, we'd instead have to write out chunks as multiple files and combine them later. That just sucks.
public void Start( long length ) {
lock ( writeLock ) {
if ( writeStarted ) { return; }
writeStarted = true;
}
this.finalLength = length;
this.currentComplete = 0;
try {
fileStream = new FileStream( output, FileMode.CreateNew );
fileStream.SetLength( this.finalLength );
} catch {
if ( fileStream != null ) {
fileStream.Close();
fileStream = null;
}
throw;
}
ThreadPool.QueueUserWorkItem( new WaitCallback( this.WriteFileChunks) );
}
The method should return, so we are using the thread pool to handle the queue checking code. If anything goes wrong making the file, we'll toss an exception back out. If anything ruins the start process this object will be useless because we set a bool flag to prevent the Start method from being called multiple times. This is just my attempt at not designing the proper API and code-path set to handle error like conditions ;-) WriteChunks is another lesson in thread-safety and we use a couple more of those lock variables. I wound up with a cancel-lock for terminating the file writing process (not used), we re-use the chunk lock for accessing our queue, and a progress lock so we always have a count of how much of the file has been written.
private void WriteFileChunks( object state ) {
while ( currentComplete < finalLength ) {
lock ( cancelLock ) {
if ( cancel ) {
break;
}
}
Nullable<Chunk> chunk = null;
lock ( chunkLock ) {
if ( chunks.Count > 0 ) {
chunk = chunks.Dequeue();
}
}
if ( chunk.HasValue ) {
fileStream.Position = chunk.Value.FileOffset;
fileStream.Write( chunk.Value.ChunkData, chunk.Value.Offset, chunk.Value.Length );
lock ( progressLock ) {
this.currentComplete += chunk.Value.Length;
}
}
}
fileStream.Flush();
fileStream.Close();
}
Last thing we'll do is adorn the class with some properties so we can get that progress data I was talking about.
public long TotalBytes {
get {
return this.finalLength;
}
}
public long WrittenBytes {
get {
lock ( progressLock ) {
return this.currentComplete;
}
}
}
Note the code above doesn't make a complete class, nor would I even want to put this into the hands of anyone else until I've gotten through the proper API design set. It also doesn't help that you need another 100 lines of code or so in order to implement the HTTP range mechanism and bind it to this file chunking class.