Plug-In Framework (Foreward): Defining the end goal

Abstract:
I've been working really hard to complete the Plug-In Framework from front to back before posting additional articles.  Part 1 got a huge readership at over 130 views (thanks guys), so I wanted to make sure I got the rest of the series perfect before posting them.  However, in working through the various issues I came to realize all of my goals weren't going to be easily accomplished and that my goals were outside the realm of what the .NET framework is capable of without some advanced extensions.  I am going to try and use this document to define existing plug-in infrastructures and what they offer, along with how (this is primarily a discussion of the .NET Terrarium), a pointer to recent blog postings along with some wrap-ups based on feedback by Brian Grunkmeyer and his pointers to articles by Chris Brumme (thanks guys), and finally a listing of features that I am going to require from the finalized framework.

This is going to be a fun experience.  While writing this series I'll be working with the Terrarium team on a final source release of the Terrarium, so I get to be reminded of all of the extensive retro-fitting we had to do in order to make things work back in V1.  I may also get some additional insight as to how much easier things are going to be in Whidbey.  Some other features of the framework are probably going to be provided by MS Research and some of the libraries they are releasing lately.  All in all I have to bring together a lot of different framework features and lack of features to truly nail the end goal.

TOC:

    1. Current Infrastructures - This is primarily a view of the Terrarium, the most advanced infrasture to date
    2. Questions of Plug-in Security - I'm lumping all of the blog entries on plug-in scalability into this one section.  The .NET Framework is well suited for plug-in architectures, but maybe not as well suited as something like Python or a custom scripting language.
    3. Levels of Plug-in Security - Different plug-in systems required different types of security.  There are varying levels of trust that can be associated to the plug-ins for purposes of general security.  There are also levels of internal trust for purposes of application specific trust.  Further there are varying levels of what I call arbitrary trust.  Arbitrary trust is contained in the execution of arbitrary code that could contain any number of hacks or make use of product features that aren't inherently security holes, but instead provide access to some form of application hole.
    4. Defining the Terrarium Model - The Terrarium is a model in and of itself.  I'd like to define a model where the Terrarium can operate and be 100% true security and have 100% playability security.
    5. Defining the World Builder Model - This model defines a concept world where people log-in to a central server and can write code to control the behavior of objects in the world.  This is slightly more secure than even the Terrarium since access to run-time services and libraries needs to be limited.
    6. Defining the Pin Point Model - The pin-point model defines a concept application that allows time slicing, arbitrary code execution, complete control over memory constraints, complete control over member access, and complete control over the types a plug-in has access to.
    7. Defining the Game Player Model - This model is really fun and describes a model where AI can be programmed and inserted into a game as a player.  Various access modifiers have to be put into play to make sure the process of playing against AI is secure and to ensure the AI can't access arbitrary engine code.
    8. Defining Programmer-Control Models - This assumes the programmer of the application has full control over the plug-ins.  They sign them and distribute them.  This allows me to step down the various restrictions and show how easy it is to write controlled code architectures versus arbitrary code infrastructures.
    9. Conclusions - I think a wrap-up defining how each of the models interacts is necessary.  There is a lot of overlap in the various models and a lot of features where simple definitions may have more impact than my verbose writing style.  A user should be able to use just the conclusion to find out how hard/easy it will be to implement their own plug-in architecture by picking the features they are going to add to their feature set.
    10. Towards the Future - I'm going to add some additional architecture goals for future extensibility.  The main extensibility here will define Longhorn features and how to interoperate with them.  Lately plug-ins have focused on skinning and changing the behavior of applications.  This can often be to the detriment of the user.  Normally plug-in architectures leave it up to the user to provide the security awareness, rather than provide some innate security.

1.  Current Infrastructures
I've seen quite a few plug-in articles on the web.  Most of them tend to focus on the esoteric goals of getting assemblies loaded dynamically, interacting dynamically with types, and then making use of interfaces or some other form of interaction to communicate with the plug-in.  What most infrastructures leave out is the great deal of security or hosting code required to ensure the plug-ins can't take the system down.  Traditionally plug-in systems were made to be used by the development team of the original product.  This meant that all plug-ins were well coded and behaved well with the system.  In other words all plug-ins were certified.

This brings us to the big lump that defines almost all current infrastructures that use .NET, the Certified Plug-In Infrastructure.  This infrastructure ensures the user will not try to hack the system, hang the system, that they get unlimited time to execute, and a whole slew of other very gross allowances.  There is an even more gross plug-in infrastructure I think, the Internal Plug-In Infrastructure, that makes various determinations about plug-ins like blindly assuming types exist, that those types have the associated interfaces required to be casted, etc...  In other words, if the assembly with the plug-in isn't well formed, they generally crash the application.

The next plug-in system is what I call the Partial Trust Infrastructure.  This becomes most apparent when you run a .NET component within a web page.  This is nothing more than loading the component as a plug-in to Internet Explorer and running under the constraints of the Internet policy which disallows a great deal of access to much of the run-time that might be considered not so secure.  Up to this point, only the CAS model is being used, or the innate security features of the .NET Framework.

At some point, you start to have to code in your own security constraints. The largest I think is secured loading, unloading, and time slicing.  I call this the Shared Services Infrastructure.  Basically, it identifies a series of services that must be shared by all plug-ins and no plug-in can take full control over these services.  This could mean that access to web services are shared using a round-robin caching mechanism, that access to internal program data is shared, and/or that the CPU becomes a resource and no plug-in should be able to take full control over the CPU as a resource.  This is where you start to write a lot of code, to get some major security benefits.  I consider controlled resource access a huge security benefit myself.  At this level the .NET Framework still provides a great deal of services, but they start to break down.  See the section 2 on various blog entries to see the hurdles of time slicing using the CLR threading model.

With the SSI in place, you start to wonder what other attacks a plug-in might be able to perform.  Instances can still communicate using statics, they can still control their own serialization, they can still use a great deal of memory that you can't really predict or control, and a whole slew of other nasties that aren't on the CLR teams list of true nasties.  The .NET Terrarium had to fend off some of these nasties through a great deal of hosting code.  This next level of hosting is going to be called the Limited Access Infrastructure.  Code at this level becomes very hard to write, since you need to write your own time-slicing code, implement an IL verifier since most of the issues at this level can't be overcome without one, but you still can't control memory.  I'll leave this to the final level of infrastructure, primarily because the Terrarium is an extremely secure application, but never did focus on memory control.  I'll talk a lot more about the LAI in later sections.

The final tier of plug-ins falls into the Full Control Infrastructure.  At this level you can safely timeslice, load arbitrary code, ensure that code only calls very specific methods or types of methods, control how much memory it is capable of using, and a whole slew of other options.  I like to define this level using the following example:

I've written a highly specialized social interaction system where users can log in and participate in a game world that has been entirely defined by other users.  For speed purposes, much of the interaction with users happens on the client side, so arbitrary code is being downloaded on the user's system.  Any user can write code within the system, submit it, and have other users run it.  In this manner, imagine a city system where all of the buildings, doors, desks, any object you can think of has been coded and placed into the game world by an arbitrary user.  Once this item is placed into the game world and begins running, it has to be timesliced (it can't hang the machine), it can only access very specific methods and can not access some methods that don't have any security restrictions (a basic I can call this inclusion list, rather than a very insecure exclusion list).  The object can only consume X amount of memory, so as it's memory consumption or memory graph tree grows, eventually it won't be able to allocate any more (ICorProfiler interfacing or disallow new'ing altogether).

How much of this is currently in place?  Well, the .NET Terrarium has an ILVerifier that it runs over code.  I think that the ILVerifier will play the largest role in the FCI tier because the .NET Framework security system really starts to break down when you want control at a very fine-grained level.  The ILVerifier planned for this series of articles will allow any number of IL rules for verifying assemblies.  Disabling constructs that are normally considered safe, such as static members, static member access, access to unprotected resources like the System.Collections namespace, the usage of finally blocks which tend to make code non-terminable, etc...

2.  Questions of Plug-in Security
[Coming Soon]

3.  Levels of Plug-in Security
[Coming Soon]

4.  Defining the Terrarium Model
I am most familiar with this model because I've worked on the Terrarium source code on and off for over two years.  The Terrarium model implies multiple levels of trust, since there is game engine logic that needs to be trusted, a trusted interface and base class library, and a bunch of untrusted code that is being transferred over the internet.  Let's attack the security in levels and talk about what the .NET Framework provides and what the Terrarium had to build in.

The first step for the Terrarium was to define an AppDomain policy.  This AppDomain policy was documented by Erik Olson on GDN for a while in the form of an article, but I can't seem to find it, so I'll briefly describe it here.  The basic premise is that all code is granted no permissions.  That includes disallowing execute permissions, which are required to even load an assembly into the AppDomain.  By disallowing all permissions you won't even be able to load malicious code into the process.

// All Code - Nothing
//   My Computer - Nothing
//      Terrarium Cache Directory/Private Assembly Cache - Execute
//      System.dll Code base - Full Trust (for XML serialization emitted assemblies)
//      Terrarium Key - Full Trust
//      MS Name - Full Trust
//      ECMA Name - Full Trust
//      Terrarium.Exe Code Dir - Execute Only (for Assembly.Load(byte[]) support)

For the most part, this provides a great deal of protection against mailcious code.  We add some additional features in code.  For instance, we disallow creatures that are signed with our key (darn developers trying to cheat eh ;-).  We check to make sure strong name checking is enabled and refuse to load if it isn't.  We even use permission asserts whenever we do file IO in order to prevent scenarios where we have bad path data.

The second major step for us was breaking our code into many assemblies.  When doing this you have to have some public interfaces through which to call.  This can be an area where creatures can take advantage of various game resources.  The .NET Framework jumped in here and fixed our problem before V1 shipped by adding the AllowPartiallyTrustedCallers attribute.  This broke us over night, but the simple fix was to add the attribute to the OrganismBase assembly.  Now, creatures aren't allowed to call into any of our strongly named assemblies because they only have partial trust.  This is a huge level of added protection that most people are unaware of.  All of the MS assemblies have the attribute, and instead rely on code permissions to disable access.  We could have done the same, but that would have been a pain in the butt.

That covers the extent of the built in security we'll be granted.  The rest was hand-coded.  We'll start by examining assembly IL verification.  For the Terrarium we had to disallow a number of constructs including static member access (a cheating scenario by which creatures of the same type could communicate), static initializers (possibly hang the app inside one of these), finally blocks (hang the app again), specific unprotected resources that the CLR doesn't consider harmful, and a sleuth of other items.  I won't go into them all here, since I'll save that for later, but needless to say we had to check for lots of badness within the IL.

Once creatures are loaded the protection continues.  We have to make all of our state types immutable so they can't change the data we pass to them.  We have to control all of their access to member functions so they can't get data they aren't entitled to (note we failed at this slightly in some cases because our multiple assembly breakout was done late in the game).  The protection we provide here comes in two forms.  The first form is data protection and the second form is time protection.  We'll talk about these separately.

The data protection comes in because we have to transport creatures across the internet and serialize them to disk.  Allowing them to save their own data, meant they could put whatever code they wanted to in their deserialization routines.  If done correctly they could deserialize protected types they shouldn't be allowed to have access to.  We provide an abstraction here called a wrapper that allows us to control the serialization process and store the serialized data as a byte array then control deserialization later.  Additionally we added serialization binders so that only specific types could be deserialized from streams.  This prevents a hack scenario where a user modifies the serialization stream on the wire.  Boy, you start to see the problems with this model and how much work we had to do to make it safe.  This is only the beginning!

The time protection scenario was huge for us.  We couldn't allow creatures to go over their time allotment and hang the machine.  That meant some ingenious multi-threaded programming.  Unfortunately, the .NET threading system doesn't like it when you abort threads and so we needed solutions.  We resorted to Win32 calls in order to terminate threads, get accurate timings for threads (note threads can be running but not getting time, and we needed to detect how much time they were actually receiving), and do handle maintenance.  The timing loop was monstrous, slightly unstable, but served it's purpose well.  In essence it adds the ability to accurately time threads, terminate them on thread overages, and properly handle all forms of thread violation (stack overflow, security exceptions, etc...).  The concept that nothing like this exists within the framework isn't a huge surprise, and the fact that very few of the asynchronous operations provided by the framework support cancelling or any form of time slicing protection is probably directly related to how much of a pain in the butt this was.

The Terrarium is the second most complex model I've ever worked with.  I'm saying second most, because it still doesn't solve many problems of running arbitrary code.  It leaves out memory management, explicit binding (only allowing access to certain types), and other features that are available to other plug-in frameworks (or at least the languages like Python that many plug-in frameworks are built upon).  The one thing the Terrarium does do well is support multiple languages without having a bunch of conditional code.  This is a huge win for the .NET Framework in that implementing a plug-in framework would allow you to program your resources in any language.  While many C++ plug-in architectures will load any arbitrary COM object or link to any exported DLL function, they don't provide all of the protection features we've talked about above.

5.  Defining the World Builder Model
We'll define the world builder model based on what we've already talked about in the Terrarium model.  The World Builder model adds all of the lacking features to support running arbitrary code and allowing users to come into the system and simply code new items.  In this scenario, the IL verifier is simply extended with the following rules.

      1. Ensure all referenced libraries contain a strong-name key in a specified set of keys
      2. Ensure all referenced libraries contain a specific version of various libraries
      3. Ensure all referenced libraries are in a given set of allowable libraries
      4. Ensure all referenced types are in a given set of allowable types

The above may seem a bit confusing, however, I'll explain each rule.  The first rule is a generic rule that will allow you to include library references to any MS, ECMA, or personally signed assemblies.  This may be all the protection you required, but be careful, MS might forget a permission somewhere or provide a library they've signed that you didn't know about when you shipped.  The second rule is designed to make sure only supported versions of various libraries are referenced.  This'll ensure the user doesn't run some beta version of the Framework or some set of libraries that might have issues, or that they don't compile against an old version of your application and try to introduce something that might possibly break.

The final two rules are hugely important for true protection.  In many cases you only want the user accessing your own personal libraries.  You need to ensure they don't reference other libraries that might allow them to load up types you don't know how to protect against.  Take the example of the collections classes.  Collections classes allow arbitrarily large growth, however, you might want to let the user only grow up to say 50 elements.  You may also only want them to get access to 5 or 6 collections maximum.  By creating a factory they have access to instead of the collection classes you can protect yourself.  Ensuring they can't access any library except a single library you provide can be a huge win for you in terms of securing your application, but at a cost to the user since they'll have to either recreate all of those classes or learn how to use yours.  Going one step further you can even allow access to only specific types.  While I don't like this very much, it can come in handy.  The Terrarium, for instance, disallows access to specific types using an exclusive system.  Going a step further, you can gain more protection by allowing an inclusive system.  Take a version of Longhorn right now and load up the various assemblies.  You'll notice many avalon classes don't have any protections on them.  Would suck to move to a new platform and find your plug-ins can now load and activate UI.

The following additional IL rules can also help to provide better protection under the world builder model.

      1. Disallow the construction of types.  Instead use factory methods to create types.  Including user types.
      2. Disable large stack allocations (.maxstack)
      3. Disallow IL instruction sets longer than some maximum number of instructions
      4. Include all of the previous rules from the Terrarium model about statics and what not.

6.  Defining the Pin Point Model
[Coming Soon]

Published Thursday, February 05, 2004 2:05 AM by Justin Rogers
Filed under:

Comments

Saturday, February 07, 2004 1:42 AM by TrackBack

# Preamble to the Plug-In Framework series of articles...

Leave a Comment

(required) 
(required) 
(optional)
(required)