Approaching multi-tenancy with cloud options

Wednesday, February 5, 2020

Building multi-tenancy into your app is an interesting (and dare I say fun) problem to solve, because there are a number of ways to approach it. And now that we don't have to spin up closets full of hardware in some basement, there are better options that we didn't have in the dark ages. I'll talk a little bit about the options that I've used, and how I solved the problem this time around with hosted POP Forums.

Establishing tenant context

To start, let's define it: Multi-tenancy is where an app has to accommodate a number of segregated customers, where the data doesn't overlap. This is different from just having multiple users, because while there are overlapping concerns between your users, they're all associated with a root customer. Think about the granddaddy of SaaS apps, Salesforce. When you're using it, you don't encounter the data of Company B down the street, only Company A, where you work. You're potentially using the same hardware instance, and definitely the same software, but you see different data. That's multi-tenancy.

Fundamentally, I believe your app needs what I call tenant context. Regardless of what part of the app is acting on something, whether it's a web-based action, API, some serverless thing (Azure Function or AWS Lambda), etc., it should know which tenant data it's working with. Setting the context is the first step of any action. At the web end of things, this is straight forward enough, because you could operate based on the domain name (customer1.contoso.com), the path (contoso.com/companyA/home), the query string (that's so 1999) or whatever your authentication and authorization mechanism is (auth tokens, for example). Some asynchronous thing, like a serverless component running in response to a queued message, may have tenant context baked into the message.

Partitioning the data

Once you have tenant context established, you have to decide how your data will be divided up. The strategies here are about as diverse as the persistence mechanisms. Many no-SQL databases already partition data by some kind of sub-key or identifier. The easiest way to isolate tenant data in a SQL database is to use a compound or composite key on every record, where you combine tenant ID and a typical identity field as your key. If you use Entity Framework, there are even a number of ways and libraries to add plumbing to all of your queries and commands to specify the tenant context, often by having your entities derive from a common base class that has a TenantID property. You could also divide up tenants by schema in SQL, or just provision a totally separate database entirely for each tenant. That's less scary than it sounds when you automate everything.

I've had to personally solve this problem several times over the years, and I never did it the same way twice. However, this allowed me to make some educated decisions when it came time to do it for POP Forums. Architecturally, the app consists of two web apps, one that serves the forums, and one that acts as the management interface for customers (billing, usage reporting and such), while a bunch of Azure Functions do a great many things in the background, triggered by timers and queue messages. The databases are SQL, using Azure SQL Elastic Pools.

Running code, in context

Let's walk through these parts. If we go to the source code, you'll find that the application is technically a monolith. The web app and Azure functions share code and they share the same database. (Moving compute around doesn't really break down your app into "microservices," because the parts still need each other and share state via a common database.) There is an interface called ITenantService, and the default implementation wired up in the dependency injection has no implemented SetTenant(tenantID) method. That's because if you're cloning this app and running it as is, there are no tenants.

Hosted POP Forums has two replacement implementations. For the web app, the SetTenant(tenantID) method puts the tenant ID into the HttpContext.Items collection. The GetTenant() method fetches it from the same place. (If you're wondering, there's a great new interface you can add to your container called IHttpContextAccessor which will get you to HttpContext, making all of this stuff easy to test.) Middleware placed early in the Startup sequence uses this implementation to set the context based on domain name. The web app also has a replacement for POP Forums' IConfig interface, which normally just reads values from a JSON config file. The replacement looks at the tenant context and builds up the right connection string, which is in turn used by all of the default SQL code. Obviously there is some error checking and redirecting for bad domain names and such, but that's (mostly) it for the web app.

The Azure Functions have an even simpler version of ITenantService. Since the functions are one-time use and thrown away, it gets the tenant ID from the queue message, stores it in the instance and reads it back as necessary. Relative to the open source code, there are not a ton of replacement implementations in the hosted version of the app. The biggest one writes error logs to a common instance of table storage in Azure instead of writing to the database.

So what about the timer based functions? The timers each queue messages in a fan-out pattern to run each task in tenant context. For example, one of the tasks is to clean up the session table, which feeds the "users online" list on the forum home page. It just nukes rows where the last updated time is too old. The timed function queues messages composed only of a tenant ID, one for each tenant. The listening function runs against that tenant.

Design rationale

One of the biggest requirements when I started the project was to modify as little as humanly possible from the open source project. I'm reasonably consistent about doing something with it on a weekly basis, and I don't want to change a consuming project beyond updated project references. So with that in mind, that's why I went with separate databases. Elastic pools make this economical and super easy, and when (if) I hit the ceiling for number of databases in a pool, it isn't hard to add some little bit of context about which "server" to look at for databases in the new pool. The downside, yes, is that any schema change has to be made in every database. Still, that's historically a small blast radius, and using something like DbUp, I could likely update hundreds of databases in a few minutes.

The domain name as key to defining the tenant is just obvious. People like to name things, and with additional automation, I can even provision custom domain names and certificates.

I've done some initial load testing in an unsophisticated way, and with two S1 app service nodes running on Linux and the old-school 100 DTU database pricing model, I can hit about 2,500 requests per second with sub-second response times. The unknown here is that my tests aren't hitting huge datasets, so the cache load (there is two-level caching going on with Redis and local memory) is not large. I'd like to design a better test and try P2v2 app services with three nodes, and flip the database to the vCore model. I think I'd need some impressive customers to really drag it down.

Overall, multi-tenancy was the easy part of the project. The more interesting parts were the recurring billing and provisioning. This week I'll be migrating one of our big sites on to the platform.

Establishing tenant context

Partitioning the data

Running code, in context

Design rationale

No Comments