TL;DR: We rewrote parts of OpenFGA in pure Postgres to make it easier to keep our authorization system up-to-date with our main database. Everything is available on GitHub.
In software engineering, ‘auth’ is a bit of a loaded term. It’s generally used to describe the process of both authenticating users (determining who a user is) and authorizing users (determining what they’re allowed to do).
Authentication (often referred to as ‘authn’) is pretty much a solved problem when you’re writing a user-facing application. There are cookie-cutter solutions that you can usually just drop into your code and get going - think OAuth, passkeys, magic links, etc.. Even if you’re doing the hip thing and rolling your own auth, you’ll pretty much always be following one of these patterns. They’re what users have come to expect, and doing differently is more likely to single you out as an annoyance than win you any UX awards.
Authorization (‘authz’), on the other hand, is a lot more tricky. How you check whether a user is or isn’t allowed to do something will depend heavily on your application and the audience you’re intending to serve. On top of that, authz checks aren’t something your users are likely to be interacting with directly all that often - they usually happen silently in the background - so there’s a lot more flexibility.
In the wild, you’ll usually run into the following systems:
RBAC - role-based access control. Each user has a number of roles, each of which has permissions that allow users to do things. The manager
role might have the approve_expenses
permission, while the employee
role only has submit_expenses
. This method is simple and scalable for a lot of use cases, but can become quite complex if your app has a lot of moving parts.
PBAC - policy-based access control. This is what governs most of AWS, and, personally, my least favourite of the bunch. Access is governed by policies - usally in JSON or YAML - that describe the actions a user is or isn’t allowed to take, and under what conditions they’re allowed to take them. For example, users who have the AWSCodeCommitReadOnly
policy will have read access to AWS CodeCommit repos, but won’t be able to push.
ABAC - attribute-based access control. This one is a little less common, but is interesting nonetheless. Essentially, access is granted based on “attributes” (usually key-value pairs) on the users, resources and the environment. A user may only be able to access documents in a folder if their department
attribute matches that of the folder. ABAC is incredibly fine-grained, but that can make it complex to manage and scale.
ReBAC - relationship-based access control. This is what you’ll see with apps like Google Drive. Alice is an owner
of a folder and can therefore read and write any documents that belong to it. Bob is a viewer
of a specific document in that folder, but he can’t see any of the others, and can’t edit anything. ReBAC is a nice fit when you’re using a relational database, because it fits with your existing mental model of your data.
So, what about Rover?
Rover is an incident prevention platform mostly aimed at engineering teams rather than individuals (though we do cater to loners solo users, too!). As such, our main organizational unit is, well, an organization
. An organization can have many repositories, which in turn have many alerts, codebase indexes, and so on.
Each organization (currently) has a single owner, and a number of admins and members. The user’s role dictates some of the actions they’re allowed to take (though, at the time of writing, there are very few differences between what each can do).
At first glance, it might look like we could model everything with RBAC alone. The owner
and admin
roles get the edit_repository
permission, while the member
role only gets the read_repository
permission.
And that would be fine… if Rover wasn’t a multi-tenant application. Because we run SaaS, we have to be able to support multiple organizations sharing a database. While it’s technically feasible to have a one-database-per-customer model using something like Turso, where RBAC would work perfectly, we want users to be able to be part of multiple organizations and switch seamlessly between them. That’s much easier when everything is in one database.
At that point, your authz checks necessarily become relationship-based. A member
should only have read_repository
for repos in the organization
(s) they’re a member
of, and not for any other repositories. That requires looking at the relationship between the repo and its parent org, and the user in question and that same org.
So, what about Rover? Well, we decided to go with a ReBAC system… kinda. We chose to use an open-source solution called OpenFGA, which is based on Google’s authz system, Zanzibar.
OpenFGA
In essence, OpenFGA is centred around the concept of ‘relationship tuples’. These take the form { user, relation, object }
. For example, to represent ‘Isaac’ being a ‘member’ of the ‘Rover’ organization, you’d create a tuple that looks something like this: { user: “user:Isaac”, relation: “member”, object: “org:Rover” }
The user:
and org:
bits are namespaces, and they define the type of the user/object (in OpenFGA they’re the same thing, and they’re all called ‘users’).
You use the same format for querying. When you’re trying to determine if a user has permission to perform an action on an object, you’re essentially checking to see if a user has a particular relationship with the object.
Having to create every possible tuple for every possible permission would be nightmarish though, and would be awful to maintain as things change in the database. That’s why OpenFGA also has the concept of an authorization model. This defines the possible relationships in your system, but, crucially, it also defines how your relationships are related to each other.
This makes sense - you want your admin
permissions to be a superset of your member
permissions (this is called a concentric relationship), and you owner
permissions to be a superset of that.
That way, if I say that:
{ user: “user:Isaac”, relation: “owner”, object: “org:Rover” }
When I then query :
{ user: “user:Isaac”, relation: “member”, object: “org:Rover” }
I should get a positive response.
More than that though, OpenFGA allows you to define inheritance between objects. For example, you could say that:
{ user: “repo:arda”, relation: “organization”, object: “org:Rover” }
And could then set things up such that any member
s of org:Rover
automatically have a read_repository
relationship with repo:arda
(yes, that is what we call our internal monorepo - we’re all Tolkien nerds 🧙♂️).
OpenFGA allows you to define this with a configuration language, and here’s a snippet of what we use at Rover that outlines the above:
type user
type organization
relations
define owner: [user]
define admin: [user] or owner
define member: [user] or admin # Owners and admins are also members
type repository
relations
# Parent resource
define organization: [organization]
# Crud
define can_read: member from organization
We define our user
, organization
and repository
types, some our concentric relationship for our organization roles, and finally set up permissions on repos such that any member of the parent org can read the repo.
I won’t go into the OpenFGA configuration language here, but you can read more about it in their documentation.
That all sounds great, and it works really well! So what’s the problem?
The problem
A deployed OpenFGA instance is comprised two parts: one or more API server containers, and a Postgres database to host the model and store the relationship tuples. We set this up, and as per OpenFGA’s production recommendations, used a separate Postgres database to our production DB.
This works really well… until it doesn’t. The big problem that we kept running into was keeping everything in sync all the time. Every time we add a new user, organization, repository, etc. in our database, we also had to add the corresponding tuples in OpenFGA.
‘Fine,’ we thought.
‘That’s the cost of having a robust authz system, and we have a centralised data access layer (DAL) internally anyway, so we can just put the FGA tuple writes in with the DB writes.’
‘Oh, but now we have to have a transaction system (which OpenFGA doesn’t really have a good solution for) so we can roll back the tuple inserts if the DB inserts have failed or vice versa.’
‘Okay, whatever, it’s just a command pattern. We can do that.’
‘Ah, but now we need to implement user account deletion to comply with EU data protection laws. That should be fine, right?’
‘Sure. Postgres allows us to cascade, so when we delete the user’s personal org (or any orgs they’re the owner of), we can cascade down to the repos and so on-
‘OpenFGA. F*ck. Now we’ve got to implement some sort of cascading deletion system internally and make sure the order it uses is up-to-date with our current authorization model.’
‘Oh, we can also never go in and manually edit the database, because changes that don’t go through the DAL won’t sync to OpenFGA.’
It was just one thing after another that made having our authz system separate to our main database just that little bit too painful to deal with for a small team such as ours.
On top of this, while OpenFGA is fantastic, and I don’t at all mean to besmirch the efforts of all the folks who’ve put time into creating such a great tool, all totally open-source and free, but the fga
CLI is missing some pretty key features.
Every time you create an authorization model, it gets a new ID. That’s okay, except the IDs are randomly generated ULIDs, and when developing locally you’re constantly having to edit your local .env
file.
On top of this, there’s no way to check whether the current authorization model in the database matches your schema file, so, to be safe, you have to run a migration on every deployment.
And we deploy to our dev environment on every push to main
.
All this to say, OpenFGA wasn’t working out for us, but we really liked the philosophy and ReBAC as a whole. Desperate to keep working with it, we explored ways to keep OpenFGA up to date with our application database automatically. It’s technically possible by using something like Debezium Server to stream the Postgres write-ahead log into our RabbitMQ instance, then having a service on the other end listening to those events and updating OpenFGA, but just… no.
The operational complexity was too much for us to handle, so, with heavy hearts, we started looking at other options.
Until, a day or so later, it struck me.
SQL is Turing-complete.
The solution
Okay, well, there are arguments to be made as to whether or not plain SQL is Turing-complete, but it’s pretty close, and Postgres’ procedural language, PL/pgSQL, definitely is.
So, could I replicate OpenFGA in pure Postgres, allowing us to run all our authz checks from within our database? Would this mean we could keep use relationship-based access control, but using the relationships we’re already defining as foreign keys?
Turns out, yes! And it wasn’t even that difficult.
We don’t take full advantage of OpenFGA’s offerings - features like usersets and conditional relationship tuples don’t really apply to us. As such, it was fairly straightforward to port OpenFGA to pure Postgres. It just took a table to define our authorization model, a view to define our tuples, and a recursive function to traverse the two to answer queries.
Let’s take a look.
authz_model
This table stores the authorization model defined by our schema in the OpenFGA configuration language.
Anyway, the table looks like this:
CREATE TABLE "authz_model" (
-- We use TypeIDs (jetify-com/typeid) as our primary keys
"id" text PRIMARY KEY DEFAULT typeid_generate_text('authz') NOT NULL,
"schema_version" bigint NOT NULL,
"entity_type" text NOT NULL,
"relation" text NOT NULL,
-- Subject type represents `[thing]` schema in OpenFGA used for direct relations,
-- and can be null in the case where the relation is only inherited
"subject_type" text,
-- e.g. 'member' is implied_by someone being an 'admin'
"implied_by" text,
-- Parent relation represents the `from` syntax in OpenFGA for relations
-- on related objects, e.g. a someone may have permissions on a 'repository'
-- if they are an 'admin' of the repository's 'organization', which is
-- the parent relation
"parent_relation" text,
CONSTRAINT "id_prefix_check_authz" CHECK (typeid_check_text(id, 'authz'))
);
Thanks to the fga
CLI’s ability to transform schemas in the configuration language into JSON, we’re actually able to keep using the same schemas with a little glue code for inserting the database records.
In this case, the entity_type
column represents our namespaces (user
, organization
, etc.). For ease of use, we keep these the same as our application table names.
authz_relationship
To avoid having to keep our relationship tuples up to date manually, we use a Postgres view that computes them on-demand from our existing tables. This way, we can just update our tables normally and our tuples will adjust accordingly. Now we can take full advantage of things like on delete cascade
and other great SQL features.
The definition looks like this, but extended to include every relationship type:
-- A user's role is determined by the organization_user table
select
'user' as user_type,
ou.user_id as user_id,
ou.role as relation,
'organization' as object_type,
ou.organization_id as object_id
from organization_user as ou
union all
-- Repositories are tied to their owning org
select
'repository' as user_type,
repository.id as user_id,
'organization' as relation,
'organization' as object_type,
repository.organization_id as object_id
from repository
Currently, we have to keep this up-to-date with our authz model manually, but we’re exploring ways to have it generated using our schema.
check_permission
The check_permission
function is the real meat of our system (which we’ve lovingly dubbed ‘pgFGA’). It’s a bit longer, so rather than dumping it straight into Substack’s editor, you can check it out on GitHub by clicking here.
Essentially, the function runs recursively, searching the graph defined by our authorization model and tuples for a tuple matching the given query. It works pretty much the same as OpenFGA does, but we’ve adjusted things slightly to prioritise the relationship types that predominate in our authz system (we use a lot of inherited tuples).
The primary overload of the function takes in an array of authz_model
rows, but we also have overloads that fetch that from the table based on either:
A given version, in set by an environment variable in production environments
The latest version, so local development is easier
Utility scripts
We’ve also got some utility scripts written using Bun and TypeScript that manage things like work-in-progress schemas, creating new authz schema migrations, applying them to the database and so on.
In conclusion…
This rewrite (which didn’t take much longer than a weekend) is probably one of the best decisions we’ve made regarding our auth system as a whole. It’s so much easier to maintain, and we’re not at the scale where we need a separate database and service just for our authorization checks. Once we get to that point, it’ll be fairly straightforward to scale just using some Postgres read replicas.
At some point, we might run into performance issues stemming from repeatedly computing our authz_relationship
view, but that can be mitigated by materializing the view and refreshing on write to any relevant tables.
If you’re interested in using a system like this yourself, we’ve made much of it open source over on GitHub under the MIT license.
We’re hoping to be able to share more of the cool stuff we develop internally in the future, so sign up to this Substack if you’re interested 😄
Ciao ciao!