Explaining what happened to my rooms: A post-mortem analysis

Anyone who is a member in one of my rooms should have noticed by now: Something went terribly wrong. There was no malicious intent at any point in time which lead to this, and, while certain peculiarities of the Matrix spec may have had an influence, the fault was mainly due to a mistake on my side… Everything was related to server ACLs in these rooms, so that is where I will start.

What are server ACLs?

"ACL" is short for "Access Control List". They are used to, well, control which entities are permitted to access which resources. In Matrix, rooms can have server ACLs in order to prevent or allow certain servers from interacting with other servers in that room. Like most things in Matrix, they are defined in JSON, with an "allow" array as a whitelist and a "deny" array as a blacklist. The former typically has a "*" element to allow any server to participate (* being a wildcard for an arbitrary string).

What did I do, and why?

I had recently configured a Mjolnir bot to help with moderating my rooms. Now, said bot wanted to configure a server ACL in my rooms, even though I had heard this frequently caused performance issues due to the aforementioned whitelist rule having to be checked against with every server - so I thought redacting the event it had sent would revert the room back to a previous state, where no ACL had existed. Wrongly so, as I learned soon after that.

What happened?

I did not notice anything at first: It seemed like I could not prevent the bot from doing this, so I decided that instead of using the more efficient Mjolnir rules like @*:server.tld, which we use in the Techlore room and do not cause any server side load, I just would use server ACLs what they were meant for. The bot was on my server; everything seemed to work fine, I could even receive messages from other servers. Until MMJD told me from his side it looked like Mjolnir had banned every other server… Unlike I had thought, the redaction only removed the content of the ACL, not the ACL itself, leaving it with an empty array for "allow". This caused every server in the rooms to ignore any events (messages, settings changes and so on) in these rooms. A truly weird situation, as my server could receive things just fine - only the other servers would refuse to do so, even though they sent events out. They would even ignore the fact I had changed the server ACL yet again…

The solution

I could manually patch servers back in by changing the ACL for these rooms on them as well - however, this would only work for servers where there was a moderator for the room on, as other servers, while being able to have users join the room, would not receive the fact someone had been promoted in there… A state reset could also have solved it, however, such is hard nowadays as Matrix has become more reliable… So, my solution was to create new rooms, reclaim the aliases from the old ones for them, and invite every old user over, while trying to close the old rooms as well as possible (tombstoning, deny permissions etc., I am not sure how much of that arrived on the other servers, however). I hope everyone has found their way into the new rooms now, if not, feel free to tell me.

The moral

NEVER EVER redact m.room.server_acl events unless you are exactly sure what you are doing. Actually, do not redact any state events where you cannot be absolutely sure nothing will break.