Best practices for a "maintenance mode"

We’ve been thinking about what would be the best way to put the server into some form of “maintenance mode” where new matches would not be allowed and clients would gracefully be pushed out without interrupting ongoing matches. Abruptly closing all RPC calls and terminating matches would lead to unhappy players.

The only discussion about the topic that I can find is:

But it moves the responsibility to load balancers and services higher up the stack, way before the server. That’s not ideal as it makes it hard to do graceful shutdowns. We would ideally also allow certain clients to use the server as normal so that we can do verification that things are indeed working correctly before re-opening the server.

I’m not looking for some magic built-in flag in Nakama, but ideas for how to implement it in a not too invasive way. It’s easy to disallow new matches to be started, but we use an unholy amount of custom RPC calls to do all kinds of things as well as some of the built in RPC calls. There doesn’t seem to be a central “before RPC call” hook where we could stop calls, so we’d need to add a check to each and every RPC call handler for the maintenance mode.

Another way would be to push out a “maintenace mode” flag to all connected clients using a notification and then let the clients handle everything. This however leaves the door still open to script kiddies to continue doing RPC calls if they have snooped out the protocol. One reason to put the server in maintenance mode is to stop some exploit from being used, like if there’s a bug that gives the players infinite resources or similar. So having a system where the server can stop the calls is better than relying on clients being honest.

Any comments?

2 Likes

I think you pretty much went over all the possible solutions available at this time. If you don’t want to check on every RPC handler, you could stop matches from being started, and then configure the load balancer/reverse proxy to block incoming traffic during maintenance mode. This also allows you to whitelist specific clients to do any verifications needed.

To add to what @ftkg mentioned;

There is a server flag called shutdown_grace_sec which if set, instructs the system to allow for a grace period before the server is shutdown.

Inside Nakama, this grace period is propagated to all the subsystems including socket handling and match handlers. No new matches are created on the current node (new ones can be created in other nodes in the same cluster - Heroic Cloud feature), all RPCs will succeed and continue, and the match handler is given the grace period as well - and to tell all clients in that match to react to this event. Typically, developers choose to migrate the match including the state to another node in the cluster. We have documentation here.

Lastly, shutdown_grace_sec config flag is set on Nakama, however in Heroic Cloud, the system is clever enough to propagate that information across the entire cluster, as well as all Heroic Cloud subsystems too - so everything is configured automatically once the value is larger than 0 (including Loadbalancer behaviours, availability of nodes, rolling restart behaviour etc).

This should be the behaviour you are looking for, and therefore once set, you can tell your game client to disconnect gracefully, show a re-connection UI (if necessary) and continue seamlessly.