After digging a lot more into this I think* I have the cause of this issue figured out. There are two main facets to it. One is the RetryInvoker timer logic and the second is the overloading of the meaning of ApiResponseException when it comes to RpcCalls.
First, why doesn’t the following code complete correctly if the server side rpc throws an exception?
try
{
var payload = new Dictionary<string, string> { { "code", code } };
var response = await Client.RpcAsync(Session.Result, "code_redeem", payload.ToJson());
Debug.Log("Code redeemed sucessfully" + response);
}
catch( ApiResponseException ex )
{
Debug.LogFormat("Error: {0}", ex.Message);
}
catch( Exception ex )
{
Debug.LogError("What else could have happened? " + ex.Message);
}
It stems from the RetryInvoker logic. On an exception, there is a check made to see if it should be considered a “transient” exception. (From the source file comments: “For example, timeouts can be transient in cases where the server is experiencing temporarily high load.”).
ApiResponseExceptions with a status code of 500 or more are considered transient. Using throw new Error("");
in typescript does indeed result in that kind of exception. Here is the http response: status=500, {"code":13,"error":{"StackTrace":"Error: No such code as 'h'\n\tat rpcRedeemCode (.:38:15(76))\n","Type":"exception"},"message":"Error: No such code as 'h' at rpcRedeemCode (.:38:15(76))"}
So, if the exception is transient, the SDK will re-attempt the server call (that applies not only to rpc but all server calls). Before re-attempting, the client will wait some time. How long is based on a RetryConfiguration object. The default one is configured with 500ms and a max retry attempt of 4.
public RetryConfiguration GlobalRetryConfiguration { get; set; } = new RetryConfiguration(500, 4, (RetryListener) null, new Jitter(RetryJitter.FullJitter));
The exact time to wait between each attempt is further based on the previous number of failed attempts
private Retry CreateNewRetry(RetryHistory history)
{
int int32 = Convert.ToInt32(Math.Pow((double) history.Configuration.BaseDelay, (double) (history.Retries.Count + 1)));
int jitterBackoff = history.Configuration.Jitter((IList<Retry>) history.Retries, int32, new Random(this.JitterSeed));
return new Retry(int32, jitterBackoff);
}
The following math is a bit suspicious Math.Pow((double) history.Configuration.BaseDelay, (double) (history.Retries.Count + 1))
. With a base delay of 500ms, on the first retry, you get 500ms. Good. On the second retry you get 500^2. That’s starting to be a rather large number. On the 3rd attempt we are up to ~34h…
Therefore, the original call to RpcAsync is simply waiting for hours before actually completing, that’s what is going on. It is easy enough to work around this issue by providing a RetryConfiguration object with a maxRetry of 1.
I’ll be posting this in original issue in the .Net client but I think they’ve already figured out the problem as there was a commit 5 days ago that changes the logic of the delay calculation. However, this update has not propagated to the Unity SDK DLL yet.
Which begs the question: what is the proper way to return errors from rpc calls? The documentation samples I’ve seen so far are using (for typescript) throw new Error("something");
First, this results in the unfortunate side effect of the client automatically retrying the rpc because the exception is deemed “transient”. Secondly, it is hard to extract the actual error from the resulting exception stack (it is a sub-exception to the main TaskCanceledException)
Also, the errors a rpc call can returns do not all fall under the umbrella of a failed api call. There could simply be a logical failure: code doesn’t exist, entry has already been removed, parameters are outside the desired range. Is there a way to distinguish those from a failed database connection or a divide-by-zero kind of exception (on the client side). Could we throw new LogicalError("something)
that wouldn’t get wrapped in an ApiResponseException?
Reading the new commit on the .Net client from 5 days ago, I see that it will be possible to override the check on whether an exception is “transient”. That would solve part of the problem. Are there plans to update the Unity DLL?