From Thursday 22nd to Saturday 24th October, the Tray Workflow Builder and the Embedded APIs were occasionally unavailable for a portion of our users. Workflow execution was partially delayed, and our trigger ingestion service was intermittently unavailable, occasionally responding with 500 HTTP status codes. Most 3rd party services would have retried these failed trigger attempts until they were successful.
The Workflow Builder and all Tray APIs are back to their normal availability, and you can resume work. All delayed executions and recorded triggers have caught up, and no action is needed from you.
If you are a Tray Embedded customer or a Tray API user and received an error response on a Tray trigger, you should try again; a 500 response from the Tray API means that the action was not processed. We advise implementing multiple retries as the default behaviour of your client, with exponential backoff.
On Thursday, 22nd October, a rapid increase in disk usage on one of our database servers triggered an automatic response to scale the machine size up to accommodate the additional usage. We rely on Amazon Web Services for our infrastructure, and this process is usually seamless.
On this occasion, the database stopped accepting new connections after the scaling process completed, resulting in several API instances starting in an unhealthy state. Engineers manually terminated some instances, auto-scaling mechanisms then kicked in and the problem gradually resolved. On Friday a similar incident prevented new connections.
In both cases, since the database was unavailable, our application servers registered degraded health. When an application instance is degraded, it is automatically removed from service and replaced with a new instance. Since the database was at fault and not the application, this put our application servers into a loop that put additional strain on our databases which may have exacerbated the issues.
Early Saturday morning (UTC) our on-call engineers, with assistance from an AWS, determined our primary database was unhealthy and made the decision to failover to a secondary.
In the short term, we have increased the size of the affected database instances. We also pushed a fix to our application to better report on the health status of the application and to prevent unnecessary restarts. Our engineers are also evaluating the health and size of our other database servers and taking liberal steps to proactively upgrade all instances that might be at similar risk.
We are actively working on several strategic initiatives to increase the stability of our services, and continuing to work directly with the AWS team to prevent these issues from happening in the future.
We continue monitoring our database server instances and continue introducing new alerting and monitoring systems and introducing new failover mechanism to prevent and further reduce future disruptions.