Degraded performance across the application

Incident Report for Tray.io

Postmortem

From Thursday 22nd to Saturday 24th October, the Tray Workflow Builder and the Embedded APIs were occasionally unavailable for a portion of our users. Workflow execution was partially delayed, and our trigger ingestion service was intermittently unavailable, occasionally responding with 500 HTTP status codes. Most 3rd party services would have retried these failed trigger attempts until they were successful.

The Workflow Builder and all Tray APIs are back to their normal availability, and you can resume work. All delayed executions and recorded triggers have caught up, and no action is needed from you.

If you are a Tray Embedded customer or a Tray API user and received an error response on a Tray trigger, you should try again; a 500 response from the Tray API means that the action was not processed. We advise implementing multiple retries as the default behaviour of your client, with exponential backoff.

What happened?

On Thursday, 22nd October, a rapid increase in disk usage on one of our database servers triggered an automatic response to scale the machine size up to accommodate the additional usage. We rely on Amazon Web Services for our infrastructure, and this process is usually seamless.

On this occasion, the database stopped accepting new connections after the scaling process completed, resulting in several API instances starting in an unhealthy state. Engineers manually terminated some instances, auto-scaling mechanisms then kicked in and the problem gradually resolved. On Friday a similar incident prevented new connections.

In both cases, since the database was unavailable, our application servers registered degraded health. When an application instance is degraded, it is automatically removed from service and replaced with a new instance. Since the database was at fault and not the application, this put our application servers into a loop that put additional strain on our databases which may have exacerbated the issues.

How did we fix it?

Early Saturday morning (UTC) our on-call engineers, with assistance from an AWS, determined our primary database was unhealthy and made the decision to failover to a secondary.

How are we preventing possible similar incidents in the future?

In the short term, we have increased the size of the affected database instances. We also pushed a fix to our application to better report on the health status of the application and to prevent unnecessary restarts. Our engineers are also evaluating the health and size of our other database servers and taking liberal steps to proactively upgrade all instances that might be at similar risk.

We are actively working on several strategic initiatives to increase the stability of our services, and continuing to work directly with the AWS team to prevent these issues from happening in the future.

We continue monitoring our database server instances and continue introducing new alerting and monitoring systems and introducing new failover mechanism to prevent and further reduce future disruptions.

What did we do right?

Our on-call engineers were notified and responded to the incident within minutes.
The on-call engineer woke up other members of the team, so we had multiple eyes on the situation.
The on-call team involved AWS Support from the start.
Our system is designed so as not to lose any workflow execution once it enters the system; while workflows may have been delayed, they all eventually executed and caught up. Similarly, any trigger that was received was eventually processed, even if it could not be processed right away.
We have several layers of redundancy and did not rely on a single database instance, so we were able to failover to a live replica with no loss of data.

What could we have done better?

We should not have relied only on auto-scaling mechanisms but also proactively set a higher baseline capacity to deal with peak traffic volumes.
The failover was delayed by approximately two hours once the decision had been made because a backup was in progress. We failed to take this into consideration, and it prolonged our downtime. We should have failed over before the backup started.

Posted Oct 28, 2020 - 11:10 GMT

Resolved

This incident has been resolved.

Posted Oct 25, 2020 - 00:25 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 24, 2020 - 08:10 BST

Update

We have identified the issue and are now in recovery mode. We still have some degraded performance but will be fully recovered shortly.

Posted Oct 24, 2020 - 04:58 BST

Update

We are aware of issues with the application. Our engineering team is continuously investigating the problem and working on a possible solution. We appreciate your understanding.

Posted Oct 24, 2020 - 04:03 BST

Update

We are seeing elevated error rates from our API. Users may be unable to perform some actions. We are investigating the issue.

Posted Oct 24, 2020 - 01:26 BST

Investigating

We are seeing elevated error rates from our API again. Users may be unable to perform some actions. We are investigating the issue.

Posted Oct 24, 2020 - 01:23 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 24, 2020 - 00:52 BST

Investigating

We are currently investigating this issue.

Posted Oct 23, 2020 - 23:46 BST

This incident affected: Workflow Execution, Workflow Builder (Workflow Builder API, Workflow Builder App, Workflow Builder Logs), and Embedded (Embedded API, Embedded App).