CSV Editor Connector Failures

Incident Report for Tray.io

Postmortem

From Wednesday, December 9th to Friday, December 11th, Tray.io customers using the CSV Editor experienced a degraded service.

Customer workflows running between 1 AM to 6 PM (UTC) on the 9th of December have intermittently failed, responding with an internal error message.

The CSV Editor is now back to its normal operation and customers can continue utilising the connector in workflows. All backlog step executions have been processed and no action is required from customers.

What happened?

Early morning on December 9th, a rapid increase in concurrent CSV Editor step executions caused our database to run out of connections.

On this occasion, the database stopped accepting new connections creating a backlog of steps retrying to connect and ultimately failing.

In the morning, the engineering team introduced global execution throttling for all customers that eased the load on the database and allowed the infrastructure to recover and clear the requests' backlog.

By the evening, the team removed global throttling and enacted fine-grain execution optimisation targeted to only some specific workflows and operations, to ensure overall platform stability.

Thursday and Friday, the team transitioned into monitoring the incident and started working on several long-term improvements. The database and the underlying services have fully recovered.

How did we fix it?

Early Wednesday morning (UTC), our technical support team and engineering determined that the CSV Editor service performance were degraded. Soon after, the on-call team introduced bandwidth management to the workflows causing the issue bringing the CSV Editor performance back to normal.

How are we preventing possible similar incidents in the future?

In the short term, we have introduced bandwidth control to prevent the database from running out of connections as well as wider resource alarms to notify the team proactively if the database health is degrading.

We are actively working on several strategic initiatives to increase our services' stability and continue to work cross-organisation to prevent these issues from happening in the future.

In the long term, we are looking at improving the compartmentalisation of database resources, and more performance segregation across our database instances.

Finally, we are working with the product team to enhance the connector with features targeted to supporting a wider range of CSV Editor use cases and increased load.

What did we do right?

Once the incident was created, our engineers were very quick in identifying the issue and in introducing a change that allowed the database to recover, minimising the impact of the incident as much as possible.
After the initial global bandwidth management, the team identified and only throttled the executions of the few workflows creating the concurrent execution issue.

What could we have done better?

The CSV Editor connector is one of the longest standing as well as most rapidly evolving connectors in the Tray Platform, and its internal documentation was slightly outdated. This created a knowledge gap that aggravated the problem. We will look into updating connectors internal documentation policies in the immediate future.
While the database in question had proactive monitoring and alerting, the alerting was focused on CPU utilisation, and not on the resource concurrent connections or on DDL latency, preventing us from identifying the issue proactively.

Posted Dec 15, 2020 - 18:40 GMT

Resolved

CSV Editor performance has been stabilised.

Posted Dec 10, 2020 - 19:25 GMT

Monitoring

A fix has been implemented and we are monitoring the results. Users should experience improved CSV-editor performance and a reduction in unexpected errors.

Posted Dec 10, 2020 - 12:08 GMT

Update

We have managed to improve the performance of the CSV Editor but it is still degraded to a certain degree. Therefore, CSV Editor steps might still be delayed and fail unexpectedly. Our engineering teams are working continuously to implement a solution.

Posted Dec 09, 2020 - 19:15 GMT

Identified

The issue has been identified, customers may still experience CSV-editor related step failures and delays. We are working hard to fix this issue.

Posted Dec 09, 2020 - 12:22 GMT

Investigating

We are currently investigating the issue. Users may experience delays and execution failures with unexpected errors related to the CSV-editor connector.

Posted Dec 09, 2020 - 11:29 GMT

This incident affected: Workflow Execution.