From Wednesday, December 9th to Friday, December 11th, Tray.io customers using the CSV Editor experienced a degraded service.
Customer workflows running between 1 AM to 6 PM (UTC) on the 9th of December have intermittently failed, responding with an internal error message.
The CSV Editor is now back to its normal operation and customers can continue utilising the connector in workflows. All backlog step executions have been processed and no action is required from customers.
Early morning on December 9th, a rapid increase in concurrent CSV Editor step executions caused our database to run out of connections.
On this occasion, the database stopped accepting new connections creating a backlog of steps retrying to connect and ultimately failing.
In the morning, the engineering team introduced global execution throttling for all customers that eased the load on the database and allowed the infrastructure to recover and clear the requests' backlog.
By the evening, the team removed global throttling and enacted fine-grain execution optimisation targeted to only some specific workflows and operations, to ensure overall platform stability.
Thursday and Friday, the team transitioned into monitoring the incident and started working on several long-term improvements. The database and the underlying services have fully recovered.
Early Wednesday morning (UTC), our technical support team and engineering determined that the CSV Editor service performance were degraded. Soon after, the on-call team introduced bandwidth management to the workflows causing the issue bringing the CSV Editor performance back to normal.
In the short term, we have introduced bandwidth control to prevent the database from running out of connections as well as wider resource alarms to notify the team proactively if the database health is degrading.
We are actively working on several strategic initiatives to increase our services' stability and continue to work cross-organisation to prevent these issues from happening in the future.
In the long term, we are looking at improving the compartmentalisation of database resources, and more performance segregation across our database instances.
Finally, we are working with the product team to enhance the connector with features targeted to supporting a wider range of CSV Editor use cases and increased load.