Order delay postmortem

The week before last, we experienced a customer-impacting outage (22nd-23rd August). I wanted to give some context on what happened, why and how we’re fixing it for the future.

Firstly, I’d like to apologise to anyone who experienced failed orders or delayed deposits. The quality of service we provide is key to our product and clearly we can do better.

What happened and why?

On Thursday 22nd August we noticed an elevated number of errors when attempting to send deposit files. On investigation, we were seeing an increased failure rate when attempting to send files via FTP (file transfer protocol). We use this to transfer files between ourselves and our service providers. Our current trading partner uses FTP to receive deposits and also to send us contract notes.

We worked with our FTP provider to resolve the issue and although we saw some improvements, we were yet to identify the root cause. Whilst experiencing the issue we were able to manually retry bank deposits but decided to disable Google and Apple pay to reduce customer pain.

On Friday morning we continued to manually process bank deposits. Our FTP provider suspected an issue with Amazon Web Services but was unable to confirm as we were the only customer affected at that time.

At approximately 3 pm on Friday, our FTP provider had identified the issue as a breaking configuration change from Amazon Web Services. On resolving the issue our deposit processing and contract note processing started to recover. Unfortunately, this was not quick enough for our batch orders. The backlog of contract notes caused our trading partner to crash at approximately 3:30 pm, disrupting the batch.

We attempted to recover the processing of batch orders but were unable to do so before the market closed, meaning our batch orders didn’t go through.

All customer data was secure at all times.

How we’re fixing it

Since before the outage, we’ve been working to increase the resilience of our FTP connection. This will reduce the number of manual intervention required to process deposits and contract notes. Some changes have already been applied and we’ll continue to monitor and improve this over the next few days.

Over the next few weeks, we’ll completely remove the dependency on FTP for generating contract notes. This will have the extra benefit of making contract notes faster, which should be a welcome improvement for all.

Over the coming months, we’ll also remove the dependency on FTP for all our deposits. This also comes with the added benefit of significantly improving the time it takes to see a deposit. Once a deposit is registered in our platform it will take seconds rather than minutes to appear in your account.

We hope this goes some way to explaining what happened last week and clarifies what we’re doing to improve things in the future. Again, I’d like to apologise personally and on behalf of the team. We’re committed to providing the world’s best investment platform. This experience fell short of the standards we expect to deliver and we’ll work tirelessly to make sure we can redress it.

If you have any other questions, please do send me an email or reach out to me here.

The post on our blog:

27 Likes

thank you for taking the time to clarify - these things happen and it’s to be expected!

3 Likes

A definite nod to Monzo here with this - the process issue explanations given on their community are superbly detailed.

4 Likes

:astonished: presumably you mean SFTP or FTPS rather than FTP ??

To be specific we actually access it over HTTPS through a REST API. Our trading partner uses SFTP. Nothing is sent in the clear.

Thanks for clarifying though!

2 Likes

cheers

:ok_hand: looking forward to the new FT investment platform sorting out these issues …

1 Like

So the reason for the outage is being blamed on AWS changing config? Or someone at the provider or your side made a breaking config change in AWS?

Hi Lewis,

Ultimately the responsibility stops with us. And we’re working to remove dependencies as quickly as we can. With this postmortem we also want to be transparent to give people the complete picture.

In this instance AWS made a breaking change to S3 without communicating it to our FTP provider. This impacted the FTP service they were able to provide to us which led to the issues I’ve described above.

Hope that helps?

Ian

4 Likes

Appreciate the detail in this post. Keeping customers in the loop about these things is a great way to maintain our confidence in you. Keep it up :+1:

3 Likes

We’ve had configuration changes that break internal systems before, it was vodaphone causing our service provider connections to go down. Then the backups were taking too long to bring online, so they decided to just focus on fixing the original problem. Change governance and controls should pick up the issues before going live.

Sometimes it’s hard to know the weaknesses until they break, good luck in reducing your dependence! I hope you have monetary penalties written into your SLAs :wink:

1 Like