On 2023-02-06, Square experienced a service disruption impacting Square payments. Starting at 19:17 UTC, all transactions to Discover started failing due to an external outage. Starting at 19:54 UTC, the disruption spread beyond Discover transactions.
In this postmortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.
2023-02-06 19:17 UTC Beginning of Discover impact: all authorizations and verifications to Discover start timing out.
19:23 Engineering is alerted by automated alerting and multiple teams start investigating.
19:42 issquareup.com updated.
19:54 Beginning of wide impact: timeouts cascade and Square's payment processing is degraded globally.
20:41 After exhausting any quick configuration changes to isolate the card network traffic, engineers start preparing code changes to quickly reject Discover transactions.
21:45 End of wide impact: Code changes declining card network auths reaches production. Wide impact ends.
22:00 Discover network recovers, but code to quickly decline auths remains in place.
22:37 Quick declines of Discover transactions are turned off. Card network impact largely ends. Some data remained cached so a few errors continue.
2023-02-07 00:45 Discover impact ends: All caches have been refreshed and auth declines have returned to normal levels.
00:46 issquareup.com updated to resolved
This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which we are actively working on.
The widespread impact was caused by a small portion of authorization traffic timing out and our services handling those timeouts poorly. We discovered that for a single upstream processing partner, Square’s systems mark the connection as unhealthy after any timeout of a financial message like an authorization. This let the timeouts from the Discover issue mark all of our connections as unhealthy, impacting other transactions. From 19:23 to 19:54 we had enough healthy connections to serve traffic, but after enough connections went unhealthy at 19:54 we were unable to serve a significant portion of other traffic. We are actively working on addressing this behavior. We will begin testing these improvements next week.
This outage also illustrated the need to be able to quickly disable traffic that is threatening other Square infrastructure. If we had this, the impact wouldn’t have spread beyond Discover transactions. We will be adding this to multiple layers of the payments stack. Our emergency mitigation will remain available to be reenabled until this change is released.
We know any disruption is painful for our customers. We are in the midst of a longer-term effort to identify critical payment flows for our sellers and improve those systems’ resiliency to disruption.