What went wrong?
At around 2020-11-18 03:36:00 PST there was an issue with a DNS server we use for internal DNS (Domain Name System) caching. This server is used to resolve a hostname (for example api50.prv.domain.com) to an ip address. We do this to maintain speed in our app. The effects of this service being down would have taken a variable amount of time to manifest due to differences in server side memory caches and time between critical dns lookups.
This caching service is not the name server that was used by outside resolvers and our direct monitoring systems (as they are hosted elsewhere for redundancy) and unfortunately they did not immediately alert us to the root issue as a result.
The issue took longer than usual to diagnose since its effects were seen gradually and were hard to detect.
At around 2020-11-18 10:25:00 PST we were able to resolve the issue and restore normal operation.
During this period our servers were unable to communicate effectively with each other and normal operations were disrupted.
Following the fix there was a delay in processing some tasks caused by a build of work from the affected period.
Some events such as writing feed items were de-prioritized during this time to ensure that time sensitive things like journeys and emails caught up as quickly as possible which caused some clients to perceive more delay than actually occurred. This may have caused things to seem to take longer to catch up.
How will we handle this better going forwards?
We have done a post mortem to better understand the technical cause of the issue and what was affected. This service (bind 9) is a very standard open source tool that is considered extremely reliable and we have never had a previous issue. Given the evidence we have around what went wrong we do not think it's likely to be a recurring problem. All the same we have already updated the architecture style we are building going forward to avoid this type of weakness.
We have added an additional, specific alert for this kind of issue which will reduce the time it takes to identify the problem in the future.