DNS lookup failures
Incident Report for AchieveIt
Postmortem

Summary

On Thursday, September 2, 2021 from approximately 14:06 UTC until 16:22 UTC our US commercial and government hosting environments both experienced a service interruption related to a configuration error in the service we use to manage our domain name and DNSSEC configuration, GoDaddy. We observed that our monitoring tools reported increasing errors of DNS queries for domain names used by our application, such as my.achieveit.com, during the 136 minutes of the interruption. As the errors increased, it is possible that some users were not able to reach the system. At approximately 16:20 UTC we cleared the interruption by manually resetting the state on our DNSSEC configuration, and this resulted in queries succeeding across all our monitoring regions within minutes.

Root Cause

We confirmed with GoDaddy support that on September 2, 2021, they experienced a failure in their managed DNSSEC service that caused some of the regular updates to DNSSEC records to not succeed. We observed that the outcome of this failure was that one of RRSIG signing key used to sign the DNSKEY records for the achieveit.com domain was not properly rotated and thus expired at 14:05 UTC on September 2. When the key expired, DNS resolvers that were handling queries for our domain began to error out due to the invalid key. Although this was a good indication that DNSSEC protection of our domain was working as desired, in this case it was due to an operational error and not a security fault.

After investigating the issue and working with GoDaddy support to understand the root cause, we decided to toggle the DNSSEC configuration in an attempt to reset the expiration on the RRSIG signing key. At approximately 16:20 UTC we toggle the configuration and immediately observed that the RRSIG expiration was updated and DNS queries began to succeed again. Within a minutes, we observed that all our DNS monitoring errors cleared worldwide.

Mitigation Actions

The main mitigation action was taken by GoDaddy: they verified that they released patches for the DNSSEC system on September 2 and September 3 to correct the failure. AchieveIt will continue to monitor the rotation cycles of our DNSSEC records to ensure the fix GoDaddy put in place works as expected.

Posted Sep 10, 2021 - 15:14 EDT

Resolved
We have confirmed that DNS lookups are working functioning as expected and all services are accessible again. We will perform additional investigation and post a post mortem as soon as we have identified the root cause of the interruption.
Posted Sep 02, 2021 - 12:55 EDT
Monitoring
We have updated a configuration in our DNS records and we believe that has allowed lookups to succeed again. We will continue to monitor the issue and provide an update with additional resolution details shortly.
Posted Sep 02, 2021 - 12:35 EDT
Identified
We have identified a problem with part of the security configuration in our DNS settings that is likely causing the lookup failures. We are working with our DNS provider to try to resolve the issue.
Posted Sep 02, 2021 - 12:13 EDT
Investigating
We are receiving reports that some users are not able to access our production web applications that appear to be related to DNS lookups failing in some public DNS resolvers. We are investigating the issue and will update with additional details.
Posted Sep 02, 2021 - 11:00 EDT
This incident affected: Web Application - Commercial Environment and Web Application - US Government Environment.