On Thursday, September 2, 2021 from approximately 14:06 UTC until 16:22 UTC our US commercial and government hosting environments both experienced a service interruption related to a configuration error in the service we use to manage our domain name and DNSSEC configuration, GoDaddy. We observed that our monitoring tools reported increasing errors of DNS queries for domain names used by our application, such as my.achieveit.com, during the 136 minutes of the interruption. As the errors increased, it is possible that some users were not able to reach the system. At approximately 16:20 UTC we cleared the interruption by manually resetting the state on our DNSSEC configuration, and this resulted in queries succeeding across all our monitoring regions within minutes.
We confirmed with GoDaddy support that on September 2, 2021, they experienced a failure in their managed DNSSEC service that caused some of the regular updates to DNSSEC records to not succeed. We observed that the outcome of this failure was that one of RRSIG signing key used to sign the DNSKEY records for the achieveit.com domain was not properly rotated and thus expired at 14:05 UTC on September 2. When the key expired, DNS resolvers that were handling queries for our domain began to error out due to the invalid key. Although this was a good indication that DNSSEC protection of our domain was working as desired, in this case it was due to an operational error and not a security fault.
After investigating the issue and working with GoDaddy support to understand the root cause, we decided to toggle the DNSSEC configuration in an attempt to reset the expiration on the RRSIG signing key. At approximately 16:20 UTC we toggle the configuration and immediately observed that the RRSIG expiration was updated and DNS queries began to succeed again. Within a minutes, we observed that all our DNS monitoring errors cleared worldwide.
The main mitigation action was taken by GoDaddy: they verified that they released patches for the DNSSEC system on September 2 and September 3 to correct the failure. AchieveIt will continue to monitor the rotation cycles of our DNSSEC records to ensure the fix GoDaddy put in place works as expected.