Slowness loading some plan information
Incident Report for AchieveIt
Postmortem

Summary

On Wednesday, January 12, 2022 at approximately 14:52 UTC (9:52 AM EST) we were notified that some customers were experiencing errors when trying to load plan item data in certain views, including List View and list widgets in Custom Dashboards. We were able to replicate the errors in certain scenarios and began investigating the cause within 10 minutes of the initial report. We also confirmed that most application features and operations on plans and plan items continued to function normally. By 17:00 UTC (12:00 PM), we confirmed that the issue was likely related to an increase in the amount of time it was taking to return some data from queries for plan items and that the issue only began on January 12. Over the succeeding 9 hours, we continued to research and make adjustments to the database. Some of these adjustments caused increase overall load, particularly between 21:00 UTC and 23:00 UTC (4:00 PM - 6:00 PM EST). During this time customers likely experienced significant slowness through the application. At approximately 2:00 UTC on January 13 (9:00 PM EST on January 12) we identified that there was a misconfiguration in the primary production database. We corrected that configuration and saw all system operations immediately return to normal.

This incident did not affect the US Government production environment. No data was lost or corrupted in either environment.

Root Cause

The database misconfiguration was caused by maintenance activities we performed during a production release the evening of January 11, 2022. During the maintenance we inadvertently missed a specific configuration related to the performance of retrieving comments on plan item progress updates. This caused queries that included this comment data to, in certain instances, take much longer to complete. At times, the application would time out waiting on the query and display an error to the user.

Mitigation Actions

We manage all configurations for our environments in version control just as if it were code. We have already reviewed and corrected that the database configuration under version control to ensure that it is not missing any other configuration information. We are also in the process of changing how we store database configuration changes, which are done from time to time to manage system performance and security, to be in a more centralized location to improve review of these changes in our quality processes.

Posted Jan 14, 2022 - 08:41 EST

Resolved
We have confirmed that all plans and plan items are loading as expected. We will publish a post mortem with additional information about the incident in the next two days.
Posted Jan 12, 2022 - 21:38 EST
Monitoring
We have confirmed system performance has returned to normal. We will continue to monitor activity for the next hour to confirm that all performance issues have been resolved.
Posted Jan 12, 2022 - 20:15 EST
Update
We have finished applying the changes to the database to improve the performance. We are seeing improvements but are continuing to evaluate to identify if there are any scenarios where displaying list views of plans are not functioning as expected.
Posted Jan 12, 2022 - 18:33 EST
Update
We are in the process of applying changes to the database to alleviate the performance issues. In the process of making these changes, some areas of the application may have limited functionality. We will continue to update as we complete these changes.
Posted Jan 12, 2022 - 16:32 EST
Identified
We have identified that the performance degradation is related to a database configuration. We are working to confirm the root cause and also make configuration adjustments to alleviate the immediate performance issue.
Posted Jan 12, 2022 - 15:44 EST
Investigating
We are investigating an issue that appears to be causing information for some plans to load slowly, and in some instances fail to load.
Posted Jan 12, 2022 - 13:52 EST
This incident affected: Web Application - Commercial Environment.