WiscWeb Outages Update

I would like to provide you some additional information on all of the recent outages, what has happened and what we are doing to prevent issues from happening in the future.

The outage on Wednesday 1/17, was caused by one of our application administrators starting and running a database copy process without verifying whether or not there was enough disk space available on the server to complete the process. When the server ran out of disk space during this process, the entire WordPress application paused to wait for additional disk space. We added a significant amount of disk space as quickly as we could and when there was enough space the service resumed and the sites came back up. The good news here is that no data was lost, the bad news is, there was an outage. To prevent this in the future, we have tripled the amount of disk space on our server. We have put a monitor in place to send us a notification when we have used up 60% of the existing space, and we have updated our procedures to include steps to verify that there is triple the amount of disk space necessary to run any process before we start. When we move to Amazon Web Services, this issue will never occur, because the disk space will automatically grow to accommodate any requirements.

The outage on Thursday 1/18 was caused by a server administrator trying to delete a domain and clicking on the wrong buttons. The actions he took not only deleted the domain, but also deleted the files and the database. This should never have happened, but it did. To recover, we needed to restore both the database and the files from backup. It did take us about 30 minutes to figure out what happened and that we needed to do the restore. It took the restore about 2 and ½ hours so the full outage was about 3 hours. There was loss of data in this outage between the time of 3:30am (when the nightly backup was taken) and 10:03am (when the outage occurred). We were not as prepared for this type of outage as we should have been. In our infrastructure design, we had concentrated so hard on preventing external attacks that might cause this, we did not prepare adequately for recovery from internal staff mistakes. We have met to discuss the best approach for system recovery and will address this by doing regular copies of the files and database throughout the day and storing these separately. In the event of another catastrophic issue with the database and files, we will be able to point the WordPress application at the copies and return service much sooner, with less loss of data. We have developed a notification list to go to administrators of all of the sites, we have added a front page to the service so that when sites go down, visitors see an outage page instead of a database error page. We have put solid outage reporting processes in place for all team members to follow. We have also worked closely with the individual who made the error to ensure that nothing like this happens in the future. He is helping write and update procedures so that no one else makes the same mistake. In all honesty, we will never fully be able to prevent this type of mistake from happening. It shouldn’t have happened on Thursday, but we can focus on how to recover more quickly and more effectively and I am confident that we have done this.

The outage on Friday 1/26 was caused by a process running on the server that caused a significant spike in the server load, which then caused the server to run out of memory. When this happened, the server started to kill off processes. One of the processes the server stopped was mysql, which then shut down WordPress. We immediately tried to restart mysql, but it took a couple of times before everything started up again. We became aware of the outage at 12:13pm and were back up at 12:35pm. We made some configuration changes Friday afternoon in hopes of preventing a server crash, if this happened again. This morning, 1/29 we experienced a significant spike in server load again. The good news is, it appears that this time WordPress paused, it did not crash. However, sites did not load as expected for about 10 minutes.  We are currently going through the server logs to get additional information on what process(es) was running that caused these server load spikes, and what we need to do to make sure that they do not have such a significant effect on your sites. As soon as we get the information, I will share our findings with you, but this could take a little bit of time. I do want to assure you that we will address this issue in the short term with our current infrastructure and also in the long term with Amazon Web Services.

All of these outages are completely unacceptable. The fact that they have all happened within a week is horrible. I want to assure you though, that we are fully committed to our customers and that we know how detrimental these outages are to your department. We are doing everything in our power to prevent these from occurring in the future, however we know full well that issues happen, so we are focusing on how to quickly notify our customers, and restore the service.

If you have any questions about any of this, or if you would like to meet to discuss any concerns that you have, please let me know. I sincerely apologize for all of this disruption.

Cathy Riley