Server crash of August 28, 2015

Time: 0720

First indication was that customers were reporting that Wintix was no longer working. Emails were streaming in. Our support lines were jammed. The problem was with the Data5 server – our busiest.

I checked it and found the server was not responding. I logged into the Amazon console. It said the server was working. I started a backup and it completed. I gave a sigh of relief. I started a restore with the knowledge that if we had a good backup, the server would be up in 10 minutes. Alas, that was not the case. The restore never completed.

Time: 0830

I called Amazon tech support. They agreed there was a problem. They said they would investigate and get back to me. 30 minutes later they called back. They agreed there was a problem “You have corrupted tables. Your data has been restored – but in a read-only mode. You need to do a database dump and restore.” I groaned because now the 10 minute process would take 3-4 hours.

I checked the backup that was supposed to be restoring. It was still running – after an hour. Tech support explained that the backup was made from damaged data. Therefore, it would not be possible to restore from it. I started the database dump.

Time: 1100

I had a few questions and got connected with tech support. He was friendly. I asked him where he was located. He said, Dallas, Texas. I then warned him that one of our customers was near him (in Abilene). She goes around with a six gun on her hip. She would come looking for him if the server was not back up soon. He was quiet. Evidently, such things are taken seriously in Texas.

Time: 1330

The 220 entries of logins, passwords and permissions were put in. Then Webtix was almost working. The problem turned out to be that the DNS settings needed to be updated. Then, Webtix started working. Wintix could not write to the database. It was reading the data on the old Data5 server. This was a second DNS entry that needed to be fixed.

Time: 1930

At this point, the server was up again. We had a new server drive. It was set up as a replication server. It was accessible by everyone. And, it was operating at the proper speed.

We had reached our limits though. We were exhausted and frazzled. We had nothing to eat all day. We were in no shape to continue. We quit.

In retrospect, the problem was hardware related – similar to a hard disk crash. However, our servers don’t use hard drives. Instead, they use SSD memory drives with no moving parts. They are faster and more reliable. But, they are not infallible.

Amazon is the fifth hosting company we have used. And, it has been the most reliable. But they are still not 100% reliable.

For the future, we have switched to Amazon’s version of a replication server. This means that data gets written in two locations. If one piece of hardware fails (or the whole data center disappears) the other part of the network will take over and make the data available. The failover time (time to recover) is about five minutes. We don’t expect this to happen again. But, if it does, we are prepared.

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply