The afternoon of May 27th was rather quiet. That was, until our phones started ringing constantly. The people calling were our customers and they were all saying the same thing, “Webtix is down.” We checked and sure enough, Webtix was down. Oh my gosh. We have a lot of customers. What do we do now?
A little more checking revealed some more things:
The tix1.centerstageticketing.com server was down. However, the tix2, tix3 and tix5 servers were just fine.
Even though Webtix was down, Wintix was working properly on all servers. We were not sure for how long though. This was especially interesting because each data center uses 2 servers: 1 for Webtix and the other for Wintix. We asked ourselves, if 1 server was affected, why not the other servers?
We have 7 servers and only 1 was affected.
We tried calling the company that provided the servers. We couldn’t get through. Evidently their phone lines were jammed.
We started the disaster plan.
Our first priority was we had to be back up in a hurry. That meant saving everything that could be saved.
A) We checked our backups. They were a couple hours old, but otherwise fine.
B) We started doing another backup on each server so we could restore to the latest point in time.
C) We started looking around for another company that could supply a server and disk space in a hurry.
Just as we were getting underway with our plan, the tix1 server came back on line. We checked it out. Everything was fine. We started to breathe a little easier. We did continue backing up though.
We checked the (geek) news and rumor mill. It had the story. http://www.theregister.co.uk/2014/05/28/joyent_cloud_down/
Basically, what happened was that one of the sysadmins intended to re-boot one of the servers. He accidentally re-booted everything in the data center.
The panic is over. Our customers are fine. And, their customers are fine. Everyone is happy. Are there any lessons though? Yes.
The first thing is that all of this computing stuff is still fallible. Cloud computing has made great strides in reliable service. But, it can fail.
Lesson: Have a backup plan Backups are still important. If we did not have the backups we did, the failure would have been much more stressful.
Lesson: Backup everyday and check the backup. It doesn’t cost anything and you will sleep better.