The Great TMN Crash of 2014
Posted: Sun Sep 21, 2014 4:28 pm
First, allow me to apologize to each and every one of you for what happened over the weekend. I view any downtime as unacceptable and I certain view what happened this weekend as a complete breakdown in how our site should run. For those that are interested I'm starting this thread to explain and discuss what happened. Here's a breakdown of what went down, starting on Friday night.
Friday:
At 7:45PM I was notified by a member that the site was down. After initial troubleshooting I contacted our hosting provider and learned they were having database server issues and hoped to have them fixed quickly. I wasn't given an estimate for when we would be back online but was assured there was nothing we could do. Given that our primary database and any backups are managed by our host moving the site to another server elsewhere wasn't an option.
Saturday:
At 11:30AM I contacted the host for an update. They said they were still looking at it and no completion time was available.
At 2:30PM I contacted the host again. This time I was told that our site wasn't affected by the issue and our database had too many connections because the software was poorly written. When I told them that the database code hadn't changed in 6 months they had no answer.
Between 3:00 and 7:30 I contacted the host numerous times with escalating levels of frustration. I learned they did unannounced maintenance on our server and it corrupted a number of databases, of which ours was one. They had tried to repair our database and it was too corrupted to be saved. I told them to restore from a backup with they tried only to find out that the backup was corrupted as well.
At 8:30PM I requested a supervisor to discuss the situation with. At this point we were faced with not having any database backups newer than the end of last year since all of the ones from this year were on their servers and were all apparently corrupted. After a heated discussion they allowed me to have the raw database files for our site. I began repairing each file manually. The corrupted files were the table that stored posts with 306,000 records and the private messages table with 26,000 records. Over the course of last night and today I have been able to successfully recover the tables.
At 3:45 the database had been repaired, checked and installed on a server that we control. The site was brought back up and is functioning.
The site is currently going to be a little sluggish based on the fact that the install was hurried in order to get the site running.
Moving forward we have been left with no option but to move our site from a VPS environment to a dedicated server of our own. That way we can control backups, site and database settings and all aspects of the server. Until we find a new home the site can run on its current setup without any issues.
There will be abbreviated downtime when we move the site again. That downtime will be 15-20 minutes and will be one of our normal downtimes that happen late at night.
If you see anything not functioning on the website let us know. So far everything looks good. Hopefully it will stay that way.
Again, on behalf of the leadership team I apologize for the downtime. Moving forward we will take steps to make sure it doesn't happen again.
Friday:
At 7:45PM I was notified by a member that the site was down. After initial troubleshooting I contacted our hosting provider and learned they were having database server issues and hoped to have them fixed quickly. I wasn't given an estimate for when we would be back online but was assured there was nothing we could do. Given that our primary database and any backups are managed by our host moving the site to another server elsewhere wasn't an option.
Saturday:
At 11:30AM I contacted the host for an update. They said they were still looking at it and no completion time was available.
At 2:30PM I contacted the host again. This time I was told that our site wasn't affected by the issue and our database had too many connections because the software was poorly written. When I told them that the database code hadn't changed in 6 months they had no answer.
Between 3:00 and 7:30 I contacted the host numerous times with escalating levels of frustration. I learned they did unannounced maintenance on our server and it corrupted a number of databases, of which ours was one. They had tried to repair our database and it was too corrupted to be saved. I told them to restore from a backup with they tried only to find out that the backup was corrupted as well.
At 8:30PM I requested a supervisor to discuss the situation with. At this point we were faced with not having any database backups newer than the end of last year since all of the ones from this year were on their servers and were all apparently corrupted. After a heated discussion they allowed me to have the raw database files for our site. I began repairing each file manually. The corrupted files were the table that stored posts with 306,000 records and the private messages table with 26,000 records. Over the course of last night and today I have been able to successfully recover the tables.
At 3:45 the database had been repaired, checked and installed on a server that we control. The site was brought back up and is functioning.
The site is currently going to be a little sluggish based on the fact that the install was hurried in order to get the site running.
Moving forward we have been left with no option but to move our site from a VPS environment to a dedicated server of our own. That way we can control backups, site and database settings and all aspects of the server. Until we find a new home the site can run on its current setup without any issues.
There will be abbreviated downtime when we move the site again. That downtime will be 15-20 minutes and will be one of our normal downtimes that happen late at night.
If you see anything not functioning on the website let us know. So far everything looks good. Hopefully it will stay that way.
Again, on behalf of the leadership team I apologize for the downtime. Moving forward we will take steps to make sure it doesn't happen again.