VaultWiki - Wiki for Forum Communities - Recent Server Unavailability

Recent Server Unavailability
by
pegasus

View Profile

View Forum Posts

View Blog Entries

Visit Homepage

View Articles
Published on May 26, 2011 12:02 PM

2 Comments
Beginning some time around last Friday, May 20, our server began encountering critical problems and the site became intermittently unavailable until the machine crashed completely.

What Happened
As it was a vacation weekend for many of us here, we were unaware of the issue almost until we returned on Monday, at which point we contacted the WiredTree data-center to have them investigate.

First indications were of a runaway cron job, because memory usage rose to 100% before the server first went down. However, after a day of this not panning out, we soon encountered another problem. Even after each server reboot, we were unable to log into WHM or cPanel to administer the server because "400 attempts to create a session failed". We were even unable to make any changes to the server via SSH, which gave a more informative message: "read-only file system."

Apparently a read-only file system occurs when there is a serious problem with I/O to the hard drive, and the file system becomes read-only to protect the data from damage. The people at the data-center swapped out and cloned the hard-drive to no avail; it wasn't until the motherboard was replaced that the server returned to normal. The onboard SATA controller had essentially failed.

What We Did Afterwards
Even though the problem was a faulty controller that probably swallowed up all the memory by queuing file updates until the queue used up all available memory, we still didn't like that memory was used up at all.

We went on a spree of updating server software, two of which (PHP 5.3.6 and XCache 1.3.2) corrected memory leak situations in previous iterations. The PHP memory leak was actually likely to occur in a vBulletin forum installation like we have here. While the site was down, we also went ahead and updated MySQL to 5.5 to receive the reported performance benefits.

What This Means for the Future
The site was down for an extended period of time and during that time all services were unavailable. We apologize for that. However, the issue was largely out of our hands (hardware failure) and as far as we can tell it was not due to any malicious attack by a third party.

If anything like this occurs again in the future (and with computers, we all know it eventually does), I assure you, we will be back. This event has caused us to set up additional measures to protect some services so that we can alert visitors and customers to the problem while it's happening.

As this took most of our attention, almost no work was done on VaultWiki development during that time. Therefore, we are about a week behind schedule, but are back to work already today.
2 Comments
Mokonzi - May 27, 2011

Reply

Thanks for the update pegasus. I've recently had a nightmare situation similar, and can well understand the frustration.

gibigbig - June 13, 2011

Reply

great job guys
Oops!

Cancel Changes

All times are GMT -4. The time now is 9:48 AM.

This site uses cookies to help personalize content, to tailor your experience, and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Learn more… Accept Remind me later

Welcome to VaultWiki.org, home of the wiki add-on for vBulletin and XenForo!

Recent Server Unavailability

What Happened

What We Did Afterwards

What This Means for the Future

Oops!