What Happened
As it was a vacation weekend for many of us here, we were unaware of the issue almost until we returned on Monday, at which point we contacted the WiredTree data-center to have them investigate.First indications were of a runaway cron job, because memory usage rose to 100% before the server first went down. However, after a day of this not panning out, we soon encountered another problem. Even after each server reboot, we were unable to log into WHM or cPanel to administer the server because "400 attempts to create a session failed". We were even unable to make any changes to the server via SSH, which gave a more informative message: "read-only file system."
Apparently a read-only file system occurs when there is a serious problem with I/O to the hard drive, and the file system becomes read-only to protect the data from damage. The people at the data-center swapped out and cloned the hard-drive to no avail; it wasn't until the motherboard was replaced that the server returned to normal. The onboard SATA controller had essentially failed.
What We Did Afterwards
Even though the problem was a faulty controller that probably swallowed up all the memory by queuing file updates until the queue used up all available memory, we still didn't like that memory was used up at all.We went on a spree of updating server software, two of which (PHP 5.3.6 and XCache 1.3.2) corrected memory leak situations in previous iterations. The PHP memory leak was actually likely to occur in a vBulletin forum installation like we have here. While the site was down, we also went ahead and updated MySQL to 5.5 to receive the reported performance benefits.
What This Means for the Future
The site was down for an extended period of time and during that time all services were unavailable. We apologize for that. However, the issue was largely out of our hands (hardware failure) and as far as we can tell it was not due to any malicious attack by a third party.If anything like this occurs again in the future (and with computers, we all know it eventually does), I assure you, we will be back. This event has caused us to set up additional measures to protect some services so that we can alert visitors and customers to the problem while it's happening.
As this took most of our attention, almost no work was done on VaultWiki development during that time. Therefore, we are about a week behind schedule, but are back to work already today.
Oops!