Summary of what happened

Post by **Steve Sokolowski** » Thu May 28, 2015 12:23 pm

I thought it might be useful to provide a summary of what happened during this recent downtime. The direct cause of the failure was a defective network card. The defective card was responsible for a sequence of many additional events over the course of three days.

1. User rootdude said that he has 2Gh/s of hashpower he would like to connect to the pool, but he is limited by port restrictions. We told him that if he was willing to dedicate his equipment here for a significant period of time, we would spend money to accomodate him.
2. Chris purchased the cheapest Gigabit Ethernet card he could find at Micro Center for $12.99. The card was defective.
3. He installed the card in the existing server, then created a new virtual machine and installed a proxy server on it for rootdude. The proxy server has a different IP address on this new Ethernet card that will allow rootdude to use the pool through port 443.
4. Chris found that the proxy server could not connect to the Internet, so he troubleshooted it. After some research, he mistakenly concluded that the cause of the problem was that the BIOS on the server was outdated, and was incompatible with the Ethernet card.
5. Chris flashed the new BIOS successfully, and rebooted the server, but it failed to recognize the disks on the MegaRAID controller because it turns out that LSI's MegaRAID software has a bug where the boot partition will not be found when specific BIOS versions are changed.
6. After hours of troubleshooting this, Chris attempted to locate the old version of the BIOS on the Internet, which would have resolved the issue, but he was unable to do so because the manufacturer had deleted it from their website. The only choice was to reinstall the hypervisor.
7. To reinstall the hypervisor, Chris needed to make a backup onto a 4TB external hard drive. Since Linux does not display file progress, he was unaware for 12 hours that the copying was taking a long time because the USB drive was disconnecting.
8. When Chris instructed the server to shut down for reinstallation, it gave only 300s for the daemon server to kill all its daemons, which it was not able to do in time. The server killed the daemons and corrupted the blockchains of all daemons that had not yet shut down within the 300s time period.
9. While Chris had originally thought he could just recreate the hypervisor and retain the data drive, Chris determined that in order to reinstall Debian, he had to reformat the entire drive, which would necessitate a restore from the backup he just made.
10. Chris reinstalled Debian 8 on the hypervisor (rather than stick with Debian 7, because he didn't want to have to upgrade later), but he found that Debian 8 had many packages with bugs in them and had to resolve the issues.
11. While attempting to copy the backup back to the server, the USB controller failed completely, stopping the copy process while Chris was asleep.
12. When Chris woke up, he pulled the broken USB controller and connected the drives to one of the new servers we had just purchased. He had to copy data to the new servers and then transfer across a network connection to the old server. But since the new servers' hard drives were smaller than the old server's, he had to take additional steps to split and join files when they arrived back on the main server.
13. Chris was able to start four of the virtual machines successfully. At this time, he discovered that the network card had been defective, and that none of the entire process had been necessary in the first place. He could have just returned the card for a replacement. He returned the card to the store, got a new card, installed it, and it worked.
14. When he tried to restart the daemon server, it thrashed and took a long time to load. Chris assumed that the daemons were taking a long time to load, so he decided to get on the bus back to State College and connect through the cell network and declare everything working.
15. On the bus somewhere near Harrisburg, Chris determined that the reason the daemons were taking a long time to load was because their blockchains were corrupted and they were failing checks. Chris stopped all the daemons and ran a disk check to determine where corrupt sectors were in the daemon server virtual machine image.
16. The transfer bus didn't arrive on schedule, and he wasted two hours waiting on the street without wireless connectivity.
17. The disk check took several hours. When it completed, Chris found that he would have to redownload the blockchains of many daemons, so he started doing that. He also discovered that the litecoin private key had been corrupted, so he had to restore that from backup.
18. After deleting the corrupt blockchain data, the daemons needed many hours to get back up to date, which is what they did for 12 hours.

In conclusion, there was nothing that could have been done to prevent this issue, short of spending money that we don't have to buy overpriced, but more reliable, servers from a place like Dell. If Chris had thought that the network card was defective in the first place, then the entire sequence of events would never have occurred.

kires · Post by **kires** » Thu May 28, 2015 12:44 pm

Damn. Just ... damn. The only other thing that comes to mind are the words 'perfect storm'. Are you sure Chris doesn't drink? Cause after that... Well, once the server got back on it's little metaphorical rubber feet, I'd want to get maybe a little drunk.

rootdude · Post by **rootdude** » Thu May 28, 2015 2:42 pm

Sounds like a day in my life as a tier 3 sysadmin

Good to see things back to normal again (super-normal, perhaps, given the bug squashing that accompanied the work)... I do so love it that the litany of events started with some guy named 'rootdude' asking for an accommodation though. I guess everyone can blame me if they want to.

Once you guys get a handle on the stability of the new proxy on that IP (send me the info and I'll move my other machines there as well to test) - I'll do my part and ante up for more PSUs and rack space and get the extra rig running.

rootdude

Post by **Steve Sokolowski** » Thu May 28, 2015 4:05 pm

We should have this proxy server ready by tomorrow, so we'll get back to you in the morning and you can go to town.

Summary of what happened

Summary of what happened

Re: Summary of what happened

Re: Summary of what happened

Re: Summary of what happened