Update on yesterday's issues
Posted: Wed Nov 23, 2016 9:52 am
I wanted to provide a brief update on yesterday's issues.
Chris is hard at work at the hosting site this week installing two new servers and transferring data from the decommissioned weaker servers to the new ones. The goals of these upgrades include improving the performance of these forums, reducing orphan rates by decreasing CPU response time, and providing excess capacity so that multiple algorithms can be launched in a few months. So far, we have successfully migrated the forums server, and as you can see that the performance of this server has dramatically improved since it now has eight cores available (compared to two) and a fast SSD instead of a slow HDD.
Chris is still moving coin daemons to one of the new servers, and the second new server has yet to be delivered. To improve coin daemon performance, Chris changed network configuration so that data transmitted between virtual machines on the same server does not flow out to the router one level up and instead remains on the same hypervisor. He also teamed network adapters so that 2Gbps of bandwidth can be used. While changing the network configurations, Chris discovered a bug in the Dell switch firmware. He determined that when a new server is teamed, the existing servers are disconnected and do not reconnect until either the Ethernet cable is removed or the switch is rebooted. This bug happened twice to the existing server as Chris changed configurations of the other servers. The first time, he mistakenly believed that the cable had already been disconnected and that's why the server was offline, when he unknowingly had fixed the problem by removing and reconnecting the cable to be sure they were connected.
The solution to the problem is to unplug and plug back in the network cables to the mining server so that the switch reinitializes, but it took him hours to figure that one out. We don't expect it to have any significant effect during the rest of this maintenance, because when the bug appears, Chris will simply disconnect and reconnect the server and it will come back online immediately, now that we know the cause.
Thanks for your patience!
Chris is hard at work at the hosting site this week installing two new servers and transferring data from the decommissioned weaker servers to the new ones. The goals of these upgrades include improving the performance of these forums, reducing orphan rates by decreasing CPU response time, and providing excess capacity so that multiple algorithms can be launched in a few months. So far, we have successfully migrated the forums server, and as you can see that the performance of this server has dramatically improved since it now has eight cores available (compared to two) and a fast SSD instead of a slow HDD.
Chris is still moving coin daemons to one of the new servers, and the second new server has yet to be delivered. To improve coin daemon performance, Chris changed network configuration so that data transmitted between virtual machines on the same server does not flow out to the router one level up and instead remains on the same hypervisor. He also teamed network adapters so that 2Gbps of bandwidth can be used. While changing the network configurations, Chris discovered a bug in the Dell switch firmware. He determined that when a new server is teamed, the existing servers are disconnected and do not reconnect until either the Ethernet cable is removed or the switch is rebooted. This bug happened twice to the existing server as Chris changed configurations of the other servers. The first time, he mistakenly believed that the cable had already been disconnected and that's why the server was offline, when he unknowingly had fixed the problem by removing and reconnecting the cable to be sure they were connected.
The solution to the problem is to unplug and plug back in the network cables to the mining server so that the switch reinitializes, but it took him hours to figure that one out. We don't expect it to have any significant effect during the rest of this maintenance, because when the bug appears, Chris will simply disconnect and reconnect the server and it will come back online immediately, now that we know the cause.
Thanks for your patience!