Testing today with one mining server
Posted: Mon Mar 26, 2018 9:02 am
Today, there will be a few restarts with one of the mining servers, which I'll choose at random. Only about 1/4 of customers will be affected by these tests.
When the multiple algorithm code was released in April 2017, it became apparent that we could no longer look through every miner and add their hashrates every time we needed to determine what the total hashrate of the system was. Instead, we wrote code to create lookup dictionaries that were changed every time someone connected to or disconnected from the system. A problem inherent with lookup tables is that they can easily get out of sync with the actual data behind them, if just one update in millions gets missed. At the time, the tables tended to diverge from the underlying data over time because of unknown bugs. We had to write a routine to synchronize the tables every five minutes with the real data to correct these accumulated errors.
Now that we're getting down to investigating the more minor bugs, I discovered that those five minute recalculations are taking up to eight seconds, during which time stale shares and disconnects can occur. Before taking steps to see if I can make them more efficient, I want to see whether they are even necessary at all anymore. Therefore, I wrote code to periodically check the memory against the lookup tables without recalculating them. If this code detects no errors after a few hours of running, then whatever bugs were causing the inconsistencies have been eliminated and we can reduce the number of network connectivity issues by eliminating the recalculations.
Since the development environment currently reveals no bugs, to accumulate this data, I'll need to use one of the mining servers in production. I'll deploy the code, collect the data, and then revert. If there is an obvious bug, I will fix it and redeploy. Once I have enough data, I'll determine whether the best course with this recalculation of lookups is to remove it, to make it less frequent, to try to improve the recalculation efficiency, or to spend weeks trying to reproduce whatever issue is still causing it to get out of sync with the real data.
When the multiple algorithm code was released in April 2017, it became apparent that we could no longer look through every miner and add their hashrates every time we needed to determine what the total hashrate of the system was. Instead, we wrote code to create lookup dictionaries that were changed every time someone connected to or disconnected from the system. A problem inherent with lookup tables is that they can easily get out of sync with the actual data behind them, if just one update in millions gets missed. At the time, the tables tended to diverge from the underlying data over time because of unknown bugs. We had to write a routine to synchronize the tables every five minutes with the real data to correct these accumulated errors.
Now that we're getting down to investigating the more minor bugs, I discovered that those five minute recalculations are taking up to eight seconds, during which time stale shares and disconnects can occur. Before taking steps to see if I can make them more efficient, I want to see whether they are even necessary at all anymore. Therefore, I wrote code to periodically check the memory against the lookup tables without recalculating them. If this code detects no errors after a few hours of running, then whatever bugs were causing the inconsistencies have been eliminated and we can reduce the number of network connectivity issues by eliminating the recalculations.
Since the development environment currently reveals no bugs, to accumulate this data, I'll need to use one of the mining servers in production. I'll deploy the code, collect the data, and then revert. If there is an obvious bug, I will fix it and redeploy. Once I have enough data, I'll determine whether the best course with this recalculation of lookups is to remove it, to make it less frequent, to try to improve the recalculation efficiency, or to spend weeks trying to reproduce whatever issue is still causing it to get out of sync with the real data.