System downtime earlier today; breakthrough in tickets

Post by **Steve Sokolowski** » Thu Feb 13, 2020 2:26 pm

The system went offline earlier today for an hour, and was operating suboptimally for two days prior.

One week ago, we noticed that the orphan rates for ethash coins were always close to zero percent, because uncled blocks were being counted as full-credit blocks. The problem caused ethash miners to earn far more than the expected payouts. We issued a temporary correction to count all uncle blocks as full orphans, but then ethash miners were earning too little, so we started work on a permanent fix in the mining servers, which was completed two days ago.

The fix queried the database and incorrectly set the orphan rate to (1 - the actual orphan rate.) The code caused coins with few orphans, like litecoins, to go into error because of a mining server failsafe that stops mining coins that have orphan rates higher than 90%. The new code was released on February 11, but the release failed because of this bug, and Chris reverted the changes.

After the revert, however, unknown to anyone, incorrect expected payouts were stored in memory in the share inserters and the website public API, which average the last 200 updates to determine the expected payouts for some operations. The failsafe continued to be triggered for some coins periodically over the course of February 12, until all the oldest bad orphan rates were flushed out the back of the averages.

The release was attempted again earlier today, and the same issue occurred, so a revert was quickly done again. This time, however, the mining servers were left online for a few minutes before reverting, so more bad orphan rates entered the averages. The bad orphan rates caused the coins to go into and out of error for hours, even after the revert occurred. At that point, the bad code was permanently released, because the fact that the revert did not fix the problem incorrectly led us to believe that the new code could not have been the cause of the issue.

After some more time, the orphan rate in the new code was actually fixed, but the problem was still not resolved, because the bad orphan rates were still in the queue, the coins kept going into and out of error. Finally, Chris recognized that the entire system needed to be shut down and restarted - not just the mining servers - and everything returned to normal.

To prevent an issue like this from occurring again, Chris will update his release procedure to restart all services in the system, not just the mining servers, when a mining server release is issued.

This is a major breakthrough. It is likely that this problem has caused many of the "late" unreproducible tickets we receive after releases that I mentioned recently. Since the problem has been present since December 8, 2017, there have been hundreds of instances where we issue a release, someone submits a ticket stating the issue is not fixed, and then by the time we answer the ticket the problem has gone away and never returns. Changing this release process will make the system much more stable in the future.

We apologize for the inconvenience!