The cause of the problem is a race condition. When the inserter is disconnected, it tries to reconnect by creating a new instance of itself every time a new share arrives. However, if a new share arrives between the time the reconnect attempt started and the attempt is confirmed to be successful, an additional reconnect attempt occurs. Then, a timeout is set to check if that instance is still connected after 60s, which it will not be, since no data has been sent to these old connections, and it will reconnect again. The correct behavior is to use the auto_reconnect constructor argument, which I wasn't aware of at the time I initially wrote the code. I fixed the problem and added it into the July 2 performance improvements release.
In the meantime today, I had taken the WAMP server offline for 1m for testing to see what was causing it to take so long to respond to calls from customers, which caused the share inserter to run out of memory making hundreds of thousands of reconnect attempts, and crash. Therefore, Chris will correct balances over the next hour.
This is a huge find, and it will fix all of these issues that have plagued the system for months when it is deployed tomorrow:
- The CPU usage of the WAMP server was always at 80%, sometimes causing delays in processing requests. After the fix on the development system, usage declined to 2%.
- Bandwidth utilization will decline even further from 2Mbps to 40Kbps. That's down from 60Mbps last week before we turned on compression.
- The share inserter's memory leak has finally been found. The memory leak caused at least 15 separate instances where balance corrections have needed to be made.
- In testing, the share inserter can now handle 20 times more shares than it could handle before. It turns out that the complex changes to insert fewer shares into the database were unnecessary. Combined with those changes, the inserter can handle 100x more capacity than it could three weeks ago.
- The mining server's CPU usage declined by 3%, a significant improvement on that highly-optimized server.
- The problem of "potentially unhandled rejections" on the forums and website appears to have vanished on the test server, and the "NaN" hashrates have disappeared.