Simple math tells us that we should be seeing numbers closer to 7,800h/s (780h/s x 10 instances). However, I was only getting 7200h/s – about 90% of my expected rate. If we actually look at the average by instance across those two hours, it becomes very apparent that only specific systems are suffering.
With all this data aggregation in place, I figured it’d be simple enough to start weeding out the bad instances within my real c4.2xlarge fleet. The Lambda script responsible for calculating each instances average over a specific window was now charged with also tearing down any instances that fell below an arbitrary threshold. I chose a fairly conservative value (or so I thought) initially of about 85% (2500 h/s) of the best-case hash rate (3000 h/s). I also added some logging so I could track when instances were getting torn down and at what hash rate it was occurring. First things first, I looked into how quickly I was tearing down machines.
There ended up being a far higher number of instances than I expected falling under this threshold. However, what I find interesting is that while I was initially destroying a large number of boxes, things eventually leveled out and I went several hours without having any instances cross the threshold again. It should be noted that this cluster ran for around 36 hours, but the graph only goes up to 16 because no other instances fell below the threshold during those final 20 hours.
Next, I wanted to verify that I had set a sane threshold value, so I then looked into what the reported hash rates were when instances would get terminated.
Surprisingly, the rates seemed to be fairly evenly distributed between the min and max values reported at termination. This would make it easy to move the threshold up and down to get a fairly linear increase or decrease in shutdown rates. I had initially expected there to be more clustering of values, assuming that if a neighbor was thrashing their CPU enough to impact my work, the impact would be more static.
While optimizing performance of each individual instance is great, I was definitely taking a hit in overall performance. There was one point where my fleet was far under the requested 50 instance cap as I was continuously destroying instances. Below is a graph of the number of instance in my fleet over time:
An additional interesting metric is to look at the lifespan of all the instances. What I found was that the vast majority of underperforming instances were discovered directly after launching them. There were very few occurrences of an instance that would perform well for a long period of time before dipping below the threshold.
|Lifespan ||Count |
|<5 minutes ||222 |
|5-10 minutes ||44 |
|10-30 minutes ||15 |
|1-2 hours ||52 |
|5-6 hours ||1 |
|13-14 hours ||2 |
|15-16 hours ||1 |
|>24 hours ||50 |
Note: I removed table rows which had a count of zero for the sake of saving space.
Overall, this turned into an interesting look into something I had never considered when working with large compute groups. I would like to be able to track this kind of data over a much longer time period to both validate some of the trends identified above and possibly tease out some additional ones. However, cloud mining tends to have very short periods of profitability, and this time was no exception. As of writing this, you lose about $50 a day running a fleet this size.
It’s also worth noting that my use-case lent itself exceptionally well to this kind of data collection and tracking – the worker conveniently already reported the processing (hash) rate, making it easy to compare across all nodes of the fleet. This will likely not be the case for the majority of work that gets processed on top of EC2 fleets. However, similar functionality could be implemented by running CPU benchmark tools on nodes prior to accepting work and comparing results to those values from known-good nodes. This won’t allow you to determine if a node begins underperforming, but could at least inform you if a neighbor is thrashing their CPU on launch.
There is also something to be said for diminishing returns. This kind of architecture can get tricky to implement and costs money to maintain (most notably the RDS instance). While applying this strategy to an applicable fleet may help optimize your instance cost:performance ratio, it very well may be cheaper to use the money you’d spend on a monitoring architecture to just add more boxes to your fleet.