12 Feb, 2018

CPU Degradation and EC2 Spot Fleets OR Why Don't My Miners Run At 100%?

by Jonn Callahan

This post will delve into some unanticipated behavior I was seeing with specific instances within my spot fleet. Namely, I’ll be digging into a non-trivial amount of CPU performance degradation.

Intro

The last decade has seen the rise of cryptocurrencies as a prevelant economic entity, with billions of dollars flowing through the system on a daily basis. I’ve personally been involved with mining various currencies since around the big Bitcoin (BTC) spike of 2014 where it hit a mind-boggling $1200. But in 2018, BTC had another brief but meteoric rise to nearly $20,000. Between this and the release of CPU-mineable coins, I figured it was time to look back into mining via EC2 spot instances. In this post, I’ll explore a particular unforeseen challenge in standing up a large fleet of Verium (VRM) miners.

Setup

First things first, I built a t2.micro running the Verium (VRM) wallet as I intended to solo mine. Latency is one of the most important factors when solo mining, so I didn’t want to route my mining fleet over the internet back to the wallet on my home network. I then created an AMI that was pre-configured to use the popular Verium (VRM) miner fork by effectsToCause. This image would begin mining at launch and start a web server for reporting hash rates.

With this AMI in place, I began benchmarking hash rates against several instance types. Because the VRM mining algorithm is incredibly CPU intensive, I only bothered looking at the C4/5 and M4/5 instance types. After comparing those benchmarks to current spot pricing across all US regions, I found that the best price:performance ratio was achieved by running on a cluster c4.2xlarges within us-east-2. With all of this in place, I spun up a fleet of 50 instances and began mining. Lo and behold, the hash rate across the entire cluster was only about 80% of what I was expecting. Even when cloud mining is profitably, margins are still razor thin. This performance loss removed all profitability and ended up costing me more to run than I was making back.

Digging In

My first thought was that my t2.micro was my bottleneck, but a quick check of its metrics showed this was unlikely. Both bandwidth utilization and CPU usage were far under the maximum values. A quick SSH into the box to check RAM usage showed similar results. Next, I decided to check the hash rates of individual machines within the cluster. While the first few were performing more-or-less as I expected, I eventually came across one that was hashing at about 60% of the benchmark.

After giving it some thought, it made sense that this was happening. After all, EC2 instances are just virtual machines. More than likely, this was simply caused by another VM running on the same physical machine attempting to thrash the CPU at the same time I was. However, I admittedly did not take this into account initially and was taken aback when I first found an instance so heavily underperforming.

We Must Go Deeper

Intrigued by this, I decided add some additional functionality to my AMI. Specifically, I wrote up a quick script to monitor the hash rate and push the results up to a Lambda script sitting behind API Gateway. The Lambda script pushed the results into an RDS store with their respective instance IDs and a timestamp. I then had a second Lambda script run every five minutes via a Cloudwatch event. This script would grab all active instances and calculate the average hash rate over a five-minute window.

After getting all this configured, I created a spot fleet request of 10 instances and let it run for about two hours. Because I was more or less just experimenting, I opted for the cheaper c4.large instance type. I benchmarked this instance at around 780 hashes per second (h/s). Below is a graph of the total average processing rate across the entire cluster over those two hours:

Sum average hash rate

Simple math tells us that we should be seeing numbers closer to 7,800h/s (780h/s x 10 instances). However, I was only getting 7200h/s – about 90% of my expected rate. If we actually look at the average by instance across those two hours, it becomes very apparent that only specific systems are suffering.

Average hash rate per instance

Proactive Mitigation

With all this data aggregation in place, I figured it’d be simple enough to start weeding out the bad instances within my real c4.2xlarge fleet. The Lambda script responsible for calculating each instances average over a specific window was now charged with also tearing down any instances that fell below an arbitrary threshold. I chose a fairly conservative value (or so I thought) initially of about 85% (2500 h/s) of the best-case hash rate (3000 h/s). I also added some logging so I could track when instances were getting torn down and at what hash rate it was occurring. First things first, I looked into how quickly I was tearing down machines.

Number of instances torn down

There ended up being a far higher number of instances than I expected falling under this threshold. However, what I find interesting is that while I was initially destroying a large number of boxes, things eventually leveled out and I went several hours without having any instances cross the threshold again. It should be noted that this cluster ran for around 36 hours, but the graph only goes up to 16 because no other instances fell below the threshold during those final 20 hours.

Next, I wanted to verify that I had set a sane threshold value, so I then looked into what the reported hash rates were when instances would get terminated.

Rates of instances when torn down

Surprisingly, the rates seemed to be fairly evenly distributed between the min and max values reported at termination. This would make it easy to move the threshold up and down to get a fairly linear increase or decrease in shutdown rates. I had initially expected there to be more clustering of values, assuming that if a neighbor was thrashing their CPU enough to impact my work, the impact would be more static.

While optimizing performance of each individual instance is great, I was definitely taking a hit in overall performance. There was one point where my fleet was far under the requested 50 instance cap as I was continuously destroying instances. Below is a graph of the number of instance in my fleet over time:

Number of running instances in fleet

An additional interesting metric is to look at the lifespan of all the instances. What I found was that the vast majority of underperforming instances were discovered directly after launching them. There were very few occurrences of an instance that would perform well for a long period of time before dipping below the threshold.

Lifespan Count
<5 minutes 222
5-10 minutes 44
10-30 minutes 15
1-2 hours 52
5-6 hours 1
13-14 hours 2
15-16 hours 1
>24 hours 50

Note: I removed table rows which had a count of zero for the sake of saving space.

Final Thoughts

Overall, this turned into an interesting look into something I had never considered when working with large compute groups. I would like to be able to track this kind of data over a much longer time period to both validate some of the trends identified above and possibly tease out some additional ones. However, cloud mining tends to have very short periods of profitability, and this time was no exception. As of writing this, you lose about $50 a day running a fleet this size.

It’s also worth noting that my use-case lent itself exceptionally well to this kind of data collection and tracking – the worker conveniently already reported the processing (hash) rate, making it easy to compare across all nodes of the fleet. This will likely not be the case for the majority of work that gets processed on top of EC2 fleets. However, similar functionality could be implemented by running CPU benchmark tools on nodes prior to accepting work and comparing results to those values from known-good nodes. This won’t allow you to determine if a node begins underperforming, but could at least inform you if a neighbor is thrashing their CPU on launch.

There is also something to be said for diminishing returns. This kind of architecture can get tricky to implement and costs money to maintain (most notably the RDS instance). While applying this strategy to an applicable fleet may help optimize your instance cost:performance ratio, it very well may be cheaper to use the money you’d spend on a monitoring architecture to just add more boxes to your fleet.