Just wanted to start a thread collecting evidence on the resource footprints for bootnodes and validator nodes. So far I’ve used B1s to host bootnode and B1ms to host validator node and this worked out very well:
Sokol validator node
I’m using EastUS since Azure VM costs are lowest there.
Can you help educate me and the rest of the Validators?
When you scan the graphs above. What are the (3) things you are looking at to evaluate health/status of the node?
In this case I’m doing capacity planning to verify whether the nodes I’ve deployed are capable enough to sustain the workloads of POA bootnode and validator roles. If you see CPU spending too much time above 70% you need to scale-up your node. Our workloads are currently clearly not CPU bound, though that may change as more people run DApps on the POA networks.
Same goes for I/O - if you see IOPS trending up into double digits, we should explore faster storage options for the node. Since blockchain nodes are I/O heavy workloads, SSD is always recommended for the validator node. With the bootnode using an HDD should be fine. Luckily, Azure gives you a very inexpensive B1s instance that also has SSD, so I won’t say no to good freebies .
Network ingress/egress is more of a cost factor as we pay for all the network activities crossing the datacenter boundaries, though so far they don’t seem significant enough. Local storage I/O footprint on the other hand has been increasingly monotonically with a 3x spike at the end of January on both my Sokol bootnode and validator node, so I’m watching that metric closely.
Does that help answer your question?
Thanks for your guidance and explanation. I’ll have to digest your response by monitoring my nodes and looking at the graphs/data … likely further questions will arise.
I do know that the Aura consensus we are using is tightly bound to clocks being synchronized across nodes for both performance and security, so clock drift is a risk as well as network latency ( I imagine ). So I do have some questions in this regard but again they are not well formed yet.
Resource utilization monitoring is a first order approximation for doing capacity planning, but it often works “well enough” (i.e. just throw more cores at it). To ensure latencies meet SLAs, we need to track critical paths of the transactions that make Aura tick. I’m not familiar enough with how it works, but work transfers in any distributed system can be instrumented.
You’d need an APM for this - one that tracks incoming requests and then correlates those with dependencies (call-outs). That’s the only reliable way to quickly triage and diagnose statistical long poles that slow down system throughput. We have an APM in Azure that can handle this, which I described in Scalable, extensible, and robust POA monitoring.