Some ideas on monitoring / recovery and communication


#1

Some ideas on monitoring / recovery and communication

While it is individual responsibility for each independent validator to secure their node and make sure it performs at it’s best, we as a group of POA network validators should also make sure that network is secure, with 100% up-time, transactions are healthy and without delays.

Highly Available nodes:

During our discussion, Jim had an idea of detecting problematic node and taking an action on it. I don’t think he meant “remove this node from consensus for some time”.

What could be done?

Before you take a look at the list, I would like to point out that this would be done by individual validators IF they choose to do so. It could be done from their control VM, or another server that they own and NOT from some central location. These are just ideas on how to help each validator to fulfill their responsibilities better and easier.

So let’s say each validator has an option to download a script from Github repo in order to:

  • Detect if validator node has a problem
  • Within seconds (ideally without ANY downtime) swap Virtual Servers and have a brand new node running. (I can elaborate on how to achieve that in a separate forum post or Wiki instruction on how to do that. Already tested the “swapping nodes” part)

Network monitoring system (again this is just an additional tool, but each independent validator is STILL responsible for their nodes)

What could be done?

POA DEV-OPS Bot

  • Monitor network and automatically detect any problems
  • Automatically notify all validators by posting to dedicated dev-ops channel. (That could be Telegram, Slack, IRC, SMS by Twillio, it is up to validator’s group to decide)

Examples of dev-ops bot messages:

  • “Validator node: ‘NAME’ creates 5 seconds delays on core network”
  • “Number of Active nodes has changed. Two nodes went offline.”
  • “Activated node swapping for validator node: ‘NAME’

Acknowledgement process

If there is a problem with individual node and it gets posted to dev-ops channel, owner of this node is responsible to acknowledge this message, so all other validators know that action is being taken.

Example 1:
Dev-Ops-Bot: “Validator node 'Marat Pekker’ creates 5 seconds delays on core network”
Marat Pekker: “Got it”

Example 2:
Dev-Ops-Bot: “HF on core network is scheduled within 7 days. Upgrade is needed.”
Validator-1: “Upgraded”
Validator-2: “Upgraded”
Validator-3: “Upgraded"
Dev-Ops-Bot: “HF on core network is scheduled within 4 days. 3 validators already upgraded: Val-1, Val-2 and Val-3”


#2

I think these are very good use cases, Marat. Thank you for stating this so clearly.

Jim


#3

Regarding swapping of nodes, I just realized that enode info would also need to be updated and added to bootnodes.txt