Validators’ node down time and action protocol


#21
  1. What is the reasonable period of validators node down time before we can call it an issue? This must exclude common software upgrades.
    30 minutes - Start trying to notify
    16 hours - If not solved, irregardless if has been communicating or not we kick off ballot to “pause” validator. I’m aware this doesn’t currently exist, but would be nice to have. I pick 16 as being between 12 and 24 so irregardless of timezone incase they are sleeping or flying or anything they are hopefully in an internet connected location and can address issue.

  2. How we should notify fellow validators if we observe one of the nodes goes offline or out of sync causing network performance degradation?
    Telegram support channel, also I would be willing to give my personal phone and email to other validators to notify me in such instances. This would also be a nice addition to https://github.com/poanetwork/poa-governance-notifications

  3. How much time should validators wait between notifying a validator about the problem and activating the removal protocol designed by POA governance model for not meeting performance expectations?
    See answer to #1

  4. Should validators be responsible for compensating missed emission funds for the network if the problem gets fixed within the expected time frame and the removal protocol has not been activated?
    I don’t really think that’s necessary because amount wouldn’t be significant, but don’t really have a problem with it if everyone else likes this idea.

  5. Should we define accepted total number of node down time hours for such incidents within a certain time frame before we can say that a validator does not meet performance expectations? If so, what should that be?
    Also, don’t really think it’s necessary because in node’s best interest to always run. If becomes an issue we ask them to switch their node service (if a region on aws or digital ocean is acting strange)


#22

Very reasonable, well stated.


#23

I think all of @johndangerstorey suggestions are really good.

On #1, I would say try to notify in direct message if someone notices a node is down right away. The faster a situation can be remedied the better.

#5 is a tough one. I imagine every validator is different in regards to response time and action. Most have work, family and other commitments in which they might not necessarily be near a computer to remedy the situation, so specifying arbitrary timelines for this might not be a good idea. For example, software devs that work at home are going to have a faster response time during day time hours than someone with a different profession.


#24

Hi John, thank you for the constructive feedback. I do have a couple clarifying questions to that:

I completely agree it would be a better interim measure for a node going offline for a prolonged time, however as I mentioned above, this measure does not exist and as far as I am concerned there are no plans to work on it. So this wont work.

Even though we are already using all of that at some extent, it doesn’t always work. So how does this help?

Even though you mentioned at the end that you might support this idea if others do too, however I would like to clarify what defines as insignificant in this case?
Here is some context, we currently have 22 validating nodes running 24 hours 7 days a week 365(6) days a year. 1 single node downtime results in network loss of 32.5 POA per hour (besides validator’s personal reward). I would agree one short incident might not be significant, but I was looking at a bigger picture.

Please correct me if I am wrong, but the way I understood your statement is that it covers the case of a repeatedly flaky node with the validator being aware of it. If that’s the case, then it should be a no brainer. Also what happens if the validator doesn’t admit the problem?
What this post is addressing is the case when the node goes down and the validator is either not responsive or not able to address the issue for a prolonged period of time. Simply keep asking them to switch to another provider wont make any effect, at least since they are not responsive in this scenario.

I do appreciate the feedback, however I feel it doesn’t fully cover the problem described in the post. The ideas and cases you mentioned worth discussing in a separate topic though.


#25
  1. Transient and uncommon downtimes for individual nodes should be tolerated. After all we are a fault-tolerant distributed network. When node is down for 24h+, lost rewards should be repaid by the validator in question. After first occurrence we can make the next occurrence threshold be 12h, which resets back to 24h in 6 months.

  2. Do we have a way to track network lag?

  3. Best approach to ensuring fault tolerance is to run multiple instances of the validator node. I hear it’s possible - can we publish a wiki on how to do this? A validator who runs two instances in unrelated environments is incredibly unlikely to experience downtime for both nodes concurrently.