Validators’ node down time and action protocol


#1

POA Network validators are a diverse and self-governed group of individuals. They each have been entrusted in a decentralized manner to promote network health, performance and security. Validators must monitor their own nodes and the ecosystem at large to maintain optimal performance.

Objective: As POA Core network validators, we need to define the action protocol for the case when one of the validators nodes experiences down time and therefore degrades network performance and slows down network emission funds generation designed for the network sustainability.

Questions:

  1. What is the reasonable period of validators node down time before we can call it an issue? This must exclude common software upgrades.

  2. How we should notify fellow validators if we observe one of the nodes goes offline or out of sync causing network performance degradation?

  3. How much time should validators wait between notifying a validator about the problem and activating the removal protocol designed by POA governance model for not meeting performance expectations?

  4. Should validators be responsible for compensating missed emission funds for the network if the problem gets fixed within the expected time frame and the removal protocol has not been activated?

  5. Should we define accepted total number of node down time hours for such incidents within a certain time frame before we can say that a validator does not meet performance expectations? If so, what should that be?

I encourage all validators to participate in the discussion. The agreed on protocol may become a wiki page.


Governance Discussion - Validator Node Uptime Protocols
#2

Hello Henry,

Alex as me to share this here:


#3

Could you please elaborate more on that?

It was discussed before and I would very much agree this is the summary. However that was not claimed as an ultimate truth. Please suggest specific corrections to the questions. Or perhaps define the new questions related to the described scenario.


#4

I think by requiring validator to reimburse the network for missed emission rewards we put in place a natural forcing function to incentivize validators to minimize their downtime.


#5

I don’t think having so many questions for what is basically a first attempt at a discussion is useful. When asking a group to give their input, and then framing at such a granular level certain vectors - it basically will result in drop off of any meaningful discussions.

I could understand giving context for why a post was created in the first place, maybe give one’s own opinion of some ideas (provided that is well communicated) and then see how the discussion develops.

I’ll be over in the other posting, asking people to weigh in on their thoughts. Taking this one slow is better - given the history. As it has been said, recently, in the “POA Core Support” telegram channel - this was discussed over a year ago and the end result was less than ideal.

Going slow to ensure positive forward momentum (regardless of direction) ought to take priority.

UPDATE: Governance Discussion - Validator Node Uptime Protocols


#6

It’s not productive to support multiple topics on the same discussion, I’d encourage validators to stay focused on the problem described in the first post and the possible solutions in the best internets of the network.

I very much expect validators to come up with the reasonable action protocol related to the possibility of the node(s) being down and not participating in the block validation.


#7

On a contrary the post with the clearly defined problem and the corresponding questions has better chances to result in the action protocol, which is the main goal of this post. I encourage validators to stay focused on the point. The network has been operating more than a year, yet we don’t have a concrete documented action protocol for the very essential operational part of it. Would you agree this is a problem?


#8

Context

Here is the Role of Validator wiki that outlines the key responsibilities of the validators. We all agreed to that when became validators. However that document is lacking a base set of instructions on what to do when those requirements are not met or violated.

In this particular topic we are discussing the responsibility of running and monitoring the node. Validator’s node is the key operational unit of the network. It’s for each independent validator to ensure node’s safety and performance. However reality is less than ideal, nodes might go down for many reasons (updates, infrastructure failure, attacks, etc.). While it’s impossible to cover all cases, we must come up with some base protocol that we invoke in case of such a failure.

In the current situation with 22 validating nodes, 1 node*hour downtime results in the ~5% block validation time increase and the loss of 32.5 POA for the emission contract that is designed to keep the network sustainable.

Action

The most obvious solution is to initiate a ballot to vote out the offending node. I would like to remind that this is currently the only existing option in our on-chain governance. There are no “temp” solutions at the moment, so please refrain from proposing new types of ballots here. If such new ballots would be implemented in the future we can always revisit this.

Pre-Action

However it’s not optimal to initiate the vote out ballot every time the node goes down. We must agree on the reasonable amount of time and attempts to contact the validator and fix the node. To spin up a new node from scratch takes no longer than 30 mins. Recover from backup – ~10 mins. Any troubleshooting attempt that takes longer than 1 hour should be abandoned as not optimal and the recovery option should be activated. There is absolutely no reason for a node to be down say 3 hours if the validator is online and aware of the problem. Another question what if the validator is not online or doesn’t have access to needed tools. Also taking into consideration human aspect, time of the day and some personal circumstances. I think it’s reasonable to initiate the action part if the node is down for 24+ hours and no response or feasible solutions from the validator. However to boost the motivation and minimize the downtime we should consider the penalization options. For example, for any X (not sure about the number here) and longer amount of downtime the validator with the offending node should reimburse the network for the emission reward loss while being offline. Such motivation mechanism should also stimulate the discussion around redundancy and better security.

I realize I haven’t considered several aspects of the proposed protocol and most likely missed something, will add/update as we discuss.

Thoughts?


#9

"To spin up a new node from scratch takes no longer than 30 mins. Recover from backup – ~10 mins. "

Re-sync has been shown on multiple occasions to take sig longer than this. Spinning up from scratch even longer… 24 hours? How is that reasonable.

But maybe in your opinion it is. Fine. I’m going to push for whatever we as validators agree to (which is itself another topic) to be hard coded… so that we DO NOT need to begin the balloting process. As I have seen what I can only describe as the moving of goal posts or selective enforcement. So I would rather this not even be an option.

What I mean: X demands we ought to have Y in the name of security / safety / the children / whatever. This is then applied to people Z. Then when this same issue arises for X, well… they had a rough day and maybe show some grace. <- I wish to avoid this entirely on this subject.

So be sure that you can yourself perform to whatever goal you publicly state. That’s all I’m saying.


#10

Clarifying questions to ponder:

  1. AWS for some regions have gone dark in the past. What do you do in this case? This validator should be double penalized?

  2. Some validators have, in the past, purposely used under-spec’ed VMs… resulting in network lag. Should they be hit up to pay too?

  3. Validators that have stated that they would be traveling and away from internet - asking for a process or procedure (hell, dare I say “Help”) as to ensure up time were given no clear direction. Well, some validators basically stated you should be always prepared. These individuals trying to do the right thing are to be penalized should a Black Swan event happen?

In the Blockchain world - you get what you incentivize. Down time == payout. hum…

I worry that this would simply incentivize fellow validators to launch even more attacks to the network but now against specific nodes. Granted we should never tell anyone where our instance(s) are at, but I suspect this information is trivially gotten.


#11

Wondering where did you get that number from? As far as I am concerned we are not doing the full re-sync with new node in order to get it operational. To spin up a node using ansible takes no longer than 7 minutes, I added another 23 minutes for a sync of non-ancient blocks, and snapshots (if that is the case). Usually freshly spun up node is fully operational within 30 mins.

Please correct me if I am wrong cc: @phahulin, @igorbarinov

Isn’t that the point of initiating and/or supporting the discussion? To agree and implement the action protocols that apply to all validators?


#12

I launched a new POA Core instances on Vultr (they accept bitcoins) cloud with spec

CPU:
4 vCore
RAM:
8192 MB
Storage:
100 GB SSD

here is a video of the snapshot sync (about 2 minutes to sync)
https://asciinema.org/a/va3RylmwhImnmp7JoEpfUiOyJ


#13

I would like to point out again, this whole thing is open for a discussion. I encourage everyone to ask questions, but most importantly answer those already asked.

Cases like that are up for a debate. I think if the WHOLE region goes down, we would have a different set of problems. However, I am happy to answer, usually if the “region” (let’s not go to deep in the definitions here) goes down it’s either a natural disaster or some sort of malfunction. In the classic word almost every contract has a clause about natural disaster, so let’s leave this one out. If that is a provider malfunction, most global providers are very quick about addressing those kinds of issues. We are talking minutes, in rare cases it would take 3+ hours. For example, the longest recent AWS downtime on my memory was about 5 hours in 2015. That is still within proposed (I haven’t specified initially, but for the sake of discussion I will give the number) 12 hours grace window of no response and no feasible solution from the validator.

I don’t see any reasons for enforcing this retrospectively, just doesn’t make sense. For starters, the network emission contract took effect only 2 months ago. This is still up for a debate if you wish so, but I realize it would be hard to provide any proof of every downtime in the past and calculate the loss considering the variable validators count. I am proposing to keep things simple.

In my opinion this would still apply. Traveling and other personal matters are not an excuse to neglect your primary validator responsibilities. Let’s separate professional and personal aspects of it. As a validator you ARE responsible for the security and stability of your node. Please correct me if I am wrong. Notifying other validators of the upcoming travels or offline status is always a great idea and will direct the course of actions, however, if you are traveling, and your node is down for the prolonged period of time, and you are neither responsive to making an attempt to fix it, I don’t see any other way to vote it out. I would like to remind you about current timelines, I don’t propose any governance actions for nodes downtime below 24 hours, so realistically we are talking about 24+ hours of the node being offline and the validator being non-responsive. Add 48 hours (current minimum ballot time). That results in at least 72 hours for the validator to address the issue before action measures are in effect. Also, you are not obligate to vote right away and may wait until closer to the end of the ballot to allow validator the chance to restore nodes functionality. Nonetheless, as a validator, you are responsible to enforce network security and stability as well as participate in the governance as written in the Role of Validator wiki. So I would also propose to post the results of such a governance in the forum post related to the issue for the documentation purposes.

I would like to assure you this is not the intent of such a protocol. All validators should have a common interest – network stability and security. And I am willing to work with others to ensure that.

Hope I answered all your questions in full. Feel free to add more constructive and reasonable suggestions to the matter.


#14

Thanks. A good example is better than 1000 words.
I guess I was too generous with my estimations of 10-30 mins to allow extra time to remember all the configs and commands.


#15

We do cause a resync sometimes during upgrade if there were updates in the underlying blockchain database or some issues were found that required a resync.
When syncing of snapshots is complete, syncing ancient blocks can take more time. During this process some validator nodes were observed to be sluggish and occasionally delay/skip blocks.


#16

Thanks for the input. I believe this is very much aligns with the original statement:

Planned updates, re-resyncs and maintenance should be considered and described in the action protocol.


#17

Updates can be performed on a second validator node. It’s ok to host two nodes for validators to avoid downtime.


#18

This topic is my raw attempt to define validators node uptime/downtime action protocol that validators can refer to if or when nodes down time happens in the future. I believe this is very much needed for Proof of Autonomy. All of the 5 original questions are just a conversation starter that I thought would be helpful to define such protocol. We have some good input already and want to know what other validators think.


#19

#20

Sorry, I can only tag 10 at a time in one post.