Sokol Node Down - A Deconstruction of What Happened

tutorial

#1

So my node was down - and perhaps it is of interest to understand what happen, why, and corrective actions that were needed.

Murphy’s law was in full force. Woke up, looked at the networks real quick - both Sokol and Core looked great for me. Went to work. As I was beginning my day, Jim texted me asking that I look at my Sokol node - as it was down. “Wunderbar” I thought as I was just starting my classes. I really appreciated that heads up and felt terrible as I was ‘stuck’ without a means of fixing right away…

You see, whilst all my passwords are within a password manager and my ssh keys are contained within my OnlyKey (https://onlykey.io/) - meaning I could have performed node maintenance… I just would have needed to know the IP address of the instance needing work.

I typically get this by log’ing into my AWS account to make sure that I’m on the correct VM. And this was where the issues began. I use, where possible, physical MFA devices and AWS only allowed (until recently) the use of a fob by Gemalto. Think big-o-key fob, old school style. And it gets a lot worse. This physical key only can be associated with a single IAM or user. So I literally had to buy 9 of these and create multiple IAMs. And so I have a set in my safe and the others offsite - i.e.: not on me as my pockets just are not that big.

It should be noted that recently AWS has began allowing the use of u2f devices and this was on my todo list - as I already carry such a device on my keychain.

So my inability to log into AWS and double check with IP address I needed to use was a result of my using a key fob from the 90s and being physically separated from it. The corrective action here would be to have a copy of node IP addresses in a secure location for easy lookup (key word being secure.)

Even after performing the necessary updates - my node was down and out. Couldn’t find peers, so I updated my bootnodes.txt file (as I hadn’t done this since starting my node) using:

And still nothing. Well, it seemed like my node was just limping along. This is where Jim really helped out, as he suggested that I kill my db. I must admit, I wouldn’t have thought of this - I was still going on the idea that my instance was just having a hard time seeing peers. The steps were:

  1. log into my node via ssh
  2. Stopping parity( >sudo systemctl stop poa-parity)
  3. entering the validator folder ( >cd /home/validator)
  4. performing the db kill ( >sudo ./parity --config node.toml db kill)
  5. restarting parity ( >sudo systemctl start poa-parity)

And my node came up in no time at all. I was concerned that it would take forever to re-sync; however, that was not the case.

Hopefully this may be of use to someone else in the future.