Scalable, extensible, and robust POA monitoring

First a bit about why I’m qualified to offer an opinion on the subject :slight_smile: My professional background is QoS and QoE (Quality of Service/Experience) of large scale systems with specialization in performance engineering. I’ve spent the past decade building full-stack SLA-bound systems informed by usage data as well as driving product and services development leveraging such systems. My most recent experiences in this field include ownership of the Power BI fundamentals from inception through its GA as well as the new Azure APM called Application Insights that enables multi-cloud/hybrid/on-prem SaaS monitoring with visibility into the underlying PaaS/IaaS layers.

While POA is still very young, in particular given its distributed nature and the spectrum of validators’ familiarity with its Ops requirements, we are already running into use cases where actionable monitoring is needed. Following our teleconference today I wanted to start a thread to help us evolve our visibility into and understanding of the POA Ops world.

As POA validators we have an obligation to the POA network to ensure high availability and fast block emission rates. We are also responsible for maintaining distributed trustless consensus at all times. Since validators are running the nodes in test/prod environments, we are de-facto putting on the hats of the Ops persona, which interfaces with, yet is quite different from the DevOps persona (our core Dev crew). Establishing a playbook for POA Ops is a crucial part of our BCDR (Business Continuity / Disaster Recovery) model. Ops cares a great deal about three metrics:

  1. Time to detection (TTD): latency from occurrence till acknowledgement
  2. Time to triage (TTT): latency from acknowledgement till identification of component at fault
  3. Time to mitigation (TTM): latency from successful triage till mitigation

Ops do not solve the actual code-level issue - that’s deferred to a typically much longer-running metric for DevOps: Time to resolution (TTR), which tracks latency from the successful triage till a fix that precludes future occurrences. End-users of the POA network ultimately only care about the Total Incident Duration (TID) = TTD + TTT + TTM. Note that TTR is not explicitly visible to end-users, but may impact the number of TID occurrences. In other words, maintaining end-user satisfaction oftentimes falls on the validators, not the dev team.

In the world of POA Ops, to reduce TTD validators therefore need to have effective alerting capabilities (acknowledgements of SMS-based alerts do tend to have significantly lower latencies than email-based alerts, but it’s good to have both). Employing a modern IcM (Incident Management) system will help us both stay on top of the incidents as well as track our ability to detect and mitigate them effectively. Furthermore, to reduce TTT we need an effective triage solution that helps us identify specific components at fault and whether an infra or a dependency issue is behind the incident.

There are several monitoring options on the market, and we could explore multiple solutions as part of doing cost benefit analysis specific to POA needs and constraints. For instance, desire to be approachable by non-technical validators is an important consideration and, I think, a requirement to enable viral growth of distributed businesses based on the POA networks. That in particular means that we need an easy to deploy and access monitoring management solution. A strong freemium model for low-volume inception/early adoption phases is another such requirement to help bootstrap POA and future distributed businesses that decide to run on top of POA networks. We also need a secure monitoring solution, where some KPIs are exposed publically, while access to other KPIs is limited for NCP reasons.

Service Level Objectives
Before we go implementing a process, we first need to agree on what metrics POA validators care about tracking and what type of SLA (Service Level Agreement) do we want to offer to the POA network users. Specific SLOs (Service Level Objectives) need to be quantified and alerts must be scheduled to track our compliance to hit such goals in the Nth percentiles (it’s common to track 95th and 99th for mature services).

Validators are then collectively on point to ensure that we achieve such SLO goals through early detection and mitigation, while the Dev team is on point to provide permanent remediation of recurring incidents. Looking for problems to fix, rather than focusing on the right problems to fix is the most common way to go wrong with QoS/QoE improvements while having all the right motives in place.

For POA network the first and foremost KPI we will likely need to track is its block emmission latency. Having too many validator nodes go offline can adversely impact this KPI, impacting our SLO. Other KPIs I see us tracking in the future include validator’s responsiveness to governance ballots, incidence reccurence rates, network compromise probability index, core DApps geo-distributed availability, etc.

Azure Monitoring
There are multiple monitoring options on the market of variable maturity, utility, and usability. Each cloud provider has their own built-in solution: AWS CloudWatch+XRay, Google Stackdriver, Microsoft Application Insights. There are also plenty of third party offerings including New Relic, App Dynamics, Dynatrace, Sumo Logic, Splunk, Solr, ELK, Grafana, SignalFX, etc. Most of these solutions are niche-players, offering either strong metric monitoring (CloudWatch), strong log analytics (Sumo), or strong E2E tracing (e.g. AppDynamics, Dynatrace). There is only one I’m aware of which does it all and lets you benefit from the economies of scale. That’s the service I’ve been working on for the past two years, which builds on a decade of big data work at SQL BI/Power BI and the BizOps culture in System Center and Azure. That service is App Insights.

Let’s review some of its key strengths that should appeal to both Ops and DevOps:

  1. Cost: Offering inexpensive solutions in the monitoring space requires grand scale operation, and there are very few companies who can afford to offer their monitoring solutions as loss-leaders until economy of scale picks up. Most solutions in this space cost money and a lot of it. Some have freemium offerings limited in functionality that upsell you into more functionality. App Insights offers 5GBs of ingested telemetry for free every month for every subscription with a 90d retention policy with its full feature set. That means that Sokol and Core can get 5GB/month for free each. After the first 5GB, every new GB is only $2.30 or 10POA (at the pre-sale ETH cost with ETH@$1000). Geodistributed pingalive tests are also bundled for free. For POA that likely means a free full-stack monitoring solution for months to come and negligible OpEx thereafter.

  2. Availability: App Insights bundles free geo-distributed pingalive tests and lets you onboard multistep tests, including Selenium, to exercise your use cases (in case of POA we can proactively validate that core DApps unit tests are operational across the globe). The sixteen currently supported locations include Australia East, Brazil South, France South/Central, East Asia, Japan East, Europe North/West/East, UK South/West, Asia Southeast, US West/Central/East/North Central/South.;

  3. Log Analytics: some solutions in this space (e.g. Sumo, Splunk) let you parse through unstructured logs effectively and build good dashboarding on top. They however lack visibility into the semantic model of the application layer (i.e. requests, dependencies, exceptions, events) and are often reserved to DevOps diagnostics scenarios, since tracing format changes often. App Insights is built on top of the Log Analytics platform, which is more powerful and far less expensive to use than both Sumo and Splunk (two top contenders in this space). All custom DevOps/Ops dashboarding in Azure is built on top of Log Analytics and Application Insights and it also powers Power BI BizOps dashboards consumed by the senior leadership. Any Ops/DevOps/BizOps question can be tracked analytically and visually. Alerting can be scheduled using the same analytical queries and is a great way to monitor complex and evolving SLAs.

  4. Adaptively-sampled Semantic Application E2E Transactions: For robust Ops/BizOps relying on logs both doesn’t offer enough insight (no built-in transactional correlation, inability to generate insights without built-in semantic knowledge) and isn’t robust enough for operationalization (devs change them too often). SLA monitoring requires that we track application layer both top-down and E2E, including operational and dependency latencies, failures, exceptions, etc. App Insights is the only solution on the market that collects the full semantic application layer, by adaptively sampling incoming requests based on throughput, and for sampled requests, it collects all the metrics, events, traces, dependencies, and exceptions within that transaction. This provides an out-of-box monitoring and diagnostics of real usage monitoring (RUM) for service availability, reliability, responsiveness, efficiency, and usage (as opposed to monitoring just simulated workloads). Facilities for cross-node transaction-based correlation are available to monitor, triage, and diagnose issues in micro-services architectures, such as the distributed network that POA is.

  5. Infra monitoring: Many application issues are caused by resource saturation/contention/starvation issues. In addition to application layer telemetry, App Insights also collect performance counters for key resources, including CPU, storage, and network, as well as any custom counters exposed by the application. For POA this can help us diagnose when it’s time to scale up our validator nodes, and see whether having nodes fall behind is caused by infra, bugs in the code, or dependencies on other nodes. This will help drive our TTT down significantly.

  6. Application topology: Services form logical topologies between multiple roles/instances (i.e. nodes). App Insights offers an on-demand visual representation of that topology built based on the dynamically detected communication to/from these nodes. This helps significantly reduce TTT and TTM.

  7. Live Metrics: App Insights is the only APM in the market (outside of the niche product SignalFX) to offer live stream of metrics from your servers for both triage and diagnostics. This is very useful when deploying/upgrading nodes in production. During hard-forks live metrics stream lets you monitor health of the network not unlike NETSTATS, but with far greater customizability and interactivity (e.g. resource usage by process, immediate access to events/traces/exceptions as they occur, etc.)

  8. OSS SDKs: App Insights has a rich OSS SDK portfolio on GitHub. In particular, Python is community supported and Node.JS is officialy supported by the PG. Since Solidity is JS based, it should be straighforward to enable Track* APIs

  9. Smart Detection Insights: App Insights continuously monitors incoming data for regressions and anomalies and lets you know when your durations have regressed or failure rates have spiked. It also gives likely causes for why the incident occurred (e.g. load burst, dependency slowdown, etc.). This helps drastically reduce both TTD, TTT, and TTM.

  10. Client-side Monitoring: App Insights offers E2E visibility across both browser and the front/backend. JScript is injected into client side snippets as they are served by the frontend. This enables usage analytics as well as tracking of client-side QoS (AJAX exceptions, slow AJAX calls, etc.). As validators we want to know when DApps are lagging/failing on POA.

  11. Usage Analytics: In addition to QoS, App Insights also offers usage analytics (similar to Google Analytics), where you can monitor usage patterns, including adoption/retention/funnels/flows/etc. Unlike Google Analytics, App Insights usage analytics isn’t limited to the browser, letting you build funnels that also include server-side telemetry. Furthermore, it enables investigations into how specific server-side issues impact usage. POA can leverage this to monitor how the core DApps are being used by the validators, what other DApps are running on top of POA, as well as how specific QoS issues are impacting POA QoE for end-users.;

  12. Dashboarding: Everything listed above can be pinned to a dashboard with a full RBAC security model. Furthermore, Power BI integration enables building both tenant-scoped as well as public anonymous reports and dashboards that can be hosted on websites. If you can query it, you can monitor it.

There are many other benefits specific to the DevOps world (e.g. profiling, snapshot debugging, integration with IDEs, release annotations, Azure DevOps integration, etc.), but those aren’t relevant as much to the POA Ops discussion.

Best, MM



Thanks and wow! for thoughtful and thorough explanation/guidance. I particularly like that you have provided a framework for thinking about this issue an explanation of QoS concepts, list of existing options and the Ops/DevOps monitoring landscape.

Wondering what might be the next step forward for creating a minimal or recommended set of monitoring and alert requirements document for a validator node …

  • Should a small subset of existing Validators collaborate with your guidance?
  • Should we as validators get independent audit of nodes/recommendations and share results?

Also, I’m wondering about DDOS/intrusion detection and mitigation, do you have any insights there?

Thanks again,

Security monitoring is a big topic in its own right. There are multiple products in this space, none are cheap. Azure has a Security Center, but I don’t know enough about it (yet) to recommend. Will need to play with it some and explore.

As far as QoS monitoring, paired with a POA core team dev, I’d be happy to guide through the onboarding and creation of a dashboard that gives us insights beyond the default Netstats roles. Azure App Insights gives us a lot of monitoring and analytical capacity at no cost.