Node-Watcher is a Slack application that monitors and reports the status of nodes in an OSPF network. It currently works with datasources that are specific to NYC Mesh, but the initiative to make it more generic is planned, and happens under the branch 'make-generic'
- Needs a Slack app with the following Oauth permissions in all channels it will be posting in:
chat:write,reactions:read,reactions:write,channels:history,groups:history - Python3, using built-in modules
- Linux
- Pull the repo:
git clone git@github.com:scottongithub/node-watcher.git - Edit the environment variable situation in
node_watcher_launcher.sh(info specific to your Slack environment) - All tuneables are at the top of
node_watcher.py, some separated by ('dev', 'prod') environments
- Run
node_watcher_launcher.sh- you will be prompted to paste in the bot's API token. It will then detach and run in the background - Run
ps aux | grep node_watcherto find its PID. Usekill -9 <PID>to stop it
Node-Watcher pulls the entire OSPF LSDB every 60 seconds. If a node is in the LSDB, and it's not being filtered (Filters section), then it gets monitored. Node-Watcher will report when a node is observed to be down for longer than the defined threshhold alert_threshold_ms (default 5 min), or when it is determined to be flappy (default 12 flaps over the course of 24 hours). The first time a node is observed as either down or flappy, a message for that node is created in the Node-Watcher channel:
This message will be the start of the node's history thread that will contain all future activity of the node. An example of a node's history thread with a bunch of observed state changes:
In addition to this node's history thread above, an additional message is added to the Node-Watcher channel for each state change of the node:
This additional message, in the main channel as opposed to inside the node's history thread, will be deleted when that node's state changes again, and the new message will replace it. This is done to minimize the amount of noise that a flappy node can produce. A node's full history is available by looking at the node's history thread, viewable by clicking the 'node history' link in the alert message
Certain functionalities can be invoked by leaving reactions on the parent of a node's history thread, or on the node's alert message in the channel (but not messages inside the node's history thread):
👀 -> the user that left this will be @'ed on (only) the next alert message from this node
❤️ -> the user that left this will be subscribed to all future alerts from this node
💔 -> the user that left this will be un-subscribed to alerts from this node
⏱️ -> silence all alerts from this node for 3 hours
📅 -> silence all alerts from this node for 24 hours
❌ -> silence all alerts from this node forever. remove the ❌ to re-enable alerts from this node
Leave a message in the channel starting with nw (no slash) and the following commands are available:
nw show subscriptions|subs: show your current node/link subscriptions
nw subscribe|sub <router id>: subscribe to node
nw subscribe|sub <router id> <advertised_router_id> <metric>: subscribe to link
nw unsubscribe|unsub <router id>: unsubscribe to node/link
nw show router <router id>: show node's OSPF links to neighbors (can be copy-pasted into nw subscribe afterwards), recent flap history of node, along with history of its links and history of its OSPF neighbors
When 5 or more nodes (set byhub_down_node_qty) go down, and stay down, past a time threshhold (hub_down_alert_time_ms, default 3 min), it's considered a hub-down outage and will look like this:
🔥🔥 12 nodes down at once, looking like a hub went down 3 min ago. Suspected root cause node: aa.bb.cc.dd
Number of fire emojis is the number of down nodes/5 (rounded) so the above example has ~10 nodes down for 5 or more min. This hub-down message will serve as a thread for all info that pertains to the outage, e.g. when a node comes back up, when all nodes are back up etc:
While waiting for all nodes in a degraded hub to come back up, it may be useful to see a report of which of the hub's dependant nodes have not yet come back up. Place a 👀 on the parent message of the hub-down thread and a report of still-down nodes will be posted to the thread every minute. Remove the 👀 to stop the reporting.
If 25 or more nodes (set byhub_down_raise_qty) go down at once, an additional escalation message is sent to SLACK_ESCALATION_CHANNEL, which is set in node_watcher_launcher.sh
Node-Watcher will send out a report every day at a set time (reporting_hour, reporting_minute) showing which nodes are down and for how long, along with which nodes are flappy and flap quantity over the past 24 hours:
Nodes that have been down for more than 14 days (set by abandoned_threshold_ms) will be removed from reporting and monitoring, until the node shows back up in the LSDB
- NYC Mesh volunteers who help with testing, and for their practical and creative suggestions
- @Andrew-Dickinson for providing OSPF data via bird-ospf-link-db-parser and node-explorer
This project is licensed under the MIT License - see the LICENSE.md file for details




