Monitoring solution

PaulCa · 31 Mar 2021 at 18:03

My home automation system has expanded rapidly over the past year. I have probably a dozen or two devices, some wifi, some rf, some zigbee. I also have a dozen or so microservices and a few servers.

I was considering DIY'ing it as quite a few would be bespoke statuses, such as "stale data", "device last seen", "last message on MQTT topic", "status from REST end point"

However the guts of it is just a network monitoring tool with a few exotic plugins/scripts.

I looked at Nagios as it's famed for being customisable. However, it seems dated, reminds me of Cacti days, everything is manually configured. All the good stuff that actually makes it useful in enterprise is pay ware add-ons.

I looked at OpenNMS, but got lost.. no bored.. in the complexity.

Can anyone make any other suggestions for pluggable network monitoring applications or should I persevere with the two above? Ideally, "FREE[tm]"

Buffalo2102 · 31 Mar 2021 at 21:12

I've found Zabbix to be quite good.

Tooms · 31 Mar 2021 at 23:59

I’ve just checked and Paessler still offer a free PRTG license which covers 100 ‘sensors’ (monitored data points). I haven’t used PRTG in 6 years or so but it was pretty good when I did. It’s still used by some other teams at my employer (IT service provider) and no one moans about it.

https://blog.paessler.com/paessler-offers-prtg-100-for-free

PaulCa · 1 Apr 2021 at 10:31

Thanks. While I ponder and procrastinate on that I decided to start with a "sensor". If you will.

Using python, I
* opened the DHCP server's leases file. Scanned it for MAC addresses.
* For every MAC address I query the DHCP server's OMAPI API to get the full latest lease information. Including: ip-address, lease state, expiry time, client-hostname, ddns-fws-name (name forced on the device).
* For every MAC address do a vendor lookup and store it.
* For every valid lease, take the IP address and spawn a thread to ping it once, timeout 1 second. Store the ping rtt, packetloss, status.
* Wait on all the pings completing, max 1 second.
* Convert all the data points into InfluxDB line format.

Using Telegraf I can then run that script and publish the data to Influx. Which I can then find a way to visualize in Grafana.

That only provides me with a low layer overview of what is on the network segments which got it's config from DHCP. It does not cover the static IP addresses. I could list those manually, or I could run a periodic subnet sweep with nmap or ping/arpping to discover "rogues". Rouge is most likely a device I completely forgot I had running somewhere with a static IP I didn't know I had.

I also need to plumb this into a monitoring and alerting system to set rules on, say, the presence of a MAC address with a successful ping at least once in the last 5 minutes. Otherwise send an SMS or flash the house lights.

#Chri5# · 6 Apr 2021 at 11:50

PRTG has built-in sensors for MQTT and APIs, so you might be re-inventing the wheel a little.

https://www.paessler.com/rest-api-monitoring
https://www.paessler.com/mqtt

Static device monitoring is easy as you can add them by IP, then setup ping tests etc. Needs a bit more thought to query DHCP table and test from that. It will monitor DHCP availability as well - I have a vNIC on a VM which is set to DHCP and PRTG checks it can obtain a lease.

Bobcat · 6 Apr 2021 at 15:48

I use PRTG here at home with the free up to 100 sensors, it works fine and never have a problem with it, there is even a tool to upload your own CERT if you are into that kind of thing, all alerts go to my domain name email address which come through on my phone, say for instance power has been lost to the plug that powers one of the freezers, it kicks off alerting until i resolve the issue, monitoring everything from PiHole to Router/AP throughput, good piece of kit for free, in my case its installed on a VM Server OS but can be installed on different OS's.

Lots of fancy sensors to choose from out of the box and lots of things pre-configured, just remember it has an upper limit of 100 sensors (free), so if you set off around 20-30 different devices all Auto-Discovering putting into place the sensors that it thinks are correct, you may end up with a whole load of sensors which are blue/paused due to you blowing the 100 sensor limit, so you then need to tailor things a little to how you want them and to get under the 100 free sensor limit, remove the sensors you dont want leaving the ones you want in place, but once you get it all setup its great and a good web interface, i wouldn't be without it

I did try the Nagios Appliance (VM) route as that has a similar sensor limit for free but its a little more complex and they then keep bugging you via email to take out a paid for licence, so i avoided and went back to PRTG personally, PRTG just leave you get on with it without the constant bugging/chasing to purchase a licence, does what it says on the tin with little hassle, can be clustered too if you want to go that far.

bledd · 9 Apr 2021 at 08:44

I really really like Observium.

Can run it from a Pi too.

Get it installed, set a static IP.
Configure SNMP on your network devices to allow read access to that IP.
Add devices in
Watch them populate the GUI in all it's glory.

PaulCa · 12 Apr 2021 at 14:23

Both PRTG and Observium appear to be jack of all trades fully encased, encapsulated, locked in, cripple-ware, proprietary efforts. No thanks.

100 sensors? Wuhahaha! I currently track about 6000 time series. Granted quite a number of those I don't need and should weed out, like network stats for all docker bridge networks etc.

The thing is I have a data logging and graphing system, so I'm all good on metrics and displaying them graphically.

It's the thresholds, limits and diagnostic tests that I'm more interested in. I know the above include that, but they also appear to be the database, the agent, the UI, the API and the graph engine all at once. I just want the alerting system.