Status: To Do
Affects Version/s: None
Fix Version/s: None
Epic Name:Monitor s4: Connectivity
Risk & mitigation:TBD
Epic Total Estimate:18
Given support for service (MEN-4616) and log (MEN-4793) monitoring, the final key aspect of monitoring current uptime is connectivity.
This is a bit different than the other types of monitoring because:
- Loss of connectivity can only be detected by the server (because the device can not send an Alert when it's disconnected)
- The basic version of this (Thresholds based on Last check in) is already available in Mender (Open Source without addons), and need to continue to be
For all editions of Mender (Open Source w/o addons), we introduce a concept of Device offline / online. This becomes a an attribute of the device, available in the UI (device details, dashboard) and APIs (to be added to mender-cli as well; get all offline/online devices).
The offline / online state of a device is determined by the following attributes:
- Last seen: This should be when the Mender server last had any connection (or the most frequent type of connection) to the device. For now we can continue to use the "Last check in" attribute for this (last inventory poll), though this is a bit misleading.
- Offline threshold: Configurable setting, relative time, for how long since the device was last seen until it should be considered offline, e.g. 1 day, or 1 week.
We should support offline threshold on the device group level because different customers may have different environments and expectations around being online. Also, test and production devices generally behave differently in this respect (it may be normal that production devices are not seen frequently, and test devices may be turned off at night). Typical intervals are X hours or Y days, suggesting to start with this (we may need to add minutes granularity).
If the Monitor add-on is enabled, we also support (email) notifications when a device changes state from online (OK) to offline (CRITICAL). This is a special type of Alert, as the other Alerts are all configured on the Device-side.
User value (why)
- Improve customer satisfaction by quickly detecting connectivity problems which may cause the IoT product to malfunction
- Lower cost of support by proactively detecting incorrect or suboptimal network connectivity
- Detect potential theft of devices
- There is a tenant-wide configuration setting, available on the device group level, of the offline threshold for this device group, with (most common) granularity X hours or Y days [API should likely support Z minutes only, and then UI converts] (all plans)
- It is possible to filter to find all offline devices, also within a given group, both in the UI and API (all plans)
- If a device is offline, this is clearly indicated in the UI Devices list and the device details (all plans)
- The dashboard shows the share of devices offline for different groups (configurable) with possibility to list the offline devices
- If the Monitor add-on is enabled, an email Alert is sent as soon as a device changes state from online (OK) to offline (CRITICAL) (Professional w/ Monitor or Enterprise w/ Monitor only)