At current, the client gets rate-limited by hosted-mender due to the calls to deployments/next.
Tldr; The client will have to gracefully handle 429's from the server, when it is rate-limited by the server, due to moving to quickly.
Note: This happens only when the client is rate-limited, which right now is hosted-mender only (unless the on-prem customer has configured this themselves).
The issue stems from the changes in https://tracker.mender.io/browse/MEN-5096 which changed the client to poll for update-control-maps from the server upon each opportunity.
With this change, since hosted-mender is rate-limited to 5/s on the endpoint. Therefore, when the client is too quick to go from one state to another (like in from download_leave -> install), the server will give a HTTP 429 on the POST deployments/next from the client. Then the client falls back to the POST v1 endpoint, and gets a 204, `deployment aborted from the server`.
Relevant log-lines from a failing deployment.
2022-01-28 11:51:16 +0000 UTC info: State transition: update-after-store [Download_Leave] -> mender-update-control-refresh-maps [none]
2022-01-28 11:51:17 +0000 UTC info: State transition: update-install [ArtifactInstall] -> mender-update-control-refresh-maps [none]
2022-01-28 11:51:18 +0000 UTC debug: request not accepted by the server: (POST https://hosted.mender.io/api/devices/v2/deployments/device/deployments/next): Response code: 429
2022-01-28 11:51:18 +0000 UTC debug: Connecting to server http://localhost:46331
2022-01-28 11:51:18 +0000 UTC debug: Request: "" "" "https" "hosted.mender.io" "/api/devices/v1/deployments/device/deployments/next"
2022-01-28 11:51:19 +0000 UTC debug: Successful (authorized) request: (POST https://hosted.mender.io/api/devices/v1/deployments/device/deployments/next): Response code: 204
2022-01-28 11:51:19 +0000 UTC debug: Received response:204 No Content
2022-01-28 11:51:19 +0000 UTC debug: No update available
2022-01-28 11:51:19 +0000 UTC error: transient error: The deployment was aborted from the server
2022-01-28 11:51:19 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> rollback [ArtifactRollback]
2022-01-28 11:51:19 +0000 UTC debug: Transitioning to error state
Reference discussion on slack: https://northern-tech.slack.com/archives/C0XM0KX9C/p1643374074791979
The client will have to be changed in two ways:
- Deal with being rate-limited, not treat it as any other error code.
- Special handling when the update is running with update-control-maps. There is no point in falling back to the v1 POST endpoint when the client is using control-maps (204).
- client must gracefully handle 429's
- client must have special handling of update polling when already going through an update with update control maps.
- integration tests for the new functionality.