Currently, during the download phase (update-store state), if the mender client is restarted or the target is rebooted, the deployment will be aborted on the next boot. Although the code in update_resumer.go allows the client to pick up downloading where it left off in case of network interruption or timeouts, the same is not possible after a reboot.
My use case has devices deployed on battery-operated, mobile platforms which may turn on/off frequently and which operate over potentially low-bandwidth networks. The inability to "trickle download" updates in the background across possibly several system restarts severely limits the usefulness of the OTA update capability provided by Mender.
The good news is that I think this issue can be overcome by periodically "checkpointing" the progress of the rootfs-image extraction process. The basic idea would be:
- As the download to the underlying device is taking place, periodically (either every so many minutes or every so many bytes written to the disk) a "checkpoint" is written to the lmdb Store. The checkpoint would contain all information required to resume writing to the block device such as:
- Current Block Device Offset
- Current Artifact Offset (in Reader stream)
- Probably other things required here.
- If the system restarts while in the update-store state, we could look for these entries in the Store and resume writing to the Block Device near where we left off. Since checkpoints are only saved in the Store periodically, we might repeat ourselves a bit (writing the same data to the same block device location more than once) occasionally, but it would be much better than having to restart the entire deployment.
The complicated part is of course persisting the gzip.Writer state. Gzip was not designed to be random-access, but I think "resumability" is achievable in our case because:
- Gzip requires only a 32 kiB "lookback buffer" that it uses when resolving backreferences to prior data in the compressed stream.
- The underlying DEFLATE format is itself block-oriented, potentially giving us convenient "checkpoint" locations where we can restart the decoding process from.
This would mean that in practice all we'll have to do is issue an HTTP Range request for the (Artifact Offset - 32k) to cover the "lookback buffer".