This may be one of those bugs we will never get to the bottom of, but the log was interesting so I thought a report was in order.
It looks like the kernel occasionally panics instead of booting correctly when we are simulating a powerloss. I have attached the log from the incident in Jenkins; notice how the client log abruptly stops at the panic.
mender-client_1 | [ 2.206994] EXT4-fs (mmcblk0p2): couldn't mount as ext3 due to feature incompatibilities
mender-client_1 | [ 2.224294] random: fast init done
mender-client_1 | [ 2.233068] EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
mender-client_1 | [ 2.238877] EXT4-fs (mmcblk0p2): write access will be enabled during recovery
mender-client_1 | [ 2.440121] JBD2: Invalid checksum recovering block 2 in log
mender-client_1 | [ 2.516621] JBD2: recovery failed
mender-client_1 | [ 2.520742] EXT4-fs (mmcblk0p2): error loading journal
mender-client_1 | [ 2.529380] VFS: Cannot open root device "mmcblk0p2" or unknown-block(179,2): error -5
mender-client_1 | [ 2.533118] Please append a correct "root=" boot option; here are the available partitions:
mender-client_1 | [ 2.537779] 1f00 131072 mtdblock0
mender-client_1 | [ 2.537823] (driver?)
mender-client_1 | [ 2.545085] 1f01 32768 mtdblock1
mender-client_1 | [ 2.545103] (driver?)
mender-client_1 | [ 2.551792] b300 614400 mmcblk0
mender-client_1 | [ 2.551827] driver: mmcblk
mender-client_1 | [ 2.558827] b301 16384 mmcblk0p1 b4329424-01
mender-client_1 | [ 2.558860]
mender-client_1 | [ 2.565948] b302 221184 mmcblk0p2 b4329424-02
mender-client_1 | [ 2.565962]
mender-client_1 | [ 2.573318] b303 221184 mmcblk0p3 b4329424-03
mender-client_1 | [ 2.573339]
mender-client_1 | [ 2.584947] b304 131072 mmcblk0p4 b4329424-04
mender-client_1 | [ 2.584964]
mender-client_1 | [ 2.592335] VFS: Unable to mount root fs on unknown-block(179,2)
mender-client_1 | [ 2.595921] User configuration error - no valid root filesystem found
mender-client_1 | [ 2.599878] Kernel panic - not syncing: Invalid configuration from end user prevents continuing
mender-client_1 | [ 2.603996] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.26-yocto-standard #1
mender-client_1 | [ 2.607742] Hardware name: ARM-Versatile Express
mender-client_1 | [ 2.612761] [<c0015da8>] (unwind_backtrace) from [<c00129ac>] (show_stack+0x10/0x14)
mender-client_1 | [ 2.616731] [<c00129ac>] (show_stack) from [<c0261e4c>] (dump_stack+0x88/0x9c)
mender-client_1 | [ 2.621124] [<c0261e4c>] (dump_stack) from [<c00aa780>] (panic+0xdc/0x248)
mender-client_1 | [ 2.625375] [<c00aa780>] (panic) from [<c065d390>] (mount_block_root+0x288/0x294)
mender-client_1 | [ 2.629343] [<c065d390>] (mount_block_root) from [<c065d49c>] (mount_root+0x100/0x108)
mender-client_1 | [ 2.633196] [<c065d49c>] (mount_root) from [<c065d5f4>] (prepare_namespace+0x150/0x198)
mender-client_1 | [ 2.637161] [<c065d5f4>] (prepare_namespace) from [<c065cec0>] (kernel_init_freeable+0x284/0x294)
mender-client_1 | [ 2.641055] [<c065cec0>] (kernel_init_freeable) from [<c04f1630>] (kernel_init+0x8/0xf0)
mender-client_1 | [ 2.644874] [<c04f1630>] (kernel_init) from [<c000f818>] (ret_from_fork+0x14/0x3c)
mender-client_1 | [ 2.649161] ---[ end Kernel panic - not syncing: Invalid configuration from end user prevents continuing
If the filesystem driver is not behaving correctly, there is not much we can do, but this seems a bit unlikely given how extremely widely used it is. A couple of reasons I can think of that are alternative explanations:
- We are not using a good method for simulating powerloss.
- I think it's /proc../reboot-something we are using, right? I think it should be the best one, but maybe not?
- There is an actual bug in our implementation, and we are not handling powerloss correctly and corrupting something.
- Not sure what it would be, but can't be ruled out.
- We are somehow corrupting the partition table, which would explain why mmcblk0p2 would also be corrupted.
This problem is happening semi-frequently, so worth keeping and eye on this and track any findings in this ticket.