Uploaded image for project: 'Mender'
  1. Mender
  2. MEN-2285

Client fills page cache causing out-of-memory condition, ignores resulting fsync failure causing corrupted firmware updates.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: (None)
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Labels:

      Description

      Background

      Thank you for developing Mender – it's a useful solution to a problem of growing importance, and exactly what my project needed.

      I've integrated Mender into my IoT sensor network product (running Debian on BeagleBone Black) and have been putting the system through its paces. I'm running my own Mender update server and I've been pushing OTA updates to 15-20 devices every night in order to test out the overall solution. I'm running Mender client version 1.6.0 (Note: I do cross-build the mender binary myself.).

      The Symptoms

      In a number of cases, we've seen the image delivered to the inactive partition become corrupted. We've experienced the following types of failures:

      • In some cases (if we're lucky), the reboot into the updated rootfs fails due to one or more corrupted files, in which case the BBB is typically able to fall back to the existing partition and the deployment is marked as failed.
      • In other cases, we're able to reboot into the updated rootfs and the update commits (and is marked successful in the deployment UI), but our application code crashes because of corruption in other parts of the filesystem.

      Notes about the nature of the corruption:

      • In one case, we had a .so (shared library) file that was failing to load, causing our application to crash. When I looked at the contents of the .so file it contained Python source code. Other files that should have had textual data contained what looked like binary code.
      • I double-checked the file contents my .mender artifact file and found that the files in the artifact were correct as they should be.
      • fsck returns clean on the filesystems containing these corrupted files.
      • I dd-ed the entire eMMC device into a "forensic backup" file for analysis. There were no complaints from the eMMC layer (so the problem is not due to a worn-out or dying eMMC device).

      All of the above led me to believe that some incomplete portion of the update rootfs-image was being written – as if the filesystem metadata (we're running ext4 on eMMC) had been successfully copied to the eMMC but not the actual file contents. The metadata was pointing to invalid/junk "stale" content.

      Investigation

      Digging into the logs helpfully provided by the Mender UI, I saw the following:

      2018-12-08 10:50:31 +0000 UTC info: wrote 1840250880/1840250880 bytes of update to device /dev/mmcblk1p2
      2018-12-08 10:50:34 +0000 UTC error: failed to fsync partition /dev/mmcblk1p2: sync /dev/mmcblk1p2: input/output error 
      2018-12-08 10:50:34 +0000 UTC error: closing device /dev/mmcblk1p2 failed: sync /dev/mmcblk1p2: input/output error
      2018-12-08 10:50:34 +0000 UTC debug: installer: successfully read artifact [name: MY_ARTIFACT_VERSION; version: 2; compatible devices: [xxx]]
      

      This is a telltale sign that something went wrong while writing the update image to the inactive partition, but the client seems to have ignored the issue and continued on.

      This message is produced at: https://github.com/mendersoftware/mender/blob/31caee72ba26b266ba0a0f344796b6acd11d3a6e/block_device.go#L106

      However, due to a typo bug at: https://github.com/mendersoftware/mender/blob/31caee72ba26b266ba0a0f344796b6acd11d3a6e/device.go#L135, the error is ignored and we move on, rebooting into a potentially-corrupted rootfs.

      I think that should be if cerr != nil on line 135.

      Root Cause Investigation

      OK, so the client is ignoring the Sync() error, which is not good, but why is that error happening in the first place? The "input/output error" message is due to the underlying fsync() call returning -EIO, which typically indicates some pretty fundamental issues in the storage layers of the kernel.

      I was able to go back and grab the contents of dmesg out of the systemd journal leading up to this error on the unit that failed, and sure enough I found:

      Dec 08 10:15:17 arm kernel: mmcqd/1: page allocation failure: order:2, mode:0x208c020
      Dec 08 10:15:17 arm kernel: CPU: 0 PID: 66 Comm: mmcqd/1 Not tainted 4.4.44-ti-r85 #1
      Dec 08 10:15:17 arm kernel: Hardware name: Generic AM33XX (Flattened Device Tree)
      Dec 08 10:15:17 arm kernel: [<c001acd8>] (unwind_backtrace) from [<c0015a34>] (show_stack+0x20/0x24)
      Dec 08 10:15:17 arm kernel: [<c0015a34>] (show_stack) from [<c05ba9d8>] (dump_stack+0x9c/0xb0)
      Dec 08 10:15:17 arm kernel: [<c05ba9d8>] (dump_stack) from [<c0165a88>] (warn_alloc_failed+0x110/0x150)
      Dec 08 10:15:17 arm kernel: [<c0165a88>] (warn_alloc_failed) from [<c016914c>] (__alloc_pages_nodemask+0x890/0xa54)
      Dec 08 10:15:17 arm kernel: [<c016914c>] (__alloc_pages_nodemask) from [<c0169540>] (alloc_kmem_pages+0x28/0xc0)
      Dec 08 10:15:17 arm kernel: [<c0169540>] (alloc_kmem_pages) from [<c0185cd8>] (kmalloc_order+0x20/0x38)
      Dec 08 10:15:17 arm kernel: [<c0185cd8>] (kmalloc_order) from [<c0185d1c>] (kmalloc_order_trace+0x2c/0xdc)
      Dec 08 10:15:17 arm kernel: [<c0185d1c>] (kmalloc_order_trace) from [<c01b19b0>] (__kmalloc+0x184/0x2c8)
      Dec 08 10:15:17 arm kernel: [<c01b19b0>] (__kmalloc) from [<c0633898>] (edma_prep_slave_sg+0xa8/0x33c)
      Dec 08 10:15:17 arm kernel: [<c0633898>] (edma_prep_slave_sg) from [<c088f4c0>] (omap_hsmmc_request+0x490/0x664)
      Dec 08 10:15:17 arm kernel: [<c088f4c0>] (omap_hsmmc_request) from [<c0875168>] (__mmc_start_request+0x50/0xe0)
      Dec 08 10:15:17 arm kernel: [<c0875168>] (__mmc_start_request) from [<c0875358>] (mmc_start_request+0x160/0x294)
      Dec 08 10:15:17 arm kernel: [<c0875358>] (mmc_start_request) from [<c08766cc>] (mmc_start_req+0x2f0/0x420)
      Dec 08 10:15:17 arm kernel: [<c08766cc>] (mmc_start_req) from [<c0887654>] (mmc_blk_issue_rw_rq+0xe0/0xb20)
      Dec 08 10:15:17 arm kernel: [<c0887654>] (mmc_blk_issue_rw_rq) from [<c088819c>] (mmc_blk_issue_rq+0x108/0x4fc)
      Dec 08 10:15:17 arm kernel: [<c088819c>] (mmc_blk_issue_rq) from [<c0889fc4>] (mmc_queue_thread+0xe4/0x1cc)
      Dec 08 10:15:17 arm kernel: [<c0889fc4>] (mmc_queue_thread) from [<c0064e90>] (kthread+0xfc/0x114)
      Dec 08 10:15:17 arm kernel: [<c0064e90>] (kthread) from [<c0010f20>] (ret_from_fork+0x14/0x34)
      Dec 08 10:15:17 arm kernel: Mem-Info:
      Dec 08 10:15:17 arm kernel: active_anon:30372 inactive_anon:52 isolated_anon:0
       active_file:41322 inactive_file:33153 isolated_file:0
       unevictable:0 dirty:32 writeback:1780 unstable:0
       slab_reclaimable:4584 slab_unreclaimable:1708
       mapped:2760 shmem:97 pagetables:371 bounce:0
       free:11306 free_pcp:166 free_cma:10470
      Dec 08 10:15:17 arm kernel: Normal free:45224kB min:2668kB low:3332kB high:4000kB active_anon:121488kB inactive_anon:208kB active_file:165288kB inactive_file:132612kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:522240kB managed:499836kB mlocked:0kB dirty:128kB writeback:7120kB mapped:11040kB shmem:388kB slab_reclaimable:18336kB slab_unreclaimable:6832kB kernel_stack:1200kB pagetables:1484kB unstable:0kB bounce:0kB free_pcp:664kB local_pcp:664kB free_cma:41880kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Dec 08 10:15:17 arm kernel: lowmem_reserve[]: 0 0 0
      Dec 08 10:15:17 arm kernel: Normal: 898*4kB (UMEC) 156*8kB (UMC) 72*16kB (C) 54*32kB (C) 26*64kB (C) 14*128kB (C) 3*256kB (C) 1*512kB (C) 0*1024kB 0*2048kB 0*4096kB 4*8192kB (C) = 45224kB
      Dec 08 10:15:17 arm kernel: 74578 total pagecache pages
      Dec 08 10:15:17 arm kernel: 0 pages in swap cache
      Dec 08 10:15:17 arm kernel: Swap cache stats: add 0, delete 0, find 0/0
      Dec 08 10:15:17 arm kernel: Free swap = 0kB
      Dec 08 10:15:17 arm kernel: Total swap = 0kB
      Dec 08 10:15:17 arm kernel: 130560 pages RAM
      Dec 08 10:15:17 arm kernel: 0 pages HighMem/MovableOnly
      Dec 08 10:15:17 arm kernel: 5601 pages reserved
      Dec 08 10:15:17 arm kernel: 12288 pages cma reserved
      Dec 08 10:15:17 arm kernel: edma 49000000.edma: edma_prep_slave_sg: Failed to allocate a descriptor
      Dec 08 10:15:17 arm kernel: omap_hsmmc 481d8000.mmc: prep_slave_sg() failed
      Dec 08 10:15:17 arm kernel: omap_hsmmc 481d8000.mmc: MMC start dma failure
      Dec 08 10:15:19 arm kernel: mmcblk1: unknown error -1 sending read/write command, card status 0x900
      Dec 08 10:15:19 arm kernel: blk_update_request: 310 callbacks suppressed
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649848
      Dec 08 10:15:19 arm kernel: buffer_io_error: 310 callbacks suppressed
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330207, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649856
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330208, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649864
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330209, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649872
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330210, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649880
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330211, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649888
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330212, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649896
      Dec 08 10:15:19 arm kernel: Buffer I/O error on dev mmcblk1p2, logical block 330213, lost async page write
      Dec 08 10:15:19 arm kernel: blk_update_request: I/O error, dev mmcblk1, sector 2649904
      

      Yikes! Ok, so the storage layer experienced some errors due to an out-of-memory condition. I suspect this happened because the Mender client's {{write()}}s into the partition block device were outpacing the speed at which the storage system was able to actually write the data to the underlying eMMC. For the record, my rootfs image size is 1755 MiB.

      Ideally if the system was experiencing memory pressure due to the writeback page cache filling up it would just backpressure our writes and block userspace, but the kernel isn't always perfect I guess.

      I checked on the other BBB units that experienced similar issues and saw the same messages in the Mender logs and in dmesg.

      Proposed Fixes

      The fact that Mender ignores the failure of the Sync() call is a bug. That's easy and should be fixed.

      The fact that we fill up the kernel's writeback cache to the point that it eventually causes an out-of-memory condition isn't strictly the Mender client's fault, but we need to ask ourselves – what does filling up the writeback page cache actually achieve? The page cache is intended to speed up access to frequently-accessed data, but in our case the goal is just to get the rootfs-image bytes on "disk", as reliably as possible. Currently, in the best case (if we manage to not cause an out-of-memory condition), we'll have to wait for the eMMC to catch up anyway when Sync() is called at the end of device.InstallUpdate(). At the worst, we cause this out-of-memory condition and the update fails and needs to be retried.

      I propose that we should be periodically Sync()-ing writes to the block device as the installation progresses. e.g. for every "N" bytes written, we force a Sync() operation.

      This has the effect of both:

      • reducing memory pressure on the writeback page cache
      • detecting storage issues earlier in the process (these would ultimately cause update failure anyway – you might as well detect them sooner).

      We might consider allowing "N" to be set from the configuration file (under the name DeviceWriteFlushIntervalBytes or similar). A reasonable default would be, say 4 MiB, but allowing users to increase or decrease the value might make sense for some folks who have more or less RAM available on their system.

      I've prepared a code modification that does the "sync every N bytes" by applying a straightforward wrapper around the io.Writer used in block_device.go.

      Other Notes

      This type of corruption could be responsible for MEN-2061.

      MEN-1216 (if it had been implemented) would have caught this. For the record, I think MEN-1216 this would be a useful feature – you could leave it off by default and only enable the extra check if configured to do so in the config file.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                a10040 Kristian Amlie
                Reporter:
                jtrimble Jeremy Trimble
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support

                    Summary Panel