Uploaded image for project: 'CFEngine Community'
  1. CFEngine Community
  2. CFE-3361

Lock of CF_CRITICAL_SECTION not working

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Higher
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.16.0, 3.15.3
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8

      Description

      We are using CFE with lmdb and have multiple cf-agents running and we ran into following syslog from CFEngine:

      CFEngine(agent) Could not move 'lock database backup' into place (rename: No such file or directory)

      After some debugging we are almost certain that this is caused by one cf-agent trying to delete a temporary file (of a DB backup) that has already been delete by another cf-agent. In the function used to protect this (WaitForCriticalSection) we found indications for a possible race condition.
      There is no IPC locking between the calls to function FindLockTime and WriteLock. A process switch between these calls can result in another cf-agent finding no lock in the DB an overwriting a previous lock. The problem might even be intensivied by the functions called by WriteLock doing threadlocking an starting other threads (VerifyThatDatabaseIsNotCorrupt_once) before the actual mdb_txn_begin.
      We added a debug output to RemoveLock. We read the lock information from the DB before deleting the lock. We output a line if the pid stored in the lock ist different to that of the process. Hier some traces

      May 06 18:15:26 somehost cf-agent[14262]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14262, lock pid: 14265
      May 06 18:15:26 somehost cf-agent[14265]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14265, lock pid: 0
      May 06 18:15:29 somehost cf-agent[14263]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14263, lock pid: 14269
      May 06 18:15:29 somehost cf-agent[14269]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14269, lock pid: 0
      May 06 18:15:31 somehost cf-agent[14267]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14267, lock pid: 0
      May 06 18:16:59 somehost cf-agent[14799]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14799, lock pid: 14808
      May 06 18:16:59 somehost cf-agent[14808]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14808, lock pid: 14799
      May 06 18:16:59 somehost cf-agent[14799]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14799, lock pid: 14808
      May 06 18:17:00 somehost cf-agent[14808]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14808, lock pid: 0
      May 06 18:17:02 somehost cf-agent[14801]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14801, lock pid: 14804
      May 06 18:17:02 somehost cf-agent[14804]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14804, lock pid: 0
      

      It shows that e.g. pid 14262 deletes the lock that process 14265 last wrote (by overwriting the lock of 14262). When reading the information process 14265 finds no lock (empty return: pid 0).

      Shouldn´t the reading and setting of CF_CRITICAL_SECTION be protected by IPC locking (linux:sem_xxx, flock}), done in a single lmdb transaction (mdb_txn_begin mdb_get check mdb_put mdb_txn_commit) or a write with MDB_NOOVERWRITE be evaluated?

      Since CF_CRITICAL_SECTION is the lock of all locks this can have multiple implications and cause corruption not only in the initially found "file rename" problem.

        Attachments

        1. test.cf
          1 kB
        2. test_start_agents
          0.3 kB
        3. locks.c
          31 kB

          Activity

            People

            • Assignee:
              vpodzime Vratislav Podzimek
              Reporter:
              groeth Guido Röth
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel