Uploaded image for project: 'CFEngine Community'
  1. CFEngine Community
  2. CFE-3361

Lock of CF_CRITICAL_SECTION not working

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Higher
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.16.0, 3.15.3
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8
    • Platform:
      SUSE
    • Found in version (details):
      3.12.2

      Description

      We are using CFE with lmdb and have multiple cf-agents running and we ran into following syslog from CFEngine:

      CFEngine(agent) Could not move 'lock database backup' into place (rename: No such file or directory)

      After some debugging we are almost certain that this is caused by one cf-agent trying to delete a temporary file (of a DB backup) that has already been delete by another cf-agent. In the function used to protect this (WaitForCriticalSection) we found indications for a possible race condition.
      There is no IPC locking between the calls to function FindLockTime and WriteLock. A process switch between these calls can result in another cf-agent finding no lock in the DB an overwriting a previous lock. The problem might even be intensivied by the functions called by WriteLock doing threadlocking an starting other threads (VerifyThatDatabaseIsNotCorrupt_once) before the actual mdb_txn_begin.
      We added a debug output to RemoveLock. We read the lock information from the DB before deleting the lock. We output a line if the pid stored in the lock ist different to that of the process. Hier some traces

      May 06 18:15:26 somehost cf-agent[14262]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14262, lock pid: 14265
      May 06 18:15:26 somehost cf-agent[14265]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14265, lock pid: 0
      May 06 18:15:29 somehost cf-agent[14263]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14263, lock pid: 14269
      May 06 18:15:29 somehost cf-agent[14269]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14269, lock pid: 0
      May 06 18:15:31 somehost cf-agent[14267]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14267, lock pid: 0
      May 06 18:16:59 somehost cf-agent[14799]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14799, lock pid: 14808
      May 06 18:16:59 somehost cf-agent[14808]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14808, lock pid: 14799
      May 06 18:16:59 somehost cf-agent[14799]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14799, lock pid: 14808
      May 06 18:17:00 somehost cf-agent[14808]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14808, lock pid: 0
      May 06 18:17:02 somehost cf-agent[14801]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14801, lock pid: 14804
      May 06 18:17:02 somehost cf-agent[14804]: Release lock of id CF_CRITICAL_SECTION FAILED: my pid: 14804, lock pid: 0
      

      It shows that e.g. pid 14262 deletes the lock that process 14265 last wrote (by overwriting the lock of 14262). When reading the information process 14265 finds no lock (empty return: pid 0).

      Shouldn´t the reading and setting of CF_CRITICAL_SECTION be protected by IPC locking (linux:sem_xxx, flock}), done in a single lmdb transaction (mdb_txn_begin mdb_get check mdb_put mdb_txn_commit) or a write with MDB_NOOVERWRITE be evaluated?

      Since CF_CRITICAL_SECTION is the lock of all locks this can have multiple implications and cause corruption not only in the initially found "file rename" problem.

        Attachments

        1. locks.c
          31 kB
        2. test_start_agents
          0.3 kB
        3. test.cf
          1 kB

          Activity

            People

            • Assignee:
              vpodzime Vratislav Podzimek
              Reporter:
              groeth Guido Röth
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel