We are using CFE with lmdb and have multiple cf-agents running and we ran into following syslog from CFEngine:
CFEngine(agent) Could not move 'lock database backup' into place (rename: No such file or directory)
After some debugging we are almost certain that this is caused by one cf-agent trying to delete a temporary file (of a DB backup) that has already been delete by another cf-agent. In the function used to protect this (WaitForCriticalSection) we found indications for a possible race condition.
There is no IPC locking between the calls to function FindLockTime and WriteLock. A process switch between these calls can result in another cf-agent finding no lock in the DB an overwriting a previous lock. The problem might even be intensivied by the functions called by WriteLock doing threadlocking an starting other threads (VerifyThatDatabaseIsNotCorrupt_once) before the actual mdb_txn_begin.
We added a debug output to RemoveLock. We read the lock information from the DB before deleting the lock. We output a line if the pid stored in the lock ist different to that of the process. Hier some traces
It shows that e.g. pid 14262 deletes the lock that process 14265 last wrote (by overwriting the lock of 14262). When reading the information process 14265 finds no lock (empty return: pid 0).
Shouldn´t the reading and setting of CF_CRITICAL_SECTION be protected by IPC locking (linux:sem_xxx, flock}), done in a single lmdb transaction (mdb_txn_begin mdb_get
check mdb_put mdb_txn_commit) or a write with MDB_NOOVERWRITE be evaluated?
Since CF_CRITICAL_SECTION is the lock of all locks this can have multiple implications and cause corruption not only in the initially found "file rename" problem.