Details
-
Type:
Bug
-
Status: Need more Info
-
Priority:
High
-
Resolution: Unresolved
-
Affects Version/s: 3.6.5, 3.10.1
-
Fix Version/s: None
-
Component/s: cf-execd
-
Labels:None
-
Environment:Found it on AIX system but could be reproduced on any system with network latency
Description
I had an issue where the agent stops to run periodically.
That agent is scheduled to run every two hours, at minute 0 (with a splay time less than 10 minutes), but after a few runs (4-6) after agent starts, no runs happen until a few days.
If I restart the agent during that period, it starts to run again but It will not run again in a few runs.
To understand what was happening i ran cf-execd in verbose and found what was preventing runs to start. cf-execd preliminary checks took 15 seconds (mostly on checking a network interface), which results in a delay between the time when cf-execd starts and when it finally checks that he has to run at the end of cf-execd( start at minute 00 second 50, ends a minute 01 second 05, which is not in minute 00 anymore so does not start the agent).
Here is an extract of a working run (delay between interface 4 and 5):
2017-05-12T00:00:19DFT verbose: CFEngine Core 3.6.5 2017-05-12T00:00:19DFT verbose: Host name is: <my_hostname> 2017-05-12T00:00:19DFT verbose: Operating System Type is aix 2017-05-12T00:00:19DFT verbose: Operating System Release is 6.1 2017-05-12T00:00:19DFT verbose: Architecture = powerpc 2017-05-12T00:00:19DFT verbose: Using internal soft-class aix for host UP0TC022 2017-05-12T00:00:19DFT verbose: The time is now Fri May 12 00:00:19 2017 2017-05-12T00:00:19DFT verbose: Additional hard class defined as: 32_bit 2017-05-12T00:00:19DFT verbose: Additional hard class defined as: aix_6_1 2017-05-12T00:00:19DFT verbose: Additional hard class defined as: aix_powerpc 2017-05-12T00:00:19DFT verbose: Additional hard class defined as: aix_powerpc_6_1 2017-05-12T00:00:19DFT verbose: GNU autoconf class from compile time: compiled_on_aix5_3 2017-05-12T00:00:19DFT verbose: Address given by nameserver: xxx 2017-05-12T00:00:19DFT verbose: No interface exception file /var/rudder/cfengine-community/inputs/ignore_interfaces.rx 2017-05-12T00:00:19DFT verbose: Interface 1: en1 2017-05-12T00:00:19DFT verbose: Interface 2: en1 2017-05-12T00:00:19DFT verbose: IP address of host set to xxx 2017-05-12T00:00:19DFT verbose: Interface 3: en0 2017-05-12T00:00:19DFT verbose: Interface 4: en0 2017-05-12T00:00:34DFT verbose: Interface 5: lo0 2017-05-12T00:00:34DFT verbose: Interface 6: lo0 2017-05-12T00:00:34DFT verbose: Interface 7: lo0 2017-05-12T00:00:34DFT verbose: Trying to locate my IPv6 address 2017-05-12T00:00:34DFT verbose: Looking for environment from cf-monitord... 2017-05-12T00:00:34DFT verbose: Unable to detect environment from cf-monitord 2017-05-12T00:00:34DFT verbose: Found 16 processors 2017-05-12T00:00:34DFT verbose: Reference time set to 'Fri May 12 00:00:34 2017' 2017-05-12T00:00:34DFT verbose: Waking up the agent at Fri May 12 00:00:34 2017 ~ Hr00.Min00 2017-05-12T00:00:34DFT verbose: Sleeping for splaytime 554 seconds
two hours later, agent was not started:
2017-05-12T02:00:49DFT verbose: CFEngine Core 3.6.5 2017-05-12T02:00:49DFT verbose: Host name is: <_hostname> 2017-05-12T02:00:49DFT verbose: Operating System Type is aix 2017-05-12T02:00:49DFT verbose: Operating System Release is 6.1 2017-05-12T02:00:49DFT verbose: Architecture = powerpc 2017-05-12T02:00:49DFT verbose: Using internal soft-class aix for host UP0TC022 2017-05-12T02:00:49DFT verbose: The time is now Fri May 12 02:00:49 2017 2017-05-12T02:00:49DFT verbose: Additional hard class defined as: 32_bit 2017-05-12T02:00:49DFT verbose: Additional hard class defined as: aix_6_1 2017-05-12T02:00:49DFT verbose: Additional hard class defined as: aix_powerpc 2017-05-12T02:00:49DFT verbose: Additional hard class defined as: aix_powerpc_6_1 2017-05-12T02:00:49DFT verbose: GNU autoconf class from compile time: compiled_on_aix5_3 2017-05-12T02:00:49DFT verbose: Address given by nameserver: xxx 2017-05-12T02:00:49DFT verbose: No interface exception file /var/rudder/cfengine-community/inputs/ignore_interfaces.rx 2017-05-12T02:00:49DFT verbose: Interface 1: en1 2017-05-12T02:00:49DFT verbose: Interface 2: en1 2017-05-12T02:00:49DFT verbose: IP address of host set to xxx 2017-05-12T02:00:49DFT verbose: Interface 3: en0 2017-05-12T02:00:49DFT verbose: Interface 4: en0 2017-05-12T02:01:04DFT verbose: Interface 5: lo0 2017-05-12T02:01:04DFT verbose: Interface 6: lo0 2017-05-12T02:01:04DFT verbose: Interface 7: lo0 2017-05-12T02:01:04DFT verbose: Trying to locate my IPv6 address 2017-05-12T02:01:04DFT verbose: Looking for environment from cf-monitord... 2017-05-12T02:01:04DFT verbose: Unable to detect environment from cf-monitord 2017-05-12T02:01:04DFT verbose: Found 16 processors 2017-05-12T02:01:04DFT verbose: Reference time set to 'Fri May 12 02:01:04 2017' 2017-05-12T02:01:04DFT verbose: Nothing to do at Fri May 12 02:01:04 2017 2017-05-12T02:01:04DFT verbose: Sleeping for pulse time 60 seconds...
That happen on only one agent in my hundreds agent but i guess it could happen anywhere at anytime, and it's quite important as it makes the agent unreliable, and quite hard to understand!
Maybe the check should be made with cf-execd start date instead of current date, what do you think of it and would that be possible?
I workaround it by adding slow interfaces into ignore_interfaces.rx
(On a side note, cf-execd checks are made 1 out of 2 runs, i guess it's because of classes that persists one minute (exactly the time between two cf-execd run, Should I open a bug with this ?)