Uploaded image for project: 'CFEngine Community'
  1. CFEngine Community
  2. CFE-2467

lmdb slowness: takes more than 2 minutes to populate cf_lastseen

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: High
    • Resolution: Fixed
    • Affects Version/s: 3.9.1
    • Fix Version/s: None
    • Component/s: cf-serverd
    • Labels:
      None
    • Environment:
      RHEL 6.8

      Description

      We, at LinkedIn, are testing 3.9.1 (RPM downloaded from here

      Noticed lmdb performs poorly when /var/cfengine/state directory is kept on HDD.

      For an example: cf-serverd takes more than 2 minutes to save connection info in lastseen db (too bad "cf-serverd -d" does not give the timestamp anymore)

      verbose: Obtained IP address of '172.20.77.211' on socket 330 from accept
       verbose: New connection (from 172.20.77.211, sd 330), spawning new thread...
          info: 172.20.77.211> Accepting connection
       verbose: 172.20.77.211> Setting socket timeout to 600 seconds.
       verbose: 172.20.77.211> Peeked nothing important in TCP stream, considering the protocol as TLS
         debug: 172.20.77.211> Peeked data: ....(...$..M..
         debug: 172.20.77.211> TLSVerifyCallback: no ssl->peer_cert
         debug: 172.20.77.211> TLSVerifyCallback: no conn_info->key
         debug: 172.20.77.211> This must be the initial TLS handshake, accepting
       verbose: 172.20.77.211> TLS version negotiated:  TLSv1.2; Cipher: AES256-GCM-SHA384,TLSv1/SSLv3
       verbose: 172.20.77.211> TLS session established, checking trust...
         debug: 172.20.77.211> TLSRecvLines(): CFE_v2 cf-agent 3.9.1.
         debug: 172.20.77.211> TLSRecvLines(): IDENTITY USERNAME=root.
       verbose: 172.20.77.211> Setting IDENTITY: USERNAME=root
       verbose: 172.20.77.211> Received public key compares equal to the one we have stored
       verbose: 172.20.77.211> MD5=f1898103601a5ff43620504d1cc02ed8: Client is TRUSTED, public key MATCHES stored one.
      
       <<<--- long pause
      
       verbose: 172.20.77.211> Remote peer terminated TLS session (SSL_read)
          info: 172.20.77.211> Closing connection, terminating thread
       verbose: Obtained IP address of '172.20.77.211' on socket 89 from accept
       verbose: New connection (from 172.20.77.211, sd 89), spawning new thread...
          info: 172.20.77.211> Accepting connection
       verbose: 172.20.77.211> Setting socket timeout to 600 seconds.
       verbose: 172.20.77.211> Peeked nothing important in TCP stream, considering the protocol as TLS
         debug: 172.20.77.211> Peeked data: ....(...$.....
         debug: 172.20.77.211> TLSVerifyCallback: no ssl->peer_cert
         debug: 172.20.77.211> TLSVerifyCallback: no conn_info->key
         debug: 172.20.77.211> This must be the initial TLS handshake, accepting
       verbose: 172.20.77.211> TLS version negotiated:  TLSv1.2; Cipher: AES256-GCM-SHA384,TLSv1/SSLv3
       verbose: 172.20.77.211> TLS session established, checking trust...
         debug: 172.20.77.211> TLSRecvLines(): CFE_v2 cf-agent 3.9.1.
         debug: 172.20.77.211> TLSRecvLines(): IDENTITY USERNAME=root.
       verbose: 172.20.77.211> Setting IDENTITY: USERNAME=root
       verbose: 172.20.77.211> Received public key compares equal to the one we have stored
       verbose: 172.20.77.211> MD5=f1898103601a5ff43620504d1cc02ed8: Client is TRUSTED, public key MATCHES stored one.
      
      <<<--- long pause
      
       verbose: 172.20.77.211> Remote peer terminated TLS session (SSL_read)
          info: 172.20.77.211> Closing connection, terminating thread
      

      From this part of the code it looks like cf-serverd is supposed to do below 3 things after "Client is TRUSTED, public key MATCHES stored one" message:

      1. User name root is set in the connection info
      2. connection info is saved in lmdb lastseen db
      3. "OK WELCOME" string is sent to the client

      Client waits for 30 seconds to get "OK WELCOME" string and then times out:

       verbose: File '/var/cfengine/network_transfer' copy_from '/export/content/cfengine/masterfiles/network_transfer'
       verbose: FindIdle: no existing connection to 'eat1-22164-mps01.corp.linkedin.com' is established.
       verbose: Connecting to host eat1-22164-mps01.corp.linkedin.com, port 5308 as address 172.20.65.22
       verbose: Waiting to connect...
       verbose: Setting socket timeout to 30 seconds.
       verbose: Connected to host eat1-22164-mps01.corp.linkedin.com address 172.20.65.22 port 5308 (socket descriptor 7)
       verbose: TLS version negotiated:  TLSv1.2; Cipher: AES256-GCM-SHA384,TLSv1/SSLv3
       verbose: TLS session established, checking trust...
       verbose: Received public key compares equal to the one we have stored
       verbose: Server is TRUSTED, received key 'MD5=99716406746bab24de39aaf03d398eda' MATCHES stored one.
      
         error: SSL_read: receive timeout
         error: Connection was hung up while receiving line:
         error: Connection was hung up during identification! (3)
       verbose: Connection to 172.20.65.22 is closed
          info: Unable to establish connection to 'eat1-22164-mps01.corp.linkedin.com'
       verbose: FindIdle: no existing connection to 'eat1-22164-mps02.corp.linkedin.com' is established.
       verbose: Connecting to host eat1-22164-mps02.corp.linkedin.com, port 5308 as address 172.20.65.25
      

      Now, if we put /var/cfengine/state into /dev/shm i.e use ramdisk instead of HDD, the connection works without any issue.

      Please could you take a look and let me know if you guys would need more data. As of now, we are using ramdisk (and have been using since 3.6.2 to keep lmdb files) as a work around. But in long run it would be better if CFE could use a better nosql db, if lmdb performance is not able to give out the performance needed to operate at hyperscale.

        Attachments

          Activity

            People

            • Assignee:
              a10003 Eystein Maloy Stenberg
              Reporter:
              soumyadip Soumyadip Das Mahapatra
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel