Uploaded image for project: 'CFEngine Community'
  1. CFEngine Community
  2. CFE-2768

Function to read delimited file filtered by class expression

    XMLWordPrintable

    Details

      Description

      As a policy writer, I would like to be able to read a line based file (like CSV) filtered by class expressions into a data container where the first line of the delimited file contains the field names.

      • Return data container using column headings as keys instead of positional
      • Filter returned data-set by class expression (ifvarclass)
        • If you thought you wanted to use classmatch() use it to define a class that is used by ifvarclass.
      • Lexically sort the data container by specified field name.
        • This is a really nice feature to have because when rendering a template based on this data that iterates over the key value pairs, the order will be retained, making diffs more readable.

      Why some people prefer line based files:

      • One row with bad data doesn't invalidate the entire file (as happens with JSON).
      • Easy for people
      • Easy for spreadsheets
      • Invalid data rows result in a warning and are discarded
      • Empty rows are silently ignored.

      Function prototype suggestions:

      • classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by")
      "filtered_data" data => classexpression_filterdata( "/tmp/data.csv", 1, ",", true, "sysctl_variable" );
      

      Example Input data

      Here is an example with delimiter ,.

      classexpression,sysctl_variable,sysctl_value
      linux,kernel.sysrq,1 
      linux,kernel.core_users_pid,1 
      linux,kernel.panic,10 
      (debian|redhat).production,vm.overcommit_memory,0
      (debian|redhat).production,vm.overcommit_ratio,99
      redhat.!production,net.core.rmem_default,1048576
      

      Here is an example with delimiter ;;.

      classexpression;;sysctl_variable;;sysctl_value
      linux;;kernel.sysrq;;1 
      linux;;kernel.core_users_pid;;1 
      linux;;kernel.panic;;10 
      (debian|redhat).production;;vm.overcommit_memory;;0
      (debian|redhat).production;;vm.overcommit_ratio;;99
      redhat.!production;;net.core.rmem_default;;1048576
      

      Should be translated into this data container debian production

      {
      "kernel.sysrq": "1",
      "kernel.core_users_pid": "1",
      "kernel.sysrq": "1",
      "kernel.core_users_pid": "1",
      "kernel.panic": "10",
      "vm.overcommit_memory": "0",
      "vm.overcommit_ratio": "99"
      }
      

      Example template to be used with:
      NOTE this example template structure does NOT match the suggested returned data format!

      {{#sysctl_data}}
      {{sysctl_variable}}={{sysctl_value}} 
      {{/sysctl_data}}
      

      For this template to work, the following data structure would be necessary:

      {
      { "sysctl_variable": "kernel.sysrq",          "sysctl_value": "1" },
      { "sysctl_variable": "kernel.core_users_pid", "sysctl_value": "1" },
      { "sysctl_variable": "kernel.sysrq",          "sysctl_value": "1" },
      { "sysctl_variable": "kernel.core_users_pid": "sysctl_value": "1" },
      { "sysctl_variable": "kernel.panic",          "sysctl_value": "10" },
      { "sysctl_variable": "vm.overcommit_memory",  "sysctl_value": "0" },
      { "sysctl_variable": "vm.overcommit_ratio",   "sysctl_value": "99" }
      }
      

      Example use cases:

      Data structure without parsing key value names into data container

      This example shows how data can be sharded which can help with execution speed and may help to better align with different groups managing different aspects.

      In the example the data is sharded into defaults, datacenter, application, and security. We want each shard is able to override the keys of the former so that everyone is given sensible defaults, settings are adjusted for environmental factors (datacenter, application/role etc ...) but security has the final word and can set mandatory defaults.

      Sharding the data allows for separation of concerns. Global IT can control the data set for default settings, facility and application admins can override with settings that are appropriate for their location and application. The policy writer controls the model and the merge strategy which determines which data will be used to configure the system in the end. Each sharded data set can leverage cfengine class expressions to determine the data loaded from the file.

      bundle agent main
      {
      classes:
      "el_7" expression => "any";
      "dc_7" expression => "any";
      "dc_group_4" expression => "any";
      
      vars:
      
      # loaded from sysctl-global-defaults.csv
      # classexpression,sysctl_variable,sysctl_value
      # "debian_9","kernel.something","DEB_9_DEFAULT"
      # "el_7","kernel.something","EL_7_DEFAULT"
      # #"LAST ENTRY WINS"
      # "linux","kernel.something","DEFAULT"
      # "any","kernel.sysrq","DEFAULT"
      # "any","kernel.core_users_pid","DEFAULT"
      # "el_7","kernel.core_users_pid","EL_7_DEFAULT"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-global-defaults.csv", 1, ",", true, "sysctl_variable" );
      "defaults" data =>
      '{
      "kernel.something": "DEFAULT",
      "kernel.sysrq": "DEFAULT",
      "kernel.core_users_pid": "DEFAULT",
      "vm.overcommit_memory": "EL_7_DEFAULT",
      }';
      
      # Loaded from sysctl-$(my.datacenter).json
      # classexpression,sysctl_variable,sysctl_value
      # "dc_group_4","kernel.sysrq","MY DATACENTER",
      # "dc_group_4","vm.overcommit_ratio","MY DATA CENTER"
      # "dc_group_5","vm.overcommit_ratio","MY DC GROUP 5"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.datacenter).csv", 1, ",", true, "sysctl_variable" );
      "datacenter" data =>
      '{
      "kernel.sysrq": "MY DATACENTER",
      "vm.overcommit_ratio": "MY DATA CENTER"
      }';
      
      # Loaded from sysctl-$(my.app).json
      # classexpression,sysctl_variable,sysctl_value
      # "linux":"kernel.sysrq":"MYAPP",
      # "linux":"sys.ipv4_forward"      :     "MYAPP"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.app).csv", 1, "\s+:\s+", true, "sysctl_variable" );
      "application" data =>
      '{
      "kernel.sysrq": "MYAPP",
      "sys.ipv4_forward": "MYAPP"
      }';
      
      # Loaded from sysctl-security.csv
      # classexpression,sysctl_variable,sysctl_value
      # "!router":"sys.ipv4_forward"      :     "SECURITY SAYS ONLY ROUTERS"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.app).csv", 1, "\s+:\s+", true, "sysctl_variable" );
      "security" data =>
      '{
      "sys.ipv4_forward": "SECURITY SAYS ONLY ROUTERS"
      }
      ';
      
      reports:
      "Merged with picked data:$(const.n)$(with)"
      with => string_mustache('{{#-top-}}
      {{@}}={{.}}
      {{/-top-}}', mergedata( defaults, datacenter, application, security ));
      }
      

      Output:

      R: Merged with picked data:
      kernel.something=DEFAULT
      kernel.core_users_pid=DEFAULT
      vm.overcommit_memory=EL_7_DEFAULT
      vm.overcommit_ratio=MY DATA CENTER
      kernel.sysrq=MYAPP
      sys.ipv4_forward=SECURITY SAYS ONLY ROUTERS
      

      Example where column headings are keys in the returned data

      These are identical examples, but using data where as originally requested data is parsed into named key values based on the column header. It seems this would be less desirable if data is to be merged.

      bundle agent main
      {
      classes:
      "el_7" expression => "any";
      "dc_7" expression => "any";
      "dc_group_4" expression => "any";
      
      vars:
      
      # loaded from sysctl-global-defaults.csv
      # classexpression,sysctl_variable,sysctl_value
      # "debian_9","kernel.something","DEB_9_DEFAULT"
      # "el_7","kernel.something","EL_7_DEFAULT"
      # #"LAST ENTRY WINS"
      # "linux","kernel.something","DEFAULT"
      # "any","kernel.sysrq","DEFAULT"
      # "any","kernel.core_users_pid","DEFAULT"
      # "el_7","kernel.core_users_pid","EL_7_DEFAULT"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-global-defaults.csv", 1, ",", true, "sysctl_variable" );
      "defaults" data =>
      '[
      { "sysctl_variable": "kernel.something",      "sysctl_value": "DEFAULT" },
      { "sysctl_variable": "kernel.sysrq",          "sysctl_value": "DEFAULT" },
      { "sysctl_variable": "kernel.core_users_pid", "sysctl_value": "DEFAULT" },
      { "sysctl_variable": "vm.overcommit_memory",  "sysctl_value": "EL_7_DEFAULT" },
      ]';
      
      # Loaded from sysctl-$(my.datacenter).json
      # classexpression,sysctl_variable,sysctl_value
      # "dc_group_4","kernel.sysrq","MY DATACENTER",
      # "dc_group_4","vm.overcommit_ratio","MY DATA CENTER"
      # "dc_group_5","vm.overcommit_ratio","MY DC GROUP 5"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.datacenter).csv", 1, ",", true, "sysctl_variable" );
      "datacenter" data =>
      '[
      { "sysctl_variable": "kernel.sysrq",        "sysctl_value":"MY DATACENTER" },
      { "sysctl_variable": "vm.overcommit_ratio", "sysctl_value":"MY DATA CENTER" }
      ]';
      
      # Loaded from sysctl-$(my.app).json
      # classexpression,sysctl_variable,sysctl_value
      # "linux":"kernel.sysrq":"MYAPP",
      # "linux":"sys.ipv4_forward"      :     "MYAPP"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.app).csv", 1, "\s+:\s+", true, "sysctl_variable" );
      "application" data =>
      '[
      { "sysctl_variable": "kernel.sysrq",     "sysctl_value": "MYAPP" },
      { "sysctl_variable": "sys.ipv4_forward", "sysctl_value": "MYAPP" }
      ]';
      
      # Loaded from sysctl-security.csv
      # classexpression,sysctl_variable,sysctl_value
      # "!router":"sys.ipv4_forward"      :     "SECURITY SAYS ONLY ROUTERS"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-$(my.app).csv", 1, "\s+:\s+", true, "sysctl_variable" );
      "security" data =>
      '[
      { "sysctl_variable": "sys.ipv4_forward", "sysctl_value": "SECURITY SAYS ONLY ROUTERS" }
      ]
      ';
      
      reports:
      "Merged with picked data:$(const.n)$(with)"
      with => string_mustache('{{#-top-}}
      {{sysctl_variable}}={{sysctl_value}}
      {{/-top-}}', mergedata( defaults, datacenter, application, security ));
      }
      

      Note how the keys are duplicated in the final data set.

      R: Merged with picked data:
      kernel.something=DEFAULT
      kernel.sysrq=DEFAULT
      kernel.core_users_pid=DEFAULT
      vm.overcommit_memory=EL_7_DEFAULT
      kernel.sysrq=MY DATACENTER
      vm.overcommit_ratio=MY DATA CENTER
      kernel.sysrq=MYAPP
      sys.ipv4_forward=MYAPP
      sys.ipv4_forward=SECURITY SAYS ONLY ROUTERS
      
      

      For simple cases where no data merging is involved it may be ok if the function to load the data ensures that keys are unique in the returned data

      bundle agent main
      {
      classes:
      "el_7" expression => "any";
      "dc_7" expression => "any";
      "dc_group_4" expression => "any";
      
      vars:
      
      # loaded from sysctl-global-defaults.csv
      # classexpression,sysctl_variable,sysctl_value
      # "debian_9","kernel.something","DEB_9_DEFAULT"
      # "el_7","kernel.something","EL_7_DEFAULT"
      # #"LAST ENTRY WINS"
      # "linux","kernel.something","DEFAULT"
      # "any","kernel.sysrq","DEFAULT"
      # "any","kernel.core_users_pid","MY DATACENTER"
      # "el_7","kernel.core_users_pid","EL_7_DEFAULT"
      # "dc_group_4","kernel.sysrq","MY DATACENTER",
      # "linux","kernel.sysrq","MYAPP",
      # "linux","sys.ipv4_forward","MYAPP"
      # "!router","sys.ipv4_forward","SECURITY SAYS ONLY ROUTERS"
      # loaded using classexpresson_filterdata( "path to datafile", "Class expression Column/Key", "DELIM", "Has heading", "sort by" )
      #"defaults" data => classexpression_filterdata( "sysctl-global-defaults.csv", 1, ",", true, "sysctl_variable" );
      "data" data =>
      '[
      { "sysctl_variable": "kernel.something",      "sysctl_value": "DEFAULT" },
      { "sysctl_variable": "kernel.sysrq",          "sysctl_value": "MYAPP" },
      { "sysctl_variable": "kernel.core_users_pid", "sysctl_value": "DEFAULT" },
      { "sysctl_variable": "vm.overcommit_memory",  "sysctl_value": "EL_7_DEFAULT" },
      { "sysctl_variable": "sys.ipv4_forward",      "sysctl_value": "SECURITY SAYS ONLY ROUTERS" }
      ]';
      
      reports:
      "Merged with picked data:$(const.n)$(with)"
      with => string_mustache('{{#-top-}}
      {{sysctl_variable}}={{sysctl_value}}
      {{/-top-}}', data);
      }
      
      R: Merged with picked data:
      kernel.something=DEFAULT
      kernel.sysrq=MYAPP
      kernel.core_users_pid=DEFAULT
      vm.overcommit_memory=EL_7_DEFAULT
      sys.ipv4_forward=SECURITY SAYS ONLY ROUTERS
      
      

      Additional parameters that might be desirable

      • Ignore lines
        • Like we have today for many functions that parse data files.
        • Don't consider lines matching regular expression (like comments).
      • Additional filter
        • Let the function further restrict which data shall be allowed
        • Example: data file has linux = value x, debian = value y (later, so more specific). Function could load and say !debian to filter out the debian specific line.
        • Why?
          • Because people are crazy. The flexibility allows for policy writer to work around some issue in the incoming data set.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                karlhto Karl Hole Totland
                Reporter:
                a10042 Nick Anderson
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: