Automated End-to-End Encrypted backups for massively scaled Elasticsearch via Puppet, AWS and Rsyslog

December 28, 2018
elasticsearch curator aws puppet rsyslog elasticsearch watcher

Background



When trying to decide which database to use the two most important considerations are the data’s structure and retrieval complexity because all databases scale best against a particular type of data and query. For example, the ACID compliance around which SQL is built makes SQL databases well-suited for transaction data where there is the potential for highly filtered queries that require a high degree of integrity. NoSQL databases on the other hand are built around the theory of eventual consistency in an attempt to get around ACID compliance and thereby increase speed. The lowered integrity, but increased speed of NoSQL languages is therefore ideal for large hierchical and document data sets, although the degraded transactional integrity compels most engineers to implement NoSQL solutions on less persistent or less important data. Because of this they rightly avoid backups and I agree with this logic, except I also must acknowledge that companies, often through no willful ignorance, sometimes use NoSQL databases in non-time series and persistent data contexts that really should be backed up. Once such example of this is when the document store (search engine) Elasticsearch is used to store personal health information (PHI).



Introduction



Protected Health Information (PHI) is, loosely put, the medical and financial data that results from receiving medical care (e.g. medical records, bills, etc.); it is protected with rigorous legality. PHI origins such as universities, hospitals and clinics create exabyte-sized data warehouses where the individual unit of PHI storage is a simple document like a medical record. The massive scale of PHI, as well as the vertical and key/value-pair nature of PHI documents, make PHI an excellent big data analysis use case for a document store/search engine like Elasticsearch. However, PHI documents are persistent and non-time series and thus could require backups (assuming that an ‘upstream’ PHI data source could not restore lost PHI easily). Because a PHI hack or leak could be bankrupting to an enterprise of any size. and given that there is no automagical and comprehensive backup and restore tool for (at least) On-Prem Elasticsearch, an End-to-End Automated, Encrypted and HIPAA-compliant backup and restore solution is required. Furthermore, because this system is going to be handling the backing up of sensitive data we of course want to monitor and alert on the success and failure of backup jobs. What follows is a largely open source approach to handling such needs leveraging Elasticsearch Curator and Watcher, AWS IAM and S3, Puppet and Rsyslog.



Diagrammatic Conceptual Overview of Automated End-to-End Encrypted Backups



E2E Elasticsearch Backup/Restore Solution



Diagrammatic Conceptual Overview of Backup Alerts



mailx_and_watcher_alerting



Caveats and Requirements



The Puppet code ahead was written on version 3.8.7, although what is posted in this article will likely work on all versions. There is too much code and logic for a single project so only the core programmatic blocks are discussed. All of the code, however, is on my Github. The OS was CentOS 7; Elasticsearch was 5 and 6.

The main motivation for my writing of this article is because as of 2018 I was unable to find any internet resource that lays out a conceptual approach for automating, and alerting on, Curator-based backups. The pieces are around, but nothing is complete and significantly less so for more legacy Puppet versions. However, in the interest of transparency and courtesy, I’d recommend that anyone implementing automated curator backups consider porting the logic I lay out ahead to an orchestrator like Ansible or Terraform because these tools are written for modern SRE and infrastructure engineering paradigms such as immutable infrastructure and security automation.



Curator Backups: End-to-End Encryption



End-to-end encryption refers to any system that implements storage encryption-at-rest and network encryption-in-transit in all places. Curator achieves encryption-at-rest via Elastic’s repository-s3 plugin which encrypts snapshots and related files on EC2 storage before they are sent to S3. S3 has multiple encryption-at-rest options; this tut uses Amazon S3-Managed Keys (SSE-S3).

Encryption-in-transit with Curator and S3 is possible because Curator uses the Boto3 library which uses the Python Certifi module’s Root CA bundle provided by Mozilla which includes the Amazon Root CA certificates to authenticate to S3 with TLS.



Curator Backups: Automation



The two most important parts of the encrypted backup process (as opposed to the backup monitoring/alerting process) are controlling which nodes receive Curator and automating repo and snapshot creation for your environments. Curator is only installed on the masters. Controlling Curator installation is done with the Puppet language e.g. if ::hostname =~ host.(dev|qa|prod).domain.com { include ::curator}, Puppet console or Hiera. For this project it is completed via Puppet Console. Automating repo and snapshot creation for your environments is handled with Puppet code. Controlling which nodes receive Curator is one of the most important features of this project because Curator runs as root and has the power to delete indices if configured to do so. While I realize that this article is about backing up Elasticsearch, as opposed to deleting it, it is still prudent to be defensive against anything that has the potential power to blow clusters away. Automating repo metadata naming conventions per your environments automates communication with AWS S3 which is the most crucial communication process required for this backup solution. The diagram below illustrates the approach just described; the diagram is followed by a more specific, step-by-step breakdown of the most important programmatic parts of the backup approach:



curator_installation_and_repo_control



Automating Encrypted Curator Backups: Installing Elasticsearch’s repository-s3 Plugin



In order to use Curator to send and retrieve snapshots to and from s3, Elasticsearch requires that you install its repository-s3 plugin. The repository-s3 plugin, however, requires a full cluster restart if being installed in an already running cluster. Ideally, this plugin would be baked into an Elasticsearch image and therein a cluster restart would be avoided, but if, as was my case, you’re installing it after cluster initialization then you’ll want to plan for a cluster restart.

Check your cluster’s plugins like so:



sudo /usr/share/elasticsearch/bin/elasticsearch-plugin list --verbose | grep s3



If the plugin is not installed the following block of Puppet code will deploy it for you:



exec {'repository-s3':
          command => '/usr/share/elasticsearch/bin/elasticsearch-plugin install --batch repository-s3',
          unless  => "/bin/test -f ${::elasticsearch::params::plugindir}/repository-s3/plugin-descriptor.properties",
          require => class[some_class_that_should_likely_come_before]
     }



--batch mode forces Elasticsearch gain root in order to install the plugin. The unless attribute is a hack to check if the plugin is already installed, because Puppet and Elasticsearch will complain if is. Finally, the require block is for order dependency because of course Elasticsearch will need to be installed before the plugin install is attempted.



Automating Encrypted Curator Backups: Installing the Curator RPM



Installing Curator on CentOS 7 and with Puppet is a breeze. The code blocks below are so trivial I almost didn’t include them, but they do handle the case of an internally mirrored Elasticsearch yum repo, a chaining arrows dependency flow for ensuring that configuration files aren’t deployed until curator is installed and running Curator via cron:



yumrepo { 'curator':
    enabled             => $manage_repo,
    baseurl             => $baseurl
    gpgcheck            => 1,
    gpgkey              => $gpg_key_file_path
    skip_if_unavailable => 0,
    require             => File[$gpg_key_file_ath]
  } ~>

  package { 'elasticsearch-curator':
    ensure  => present,
  } ~>

  file {'/opt/elasticsearch-curator/curator.yml':
    ensure  => file,
    mode    => '0640',
    content => template("${module_name}/$path_to_curator.yml.erb"),
  } ~>

  file {'/opt/elasticsearch-curator/snapshot.yml':
    ensure  => file,
    mode    => '0640',
    content => template("${module_name}/snapshot.yml.erb"),
  } -> 

  file {'/etc/cron.d/snapshot_alert':
    ensure  => present,
    mode    => '0644',
    content => "${snapshot_actions_cron_schedule} ${snapshot_actions_cron_user} /usr/bin/curator --config /opt/elasticsearch-curator/curator.yml /opt/elasticsearch-curator/snapshot.yml || /bin/bash /usr/local/bin/snapshot_alert.sh\n",
  }



The yumrepo block configures a yum repo securely by forcing GPG key signature checking and since the elasticsearch-curator is signed by elasticsearch, using this feature is highly recommended. If you’re not mirroring the elasticsearch artifacts then this block isn’t needed.

I’m not a big fan of web templating, but if you’re not using Hiera, then you’ll definitely want to templatize your curator.yml and snapshot.yml configuration files for programmability and use the content => template(... attribute to deploy them via their respective file resources.

The ~> symbols are chaining arrows and are Puppet syntactic sugar for applying the resource on the left first and if that resource changes then refresh the resource on the right. The chaining arrows may be needed to ensure that the /opt/elasticsearch-curator/ dir, which is created by the installation of the elasticsearch-curator rpm, is present before any other files are associated with that directory, such as configuration files.

The -> syntax at the end of the dependency chain means apply the resource on the left, then the resource on the right, but do not refresh the resource on the right if the resource on the left changes. This ensures that the cron job is only populated if Curator is installed and assumes that a change to the snapshot.yml file doesn’t necessitate a change to the cron job.

The content attribute creates a short-circuit evalution between the Curator cron job and the snapshot_alert.sh script that attempts to execute Curator, but executes snapshot_alert.sh if Curator exits with anything but a code of 0. The snapshot_alert.sh will be covered in the alerting section.



Automating Encrypted Curator Backups: Creating an Encrypted Snapshot Repository

The code block below uses the Elasticsearch _snapshot API to create a repository in an elasticsearch cluster to store encrypted cluster snapshots that will be securely transmitted to, and possibly retrieved from, AWS S3. It’s a curl-based PUT using HTTP Basic Auth to authenticate an elasticsearch user to the cluster. In this PUT we send the required elasticsearch API values via curl’s -d switch, but we are also required to specify an HTTP header for this data being sent via -d that says its Content-Type is application/json which the elasticsearch API requires for reasons outlined in this post. The values assigned to the command’s json keys are all variables, with the exception of \"type\": \"s3\". The types and possible values for these variables can be reviewed here. The most important value is \"server_side_encryption\" which iencrypts via AES256 snapshots actualize end-to-end encryptionalmost without exceptionthough the two most important ones are ${_snapshot_repository} and "${_snapshot_bucket}\" because they automate the creation of repos according to host environment which is critical to the overall automation of this approach. Programmatically handling the assignment of these two values is discussed in the next two subsections. The \"server_side_encryption\" value is variablized, but must be set to true to actualize end-to-end encryptionalmost without exceptionFinally, the unless attribute is used because Puppet’s exec command is notorious for breaking idempotency as discussed in numerous places, including this post here.

exec { "create_${_snapshot_repository}_repo":
      path        => ['/bin', '/usr/bin' ],
      unless      => "curl -k -H 'Authorization: Basic ${::basic_auth_password}' https://$(hostname -f):9200/_cat/repositories | grep -q ${_snapshot_repository}",
      command     => "curl -k -XPUT -H 'Authorization: Basic ${::basic_auth_password}' https://$(hostname -f):9200/_snapshot/${_snapshot_repository} -H 'Content-Type: application/json' -d '{ \"type\": \"s3\", \"settings\": { \"bucket\": \"${_snapshot_bucket}\", \"compress\": \"${snapshot_bucket_compress}\", \"server_side_encryption\": \"${snapshot_bucket_server_side_encrypt}\", \"canned_acl\": \"${snapshot_bucket_canned_acl}\" } }'"
  }



Automating Encrypted Curator Backups: Controlling for Elasticsearch repository name



This block automates assigning the $_snapshot_repository value required in the repo creation API call. In the case of a single cluster an if with an else case statement will do. The if clause covers overriding at the Puppet console or hiera level and the else case clause assigns a $_snapshot_repository value based on the domain of the host on which the Curator manifest is being executed.



if $snapshot_repository {
    $_snapshot_repository = $snapshot_repository
} else {
    case $facts['domain'] {
      'dev.domain.com':  { $_snapshot_repository = 'dev-snapshot-repo' }
      'qa.domain.com':   { $_snapshot_repository = 'qa-snapshot-repo' }
      'prod.domain.com': { $_snapshot_repository = 'prod-snapshot-repo' }
      default:           { fail("domain must match \"^some message that covers all required domain cases$\"") }
  }
}



If you have multiple cluster use cases then a more complex case statement leveraging regexing helps. The block below is a bit hard to read which is why declarative languages shouldn’t be used for scripting. Nonetheless it has been tested against environments with thousands of AWS EC2 instances and works just fine:



if $snapshot_repository {
    $_snapshot_repository = $snapshot_repository
} else {
    case $facts['hostname'] {
      /^hostN.cluster.type.1.(dev|qa|prod).domain.com$/: { $_snapshot_repository = sprintf('%s-cluster-type-1-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
      /^hostN.cluster.type.2.(dev|qa|prod).domain.com$/: { $_snapshot_repository = sprintf('%s-cluster-type-2-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
      /^hostN.cluster.type.3.(dev|qa|prod).domain.com$/: { $_snapshot_repository = sprintf('%s-cluster-type-3-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
      default: { fail("hostname must match \"^some message that covers all required hostname cases$\"") }
  }
}



The number of repo names to create is the multiplicative product of the number of cluster types by the number of host environments, e.g. in the block above there are 3 cluster types cluster.type.1, cluster.type.2 and cluster.type.3 and 3 host environments (dev|qa|prod), hence the number of required repo names is 9, i.e dev-cluster-type-1-snapshot-repository, ..., prod-cluster-type-3-snapshot-repository. Puppet’s built-in function regsubst returns the host’s environment substring dev, qa, etc. to sprintf which in turn uses that value to interpolate its string formatted expression '%s-cluster-type-N-snapshot-repository', thereby assigning an environment to the cluster type, e.g. dev-cluster-type-1-snapshot-repository, ..., prod-cluster-type-3-snapshot-repository. The '\1' parameter of the regsubst function is a backreference to the first match between the capture group of (dev|qa|prod).domain.com and the ::domain value. For example, if ::domain == dev.domain.com, then in the case of host1.cluster.type.1.dev.domain.com regsubst would return dev; if ::domain == qa.domain.com then qa would be returned. The default case is to fail if a hostname match does not occur; keep in mind this will prevent puppet compilation meaning that the puppet code will not be executed at all, which provides added defense against creating a repo on a node that has somehow incorrectly entered the node group and has had Curator installed.



Automating Encrypted Curator Backups: Controlling for AWS S3 Bucket Name



The blocks below are identical to the ‘Controlling for Elasticsearch Repository Name’ blocks above, except in this case the S3 bucket name is being assigned:

# Single Cluster use case

if $bucket_name {
    $_bucket_name = $bucket_name
} else {
    case $facts['domain'] {
      'dev.domain.com':  { $_bucket_name = 'dev-s3-bucket' }
      'qa.domain.com':   { $_bucket_name = 'qa-s3-bucket' }
      'prod.domain.com': { $_bucket_name = 'prod-s3-bucket' }
      default:           { fail("domain must match \"^some message that covers all required domain cases$\"") }
  }
}

# Multiple cluster use cases

if $bucket_name {
    $_bucket_name = $bucket_name
} else {
    case $facts['hostname'] {
      /^hostN.cluster.type.1.(dev|qa|prod).domain.com$/: { $_bucket_name = sprintf('%s-cluster-type-1-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
      /^hostN.cluster.type.2.(dev|qa|prod).domain.com$/: { $_bucket_name = sprintf('%s-cluster-type-2-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
      /^hostN.cluster.type.3.(dev|qa|prod).domain.com$/: { $_bucket_name = sprintf('%s-cluster-type-3-snapshot-repository', regsubst($::domain, '(dev|qa|prod).domain.com', '\1')) }
        default: { fail("hostname must match \"^some message that covers all required hostname cases$\"") }
  }
}
comments powered by Disqus