End-to-End Encrypted backups for massively scaled Elasticsearch via Puppet and Rsyslog

December 28, 2018
elasticsearch aws puppet rsyslog

Background

When trying to decide which database to use the two most important considerations are the data’s structure and retrieval complexity because all databases scale best against a particular type of data and query. For example, SQL databases handle horizontal data and queries with lots of filtering better than NoSQL databases. For me, answering why is a bit elusive because the internet isn’t serving (after a fair amount of searching) an obvious answer, but I would bet that the academically rigorous and mathematical origin and nature of SQL is an important factor. NoSQL databases handle hierchical, vertical, graph, document, key/value and hashmap data, as well as simpler queries, better than SQL databases. Again, for me, answering exactly why is maybe not trivial, but I think it is safe to say that NoSQL query languages are written more for analysis and less for ACID compliance.

Somewhat, but only somewhat, tangentially, many engineers also feel that NoSQL data should be less persistent, more variable, less important, etc. and therefore no backups should be needed. I am inclined to agree, except I also must acknowledge that companies, often through no willful ignorance, sometimes use NoSQL databases in non-time series and persistent data contexts that really should be backed up. Once such example of this is when the document store (search engine) Elasticsearch is used to store personal health information (PHI).

Introduction

Protected Health Information (PHI) is, loosely put, the medical and financial data that results from receiving medical care (e.g. medical records, bills, etc.); it is protected with rigorous legality. PHI origins such as universities, hospitals and clinics create exabyte-sized data warehouses where the individual unit of PHI storage is a simple document like a medical record. The massive scale of PHI, as well as the vertical and key/value-pair nature of PHI documents, make PHI an excellent big data analysis use case for a document store/search engine like Elasticsearch. However, PHI documents are persistent and non-time series and thus could require backups (assuming that an ‘upstream’ PHI data source could not restore lost PHI easily). Additionally, a PHI hack or leak could be bankrupting or near-bankrupting to an enterprise of any size, thus given that there is no automagical and comprehensive backup and restore tool for (at least) On-Prem Elasticsearch, an End-to-End Automated, Encrypted and HIPAA-compliant backup and restore solution is required. What follows is one such (largely open-source) solution leveraging Elasticsearch Curator, AWS IAM and S3, Puppet and Ryslog.

Diagrammatic Conceptual Overview

E2E Elasticsearch Backup/Restore Solution

Simple ‘Flow-Chart’ Overview

       In progress :-)

comments powered by Disqus