We just raised $142 million in our Series D Round! Read About Our Plans for the Future

4 Different Ways to Ingest Data in AWS OpenSearch

  • Joanna Wallace
  • May 12, 2022

AWS OpenSearch is a project based on Elastic’s Elasticsearch and Kibana projects. Amazon created OpenSearch from the last open-source version of ElasticSearch (7.10) and is part of the AWS system. The key differences between the two are topics for another discussion, but the most significant point to note before running either distribution is the difference in licenses. ElasticSearch now runs under a dual-license model, and OpenSearch remains open-source. 

Like Elasticsearch, OpenSearch can store and analyze observability data, including logs, metrics, and traces. Elasticsearch primarily uses LogStash to load data, and OpenSearch users can choose from several services to ingest data into indices. Which service is best-suited for OpenSearch ingestion depends on your use case and current setup. 

Ingestion Methods for AWS OpenSearch

Data can be written to OpenSearch using the OpenSearch client and a compute function such as AWS Lambda. To write to your cluster directly, the data must be clean and formatted according to your OpenSearch mapping definition. This requirement may not be ideal for writing observability data with formats other than JSON or CSV. 

Data must also be batched appropriately so as not to overwhelm your defined OpenSearch cluster. The cluster setup significantly impacts the cost of the OpenSearch service and should be configured as efficiently as possible. Each of the methods described below requires the cluster to be running before starting to stream data.

AWS allows users to stream data directly from other AWS services into an OpenSearch index without an intermediate step through a compute function. 

AWS Kinesis Firehose

AWS Kinesis is a streaming service that collects, processes, and analyzes data in real time. It is a scalable service and will scale itself up and down based on current requirements. AWS Kinesis Firehose uses the AWS Kinesis streaming service. That also allows users to extract and transform data within the Kinesis queue itself before outputting the data to another service. 

Firehose can also automatically write data to other AWS services like AWS S3 and AWS OpenSearch before outputting the streamed data. Firehose can also send data directly to third-party vendors working with AWS to provide observability services, like Coralogix. The Kinesis stream output data can be handled separately from the automatic writes to other services.

Firehose uses an AWS Lambda to produce any changes requested to the streamed data. Developers can set up a custom Lambda to process streamed data or use one of the blueprint functions provided by AWS. These changes are not required but are helpful for observability data which is often not formatted. Recording data with a JSON format makes analytics simpler for any third-party tools you may utilize. Some tools like Coralogix’s log analytics platform also have built-in parsers that can be used if changing data at the Kinesis level is not ideal. 

Kinesis Firehose is an automatically scalable service. If your platform requires large volumes of data to flow into OpenSearch, this is a wise choice for infrastructure. It will likely be more cost-effective than other AWS services, assuming none are already used. 

AWS CloudWatch

AWS CloudWatch is a service that collects logs from other AWS services and makes them available for display. Compute functions like AWS Lambda and AWS Fargate can send log data to CloudWatch for troubleshooting by DevOps teams. From CloudWatch, data can be sent to other services within AWS or to observability tools to help with troubleshooting. Logs are essential for observability but are most helpful when used in concert with metrics and traces. 

To send log data to OpenSearch, developers first need to set up a subscription. CloudWatch subscriptions consist of a log stream, the receiving resource, and a subscription filter. Each log stream must have its subscription filter set up. A log stream is a set of logs from a single Lambda or Fargate task. Different invocations of the same function do not need a new setup. The receiving resource is the service to which the logs will be sent; in this case, an OpenSearch cluster. Lastly, the subscription filter is a simple setup inside CloudWatch that determines which logs should be sent to the receiving service. You can add filters, so only some of the logs with particular keywords or data present are recorded in OpenSearch. 

Developers may set up a filter where many logs are written to OpenSearch with this setup. The cost of CloudWatch can be complex to calculate, but the more you write, the more it will cost, and the price can increase very quickly. Streaming data to another service will only increase the costs of running your platform. Before using this solution, determine if the cost is worth it compared to other solutions presented here. 

LogStash

Logstash is a data processing pipeline developed by Elastic. It can ingest, transform, and write data to an Elasticsearch or OpenSearch cluster. When using Elastic, the ELK stack includes Logstash automatically, but in AWS, Logstash is not automatically set up. AWS uses the open-source version of Logstash to feed data into OpenSearch. A plugin needs to be installed and deployed on an EC2 server. Developers then configure Logstash to write directly to OpenSearch. AWS provides details on the configuration that sends data to AWS OpenSearch. 

Since Logstash requires the setup of a new server on AWS, it may not be a good production solution for AWS users. Using one of the other listed options may be less expensive, especially if any other listed services are already in use. It can also reduce the amount of engineering setup required. 

AWS Lambda

AWS Lambda is a serverless compute function that allows developers to quickly build custom functionality on the cloud. Developers can use Lambda functions with the OpenSearch library to write data to an OpenSearch cluster. Writing to OpenSearch from Lambda opens the opportunity to write very customized data to the cluster from many different services.

Many AWS services can trigger Lambdas, including DynamoDB streams, SQS, and Kinesis Firehose. Triggering a Lambda to write data directly also means developers can clean and customize data before it is written to OpenSearch. Having clean data means that observability tools can work more efficiently to detect anomalies in your platform. 

An everyday use case might be the need to update a log in OpenSearch with metadata whenever a DynamoDB entry is written or updated. Developers can configure a stream to trigger a Lambda on changes to DynamoDB, and this stream could send either new data alone or new and old data. A data model with pertinent metadata is formed from this streaming information, and Lambda can write it directly to OpenSearch for future analysis. 

AWS IoT

AWS IoT is a service that allows developers to connect hardware IoT devices to the cloud. IoT core supports different messaging protocols like MQTT and HTTPS to publish data from the device and store it in various AWS cloud services. 

Once data is in AWS IoT, developers can configure rules to send the data to other services for processing and storage. The OpenSearch action will take MQTT messages from IoT and store them in an OpenSearch cluster.  

Machine Learning and Observability

When putting logs into OpenSearch, the goal is to get better insights into how your SaaS platform functions. DevOps teams can catch errors, delays in processing, or unexpected behaviors with the correct setup in observability tools. Teams can also set up alerts to notify one another when it’s time to investigate errors. 

Instantiating log analysis or machine learning in AWS OpenSearch is not a simple task, and there is no switch to turn on and gain insight into your platform. It takes significant engineering resources to use OpenSearch for observability with machine learning, and teams would need to build a custom solution. If this type of processing is critical for your platform, consider using an established system like Coralogix that can provide log analysis and alerting to inform when your system is not performing at its best. 

Summary

AWS OpenSearch is an AWS-supported, open-source alternative to Elasticsearch. Being part of the AWS environment, it can be fed data by multiple different AWS services like Kinesis Firehose and Lambda. Developers can use OpenSearch to store various data types through customized mappings, including observability data. DevOps teams can query logs using associated Kibana dashboards or AWS compute functions to help with troubleshooting and log analysis. For a fast setup of machine learning log analytics without needing specialized engineering resources, consider also utilizing the Coralogix platform to maintain your system. 

Related Articles