Tutorials

Tutorials

Query Archive with AWS Athena

Coralogix can be configured to automatically and dynamically archive logs to an S3 bucket. This saves Coralogix customers money, but of course there are times when the data needs to be reindexed. This operation counts the reindexed logs against the daily quota. Many times customers would like to search and focus on the exact logs to be reindexed or even query the logs outside of Coralogix all together.

Since the logs reside on an S3 bucket owned by the customer, there are many ways to do this with any tool or method that can access S3.
This post will show how to use AWS Athena to query these logs.

Note: In order to perform the operations described in this post, you’ll need to have access to an AWS console with the correct permissions.

Finding Your Archive Folder

First, you’ll need to have the name of the bucket used for Coralogix archiving. You can find this by going to TCO->Archive.

Access the S3 bucket through the AWS console. It will have a parent folder called ‘cx’ → ‘csv’ → ‘v2’ → team_id → dt → hr, Under each one of these there are log files divided into 10-minute chunks. All hours are UTC. Each log file will be in CSV format and includes log lines with timestamp, metadata information, and the log itself as text (JSON fields are not parsed into different CSV entries).

AWS Athena

Athena is an AWS service that makes it easy to query the log files using standard SQL syntax. Using Athena commands you will create a DB and a schema and then query the logs. 

Open Athena service in your AWS account. You will see the following screen:

As you can see at the top you need to set up a query result location in Amazon S3 first. Click on the blue link and put there your S3 bucket name.

Once the configuration is completed the following commands need to be run in the Athena query editor:

Database creation

This command creates a database:

CREATE DATABASE <db_name>;

Example: CREATE DATABASE reindex_queries;

More information about the CREATE DATABASE command – link.

Table creation

This command creates a table called <table_name> within the database, <db_name>. This table will have entries that are mapped to the Coralogix logs from the archive, so it is important not to change the order of fields at the first section of the command. ‘Text’ is the field that contains the log as it appears in the logs screen.

CREATE EXTERNAL TABLE IF NOT EXISTS <some-db-name>.<some-table-name> (
 `timestamp` string,
 `severity` string,
 `text` string,
 `applicationname` string,
 `subsystemname` string,
 `category` string,
 `classname` string,
 `computername` string,
 `methodname` string,
 `threadid` string,
 `ipaddress` string)
PARTITIONED BY (
 `team_id` string,
 `dt` string,
 `hr` string)
ROW FORMAT SERDE
 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'separatorChar'=',')
STORED AS INPUTFORMAT
 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 's3://your-bucket-name/cx/csv/v2'
TBLPROPERTIES (
 'classification'='csv',
 'skip.header.line.count'='1'
)

Following a successful run of the ‘CREATE TABLE’ command you will see on the left side a new table. If you click on it you will see the entries.

new table entries

More information about the CREATE TABLE command – link.

Update partitions

When you need to query new data (From time you didn’t query before) there is a need to update with the new Date/Hours.
To do this you will to click “Load partitions”

SELECT query

Using the ‘SELECT’ you can query the logs.

SELECT
   substr(timestamp, 1, 23) as timestamp,
   applicationname,
   subsystemname,
   severity,
   threadid,
   try(json_extract(text, '$.kubernetes.host')) as kubernetes_host,
   text
FROM
   "<db_name>"."<table_name>"
WHERE
   severity='Info' AND
   applicationname='prod' AND
   subsystemname!='abcde' AND
   team_id='XXX' AND
   dt = '2021-01-26' AND
   hr >= '03' AND 
   hr < '04' AND
   json_extract(text, '$.kubernetes.host') IS NOT NULL
LIMIT 100;

The above command will bring the first 100 logs that fits the query. It will retrieve the first 23 characters in the timestamp fields, the other indicated fields. In addition it will extract from the text field (the log as it appears in Coralogix’s log screen) fields: kubernetes.host . It will retrieve their value into result fields called kubernetes.host. It will bring these fields from our DB and table. The query parameters are under the WHERE section. ‘IS NOT NULL’ makes sure that the logs we bring will have the field with a value.

Note that if the database is chosen on the left side then specifying the database is not needed.

 

Example 2:

In this query we are searching for free text “throttler” in the log:

SELECT
   substr(timestamp, 1, 23) as timestamp,
   applicationname,
   subsystemname,
   severity,
   text
FROM
   "coralogix_table"
WHERE
   applicationname='ip_log' AND
   team_id='12345' AND
   dt = '2021-03-17' AND
   hr >= '21' AND 
   hr < '23' AND
   text LIKE '%throttler%'
LIMIT 100

The query above returns the first 100 records with timestamp between 2021-03-17 21:00-23:00, application name ip_log, any subsystem name, any severity and text which contains a word “throttler”. The team_id 12345 is used only as a filter and will not be displayed in the results.

Example 3:

SELECT
   try(json_extract(text, '$.clientIp')) as clientIp,
   try(json_extract(text, '$.userId')) as userId,
   COUNT(*) as count
FROM
   "<db_name>"."<table_name>"
WHERE
   team_id='xxxxx' AND
   dt = '2021-03-22' AND
   hr >= '14' AND 
   hr < '17' AND
   json_extract(text, '$.clientIp') IS NOT NULL AND
   json_extract(text, '$.userId') IS NOT NULL AND
   text LIKE '%503 service unavailable%' AND
   date_parse(substr(timestamp, 1, 19),'%Y-%m-%dT%h:%i:%s') BETWEEN TIMESTAMP '2021-03-22 14:30:00' AND TIMESTAMP '2021-03-22 16:10:00'
GROUP BY text
ORDER BY count DESC;

The query above returns clientIp and userId fields and a count of their combinations:

It is done by COUNT(*) as count in the SELECT session and by GROUP BY text. The results are ordered by count field. ASC and DESC determine whether results are sorted in ascending or descending order. hr field allows only to search in the specific time frame with 1 hour interval. It was not sufficient so in the example the timestamp field was added to WHERE section. As Athena requires Java timestamp format, the command substr takes the first 19 characters and the date_parse command maps the format %Y-%m-%dT%h:%i:%s to readable by Athena %Y-%m-%d %h:%i:%s.

Query with Save

This command is also known as CTAS (Create Table AS). This is a variation on the ‘CREATE TABLE’ command mentioned above.

Query data and save in a specific location (also creates a table that needs to be deleted):

CREATE TABLE reindex_queries.ctas
WITH (
 format='TEXTFILE',
 field_delimiter=',',
 external_location='s3://coralogix-athena-results-test/reindex-testing/query_1/'
) AS
SELECT
   substr(timestamp, 1, 23) as timestamp,
   try(json_extract(json_parse(text), '$.log.kind')) AS kind,
   applicationname,
   subsystemname,
   severity, 
   threadid, 
   text
FROM
   "coralogix_example"."coralogix_reindex_testing_1"
WHERE
   applicationname='auditing' AND
   team_id='XXX' AND
   dt = '2021-01-26' AND
   hr >= '03' AND 
   hr < '04' AND
   CAST(json_extract(text, '$.log.kind') AS varchar)='Event' AND
   text like '%Started%'
LIMIT 1000;

The first part of the command indicates that a new table called ‘ctas’ will be created for database ‘reindex_queries’. The table will be of type ‘text’, with ‘,’ as delimiter, and will be created in the named bucket.

 

Download CSV file with results

Athena will put the result in a S3 bucket location defined in its configuration but also you can download the csv file:

Having the query result in CSV format on an S3 bucket allows you to either send them to Coralogix or access them with another tool (Excel, for example).

How to send CSV data to Coralogix

1. Click the following link: https://serverlessrepo.aws.amazon.com/applications/eu-central-1/597078901540/Coralogix-Athena-S3

2. Click the Deploy button.

3. Fill the table:

  • S3BucketName – The name of the S3 bucket storing the CSV files.
  • ApplicationName – A mandatory metadata field that is sent with each log and helps to classify it.
  • CoralogixRegion – Possible values are Europe, US or India. Choose Europe if your Coralogix account URL ends with .com, US if it ends with .us and India if it ends with .in. This is a Coralogix parameter and does not relate to your to your AWS region.
  • PrivateKey – Can be found in your Coralogix account under Settings -> Send your logs. It is located in the upper left corner.
  • SubsystemName – A mandatory metadata field that is sent with each log and helps to classify it.
  • S3KeyPrefix and S3KeySuffix should be adjusted based on your configuration. S3KeyPrefix is a folder at your bucket (for example default Athena directory: Unsaved/) and S3KeySuffix is the extensions of files (by default .csv)

4. Put a thick next to “I acknowledge that this app creates custom IAM roles and resource policies.” and click the Deploy button. That’s all!

Once new file is uploaded to the bucket, it will be read by the application and data will be send to Coralogix.

 

Pricing

There is a cost associated with Athena usage. It is in the order of $5 per TB of data scanned. You can see more about Athena pricing here: https://aws.amazon.com/athena/pricing/.

Start solving your production issues faster

Let's talk about how Coralogix can help you

Managed, scaled, and compliant monitoring, built for CI/CD

Get a demo

No credit card required

© 2021 Copyright Coralogix. All rights reserved.

Get a personalized demo

Jump on a call with one of our experts and get a live personalized demonstration