Coralogix can be configured to automatically and dynamically archive logs to an S3 bucket. This saves Coralogix customers money, but of course, there are times when the data needs to be reindexed. This operation counts the reindexed logs against the daily quota. Many times customers would like to search and focus on the exact logs to be reindexed or even query the logs outside of Coralogix altogether.
Since the logs reside on an S3 bucket owned by the customer, there are many ways to do this with any tool or method that can access S3.
This post will show how to use AWS Athena to query these logs.
Note: In order to perform the operations described in this post, you’ll need to have access to an AWS console with the correct permissions.
First, you’ll need to have the name of the bucket used for Coralogix archiving. You can find this by going to Data Flow-> Setup Archive.
There two types of archives:
One of them or both can be active.
Access the S3 bucket through the AWS console. It will have o a parent folder called ‘cx’ → ‘csv’/’parquet’ → ‘v1’ (for parquet) or ‘v2’ (for csv) → team_id → dt → hr. Under each one of these there are log files divided into 10-minute chunks. All hours are UTC. Each log file will be in CSV or Parquet format and includes log lines with:
src_obj__event_labels – JSON object with metadata fields. Example entry:
{
"applicationname":"my_application",
"subsystemname":"my_subsystem",
"category":"my_category",
"classname":"",
"computername":"my_hostname",
"methodname":"",
"threadid":"2080",
"ipaddress":""
}
src_obj__user_data – Log text
src_obj__event_metadata – JSON object with metadata fields. Example entry:
{
"timestamp":"2022-11-13T07:58:18.211763",
"timestampMicros":1668326298211763,
"severity":"Debug",
"logid":"43674027-6261-48aa-a79c-59f5d419e6c8",
"priorityclass":"medium",
"processingOutputTimestampNanos":1668326300213000000,
"processingOutputTimestampMicros":1668326300213000
}
team_id – Company ID
dt – date in the format: YYYY-MM-DD
hr – hour
Athena is an AWS service that makes it easy to query the log files using standard SQL syntax. Using Athena commands you will create a DB and a schema and then query the logs.
Open Athena service in your AWS account. You will see the following screen:
As you can see at the top you need to set up a query result location in Amazon S3 first. Click on the blue link and put there your S3 bucket name.
Once the configuration is completed the following commands need to be run in the Athena query editor:
This command creates a database:
CREATE DATABASE <db_name>;
Example: CREATE DATABASE reindex_queries;
More information about the CREATE DATABASE command – link.
This command creates a table called <table_name> within the database, <db_name>. This table will have entries that are mapped to the Coralogix logs from the archive, so it is important not to change the order of fields at the first section of the command. ‘Text’ is the field that contains the log as it appears in the logs screen.
CSV:
CREATE EXTERNAL TABLE IF NOT EXISTS <some-db-name>.<some-table-name> (
`timestamp` string,
`severity` string,
`text` string,
`applicationname` string,
`subsystemname` string,
`category` string,
`classname` string,
`computername` string,
`methodname` string,
`threadid` string,
`ipaddress` string)
PARTITIONED BY (
`team_id` string,
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://your-bucket-name/cx/csv/v2'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1'
)
Parquet:
CREATE EXTERNAL TABLE IF NOT EXISTS <some-db-name>.<some-table-name> (
`event_labels__methodname` string,
`event_labels__classname` string,
`event_labels__category` string,
`src_obj__event_labels` string,
`event_labels__applicationname` string,
`event_metadata__logid` string,
`src_obj__user_data` string,
`event_labels__subsystemname` string,
`event_metadata__timestamp` bigint,
`event_labels__ipaddress` string,
`event_metadata__severity` string,
`event_labels__threadid` string,
`event_metadata__priorityclass` string,
`event_labels__computername` string,
`src_obj__event_metadata` string
)
PARTITIONED BY (
`team_id` string,
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://your-bucket-name/cx/parquet/v1'
Following a successful run of the ‘CREATE TABLE’ command, you will see on the left side a new table. If you click on it you will see the entries.
More information about the CREATE TABLE command – link.
When you need to query new data (From a time period you didn’t query before) there is a need to update with the new Date/Hours.
To do this you will click “Load partitions”
Using the ‘SELECT’ you can query the logs.
CSV examples:
Example 1:
SELECT
substr(timestamp, 1, 23) as timestamp,
applicationname,
subsystemname,
severity,
threadid,
try(json_extract(text, '$.kubernetes.host')) as kubernetes_host,
text
FROM
"<db_name>"."<table_name>"
WHERE
severity='Info' AND
applicationname='prod' AND
subsystemname!='abcde' AND
team_id='XXX' AND
dt = '2021-01-26' AND
hr >= '03' AND
hr < '04' AND
json_extract(text, '$.kubernetes.host') IS NOT NULL
LIMIT 100;
The above command will bring the first 100 logs that fit the query. It will retrieve the first 23 characters in the timestamp fields, the other indicated fields. In addition, it will extract from the text field (the log as it appears in Coralogix’s log screen) fields: kubernetes.host. It will retrieve their value into result fields called kubernetes.host. It will bring these fields from our DB and table. The query parameters are under the WHERE section. ‘IS NOT NULL’ makes sure that the logs we bring will have the field with a value.
Note that if the database is chosen on the left side then specifying the database is not needed.
Example 2:
In this query we are searching for free text “throttler” in the log:
SELECT
substr(timestamp, 1, 23) as timestamp,
applicationname,
subsystemname,
severity,
text
FROM
"coralogix_table"
WHERE
applicationname='ip_log' AND
team_id='12345' AND
dt = '2021-03-17' AND
hr >= '21' AND
hr < '23' AND
text LIKE '%throttler%'
LIMIT 100
The query above returns the first 100 records with timestamp between 2021-03-17 21:00-23:00, application name ip_log, any subsystem name, any severity and text which contains a word “throttler”. The team_id 12345 is used only as a filter and will not be displayed in the results.
Example 3:
SELECT
try(json_extract(text, '$.clientIp')) as clientIp,
try(json_extract(text, '$.userId')) as userId,
COUNT(*) as count
FROM
"<db_name>"."<table_name>"
WHERE
team_id='xxxxx' AND
dt = '2021-03-22' AND
hr >= '14' AND
hr < '17' AND
json_extract(text, '$.clientIp') IS NOT NULL AND
json_extract(text, '$.userId') IS NOT NULL AND
text LIKE '%503 service unavailable%' AND
date_parse(substr(timestamp, 1, 19),'%Y-%m-%dT%h:%i:%s') BETWEEN TIMESTAMP '2021-03-22 14:30:00' AND TIMESTAMP '2021-03-22 16:10:00'
GROUP BY text
ORDER BY count DESC;
The query above returns clientIp and userId fields and a count of their combinations:
It is done by COUNT(*) as count in the SELECT session and by GROUP BY text. The results are ordered by count field. ASC and DESC determine whether results are sorted in ascending or descending order. hr field allows only to search in the specific time frame with 1 hour interval. It was not sufficient so in the example the timestamp field was added to WHERE section. As Athena requires Java timestamp format, the command substr takes the first 19 characters and the date_parse command maps the format %Y-%m-%dT%h:%i:%s to readable by Athena %Y-%m-%d %h:%i:%s.
Example 4:
SELECT
try(json_extract(text,'$.message')) AS message,
COUNT(*) as count
FROM
"<db_name>"."<table_name>"
WHERE
applicationname='prod'
AND team_id='xxxxx'
AND dt >= '2021-04-01'
AND dt <= '2021-05-28'
AND hr >= '00'
AND hr < '24'
AND CAST(json_extract(text,'$.message') AS varchar)='reset'
AND CAST(json_extract(text,'$.message') AS varchar)='machine'
GROUP BY text
ORDER BY count DESC;
This query is similar to the one above but shows new ways of filtering. It shows how to limit results by adding a date range and also by a text (“reset” and “machine”) which is a part of the message field. It returns the message field and a count of its combinations.
PARQUET examples:
Example 1:
SELECT * FROM "<db_name>"."<table_name>" LIMIT 10;
The above command will bring the first 10 logs. You can use this query to test if you can reach the archive and also to see fields which you can use for more detailed queries.
Example 2:
SELECT
try(json_extract(src_obj__event_labels, '$.applicationname')) as applicationname,
try(json_extract(src_obj__event_labels, '$.subsystemname')) as subsystemname,
src_obj__user_data as log,
try(json_extract(src_obj__event_metadata, '$.timestamp')) as time
FROM "<db_name>"."<table_name>"
WHERE
(JSON_EXTRACT_SCALAR(src_obj__event_labels, '$.applicationname'))='prod'
AND (JSON_EXTRACT_SCALAR(src_obj__event_labels, '$.subsystemname'))='nginx'
AND dt >= '2022-11-01'
AND dt <= '2022-11-05'
AND (JSON_EXTRACT_SCALAR(src_obj__event_metadata, '$.severity'))='Debug'
AND (JSON_EXTRACT_SCALAR(src_obj__user_data, '$.message')) LIKE '%Connected%';
The query above returns three columns: applicationname, subsystemname, log with debug logs which contain a message fields with a word “Connected” where the application name is prod, the subsystem name is nginx which were created between 2022-11-01 and 2022-11-05.
This command is also known as CTAS (Create Table AS). This is a variation on the ‘CREATE TABLE’ command mentioned above.
Query data and save in a specific location (also creates a table that needs to be deleted):
CREATE TABLE reindex_queries.ctas
WITH (
format='TEXTFILE',
field_delimiter=',',
external_location='s3://coralogix-athena-results-test/reindex-testing/query_1/'
) AS
SELECT
substr(timestamp, 1, 23) as timestamp,
try(json_extract(json_parse(text), '$.log.kind')) AS kind,
applicationname,
subsystemname,
severity,
threadid,
text
FROM
"coralogix_example"."coralogix_reindex_testing_1"
WHERE
applicationname='auditing' AND
team_id='XXX' AND
dt = '2021-01-26' AND
hr >= '03' AND
hr < '04' AND
CAST(json_extract(text, '$.log.kind') AS varchar)='Event' AND
text like '%Started%'
LIMIT 1000;
The first part of the command indicates that a new table called ‘ctas’ will be created for database ‘reindex_queries’. The table will be of type ‘text’, with ‘,’ as delimiter, and will be created in the named bucket.
Athena will put the result in an S3 bucket location defined in its configuration but also you can download the csv file:
Having the query result in CSV format on an S3 bucket allows you to either send them to Coralogix or access them with another tool (Excel, for example).
1. Click the following link: https://serverlessrepo.aws.amazon.com/applications/eu-central-1/597078901540/Coralogix-Athena-S3
2. Click the Deploy button.
3. Fill the table:
4. Put a thick next to “I acknowledge that this app creates custom IAM roles and resource policies.” and click the Deploy button. That’s all!
Once new file is uploaded to the bucket, it will be read by the application and data will be send to Coralogix.
There is a cost associated with Athena usage. It is in the order of $5 per TB of data scanned. You can see more about Athena pricing here: https://aws.amazon.com/athena/pricing/.