Request Demo
Tutorials

Tutorials

Archive Query

archive query

Querying Your Archive Data Using Spark

Coralogix archive generates CSV formatted logs that can be reindexed or queried directly via Coralogix Archive Query. 

This document describes how to query the archive data on your own using Spark on EMR and write the output to an s3 bucket.

The process has the following parts:

  •   Creating an EMR Cluster with Spark and Jupyter Enterprise Gateway installed
  •   Creating a new Jupyter Notebook that uses Spark as its backend
  •   Querying and saving the data using spark commands
    • Choosing the relevant S3 files to query
    • Run a set of spark commands to query the data
    • Write  the filtered results to an S3 folder (Logs can be then read back to Coralogix using our S3 reader

Creating an EMR Cluster

  1. Login to the AWS console and go to the EMR section
  2. Choose “Create Cluster” and open “go to advanced settings”.
  3. Page 1:
    1. Release: emr-6.2.0
    2. Make sure that the following packages are checked:
      1. Hadoop 3.2.1
      2. Hive 3.1.2
      3. JupyterEnterpriseGateway 2.1.0
      4. Spark 3.0.1
    3. Check “Use for Hive table metadata
    4. Check “Use for Spark table metadata
    5. Choose “Clusters enters waiting state” in the After last step completes section
  4. Page 2:
    1. In the networking section:
      1. Choose the your VPC
      2. Choose a subnet
    2. In the Cluster Nodes and Instances section:
      1. Master Nodes – Set the instance-type and purchasing option if needed
      2. Core Nodes – Set the number of Core nodes in the cluster to 1, and their instance-type if desirable. These nodes hold HDFS data, which is used rarely by s3->s3 jobs, so usually there’s no need for more than 1 node for such an ad hoc job. 
      3. Task Nodes – These are the nodes which will perform the actual execution. Choose the desired amount, instance-type and purchasing option (ondemand/spot). A running spark job can recover from a spot termination, although this will have impact on the execution time.
      4. The default EBS disk size for each of the three instance groups is 64GB. To optimize costs, it can be reduced to 32GB or 16GB unless you plan on running complex queries which require joins/sorting/etc. on a large dataset.
    3. If you are planning to run complex queries (ones which require shuffling data), increase the size of the Root Volume in the EBS Root Volume section
  5. Page 3:
    1. Choose a cluster name
  6. Page 4:
    1. Choose the kubernetis.dev-shared.* EC2 key-pair
  7. Click the Create Cluster button

This will start creating the cluster, which can take several minutes. 

Check the Summary tab and wait for the “master”, “core” and “task” parts in the network and hardware section to change to “Running”. 

The cluster startup process can also be seen from the Hardware tab.

Creating a new Jupyter Notebook

  1. Go to the EMR Notebooks screen
  2. Create a new notebook
    1. Choose a name for the notebook
    2. Choose the newly-created cluster using “Choose an existing cluster” and the “Choose” Button.
    3. Click the “Create Notebook” button. It will take up to a minute or so to create it (Wait until the “Open JupyterLab” button becomes enabled).
    4. Choose PySpark for the newly created notebook.

Querying and Storing the data

Once the notebook is opened, you can start using it in order to run spark commands. This section will provide an example of how to read data from a set of dates/hours, filter it, and write the output to an S3 folder.

This is all done inside the Jupyter Notebook. The following sections explain each part.

Choosing the relevant files

Due to an issue with Spark, filenames including colons (:) are not properly detected when reading from S3. In order to query the data, the exact list of files needs to be prepared in advance.

For this reason, and due to the fact that hours without data are not written to S3 in the Coralogix archive, there’s a need to access S3 directly, and get the relevant folder list.

Enter the following in the notebook (each part in a different cell). This will install boto3, the python aws client.

sc.install_pypi_package('boto3')

Then run the following:

import os

import boto3
s3 = boto3.resource('s3')

from pyspark.sql.types import *

# The current coralogix CSV schema
cx_schema = StructType([StructField("timestamp", StringType(), True),StructField("severity", StringType(), True),StructField("text", StringType(), True),StructField("applicationName", StringType(), True),StructField("subsystemName", StringType(), True),StructField("category", StringType(), True),StructField("className", StringType(), True),StructField("computerName", StringType(), True),StructField("methodName", StringType(), True),StructField("threadId", StringType(), True),StructField("IPAddress", StringType(), True)])

# A function to get all folders of a bucket + prefix

def get_folders(bucket_name,prefix):
     paginator = s3.meta.client.get_paginator("list_objects_v2")

     files = []
     for page in paginator.paginate(Bucket=bucket_name,Prefix=prefix):
     files += [x['Key'] for x in page['Contents']]
     folders = ["s3://{}/{}".format(bucket_name,x) for x in map(lambda f: os.path.dirname(f),files)]
     return list(set(folders))

Choose the relevant data to query by providing a bucket name and a prefix:

folders = get_folders('YOUR BUCKET NAME','PREFIX')

Querying the data

Prepare the Spark dataframe which will read the files (only metadata is being read during this command):

df = 
spark.read.schema(cx_schema).option("inferSchema","false").option("header","true").option("quote","\"").option("escape","\"").csv(folders)

Filter the dataframe to the output you want:

out_df = df.filter("applicationName=='APPLICATION NAME TO FILTER'")

Writing the results

Write the resulting dataframe to the output folder:

out_df.write.option("compression","gzip").option("header","true").option("quote","\"").option("escape","\"").csv("YOUR DESTINATION S3 BUCKET")

An entire folder will be created, it will contain multiple CSV files. The number of files is dependent on the executor count used to perform the query, and cannot be controlled.

You can now choose whether to download the results or import them back to Coralogix. 

 

Start solving your production issues faster

Let's talk about how Coralogix can help you

Managed, scaled, and compliant monitoring, built for CI/CD

Get a demo

No credit card required

Get a personalized demo

Jump on a call with one of our experts and get a live personalized demonstration