This guide explains how to query your S3 Coralogix archive bucket (cx-data) using a third-party framework with the standard Apache Parquet reader provided by the relevant framework and required schema.
cx-data is stored in standard hive-like partitions, with the following partition fields:
Loading cx-data files in Pandas can be done using the read_parquet method:
import pandas as pd
# Notice that only the three payload columns are passed eventually to read_parquet()
cx_columns = [
df = pd.read_parquet('s3://.../myfile.parquet',columns = cx_columns)
# The dataframe `df` contains all the data needed for further processing.
Here is the output of df.info() showing the expected schema of the DataFrame:
To use cx-data directly in Athena, you’ll need to create an EXTERNAL table as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS "my_table" (
PARTITIONED BY (`team_id` string, `dt` string, `hr` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
After creating the external table, table partitions should be added to limit the scope of queries, reducing costs and improving performance.
Table partitions can be added manually:
ALTER TABLE "my_table" ADD PARTITION (team_id='23333', dt='2022-05-30', hr='01');
# Additional ALTER TABLE ... ADD PARTITION statements according to which dates and hours are needed
Table partitions can also be added automatically:
MSCK REPAIR TABLE "my_table"
Note! This command scans all of the files in the table location, which may be a time-consuming task when the amount of partitions detected is large. It may be easier to add the relevant new partitions manually. One option to expedite automatic partitioning is to limit the scope of the data by specifying a specific team_id and dt in the table definition LOCATION (e.g. modify the CREATE EXTERNAL TABLE command to include the team_id=<team-id>/dt=<date>/ subpath). This restricts the partition auto-detection to the relevant dates.
After adding the relevant partitions, a query can be executed on the table. Accessing a specific field inside one of the payload fields should be done using the Athena function json_extract_scalar, as in the following example:
Our world-class customer success team is available 24/7 to walk you through your setup and answer any questions that may come up. Feel free to reach out to us via our in-app chat or by sending us an email at [email protected].