extract - Parse strings into objects
The extract
keyword is used to taking values out of a string and turning them into an object. It utilizes regular expressions and named capture groups to turn embedded strings into indexed, queryable data.
Syntax
(e|extract) <expression> into <keypath> using <extraction-type>(<extraction-params>) [datatypes keypath:datatype,keypath:datatype,...]
Using regular expressions
Defining a regular expression in dataprime is done using the regexp
type. It accepts one argument, e
(short for expression) that contains the named capture groups necessary for parsing the string.
For example, the following will create a regular expression that parses the username out of the following string: "user Chris has logged in"
.
Example - Extracting a username from a login message, using Regex
Consider the following document:
We want to extract the username from this string, so that we can search and query it directly. This can be done using the extract
keyword with a regular expression.
This results in this log object:
Using Key-Value parsing
Some strings follow consistent key value formats, and so simply need to be converted into a JSON structure. This would be difficult to do in regex, which is why we have the kv
parsing option too.
The kv
function takes two arguments.
pair_delimiter
- The delimiter to expect between pairs. Default is (a space)key_delimiter
- The delimiter to expect separating between a key and a value. Default is=
.
Example - Converting query string parameters into a JSON object using Key-Value parsing
Consider the following document:
We want to extract each query string parameter into its own object field, which will give us the ability to query on individual query string fields. We can do this using kv
.
The resulting document looks like this:
{
"domain":"https://www.coralogix.com",
"query_string":"a=b&b=c&c=d",
"path": "/home",
"query_string_parameters":{
"a":"b",
"b":"c",
"c":"d"
}
}
Parse escaped JSON object
JSON objects that have been converted to strings are commonly escaped, so they can exist as string fields inside a larger JSON object. The jsonobject
function will unescape the stringified JSON and convert into a fully parsed object.
The jsonobject
function takes one argument:
max_unescape_count
- Max number of escaping levels to unescape before parsing the json. Default is 1. When set to 1 or more, the engine will detect whether the value contains an escaped JSON string and unescape it until its parsable or max unescape count ie exceeded. Default is1
Example - Parsing an escaped JSON object back into an object.
Consider the following document:
Using the jsonobject
function in conjunction with extract
allows us to convert this into a queryable object.
The resulting document looks like this:
Providing type information
It is possible to provide datatype information as part of the extraction, by using the datatypes
clause. For example, adding datatypes my_field:number
to an extraction would cause the extract my_field
keypath to be a number instead of a string.
For example, the following query will inform dataprime that my_field
should be interpreted as type number
.
NOTE: You don't need to specify the parent object. We can see that the extraction happens into the field data
but the keypath in datatypes
is only the name of the extracted field.
Extracted data always goes into a new keypath as an object, allowing further processing of the new keys inside that new object. For example:
Example - Ensuring duration is parsed as a number for further processing.
Consider the following documents:
{
"msg": "query_type=fetch query_id=100 query_results_duration_ms=232"
},
{
"msg": "query_type=fetch query_id=200 query_results_duration_ms=1001"
}
We want to extract the results duration field as a number, so that we can perform a numerical filter. We can do this using the datatypes
clause:
This will result in the following documents:
{
"msg": "query_type=fetch query_id=100 query_results_duration_ms=232",
"query_type": "fetch",
"query_id": "100", // Note the string type!
"query_results_duration_ms": 232
},
{
"msg": "query_type=fetch query_id=200 query_results_duration_ms=1001",
"query_type": "fetch",
"query_id": "200", // Note the string type!
"query_results_duration_ms": 1001
}
We can now perform filters, aggregations and more, because we provided type information.