# `extract`

## Description

The `extract` function allows you to transform raw strings into structured data by parsing out embedded values and storing them as objects. It supports various extraction strategies to convert unstructured fields into clean, queryable formats.

## Syntax

```dataprime
(e|extract) <expression> into <keypath> using <extraction-type>(<extraction-params>) [datatypes keypath:datatype,keypath:datatype,...]
```

## Extractor functions

Extractor functions define how raw strings are parsed and transformed into structured objects when using the `extract` keyword. Each function handles a specific format, such as regular expressions, key-value pairs, delimited lists, or escaped JSON. You specify the extractor using the `using` clause, which determines how the string will be interpreted. With the right extractor, you can convert unstructured data into clean, queryable objects for filtering, analysis, and visualization.

### 1. `regexp`

Parses data using **named capture groups** in a regular expression.

#### Input

```json
{
  "message": "user Chris has logged in"
}
```

#### Query

```dataprime
extract message into user_data using regexp(e=/user (?<user>.*) has logged in/)
```

#### Output

```json
{
  "message": "user Chris has logged in",
  "user_data": {
    "user": "Chris"
  }
}
```

### 2. `multi_regexp`

Extracts **all matches** of a pattern into an **array**.

#### Input

```json
{
  "log": "user 1 did 2 things on 3 pages"
}
```

#### Query

```dataprime
extract log into numbers using multi_regexp(e=/\d+/)
```

#### Output

```json
{
  "log": "user 1 did 2 things on 3 pages",
  "numbers": ["1", "2", "3"]
}
```

### 3. `kv`

Parses a string of key-value pairs into an object.

#### Input

```json
{
  "query_string": "a=b&b=c&c=d"
}
```

#### Query

```dataprime
extract query_string into query_params using kv(pair_delimiter='&', key_delimiter='=')
```

#### Output

```json
{
  "query_string": "a=b&b=c&c=d",
  "query_params": {
    "a": "b",
    "b": "c",
    "c": "d"
  }
}
```

### 4. `jsonobject`

Unescapes and parses a stringified JSON object.

#### Input

```json
{
  "nested_json": "{\"key\": \"value\"}"
}
```

#### Query

```dataprime
extract nested_json into parsed_json using jsonobject()
```

#### Output

```json
{
  "nested_json": "{\"key\": \"value\"}",
  "parsed_json": {
    "key": "value"
  }
}
```

### 5. `split`

Splits a string by a delimiter into an array of primitive values.

#### Input

```json
{
  "csv_codes": "10,20,30"
}
```

#### Query

```dataprime
extract csv_codes into codes using split(delimiter=',', element_datatype=number)
```

#### Output

```json
{
  "csv_codes": "10,20,30",
  "codes": [10, 20, 30]
}
```

## Using `datatypes` to annotate extracted fields

You can provide explicit type annotations to specific fields using the `datatypes` clause. This ensures values are stored with the correct type, enabling numerical comparisons, aggregations, and more.

#### Input

```json
{
  "msg": "query_type=fetch query_id=100 query_results_duration_ms=232"
}
```

#### Query

```dataprime
extract msg into query_data using kv() datatypes query_results_duration_ms:number
```

#### Output

```json
{
  "msg": "query_type=fetch query_id=100 query_results_duration_ms=232",
  "query_data": {
    "query_type": "fetch",
    "query_id": "100",
    "query_results_duration_ms": 232
  }
}
```

> `query_results_duration_ms` is now a number, while `query_id` remains a string.

Note

You only need to specify the extracted field name in `datatypes`, not the full keypath.

## Summary

The `extract` keyword, combined with extractor functions, provides a flexible and powerful way to transform messy strings into usable structured data. Whether you're parsing JSON blobs, splitting CSV-like fields, or decoding regex patterns, the extractor system helps you build clean logs and metrics pipelines in a declarative, readable way.
