Our next-gen architecture is built to help you make sense of your ever-growing data. Watch a 4-min demo video!

Mastering Null Semantics: Translating SQL Expressions to OpenSearch DSL

  • Markus Appel, Software Engineer
  • September 11, 2024
Share article

Working at Coralogix, a leading full-stack observability platform, I recently faced an interesting challenge.

The team I am part of is building the DataPrime query language and query engine, used to easily query logs and other observability data on the platform, usually in the form of Parquet files on AWS S3. Inside the engine, our DataPrime queries are transformed into query plans with SQL-like expressions, for example in filters. For backwards compatibility, the query engine needs to also be able to query OpenSearch instead of Parquet files. As such, we were met with the need to translate these expressions into OpenSearch query DSL.

While OpenSearch does have some support for SQL, experimenting with it quickly revealed that it’s not particularly thorough. So, in this article I want to explain one specific aspect about the translation I was facing that was particularly interesting: null semantics.

The Basics

In SQL, a (valid) filter expression, e.g. in a WHERE clause, can have three results: TRUE, FALSE or NULL. Rows where the result is NULL will, of course, also be filtered out, just like FALSE, but NULL will behave very differently from both TRUE and FALSE within the expression.

Example:

CREATE TABLE test (
  first_name varchar(255),
  middle_name varchar(255),
  last_name varchar(255)
 );

INSERT INTO test (first_name, last_name)
VALUES ('Markus', 'Appel');


SELECT * FROM test WHERE first_name = 'Garry';
// 0 Result


SELECT * FROM test WHERE !(first_name = 'Garry');
// 1 Results


SELECT * FROM test WHERE middle_name = 'Garry';
// 0 Results


SELECT * FROM test WHERE !(midde_name = 'Garry');
// 0 Results !!!

The fourth SELECT will not return any results because for the one row, middle_name is NULL, and the = operator will return NULL if either side is NULL. And !NULL is also NULL.

This is called Three-valued logic, specifically the Kleene and Priest logics. The idea is quite simple: If you treat NULL is undetermined, you will always return NULL if the result cannot be determined either. 

So for example, FALSE && NULL is FALSE because it doesn’t matter what NULL is substituted for, the result will always be FALSE.

In contrast, FALSE || NULL will be NULL, because when substituting FALSE the result will be FALSE, when substituting TRUE the result will be TRUE.

Meanwhile, the OpenSearch DSL has no concept of expressions at all. It only knows filters, in other words operations that take a set of documents and return those documents that match.

Theory

At first glance, reducing TRUE, FALSE, and NULL to only TRUE and FALSE might sound like an impossible task.

However, the solution lies within the context in which an expression is used. Previously, I mentioned that a WHERE will interpret an expression returning NULL the same way as FALSE—by filtering out the row. So WHERE is actually also binary, it will either filter or not filter the row as well!

Formally, using pseudocode:

filter(expr) = expr == true
             = !(expr == false || expr == null)
             = !is_false_or_null(expr)

Now we’re getting somewhere. We just have to determine is_false_or_null for all expressions we want to support. Let’s start with the basics:

is_false_or_null(null)  = true
                        = { "match_all": {}}
is_false_or_null(true)  = false
                        = { "match_none": {}}
is_false_or_null(false) = true
                        = { "match_all": {}}
is_false_or_null(field) = !matches(field, true) || !exists(field)
                        = {
                           "bool": {
  	                     "should": [
                               {
        	                 "bool": {
           	                   "must_not": {
              	                     "matches": {
                 		       "field": "true"
              	                    }
           	                  }
        	                }
                              },
     	                      {
        	                "bool": {
           	                  "must_not": {
              	                   "exists": "field"
           	                 }
        	               }
     	                     }
                           ]
                         }
                       }

Due to how verbose they are, while being relatively easy to come up with, I will stop writing out the OpenSearch DSL from here on out.1

Next, we should take a look at the boolean operators: OR, AND, and NOT.

My approach there was to derive the logic from the truth tables on the Wikipedia article.

So, for example, take  is_false_or_null(expr1 && expr2). I’ve highlighted all results for which it will be true:


Coincidentally, that just corresponds to the union of A being FALSE or NULL and B being FALSE or NULL.


So, the result is:

is_false_or_null(expr1 && expr2) = is_false_or_null(expr1) || is_false_or_null(expr2)

In case of OR, the truth table looks a bit different:


Here, it’s not the union, but the intersection of A being FALSE or NULL and B being FALSE or NULL. So:

is_false_or_null(expr1 || expr2) = is_false_or_null(expr1) && is_false_or_null(expr2)

And last, we need to look at NOT.

Here, it’s obvious that NOT(A) is FALSE or NULL when A is TRUE or NULL:

is_false_or_null(!expr)= is_true_or_null(expr)

So now we need is_true_or_null too, but it’s very similar to implement:

is_true_or_null(null) = true
is_true_or_null(true) = true
is_true_or_null(false) = false
is_true_or_null(field) = matches(field, true) || !exists(field)

Even AND, OR and NOT are easy. Looking at the truth tables it’s clear that this time AND is the logical union, and OR the logical intersection.

is_true_or_null(expr1 && expr2) = is_true_or_null(expr1) && is_true_or_null(expr2)
is_true_or_null(expr1 || expr2) = is_true_or_null(expr1) || is_true_or_null(expr2)
is_true_or_null(!expr)          = is_false_or_null(expr)

Conclusion

With the above theory, the hardest part of the puzzle is solved.

The rest now consists of mere legwork of translating all concepts of the source language one wants to support. Of course, the capabilities of the ElasticSearch DSL will be exhausted rather quickly, as even simple numeric arithmetic doesn’t have any equivalent in it, meaning that rather quickly one will have to “fall back” to translating the expressions to script queries, which will have much worse performance but can cover virtually any use case. Furthermore, to avoid translating everything twice with just slight differences for is_true_or_null and is_false_or_null, it of course is best to have one is_value_or_null(value: bool, …) instead.

To avoid queries getting too large, and especially too deeply nested, we also chose an intermediate representation of the ElasticSearch DSL before turning it into JSON. 

This gives us the ability to comfortably perform a list of optimizations beforehand:

  • Apply De Morgan’s laws
  • Flatten nested bool clauses through associativity
  • Invert negated match_all and match_none
  • Remove match_all from inside of bool must and match_none from inside of bool with should
  • In bool, replace must with match_none with match_none
  • In bool, replace should with match_all with match_all
  • Combine script clauses inside of bool with must and should appropriately

1Note: be careful to set minimum_should_match to 1 when translating OR clauses to bool with should.

Observability and Security
that Scale with You.