Transforming a search query into an EF expression with Lucene

Josue Leon · March 9, 2022

Engineering

Our task

Let’s say we’re a food delivery app running a special promotion for two groups: anyone whose first name is Bob, or anyone whose favorite food is steak. Ultimately, we want our API to return results from the following Lucene query:

"firstName:Bob OR favoriteFood:Steak"

To do so, we will implement an HTTP endpoint that searches against a repository containing personal information and personal preferences for a collection of people. This endpoint will accept a query expression following Lucene’s query syntax, as seen above, transform it into an Entity Framework expression, and query a table of persons filtered by the expression. We will use ASP.NET Core 6 and create a simple REST API to contain this example endpoint. To keep it simple and fast, we will use an in memory database (named Society) which only has one table (Persons), with the following structure:

‍
(WARNING: The following example uses plain text sensitive data for illustrative purposes only. Sensitive data should be encrypted at rest and in transit. If you are building a real production application, consider using Basis Theory for storing this data more securely. )

We won’t go over all the details of setting up the web application and exposing the endpoint so that we can focus on the specific parts pertaining to parsing the query, but check the GitHub repository for the full example.

How to transform the query

Step 1. Configuring Lucene

To parse a query string with Lucene we need to instantiate a QueryParser, which in turn uses an Analyzer. The analyzer we will use for this example is WhitespaceAnalyzer, which splits on whitespaces. There are several other analyzers depending on your needs — for example, the StandardAnalyzer converts your search values to lowercase and splits on underscores.

The second parameter of QueryParser’s constructor is the default field that you want to search for. The query string that is provided to the search endpoint will be composed by one or more terms. Each term consists of a search field and a value, separated by a colon. If in the search query you omit the search field, which means you only include a value, Lucene will look for the default field to search against it. So if you want to search by default on a person's social security number, you would instantiate this with ssn as the field parameter. For our example, we don’t want a default search field, so we use null.‍


public async Task<List<Person>> SearchPersons(SearchPersonsRequest request, CancellationToken cancellationToken)
{
    if (string.IsNullOrWhiteSpace(request.Query))
        return await _societyDbContext.Persons.AsNoTracking().ToListAsync(cancellationToken);
    
    using var analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);

    var parser = new QueryParser(LuceneVersion.LUCENE_48, null, analyzer);
    
    …
    
}

Step 2. Parsing the query

Once we configure Lucene, we need to build the Entity Framework expression that will be used to filter the collection. In the next code snippet we build the expression by calling GetTerms with the result of parser.Parse(request.Query)) which generates a Lucene Query object that we must translate into an expression. We then use this expression to filter the database collection.


public async Task<List<Person>> SearchPersons(SearchPersonsRequest request, CancellationToken cancellationToken)
{
    …
    
    var searchFilter = GetTerms(parser.Parse(request.Query));
    return await _societyDbContext.Persons.AsNoTracking().Where(searchFilter).ToListAsync(cancellationToken);
}

Step 3. Building the expression


private Expression<Func<Person, bool>> GetTerms(Query query) =>
    query switch
    {
        TermQuery termQuery => CreateTermExpression(termQuery),
        TermRangeQuery => throw new InvalidOperationException(),
        PhraseQuery phraseQuery => ParsePhraseQuery(phraseQuery),
        BooleanQuery booleanQuery => ParseBooleanQuery(booleanQuery),
        _ => throw new InvalidOperationException()
    };

This switch expression matches the Query’s type. When Lucene parses a query, it converts it into an object composed of different Terms, each of which can have a different query type. To keep it simple for our example, we are just going to handle three possible query types, but even this will cover a wide range of use cases.

TermQuery is the most granular term: it represents a search field and a value, and you can think of it as a leaf in a tree. Some other types of terms are composed of TermQueries. PhraseQuery is also a granular term used to parse multi-word search values, such as a sentence, in which each word is mapped to a term and the PhraseQuery is composed of multiple terms.

Finally, BooleanQuery represents boolean combinations of queries, which enables creating conjunctive or disjunctive expressions using the AND or OR operators.

There are also some other query types, such as TermRangeQuery to handle search ranges if, for example, you want to search for a date range or between some quantities. But we are not going to cover those.

Now that we have the top-level function for building our expression, let’s dive into the functions that handle each type of query.

Step 4. Parsing subqueries

Let’s start with the simplest one: parsing a TermQuery.


private Expression<Func<Person, bool>> CreateTermExpression(TermQuery termQuery)
{
    var searchValue = termQuery.Term.Text;
    
    return termQuery.Term.Field switch
    {
        "firstName" => person => person.FirstName == searchValue,
        "lastName" => person => person.LastName == searchValue,
        "ssn" => person => person.Ssn == searchValue,
        "favoriteFood" => person => person.FavoriteFood == searchValue,
        _ => throw new InvalidOperationException()
    };
}

‍‍(This switch expression was introduced in C# 8.0. Refer to Microsoft's changelog for more info.)

This function receives the TermQuery object and returns an expression depending on the term’s field. These are just simple predicates, but depending on your use case you can create more complex predicates.

Now, let’s look into parsing a BooleanQuery.


private Expression<Func<Person, bool>> ParseBooleanQuery(BooleanQuery query)
{
    var startExpression = GetTerms(query.Clauses[0].Query);
    
    return query.Clauses.Skip(1).Fold(
        startExpression,
        (filter, clause) =>
        {
            var clauseExpression = GetTerms(clause.Query);
            return ConcatBooleanExpressions(clause, filter, clauseExpression);
        });
}

private Expression<Func<Person, bool>> ConcatBooleanExpressions(
    BooleanClause clause,
    Expression<Func<Person, bool>> first,
    Expression<Func<Person, bool>> second) =>
    clause.Occur switch
    {
        Occur.MUST => first.And(second),
        Occur.SHOULD => first.Or(second),
        _ => throw new InvalidOperationException()
    };

A BooleanQuery is composed of clauses which in turn are of type Query, which could be of any of the types we discussed in Step 2. What this means is that you can have an AND clause that consists of a TermQuery with a PhraseQuery, or any other combination, and have another clause of that with any other Query type, and so on, in a recursive tree structure, as we can see here.

The first function, ParseBooleanQuery, receives the BooleanQuery object, builds the initial expression from the first boolean clause, and then iterates over remaining terms via the Fold function, which updates the starting expression by concatenating the predicate for each clause using an AND or OR operator to combine predicates. The fold operation calls GetTerms to compute the next predicate, and this may recursively call ParseBooleanQuery if the original query contains nested boolean terms.

ConcatBooleanExpressions builds an expression from two predicates by calling the .And()/.Or() Expression extension functions which you can find in the class ExpressionUtility in the GitHub repository. We know which extension function to call because of the clause’s Occur property, which could be MUST (and), SHOULD (or) or MUST_NOT (not). For our example we are just supporting MUST and SHOULD.

Finally, we will walk through how to parse a PhraseQuery.


private Expression<Func<Person, bool>> ParsePhraseQuery(PhraseQuery phraseQuery) =>
    phraseQuery.GetTerms()[0].Field switch
    {
        "favoriteFood" => person => person.FavoriteFood == BuildPhraseQueryValue(phraseQuery),
        _ => throw new InvalidOperationException()
    };
    
private string BuildPhraseQueryValue(PhraseQuery phraseQuery)
{
    var stringBuilder = new StringBuilder();
    foreach (var term in phraseQuery.GetTerms())
    {
        stringBuilder.Append($"{term.Text} ");
    }
    
    return stringBuilder.ToString().Trim();
}

‍When a multi-word phrase is included in the search query, Lucene parses this as a PhraseQuery that contains a term for each word in the phrase (separated by whitespace). The PhraseQuery is composed of multiple terms, all of which have the same field name. In the switch block, we match on the field value favoriteFood, since this is the only field that supports multiple words, such as Hamburger with fries. We want to search our database for the original phrase that contained whitespace between words. In order to reconstruct this value to include in our expression we iterate over all of the terms in the PhraseQuery and just concatenate them with a space, which is what we do in BuildPhraseQueryValue.

Step 4. Looking at the results

Congrats! We are now done with the code and can run the application and test our search endpoint. In the project’s README you can find the instructions to run it, depending on whether you are using an IDE or running from the command line. Once you run it, the search endpoint will be exposed under http://localhost:5082/search, which is a POST method that receives a JSON body. Let’s test it using postman.

First, lets just call it with an empty query to get all the people:

‍

Now, let's write our first query:

As you can see, we provided a query to filter by Bob as first name or Steak as favorite food, and we got back the two people matching it. In the same way, we can query the other fields we accounted for in the code, and build more complex queries such as the one shown below. Be careful with Lucene, because its operators are case sensitive, so if you put or instead of OR, it will actually parse it as a search term instead of the OR operator.

Wrapping up

In this tutorial, we successfully fired up a simple API in ASP.NET Core 6 and exposed a search endpoint which filtered a table of persons by a body query parameter following the Lucene language query syntax, transforming it to an entity framework expression under the hood. We hope this can help you get started in writing your own search endpoints powered by Lucene and Entity Framework.

If you’re interested in searching encrypted data, we recently released Search Tokens for identifiable information (like employee identification numbers and social security numbers) and would love your feedback on querying other data types, like Bank, PCI or even generic data.