engineering

Transforming a search query into an EF expression with Lucene

If you’re building an API whose search endpoint accepts Lucene query expressions with data handled by Entity Framework, this blog post can help you get started.

Our task

Let’s say we’re a food delivery app running a special promotion for two groups: anyone whose first name is Bob, or anyone whose favorite food is steak. Ultimately, we want our API to return results from the following Lucene query:

<span class="code">"firstName:Bob OR favoriteFood:Steak"</span>

To do so, we will implement an HTTP endpoint that searches against a repository containing personal information and personal preferences for a collection of people. This endpoint will accept a query expression following Lucene’s query syntax, as seen above, transform it into an Entity Framework expression, and query a table of persons filtered by the expression. We will use ASP.NET Core 6 and create a simple REST API to contain this example endpoint. To keep it simple and fast, we will use an in memory database (named Society) which only has one table (Persons), with the following structure:


(WARNING: The following example uses plain text sensitive data for illustrative purposes only. Sensitive data should be encrypted at rest and in transit. If you are building a real production application, consider using Basis Theory for storing this data more securely. )

We won’t go over all the details of setting up the web application and exposing the endpoint so that we can focus on the specific parts pertaining to parsing the query, but check the GitHub repository for the full example. 

How to transform the query

Step 1. Configuring Lucene

To parse a query string with Lucene we need to instantiate a <span class="code">QueryParser</span>, which in turn uses an <span class="code">Analyzer</span>. The analyzer we will use for this example is <span class="code">WhitespaceAnalyzer</span>, which splits on whitespaces. There are several other analyzers depending on your needs — for example, the <span class="code">StandardAnalyzer</span> converts your search values to lowercase and splits on underscores.

The second parameter of <span class="code">QueryParser’s</span> constructor is the default field that you want to search for. The query string that is provided to the search endpoint will be composed by one or more terms. Each term consists of a search field and a value, separated by a colon. If in the search query you omit the search field, which means you only include a value, Lucene will look for the default field to search against it. So if you want to search by default on a person's social security number, you would instantiate this with <span class="code">ssn</span> as the field parameter. For our example, we don’t want a default search field, so we use null.


public async Task<List<Person>> SearchPersons(SearchPersonsRequest request, CancellationToken cancellationToken)
{
    if (string.IsNullOrWhiteSpace(request.Query))
        return await _societyDbContext.Persons.AsNoTracking().ToListAsync(cancellationToken);
    
    using var analyzer = new WhitespaceAnalyzer(LuceneVersion.LUCENE_48);

    var parser = new QueryParser(LuceneVersion.LUCENE_48, null, analyzer);
    
    …
    
}

Step 2. Parsing the query

Once we configure Lucene, we need to build the Entity Framework expression that will be used to filter the collection. In the next code snippet we build the expression by calling <span class="code">GetTerms</span> with the result of <span class="code">parser.Parse(request.Query))</span> which generates a Lucene Query object that we must translate into an expression. We then use this expression to filter the database collection.


public async Task<List<Person>> SearchPersons(SearchPersonsRequest request, CancellationToken cancellationToken)
{
    …
    
    var searchFilter = GetTerms(parser.Parse(request.Query));
    return await _societyDbContext.Persons.AsNoTracking().Where(searchFilter).ToListAsync(cancellationToken);
}

Step 3. Building the expression


private Expression<Func<Person, bool>> GetTerms(Query query) =>
    query switch
    {
        TermQuery termQuery => CreateTermExpression(termQuery),
        TermRangeQuery => throw new InvalidOperationException(),
        PhraseQuery phraseQuery => ParsePhraseQuery(phraseQuery),
        BooleanQuery booleanQuery => ParseBooleanQuery(booleanQuery),
        _ => throw new InvalidOperationException()
    };

This switch expression matches the Query’s type. When Lucene parses a query, it converts it into an object composed of different Terms, each of which can have a different query type. To keep it simple for our example, we are just going to handle three possible query types, but even this will cover a wide range of use cases.

TermQuery is the most granular term: it represents a search field and a value, and you can think of it as a leaf in a tree. Some other types of terms are composed of TermQueries. PhraseQuery is also a granular term used to parse multi-word search values, such as a sentence, in which each word is mapped to a term and the <span class="code">PhraseQuery</span> is composed of multiple terms. 

Finally, <span class="code">BooleanQuery</span> represents boolean combinations of queries, which enables creating conjunctive or disjunctive expressions using the <span class="code">AND</span> or <span class="code">OR</span> operators. 

There are also some other query types, such as <span class="code">TermRangeQuery</span> to handle search ranges if, for example, you want to search for a date range or between some quantities. But we are not going to cover those. 

Now that we have the top-level function for building our expression, let’s dive into the functions that handle each type of query. 

Step 4. Parsing subqueries 

Let’s start with the simplest one: parsing a <span class="code">TermQuery</span>.


private Expression<Func<Person, bool>> CreateTermExpression(TermQuery termQuery)
{
    var searchValue = termQuery.Term.Text;
    
    return termQuery.Term.Field switch
    {
        "firstName" => person => person.FirstName == searchValue,
        "lastName" => person => person.LastName == searchValue,
        "ssn" => person => person.Ssn == searchValue,
        "favoriteFood" => person => person.FavoriteFood == searchValue,
        _ => throw new InvalidOperationException()
    };
}

(This switch expression was introduced in C# 8.0. Refer to Microsoft's changelog for more info.)

This function receives the <span class="code">TermQuery</span> object and returns an expression depending on the term’s field. These are just simple predicates, but depending on your use case you can create more complex predicates. 

Now, let’s look into parsing a <span class="code">BooleanQuery</span>.


private Expression<Func<Person, bool>> ParseBooleanQuery(BooleanQuery query)
{
    var startExpression = GetTerms(query.Clauses[0].Query);
    
    return query.Clauses.Skip(1).Fold(
        startExpression,
        (filter, clause) =>
        {
            var clauseExpression = GetTerms(clause.Query);
            return ConcatBooleanExpressions(clause, filter, clauseExpression);
        });
}

private Expression<Func<Person, bool>> ConcatBooleanExpressions(
    BooleanClause clause,
    Expression<Func<Person, bool>> first,
    Expression<Func<Person, bool>> second) =>
    clause.Occur switch
    {
        Occur.MUST => first.And(second),
        Occur.SHOULD => first.Or(second),
        _ => throw new InvalidOperationException()
    };

A <span class="code">BooleanQuery</span> is composed of clauses which in turn are of type <span class="code">Query</span>, which could be of any of the types we discussed in Step 2. What this means is that you can have an <span class="code">AND</span> clause that consists of a <span class="code">TermQuery</span> with a <span class="code">PhraseQuery</span>, or any other combination, and have another clause of that with any other <span class="code">Query</span> type, and so on, in a recursive tree structure, as we can see here.

The first function, <span class="code">ParseBooleanQuery</span>, receives the <span class="code">BooleanQuery</span> object, builds the initial expression from the first boolean clause, and then iterates over remaining terms via the <span class="code">Fold</span> function, which updates the starting expression by concatenating the predicate for each clause using an <span class="code">AND</span> or <span class="code">OR</span> operator to combine predicates. The fold operation calls <span class="code">GetTerms</span> to compute the next predicate, and this may recursively call <span class="code">ParseBooleanQuery</span> if the original query contains nested boolean terms. 

<span class="code">ConcatBooleanExpressions</span> builds an expression from two predicates by calling the .And()/.Or() Expression extension functions which you can find in the class <span class="code">ExpressionUtility</span> in the GitHub repository. We know which extension function to call because of the clause’s Occur property, which could be MUST (and), SHOULD (or) or MUST_NOT (not). For our example we are just supporting MUST and SHOULD

Finally, we will walk through how to parse a <span class="code">PhraseQuery</span>.


private Expression<Func<Person, bool>> ParsePhraseQuery(PhraseQuery phraseQuery) =>
    phraseQuery.GetTerms()[0].Field switch
    {
        "favoriteFood" => person => person.FavoriteFood == BuildPhraseQueryValue(phraseQuery),
        _ => throw new InvalidOperationException()
    };
    
private string BuildPhraseQueryValue(PhraseQuery phraseQuery)
{
    var stringBuilder = new StringBuilder();
    foreach (var term in phraseQuery.GetTerms())
    {
        stringBuilder.Append($"{term.Text} ");
    }
    
    return stringBuilder.ToString().Trim();
}

When a multi-word phrase is included in the search query, Lucene parses this as a PhraseQuery that contains a term for each word in the phrase (separated by whitespace). The PhraseQuery is composed of multiple terms, all of which have the same field name. In the switch block, we match on the field value favoriteFood, since this is the only field that supports multiple words, such as Hamburger with fries. We want to search our database for the original phrase that contained whitespace between words. In order to reconstruct this value to include in our expression we iterate over all of the terms in the PhraseQuery and just concatenate them with a space, which is what we do in BuildPhraseQueryValue.

Step 4. Looking at the results

Congrats! We are now done with the code and can run the application and test our search endpoint. In the project’s README you can find the instructions to run it, depending on whether you are using an IDE or running from the command line. Once you run it, the search endpoint will be exposed under http://localhost:5082/search, which is a POST method that receives a JSON body. Let’s test it using postman.

First, lets just call it with an empty query to get all the people:

Now, let's write our first query:

As you can see, we provided a query to filter by Bob as first name or Steak as favorite food, and we got back the two people matching it. In the same way, we can query the other fields we accounted for in the code, and build more complex queries such as the one shown below. Be careful with Lucene, because its operators are case sensitive, so if you put or instead of OR, it will actually parse it as a search term instead of the OR operator.

Wrapping up

In this tutorial, we successfully fired up a simple API in ASP.NET Core 6 and exposed a search endpoint which filtered a table of persons by a body query parameter following the Lucene language query syntax, transforming it to an entity framework expression under the hood. We hope this can help you get started in writing your own search endpoints powered by Lucene and Entity Framework.

If you’re interested in searching encrypted data, we recently released Search Tokens for identifiable information (like employee identification numbers and social security numbers) and would love your feedback on querying other data types, like Bank, PCI or even generic data.

NEXT STEPS

Where to go from here?

{"type":"shift","speed":14,"random":1,"shift":1,"size":1,"rupture":70}
Usage-based pricing, but getting started is free
{"type":"shift","speed":20,"random":1,"shift":1,"size":1,"rupture":50}
Built by developers, for developers
{"type":"shift","speed":11,"random":1,"shift":1,"size":1,"rupture":50}
Helpful humans ready to work through your questions
{"type":"shift","speed":29,"random":1,"shift":1,"size":1,"rupture":50}
We’re building something special & would love your help