Query Analyzer#

A query analyzer is a frequent component of information retrieval systems that helps to improve the relevance of retrieved results.

The analyzer takes as input a raw search query from the user (e.g., “cheap restaurants near me”) together with additional metadata, and translates the query into a more precise structured query.

The resulting structured query could look like:

{
    "text": null,
    "entity_type": "restaurant",
    "filters": [
        {
            "attribute": "price",
            "op": "<",
            "value": 100
        },
        {
            "attribute": "location",
            "op": "near",
            "value": "user_geo_location"
        }        
    ]
}

To illustrate how Kork can be used, let’s re-implemnt the query analyzer implemented in langchain using the QueryConstructor chain.

import langchain
from langchain.llms import OpenAI
from typing import List, Any
from kork import CodeChain

Filters#

We’ll define some functions that can be used for query analysis as foreign functions.

Note

The representation of a function per binary operator seemed to work better than a more generic representation of using a single function:

def f_(attribute: str, op: Any, value: Any) -> Any: “””Apply a filter on the given attribute with an operator (“>”, “<”, “=”, …) and value.””” …

As an experiment, you could try to swap to the more generic representation and see if you can get it working well!

def gt(attribute: str, value: Any) -> Any:
    """Filter to where attribute > value"""
    return {"attribute": attribute, "op": ">", "value": value}


def gte(attribute: str, value: Any) -> Any:
    """Filter to where attribute >= value"""
    return {"attribute": attribute, "op": ">=", "value": value}


def eq(attribute: str, value: Any) -> Any:
    """Filter to where attribute = value"""
    return {"attribute": attribute, "op": "=", "value": value}


def neq(attribute: str, value: Any) -> Any:
    """Filter to where attribute != value"""
    return {"attribute": attribute, "op": "!=", "value": value}


def lte(attribute: str, value: Any) -> Any:
    """Filter to where attribute <= value"""
    return {"attribute": attribute, "op": "<=", "value": value}


def lt(attribute: str, value: Any) -> Any:
    """Filter to where attribute < value"""
    return {"attribute": attribute, "op": "<", "value": value}


def and_(filters: List[Any]) -> Any:
    """Combine a list of filters using an AND operator."""
    return {
        "op": "and",
        "filters": [filters],
    }


def or_(filters: List[Any]) -> Any:
    """Combine a list of filters using an OR operator."""
    return {
        "op": "or_",
        "filters": [filters],
    }


def in_(attribute: str, value: List[Any]) -> Any:
    """Filter to where attribute >= value"""
    return {"attribute": attribute, "op": "in", "value": value}


def request(query: str, filters: List[Any]) -> Any:
    return {
        "query": query,
        "filters": filters,
    }

The prompt#

Let’s define a prompt to explain the task.

You can experiment with the prompt to see if you could improve it!

See the prompt in the Langchain query constructor chain for insipiration.

from langchain.prompts import PromptTemplate

template = """\
Your task is to analyze the user's query and translate it into a search request composed \
of a search string and a filter.

Here is a set of functions that you can rely on: 

{external_functions_block}

Here is a schema of the data being queried.

```TypeScript 

type schema = {{
  author: string // The author of the document
  pub_year: string // 4 digit format year representing year when doc was published
  price: number // how much it costs to buy the document
}}
```

Filter attributes must match the data schema. If the query seems to include other attributes, \
assume those are not filters, but part of the search string.
Pay attention to the doc string in the schema for each attribute. If it doesn't look like the \
usage of the filter does not match the description in comment treat it as part of the search query.

Filters can be combined using `and_` and `or_`.

Please encapsulate the code in <code> and </code> tags.
"""
prompt_template = PromptTemplate(
    template=template, input_variables=["external_functions_block"]
)

Examples#

Now, define some example programs.

from kork.parser import parse

examples = [
    (
        "documents published after 2020",
        'var result = request(null, gte("pub_year", 2020))',
    ),
    (
        "toy models of cars by john smith or ones that were published after 2012",
        'var result = request("toy models of cars", or_([eq("author", "john smith"), gte("pub_year", 2012)]))',
    ),
    (
        "share prices by john or oliver",
        'var result = request("share prices", in("author", ["john", "oliver"]))',
    ),
]

examples_in_ast = [(query, parse(code)) for query, code in examples]
examples_in_ast[0]
('documents published after 2020',
 Program(stmts=(VarDecl(name='result', value=FunctionCall(name='request', args=(Literal(value=None), FunctionCall(name='gte', args=(Literal(value='pub_year'), Literal(value=2020)))))),)))

Let’s test it out#

llm = OpenAI(
    model_name="text-davinci-003",
    temperature=0,
    max_tokens=3000,
    frequency_penalty=0,
    presence_penalty=0,
    top_p=1.0,
    verbose=True,
)

chain = CodeChain.from_defaults(
    llm=llm,
    examples=examples_in_ast,
    context=[gte, gt, eq, neq, lte, lt, in_, and_, or_, request],
    instruction_template=prompt_template,
    input_formatter=None,
)
langchain.verbose = False
queries = [
    "publications by mama bear published after 2013",
    "documents about florida from 2013 or docs written by mama bear",
    "smells like sunshine",
    "documents that discuss $5",
    "documents that discuss the $50 debt of the bank",
    "docs that cost more than $150",
]
results = []

for query in queries:
    results.append(chain(inputs={"query": query}))
from kork.display import as_html_dict, display_html_results

Examine the results below carefully! Not all the generated requests are correct. errors means that there were no runtime exceptions, not that the result is correct.

display_html_results(
    [as_html_dict(r) for r in results], columns=["query", "code", "result", "errors"]
)
  query code result errors
0 publications by mama bear
published after 2013
var result = request("publications", and_([eq("author", "mama bear"), gte("pub_year", 2013)]))
{'query': 'publications', 'filters': {'op': 'and', 'filters': [[{'attribute': 'author', 'op': '=', 'value': 'mama bear'}, {'attribute': 'pub_year', 'op': '>=', 'value': 2013}]]}} []
1 documents about florida from
2013 or docs written by mama
bear
var result = request("documents about florida", or_([gte("pub_year", 2013), eq("author", "mama bear")]))
{'query': 'documents about florida', 'filters': {'op': 'or_', 'filters': [[{'attribute': 'pub_year', 'op': '>=', 'value': 2013}, {'attribute': 'author', 'op': '=', 'value': 'mama bear'}]]}} []
2 smells like sunshine
var result = request("smells like sunshine", null)
{'query': 'smells like sunshine', 'filters': None} []
3 documents that discuss $5
var result = request("documents that discuss $5", eq("price", 5))
{'query': 'documents that discuss $5', 'filters': {'attribute': 'price', 'op': '=', 'value': 5}} []
4 documents that discuss the $50
debt of the bank
var result = request("documents that discuss the $50 debt of the bank", null)
{'query': 'documents that discuss the $50 debt of the bank', 'filters': None} []
5 docs that cost more than $150
var result = request(null, gt("price", 150))
{'query': None, 'filters': {'attribute': 'price', 'op': '>', 'value': 150}} []