Relevance Support

Easy relevance operation in SenseiDB server.

Introduction

Sensei now has integrated relevance support. The main idea here is that users can create/tune the relevance model (ranking scheme) online or store the relevance model in a more convenient way, thus avoid the classic relevance coding in a more complicated way. Also, by doing this we may not need to re-deploy the service.

Things we will cover in this document:

How it works?

The relevance support functionality in SenseiDB is implemented to help users create and tune relevance models in an easy way. The basic pipeline is that users send a JSON/BQL request containing a simple relevance model expression to the server, the server will compile the code and generate the corresponding relevance model. Relevance models may have a name, so that next time the user only needs to refer to the model with its name. Also, models are cached inside each node, and hence no extra performance cost.

Users can write Java code math function as a relevance model (we will explain this in the following part), and we used Javassist package to do the one-time compilation.

Location of relevance component inside request JSON

Before using relevance json request, we need to get familiar with Sensei JSON request API design.

First of all, a Sensei JSON request is something like below:


{
  "query" : {
      // query sub-json inside the request json.
         "query_string":{       // the query type is "query_string"
               // the detailed info about query_string type query;
          }
    }
  },
 
  // paging parameters
 
  "size" : 10,   // default to 10;
  "from" : 0,    //default to 0;
 
  "groupBy" : {
    "columns" : ["category"],
    "top" : 3
  },
 
 
  "filter" : {
      // filters json here.
  },
 
  "selections" : [
      //details of this part see the selections.json
  ],
 
 
  "facets":{
  // facet parameters
  },
 
  "facetInit":{
       // facet initialization parameters for runtime facets
  },
  "sort":[
      {"color":"desc"},       // sort by color in reverse order
      "_score"                     // secondary sort by relevance, reverse  parameter is ignored
  ],
  "fetchStored" : false,
  "termVectors" : [],
  "partitions":[1,2],
  "explain" : false,
  "routeParam": null,
}

Inside the request json, we can see a query sub-json, which is the first part sub-json in the request json above. The query sub-json contains a key which is a query type, and the value is the real query json. The relevance part (relevance json) is located inside the specific query json. For example, if inside the request json, we have put a query json with the type of "query_string", and it has a relevance part. Note that a relevance json mainly has three parts. It can have either a "predefined-model" part, or "model" part (runtime model), we need to choose one of these two types of model references. The "values" part provides the input values for either the runtime model or the predefined one. So the combination is predefined-model plus an optional values part, or model plus an optional values part.

The query sub-json above with a relevance should be as explained below.



{
        "query_string":{  //query type specified here. this one is a query_string type query;
            "query":"",  //empty query term, which will search everything;
            "relevance":{
 
                // A predefined model is either the model preloaded during sensei startup, or saved runtime model with a name;
                "predefined-model": "model-name",
 
                // A runtime model definition below will be compiled at runtime (if no previous runtime model is cached).
                "model":{
                    "facets":{
                        "int":["year","mileage"],
                        "long":["groupid"]
                    },
                    "variables":{
                        "set_int":["goodYear"],
                        "int":["thisYear"]
                    },
                    "function_params":["_INNER_SCORE","thisYear","year","goodYear"],
                    
                    "function":" if(goodYear.contains(year)) return (float)Math.exp(10d);   if(year==thisYear) return 87f   ; return  _INNER_SCORE    ;"
                    // "save as" part below is optional, if specified, this runtime generated model will be saved with a name.
                    // After a runtime model is named, it will be convenient to use next time, we can just specify the model name.
                    // Attn: Even if we do not name a runtime model, the system will also automatically cache a certain amount of anonymous runtime models, so
                    // there is no extra compilation cost for second request with the same model function body and signature.
 
                    "save_as":{
                           "name":"RuntimeModelName",
                           "overwrite":true     //optional, default value is false;
                    }
                },
 
                // values part is used for either predefined model or a runtime model above, if these models require input values;
                "values":{
                    "thisYear":2001,
                    "goodYear":[1996,1997]
                }
            }
        }
    }

How to use it: Grammar, Layout

Before we start, a working relevance json embeded in the sensei json request is like this:



{
        "query": {
            "query_string": {
                "query": "",
                "relevance":{
 
                    "model":{
                        "variables":{
                             "set_int":["goodYear"],
                             "int":["thisYear"],
                             "map_int_float":["mileageWeight"],
                             "map_int_string":["yearcolor"],
                             "map_string_float":["colorweight"],
                             "map_string_string":["categorycolor"]
                            },
                        "facets":{
                             "int":["year","mileage"],
                             "long":["groupid"],
                             "string":["color","category"]
                            },
                        "function_params":["_INNER_SCORE", "thisYear", "year","goodYear","mileageWeight","mileage","color", "yearcolor", "colorweight", "category", "categorycolor"],
                        "function":" if(categorycolor.containsKey(category) && categorycolor.get(category).equals(color))  return 10000f; if(colorweight.containsKey(color) ) return 200f + colorweight.getFloat(color); if(yearcolor.containsKey(year) && yearcolor.get(year).equals(color)) return 200f; if(mileageWeight.containsKey(mileage)) return 10000+mileageWeight.get(mileage); if(goodYear.contains(year)) return (float)Math.exp(2d);   if(year==thisYear) return 87f   ; return  _INNER_SCORE;"
                    },
 
                    "values":{
                         "goodYear":[1996,1997],
                         "thisYear":2001,
                         "mileageWeight":{"key":[11400,11000],"value":[777.9, 10.2]},
                        "yearcolor":{"key":[1998],"value":["red"]},
                        "colorweight":{"key":["red"],"value":[335.5]},
                        "categorycolor":{"key":["compact"],"value":["red"]}
                    }
                }
            }
        },
        "from": 0,
        "size": 6,
        "explain": false,
        "fetchStored": false,
        "sort":["_score"]
    }

The relevance json has two parts, one is the model part, another is the values part. Model part should be relatively static. The values part provides the input required by the static model. Each request may have different values part, but may probably use the same model.

Inside the model part, we define 4 items:

Supported variable types:

Supported facet types:

Attention !!!:

For more sample relevance JSONs, please refer to the relevance test cases in senseiDB code, the relevance test cases have covered a lot of use case for relevance.

Some facility methods

We also plan to provide some facility methods in the relevance model, which means inside the scoring function, these facility methods/functions can be called directly for performance reason. Currently supported:

BQL support

Relevance models can be created and used in BQL queries with the USING RELEVANCE MODEL clause. A BQL query using a relevance model looks like:

SELECT <select_list>
FROM <identifier>
WHERE <search_expr>
USING RELEVANCE MODEL <identifier> <prop_list> [<relevance_model>];

Compared with a regular SELECT statement, the only thing new here is the USING RELEVANCE MODEL clause. <identifier> within the USING RELEVANCE MODE clause refers to the model name, and <prop_list> is used to pass relevance model parameters.

The last piece, <relevance_mode>, is optional, and it takes the following format:

DEFINED AS <formal_parameters>
  BEGIN
    <model_block>
  END

When <relevance_mode> is specified, it means that the SELECT statement has a relevance model defined inline; otherwise, it means that the SELECT statement is using a relevance model that has been created before with the given name.

<model_block> is the relevance model function body, written in a subset of the Java language (as described above).

<formal_parameters> is defined exactly the same as a Java method parameter list, even though only a limited number of data types are supported.

<prop_list> is a comma separated key-value-pair list. Each key-value has the form of <key>:<value>, where the value part uses JSON (or Python) format to pass primitive data, lists, sets, or maps.

Here is a complete example of a SELECT statement using a relevance model:

SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool")
  DEFINED AS (String favoriteColor, String favoriteTag)
    BEGIN
      float boost = 1.0;
      if (tags.contains(favoriteTag))
         boost += 0.5;
      if (color.equals(my_color))
         boost += 1.2;
      return _INNER_SCORE * boost;
    END

Note that, in the relevance model function body, facet names like tags and color and internal variable names like _INNER_SCORE can be used directly. They do not have to be included in the parameter list.