Relevance Support
Easy relevance operation in SenseiDB server.
Introduction
Sensei now has integrated relevance support. The main idea here is that users can create/tune the relevance model (ranking scheme) online or store the relevance model in a more convenient way, thus avoid the classic relevance coding in a more complicated way. Also, by doing this we may not need to re-deploy the service.
Things we will cover in this document:
- How it works?
How do the users interact with SenseiDB to create or tune relevance models.
- Location of relevance component in Sensei JSON request objects.
The detailed location about where the relevance component sits inside the request JSON object.
- How to use it: Grammar, Layout.
The simple grammar to write relevance models in JSON format.
- Some facility methods.
Some optimized method which could be useful when writing a customized relevance model.
- How to use relevance models in BQL.
BQL makes it even more easier to create and use relevance models.
How it works?
The relevance support functionality in SenseiDB is implemented to help users create and tune relevance models in an easy way. The basic pipeline is that users send a JSON/BQL request containing a simple relevance model expression to the server, the server will compile the code and generate the corresponding relevance model. Relevance models may have a name, so that next time the user only needs to refer to the model with its name. Also, models are cached inside each node, and hence no extra performance cost.
Users can write Java code math function as a relevance model (we will explain this in the following part), and we used Javassist package to do the one-time compilation.
Location of relevance component inside request JSON
Before using relevance json request, we need to get familiar with Sensei JSON request API design.
First of all, a Sensei JSON request is something like below:
{ "query" : { // query sub-json inside the request json. "query_string":{ // the query type is "query_string" // the detailed info about query_string type query; } } }, // paging parameters "size" : 10, // default to 10; "from" : 0, //default to 0; "groupBy" : { "columns" : ["category"], "top" : 3 }, "filter" : { // filters json here. }, "selections" : [ //details of this part see the selections.json ], "facets":{ // facet parameters }, "facetInit":{ // facet initialization parameters for runtime facets }, "sort":[ {"color":"desc"}, // sort by color in reverse order "_score" // secondary sort by relevance, reverse parameter is ignored ], "fetchStored" : false, "termVectors" : [], "partitions":[1,2], "explain" : false, "routeParam": null, }
Inside the request json, we can see a query sub-json, which is the first part sub-json in the request json above. The query sub-json contains a key which is a query type, and the value is the real query json. The relevance part (relevance json) is located inside the specific query json. For example, if inside the request json, we have put a query json with the type of "query_string", and it has a relevance part. Note that a relevance json mainly has three parts. It can have either a "predefined-model" part, or "model" part (runtime model), we need to choose one of these two types of model references. The "values" part provides the input values for either the runtime model or the predefined one. So the combination is predefined-model plus an optional values part, or model plus an optional values part.
The query sub-json above with a relevance should be as explained below.
{ "query_string":{ //query type specified here. this one is a query_string type query; "query":"", //empty query term, which will search everything; "relevance":{ // A predefined model is either the model preloaded during sensei startup, or saved runtime model with a name; "predefined-model": "model-name", // A runtime model definition below will be compiled at runtime (if no previous runtime model is cached). "model":{ "facets":{ "int":["year","mileage"], "long":["groupid"] }, "variables":{ "set_int":["goodYear"], "int":["thisYear"] }, "function_params":["_INNER_SCORE","thisYear","year","goodYear"], "function":" if(goodYear.contains(year)) return (float)Math.exp(10d); if(year==thisYear) return 87f ; return _INNER_SCORE ;" // "save as" part below is optional, if specified, this runtime generated model will be saved with a name. // After a runtime model is named, it will be convenient to use next time, we can just specify the model name. // Attn: Even if we do not name a runtime model, the system will also automatically cache a certain amount of anonymous runtime models, so // there is no extra compilation cost for second request with the same model function body and signature. "save_as":{ "name":"RuntimeModelName", "overwrite":true //optional, default value is false; } }, // values part is used for either predefined model or a runtime model above, if these models require input values; "values":{ "thisYear":2001, "goodYear":[1996,1997] } } } }
How to use it: Grammar, Layout
Before we start, a working relevance json embeded in the sensei json request is like this:
{ "query": { "query_string": { "query": "", "relevance":{ "model":{ "variables":{ "set_int":["goodYear"], "int":["thisYear"], "map_int_float":["mileageWeight"], "map_int_string":["yearcolor"], "map_string_float":["colorweight"], "map_string_string":["categorycolor"] }, "facets":{ "int":["year","mileage"], "long":["groupid"], "string":["color","category"] }, "function_params":["_INNER_SCORE", "thisYear", "year","goodYear","mileageWeight","mileage","color", "yearcolor", "colorweight", "category", "categorycolor"], "function":" if(categorycolor.containsKey(category) && categorycolor.get(category).equals(color)) return 10000f; if(colorweight.containsKey(color) ) return 200f + colorweight.getFloat(color); if(yearcolor.containsKey(year) && yearcolor.get(year).equals(color)) return 200f; if(mileageWeight.containsKey(mileage)) return 10000+mileageWeight.get(mileage); if(goodYear.contains(year)) return (float)Math.exp(2d); if(year==thisYear) return 87f ; return _INNER_SCORE;" }, "values":{ "goodYear":[1996,1997], "thisYear":2001, "mileageWeight":{"key":[11400,11000],"value":[777.9, 10.2]}, "yearcolor":{"key":[1998],"value":["red"]}, "colorweight":{"key":["red"],"value":[335.5]}, "categorycolor":{"key":["compact"],"value":["red"]} } } } }, "from": 0, "size": 6, "explain": false, "fetchStored": false, "sort":["_score"] }
The relevance json has two parts, one is the model part, another is the values part. Model part should be relatively static. The values part provides the input required by the static model. Each request may have different values part, but may probably use the same model.
Inside the model part, we define 4 items:
- variables
User provided variables. The variable names and types are defined here, but the actual values have to be filled outside the model part, in the values part.
- facets
Define what facet/column will be used in the relevance model. It automatically defined the variable name the same as the facet name.
- function_params
Define which parameters will be used in the function. All the parameters listed here have to be defined either in the variables part, or the facets part.
- function
The real function body. Java code here. It must have a return type, and return a float value. No malicious class can be used, a custom class loader will prevent it from being loaded if it is not in the white class list.
Supported variable types:
- HashSet: in detail, set_int, set_float, set_string, set_double, set_long, set_int, set_float, set_string, set_double, set_long.
- HashMap: e.g., map_int_int, map_int_double, map_int_float, map_string_int, map_string_double, etc. (Currently support two types hashmap: map_int_* and map_string_*)
- other normal type: int, double, float, long, bool, string.
- Normal facet: double, float, int, long, short, string.
- Multi-facet: mdouble, mfloat, mint, mlong, mshort, mstring.
- Weighted Multi-facet: wmdouble, wmflot, wmint, wmlong, wmshort, wmstring.
Attention !!!:
- When defining the variables, some reserved keywords can not be used. Currently "_INNER_SCORE" and "_NOW" can not be used. "_INNER_SCORE" is a float value which is the inner score from the original query. "_NOW" is the long value of the current time in millionseconds.
- All the variables used in the function body have to be pre-defined in the variables or facets part.
- Inside the function body, java.lang.Math class can be used for computation work. Other class such as java.lang.System or java.lang.Thread won't be loaded.
- In the Sensei request json, must specify the sort by parameter as "sort":["_score"].
- Explicit type casting is required. For instance, java.lang.Math.exp(20d), since exp requires a double input.
- For hashmap usage, for the following maps, get method should be replaced with type-specified get method:
- map_string_int ==> getInt(String)
- map_string_double ==> getDouble(String)
- map_string_float ==> getFloat(String)
- map_string_long ==> getLong(String)
- For facet usage, normal facet can be treated as primitive type of string object directly. However, if it is defined as a multi-facet object, or a weighted multi-facet object, it has some predefined methods to be used:
- for multi-facet object, we can use contains() method. e.g. for an "mint" type multi-facet object a, we can use it as a.contains(int target) inside the relevance code. for "mstring" type, it could be a.contains(String target).
- for multi-facet object, we also have containsAny(Set set) method. It accept a predefined set variable from the values part of the relevance section in the request json, and check if any value of this facet is among this set.
- for weighted multi-facet object, besides the method for multi-facet, we can also use boolean hasWeight(int target) and int getWeight() methods. The first one has to be executed firstly, if we get true, we can use the second one to get the weight for the target.
For more sample relevance JSONs, please refer to the relevance test cases in senseiDB code, the relevance test cases have covered a lot of use case for relevance.
Some facility methods
We also plan to provide some facility methods in the relevance model, which means inside the scoring function, these facility methods/functions can be called directly for performance reason. Currently supported:
- double exp(int a) --> faster version of java.lang.Math.exp(double a)
- double exp(float a) --> faster version of java.lang.Math.exp(double a)
- double exp(double a) --> faster version of java.lang.Math.exp(double a)
BQL support
Relevance models can be created and used in BQL queries with the USING RELEVANCE MODEL clause. A BQL query using a relevance model looks like:
SELECT <select_list> FROM <identifier> WHERE <search_expr> USING RELEVANCE MODEL <identifier> <prop_list> [<relevance_model>];
Compared with a regular SELECT statement, the only thing new here is the
USING RELEVANCE MODEL clause. <identifier>
within the USING
RELEVANCE MODE clause refers to the model name,
and <prop_list>
is used to pass relevance model
parameters.
The last piece, <relevance_mode>
, is optional, and it takes
the following format:
DEFINED AS <formal_parameters> BEGIN <model_block> END
When <relevance_mode>
is specified, it means that the SELECT
statement has a relevance model defined inline; otherwise, it means that the
SELECT statement is using a relevance model that has been created before with
the given name.
<model_block>
is the relevance model function body,
written in a subset of the Java language (as described above).
<formal_parameters>
is defined exactly the same as a
Java method parameter list, even though only a limited number of data types are
supported.
<prop_list>
is a comma separated key-value-pair list.
Each key-value has the form of <key>:<value>
, where the
value part uses JSON (or Python) format to pass primitive data, lists, sets, or
maps.
Here is a complete example of a SELECT statement using a relevance model:
SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END
Note that, in the relevance model function body, facet names
like tags
and color
and internal variable names
like _INNER_SCORE
can be used directly. They do not have to be
included in the parameter list.