Data models are described in the
schema.xml
file. The XSD definition of this
XML file can be found from
http://javasoze.github.com/sensei/schema/sensei-schema.xsd.
The schema file is composed by 2 sections:
Table schema
Facet schema
A Sensei instance can be viewed as a giant table with many columns and many rows. The concept of such table directly correlates to that of traditional RDBMS's.
A table may have the following attributes:
uid (mandatory) -
defines the name of the primary key field. This must be of
type long
.
delete-field (optional) - defines the field that would indicate a delete event (we will get back to this later).
skip-field (optional) - defines the field that would indicate a skipping event (we will get back to this later).
src-data-store
(optional) - defines the format of how the source data is
saved. Currently the only supported value is
"lucene
".
src-data-field (optional) - specifies the field name used to keep the original source data value.
If this attribute is not specified, the default
value is set to "src_data
". If this
field is not set by the data source filter, the string
representation of the original data source is saved in
this field by default. If part of the source data or a
modified version of the source data is to be saved in the
index, then you need to set this field using the value you
prefer in the data source filter.
compress-src-data (optional) - defines if the source data is compressed.
A table is also composed of a set of columns. Each column has a name and a type. Below is the list of supported types:
string - value is a
string, e.g. "abc
"
int - integer value
long - long value
short - short value
float - a floating point value
double - double value
char - a character
date - a date value, which must be accompanied by a format string to be used to parse a date string
text - a searchable
text segment, standard Lucene indexing specification can
also be specified here, e.g. index="ANALYZED"
,
termvector="NO"
.
A column that is not of type "text" is considered a
meta column. Any meta column can be
specified to be either single (default) or
multi. When a column is specified to be
multi, e.g. multi="true"
, it
means that, given a row, the column can have more than one
value. A delimited string can be provided to help the indexer
parse the values (default delimiter is ",
").
To specify a different delimiter, say ":
", we
can simply set delimiter=":"
)
Here is an example of the table schema (see https://github.com/javasoze/sensei/blob/master/conf/schema.xml):
<table uid="id" delete-field="" skip-field=""> <column name="color" type="string" /> <column name="category" type="string" /> <column name="city" type="string" /> <column name="makemodel" type="string" /> <column name="year" type="int" /> <column name="price" type="float" /> <column name="mileage" type="int" /> <column name="tags" type="string" multi="true" delimiter=","/> <column name="contents" type="text" index="ANALYZED" store="NO" termvector="NO"/> </table>
By default, data objects inserted into Sensei are JSON objects.
Example:
Given the following table definition:
<table uid="id"> <column name="color" type="string" /> <column name="year" type="int" /> <column name=tag" type="string" multi="true" /> <column name="description" type="text" index="ANALYZED" store="NO" /> </table>
The following table shows as an example how a JSON object is mapped into the table:
JSON object
{ id:1 color:"red", year:2000, tag:"cool,leather", description:"i love this car" }
Table view
id | color | year | tag | description |
---|---|---|---|---|
1 | red | 2000 | cool, leather | i love this car |
To delete a row from Sensei, simply insert a data object with the specified delete-field set to true.
Example:
Given the table schema:
<table uid="id" delete-field="isDelete"> ... </table>
The following JSON object would delete the row where id=5:
{ id:5, isDelete:"true" }
In cases where runtime logic decides whether a data object should be skipped, the skip field can be useful.
Example:
Given the table schema:
<table uid="id" skip-field="isSkip"> ... </table>
The following JSON object would be skipped from indexing:
{ id:7, isSkip:"true" }
The second section is the facet schema, which defines how columns can be queried.
If we think of the table section defines how data is added into Sensei, then the facet section describes how these data can be queried.
The facet sections is composed of a set of facet definitions.
A facet definition requires a name and a type.
Possible types:
simple: simplest facet, 1 row = 1 discrete value
path: hierarchical facet, e.g. a/b/c
range: range facet, used to support range queries
multi: 1 row = N discrete values
compact-multi: similar to multi, but possible values are limited to 32
histogram: similar to a range facet, but a histogram facet automatically calculates the distribution of facet values over a predefined series of ranges with the same size. (A histogram facet depends on another numeric facet, and it requires several mandatory parameters, see the section called “Parameters for Histogram Facets”.
custom: any user defined facet type
Example: https://github.com/javasoze/sensei/blob/master/conf/schema.xml
The column attribute references the column names
defined in the table section. By default, the value of the
name
attribute is used.
This can be useful if you want to name the facet name to be different from the defined column name, or if you want to have multiple facets defined on the same column.
This is a comma delimited string denoting a set of facet names this facet is to be depended on.
When attribute depends
is specified,
Sensei guarantees that the depended facets are loaded before
this facet.
This is also how Composite Facets are constructed. (Another advanced topic).
A facet can be configured via a list of parameters. Parameters are needed for a facet under some situations, for example:
For path facets, separator strings can be configured
For range facets, predefined ranges can be configured
The parameters can be specified via element
params
, which contains a list of elements called
param
. For each param
, two
attributes need to be specified: name
and
value
.
How parameters are interpreted and used is dependent on the facet type.
Here is an example of a facet with a list of predefined ranges:
<facet name="year" type="range"> <params> <param name="range" value="1993-1994"/> <param name="range" value="1995-1996"/> <param name="range" value="1997-1998"/> <param name="range" value="1999-2000"/> <param name="range" value="2001-2002"/> </params> <facet>
A histogram facet requires 5 parameters:
datatype: the data type. Only the following 5 numeric data types are allowed:
int
short
long
float
double
datahandler: this is the name of the facet that the histogram facet depends on. The values of this facet are used to generate the distribution information.
start: the minimum value of the facet.
end: the maximum value of the facet.
unit: the unit value used to divide facet values into ranges.
Here is an example configuration for a histogram facet
over a facet called score
:
<facet name="scoreHistogram" type="histogram"> <params> <param name="datatype" value="int"/> <param name="datahandler" value="score"/> <param name="start" value="0"/> <param name="end" value="100"/> <param name="unit" value="10"/> </params> </facet>
We understand we cannot possibly cover all use cases using a short list of predefined facet handlers. It is necessary to allow users to define their own customized facets for different reasons.
If a customized facet handler is required for a column
(or multiple columns), you can set the facet type to
"custom
", and declare a bean for the facet
handler in file custom-facets.xml
.
For example, if a customized facet called
time
is declared in
schema.xml
like this:
<facet name="time" type="custom" dynamic="false"/>
and the implementation of the facet handler is in class
com.example.facets.TimeFacetHandler
, then you
should include the following line in file
custom-facets.xml
: [1]
<bean id="time" class="com.example.facets.TimeFacetHandler"/>
The id of the bean should match the name of the facet.