Data Modeling

Data models are described in the schema.xml file. The XSD definition of this XML file can be found from http://javasoze.github.com/sensei/schema/sensei-schema.xsd.

The schema file is composed by 2 sections:

  1. Table schema

  2. Facet schema

Table Schema

A Sensei instance can be viewed as a giant table with many columns and many rows. The concept of such table directly correlates to that of traditional RDBMS's.

A table may have the following attributes:

  • uid (mandatory) - defines the name of the primary key field. This must be of type long.

  • delete-field (optional) - defines the field that would indicate a delete event (we will get back to this later).

  • skip-field (optional) - defines the field that would indicate a skipping event (we will get back to this later).

  • src-data-store (optional) - defines the format of how the source data is saved. Currently the only supported value is "lucene".

  • src-data-field (optional) - specifies the field name used to keep the original source data value.

    Note

    If this attribute is not specified, the default value is set to "src_data". If this field is not set by the data source filter, the string representation of the original data source is saved in this field by default. If part of the source data or a modified version of the source data is to be saved in the index, then you need to set this field using the value you prefer in the data source filter.

  • compress-src-data (optional) - defines if the source data is compressed.

A table is also composed of a set of columns. Each column has a name and a type. Below is the list of supported types:

  • string - value is a string, e.g. "abc"

  • int - integer value

  • long - long value

  • short - short value

  • float - a floating point value

  • double - double value

  • char - a character

  • date - a date value, which must be accompanied by a format string to be used to parse a date string

  • text - a searchable text segment, standard Lucene indexing specification can also be specified here, e.g. index="ANALYZED", termvector="NO".

A column that is not of type "text" is considered a meta column. Any meta column can be specified to be either single (default) or multi. When a column is specified to be multi, e.g. multi="true", it means that, given a row, the column can have more than one value. A delimited string can be provided to help the indexer parse the values (default delimiter is ","). To specify a different delimiter, say ":", we can simply set delimiter=":")

Here is an example of the table schema (see https://github.com/javasoze/sensei/blob/master/conf/schema.xml):

  <table uid="id" delete-field="" skip-field="">
    <column name="color" type="string" />
    <column name="category" type="string" />
    <column name="city" type="string" />
    <column name="makemodel" type="string" />
    <column name="year" type="int" />
    <column name="price" type="float" />
    <column name="mileage" type="int" />
    <column name="tags" type="string" multi="true" delimiter=","/>
    <column name="contents" type="text" index="ANALYZED"
            store="NO" termvector="NO"/>
  </table>

JSON

By default, data objects inserted into Sensei are JSON objects.

Example:

Given the following table definition:

  <table uid="id">
    <column name="color" type="string" />
    <column name="year" type="int" />
    <column name=tag" type="string" multi="true" />
    <column name="description" type="text" index="ANALYZED" store="NO" />
  </table>

The following table shows as an example how a JSON object is mapped into the table:

JSON object

  {
    id:1
    color:"red",
    year:2000,
    tag:"cool,leather",
    description:"i love this car"
  }

Table view

id color year tag description
1 red 2000 cool, leather i love this car

Deletes

To delete a row from Sensei, simply insert a data object with the specified delete-field set to true.

Example:

Given the table schema:

  <table uid="id" delete-field="isDelete">
  ...
  </table>

The following JSON object would delete the row where id=5:

  {
    id:5,
    isDelete:"true"
  }

Skips

In cases where runtime logic decides whether a data object should be skipped, the skip field can be useful.

Example:

Given the table schema:

  <table uid="id" skip-field="isSkip">
  ...
  </table>

The following JSON object would be skipped from indexing:

  {
    id:7,
    isSkip:"true"
  }

Source JSON

For many cases, you may want to save the original source data from which we extract all the fields into the index. You can do this by setting the attributes src-data-store and src-data-field.

Facet Schema

The second section is the facet schema, which defines how columns can be queried.

If we think of the table section defines how data is added into Sensei, then the facet section describes how these data can be queried.

The facet sections is composed of a set of facet definitions.

A facet definition requires a name and a type.

Possible types:

  • simple: simplest facet, 1 row = 1 discrete value

  • path: hierarchical facet, e.g. a/b/c

  • range: range facet, used to support range queries

  • multi: 1 row = N discrete values

  • compact-multi: similar to multi, but possible values are limited to 32

  • histogram: similar to a range facet, but a histogram facet automatically calculates the distribution of facet values over a predefined series of ranges with the same size. (A histogram facet depends on another numeric facet, and it requires several mandatory parameters, see the section called “Parameters for Histogram Facets”.

  • custom: any user defined facet type

Example: https://github.com/javasoze/sensei/blob/master/conf/schema.xml

Optional Attributes

Column

The column attribute references the column names defined in the table section. By default, the value of the name attribute is used.

This can be useful if you want to name the facet name to be different from the defined column name, or if you want to have multiple facets defined on the same column.

Depends

This is a comma delimited string denoting a set of facet names this facet is to be depended on.

When attribute depends is specified, Sensei guarantees that the depended facets are loaded before this facet.

This is also how Composite Facets are constructed. (Another advanced topic).

Dynamic

Dynamic facets are useful when data layout is not known until query time.

Some examples:

  • Searcher's social network

  • Dynamic time ranges from when the search request is issued

This is another advanced topic to be discussed later.

Parameters

A facet can be configured via a list of parameters. Parameters are needed for a facet under some situations, for example:

  • For path facets, separator strings can be configured

  • For range facets, predefined ranges can be configured

The parameters can be specified via element params, which contains a list of elements called param. For each param, two attributes need to be specified: name and value.

How parameters are interpreted and used is dependent on the facet type.

Here is an example of a facet with a list of predefined ranges:

  <facet name="year" type="range">
    <params>
      <param name="range" value="1993-1994"/>
      <param name="range" value="1995-1996"/>
      <param name="range" value="1997-1998"/>
      <param name="range" value="1999-2000"/>
      <param name="range" value="2001-2002"/>
    </params>
  <facet>
Parameters for Histogram Facets

A histogram facet requires 5 parameters:

  • datatype: the data type. Only the following 5 numeric data types are allowed:

    1. int

    2. short

    3. long

    4. float

    5. double

  • datahandler: this is the name of the facet that the histogram facet depends on. The values of this facet are used to generate the distribution information.

  • start: the minimum value of the facet.

  • end: the maximum value of the facet.

  • unit: the unit value used to divide facet values into ranges.

Here is an example configuration for a histogram facet over a facet called score:

  <facet name="scoreHistogram" type="histogram">
    <params>
      <param name="datatype" value="int"/>
      <param name="datahandler" value="score"/>
      <param name="start" value="0"/>
      <param name="end" value="100"/>
      <param name="unit" value="10"/>
    </params>
  </facet>

Customized Facets

We understand we cannot possibly cover all use cases using a short list of predefined facet handlers. It is necessary to allow users to define their own customized facets for different reasons.

If a customized facet handler is required for a column (or multiple columns), you can set the facet type to "custom", and declare a bean for the facet handler in file custom-facets.xml.

For example, if a customized facet called time is declared in schema.xml like this:

  <facet name="time" type="custom" dynamic="false"/>

and the implementation of the facet handler is in class com.example.facets.TimeFacetHandler, then you should include the following line in file custom-facets.xml: [1]

  <bean id="time" class="com.example.facets.TimeFacetHandler"/>

The id of the bean should match the name of the facet.



[1] Here we assume that the time facet handler does not take any arguments.