Sensei configuration
Guide to configure your application.
Configuration files
A Sensei node is configured by a set of files. These files describe a Sensei node in terms of data models, server configuration, indexing tuning parameters, customizations, etc.
Sensei configuration is a directory containing a couple of files:
- sensei.properties - settings for a Sensei node or process
- schema.xml - data schema declaration
Sensei Properties
The Sensei properties configures different parts of the system, from server settings, to indexing gateways.
Schema
The Sensei schema describes the data-model of a Sensei application. It is broken into 2 parts:
- table - describes how data is being persisted, can think of it as the indexing part of the schema.
- facets - descrbies how data is being viewed or queried, can think of it as the query part of the schema.
Below is an example schema for the Twitter feed:
<table uid="id_str"> <column name="created-time" type="long" /> <column name="authorname" type="string" /> <column name="text" type="text" /> </table>
Data models are described in the schema.xml
file. The XSD definition of this XML file can be found
here
The schema file is composed by 2 sections:
Table schema
Facet schema
A Sensei instance can be viewed as a giant table with many columns and many rows. The concept of such table directly correlates to that of traditional RDBMS's.
A table may have the following attributes:
uid (mandatory) - defines the name of the primary key field. This must be of type
long
.
A table is also composed of a set of columns. Each column has a name and a type. Below is the list of supported types:
string - value is a string, e.g. "
abc
"int - integer value
long - long value
short - short value
float - a floating point value
double - double value
char - a character
date - a date value, which must be accompanied by a format string to be used to parse a date string
text - a searchable text segment, standard Lucene indexing specification can also be specified here, e.g.
index="ANALYZED"
,termvector="NO"
.
For number types, we don't currently support negative values. This is coming in a future release.
A column that is not of type "text" is considered a
meta column. Any meta column can be specified to be
either single (default) or
multi. When a column is specified to be
multi, e.g. multi="true"
, it
means that, given a row, the column can have more than one value. A
delimited string can be provided to help the indexer parse the values
(default delimiter is ",
"). To specify a different
delimiter, say ":
", we can simply set
delimiter=":"
)
Here is an example of the table schema (see https://github.com/javasoze/sensei/blob/master/conf/schema.xml):
<table uid="id" delete-field="" skip-field=""> <column name="color" type="string" /> <column name="category" type="string" /> <column name="city" type="string" /> <column name="makemodel" type="string" /> <column name="year" type="int" /> <column name="price" type="float" /> <column name="mileage" type="int" /> <column name="tags" type="string" multi="true" delimiter=","/> <column name="contents" type="text" index="ANALYZED" store="NO" termvector="NO"/> </table>
By default, data objects inserted into Sensei are JSON objects.
Example:
Given the following table definition:
<table uid="id"> <column name="color" type="string" /> <column name="year" type="int" /> <column name=tag" type="string" multi="true" /> <column name="description" type="text" index="ANALYZED" store="NO" /> </table>
The following table shows as an example how a JSON object is mapped into the table:
JSON object
{ id:1 color:"red", year:2000, tag:"cool,leather", description:"i love this car" }
Table view
id | color | year | tag | description |
---|---|---|---|---|
1 | red | 2000 | cool, leather | i love this car |
To delete a row from Sensei, simply insert a data object with the specified delete-field set to true.
Example:
The following JSON object would delete the row where
id=5:
{
_type:delete,
_uid:5
}
The second section is the facet schema, which defines how columns can be queried.
If we think of the table section defines how data is added into Sensei, then the facet section describes how these data can be queried.
The facet sections is composed of a set of facet definitions.
A facet definition requires a name and a type.
Possible types:
simple: simplest facet, 1 row = 1 discrete value
path: hierarchical facet, e.g. a/b/c
range: range facet, used to support range queries
multi: 1 row = N discrete values
compact-multi: similar to multi, but possible values are limited to 32
histogram: similar to a range facet, but a histogram facet automatically calculates the distribution of facet values over a predefined series of ranges with the same size. (A histogram facet depends on another numeric facet, and it requires several mandatory parameters, see the section called “Parameters for Histogram Facets”.
timeRange: also similar to a range facet, it is a dynamic facet handler, that allows to search for documents, that have the time column value within the specified range from now,see the section called “Parameters for TimeRange Facets”.
custom: any user defined facet type
Example: https://github.com/javasoze/sensei/blob/master/conf/schema.xml
The column attribute references the column names defined in
the table section. By default, the value of the
name
attribute is used.
This can be useful if you want to name the facet name to be different from the defined column name, or if you want to have multiple facets defined on the same column.
This is a comma delimited string denoting a set of facet names this facet is to be depended on.
When attribute depends
is specified, Sensei
guarantees that the depended facets are loaded before this
facet.
This is also how Composite Facets are constructed. (Another advanced topic).
A facet can be configured via a list of parameters. Parameters are needed for a facet under some situations, for example:
For path facets, separator strings can be configured
For range facets, predefined ranges can be configured
The parameters can be specified via element
params
, which contains a list of elements called
param
. For each param
, two
attributes need to be specified: name
and
value
.
How parameters are interpreted and used is dependent on the facet type.
Here is an example of a facet with a list of predefined ranges:
<facet name="year" type="range"> <params> <param name="range" value="1993-1994"/> <param name="range" value="1995-1996"/> <param name="range" value="1997-1998"/> <param name="range" value="1999-2000"/> <param name="range" value="2001-2002"/> </params> <facet>
A histogram facet requires 5 parameters:
datatype: the data type. Only the following 5 numeric data types are allowed:
int
short
long
float
double
datahandler: this is the name of the facet that the histogram facet depends on. The values of this facet are used to generate the distribution information.
start: the minimum value of the facet.
end: the maximum value of the facet.
unit: the unit value used to divide facet values into ranges.
Here is an example configuration for a histogram facet over a
facet called score
:
<facet name="scoreHistogram" type="histogram"> <params> <param name="datatype" value="int"/> <param name="datahandler" value="score"/> <param name="start" value="0"/> <param name="end" value="100"/> <param name="unit" value="10"/> </params> </facet>
A TimeRange facet requires either column or depends parameter:
column: this is the reference for the column, that represents the timestamp in milliseconds. Under the hood Sensei will create another range facetHandler for this column
depends: this is the name of the facet that the TimeRange facet depends on. The values of this facet are used to evaluate if the document needs to be matched.
param range: this the format of range string is dddhhmmss. (ddd: days (000-999), hh : hours (00-23), mm: minutes (00-59), ss: seconds (00-59)) It represents the timerange used by the facet handler.
Here is an example configuration for a time facet over a facet
called time
:. It will match the documents that
have the time column not older than 12 hours
<facet name="timeRange" type="dynamicTimeRange" depends="time" dynamic="true"> <params> <param name="range" value="000120000" /> </params> </facet>
We understand we cannot possibly cover all use cases using a short list of predefined facet handlers. It is necessary to allow users to define their own customized facets for different reasons.
If a customized facet handler is required for a column (or
multiple columns), you can set the facet type to
"custom
", and declare a correspodning property
handler in file sensei.properties
.
For example, if a customized facet called
time
is declared in schema.xml
like this:
<facet name="time" type="custom" dynamic="false"/>
and the implementation of the facet handler is in class
com.example.facets.TimeFacetHandler
, then you
should include following lines in file
sensei.properties
: [1]
my.custom.facets.time.class = com.example.facets.TimeFacetHandler" sensei.custom.facets.list=..., my.custom.facets.time
The property of the bean should match the reference at sensei.custom.facets.list.
A Sensei node is configured via the
sensei.properties
, which uses the format supported by
Apache Commons Configuration (http:/commons.apache.org/). This file
consists of the following five parts:
server: port to listen on, rpc parameters, etc.
cluster: cluster manager, sharding, request routing, etc.
indexing: data interpretation, tokenization, indexer type, etc.
broker and client: e.g. entry into Sensei system
plugins: e.g. customized facet handlers
Below is the configuration file for the car demo (available from
# sensei node parameters sensei.node.id=1 sensei.node.partitions=0,1 # sensei network server parameters sensei.server.port=1234 sensei.server.requestThreadCorePoolSize=20 sensei.server.requestThreadMaxPoolSize=70 sensei.server.requestThreadKeepAliveTimeSecs=300 # sensei cluster parameters sensei.cluster.name=sensei sensei.cluster.url=localhost:2181 sensei.cluster.timeout=30000 # sensei indexing parameters sensei.index.directory = index/cardata sensei.index.batchSize = 10000 sensei.index.batchDelay = 300000 sensei.index.maxBatchSize = 10000 sensei.index.realtime = true # indicator of freshness of data, in seconds, a number <=0 implies as fast as possible sensei.index.freshness = 5 # index manager parameters sensei.index.manager.default.maxpartition.id = 1 # gateway information sensei.gateway.class=com.senseidb.gateway.file.LinedFileDataProviderBuilder sensei.gateway.file.path = example/cars/data/cars.json # index manager parameters sensei.index.manager.default.maxpartition.id = 1 # broker properties sensei.broker.port = 8080 sensei.broker.minThread = 50 sensei.broker.maxThread = 100 sensei.broker.maxWaittime = 2000 sensei.search.cluster.network.conn.timeout = 1000 sensei.search.cluster.network.write.timeout = 150 sensei.search.cluster.network.max.conn.per.node = 5 sensei.search.cluster.network.stale.timeout.mins = 10 sensei.search.cluster.network.stale.cleanup.freq.mins = 10 # custom router factory # sensei.search.router.factory.class = com.senseidb.plugin.example.myRouterFactory
Let's take a brief look, how the class properties are loaded into Sensei.
On the startup Sensei will scan the config for all the properties with the suffix '.class'. It will try to load the specified classes from its classpath and instantiate it by calling the no-arg constructor. If the instantiated object implements the com.senseidb.plugin.SenseiPlugin interface, the init and the start callback methods would be called
public interface SenseiPlugin { public void init(Map<String, String> config); public void start(); public void stop(); }
The config properties are taken from the sensei.properties and they should have the same prefix as the corresponding key with the '.class' suffix
#Custom implementation of the sensei load balancer strategy sensei.search.router.factory.class=com.senseidb.plugin.example.MyCustomRouterFactory sensei.search.router.factory.customType=round-robin
Sometimes we would need to provide a reference of the class that doesn't have the no-arg default constructor We may implement the SenseiPluginFactory interface for this: in the config:
sensei.index.analyzer.class = com.senseidb.plugin.example.LuceneStandardAnalyzerFactory sensei.index.analyzer.version = LUCENE_34
in the code:
public class LuceneStandardAnalyzerFactory implements SenseiPluginFactory<StandardAnalyzer> { @Override public StandardAnalyzer getBean(Map<String, String> initProperties, String fullPrefix, SenseiPluginRegistry pluginRegistry) { return new StandardAnalyzer(Version.valueOf(initProperties.get("version"))); } }
The custom facets may be defined in the sensei.properties and referenced from the schema defined in the schema.xml file:
In the schema.xml:
<facets> <facet name="groupid" type="custom" /> <facet name="tags" type="custom" /> <facet name="virtual_groupid" type="custom" /> <facet name="virtual_groupid_fixed" type="custom" /> </facets>
In the Java code:
public class VirtualGroupIdFactory implements SenseiPluginFactory<List<FacetHandler>> { @Override public List<FacetHandler> getBean(Map initProperties, String fullPrefix, SenseiPluginRegistry pluginRegistry) { List<FacetHandler> ret = new ArrayList<FacetHandler>(2); ret.add(new VirtualSimpleFacetHandler("virtual_groupid", new PredefinedTermListFactory(Long.class, "00000000000000000000000000000000000"), null, Collections.EMPTY_SET)); ret.add(new VirtualSimpleFacetHandler("virtual_groupid_fixed", new TermFixedLengthLongArrayListFactory(2), null, Collections.EMPTY_SET)); return ret; } } public class SimpleFacetHandlerFactory implements SenseiPluginFactory<SimpleFacetHandler> { @Override public SimpleFacetHandler getBean(Map<String, String> initProperties, String fullPrefix, SenseiPluginRegistry pluginRegistry) { return new SimpleFacetHandler(initProperties.get("facetName"), initProperties.get("fieldName"), null, Collections.EMPTY_SET); } }
And as the last step it needs to be wired together in the sensei.properties:
my.custom_facets.virtual_groupids.class=com.senseidb.plugin.example.VirtualGroupIdFactory my.custom_facets.groupid.class=com.senseidb.plugin.example.SimpleFacetHandlerFactory my.custom_facets.groupid.facetName=groupid my.custom_facets.groupid.fieldName=groupid my.custom_facets.tags.class=com.senseidb.plugin.example.SimpleFacetHandlerFactory my.custom_facets.tags.facetName=tags my.custom_facets.tags.fieldName=tags # beans might be referenced either by the simple name eg 'tags' or by the full key eg 'my.custom_facets.tags' # Note that the virtual_groupids references not the single FacetHandler but the list of the facetHandlers returned by the VirtualGroupIdFactory sensei.custom.facets.list= virtual_groupids, my.custom_facets.tags, groupid
In the following sections, we are going to explain every configuration property in each part: what the property type is, whether it is required, what the default value is, and how it is used, etc.
- sensei.node.id
Type: int
Required: Yes
Default: None
This is the node ID of the Sensei node in a cluster.
- sensei.node.partitions
Type: String (comma separated integers or ranges)
Required: Yes
Default: None
This specifies the partitions IDs this the Sensei server is going to handle. Partition IDs can be given as either integer numbers or ranges, separated by commas. For example, the following line denotes that the Sensei server has six partitions: 1,4,5,6,7,10.
sensei.node.partitions=1,4-7,10
- sensei.server.port
Type: int
Required: Yes
Default: None
This is the Sensei server port number.
- sensei.server.requestThreadCorePoolSize
Type: int
Required: No
Default: 20
This is the core size of thread pool used to execute requests.
- sensei.server.requestThreadKeepAliveTimeSecs
Type: int
Required: No
Default: 300
This is the length of time in seconds to keep an idle request thread alive.
- sensei.server.requestThreadMaxPoolSize
Type: int
Required: No
Default: 70
This is the maximum size of thread pool used to execute requests.
- sensei.cluster.name
Type: String
Required: Yes
Default: None
This is the name of the Sensei server cluster.
- sensei.cluster.timeout
Type: int
Required: No
Default: 300000
This is the session timeout value, in milliseconds, that is passed to ZooKeeper.
- sensei.cluster.url
Type: String
Required: Yes
Default: None
This is the ZooKeeper URL for the Sensei cluster.
- sensei.index.analyzer.class
See sensei.index.analyzer.class in the section called “Plug-in Properties”.
- sensei.index.batchDelay
Type: int
Required: No
Default: 300000
This is the maximum time to wait in milliseconds before flushing index events to disk. The default value is 300000 (i.e. 5 minutes).
- sensei.index.batchSize
Type: int
Required: No
Default: 10000
This is the batch size to control the pace of data event consumption on the back-end. It is the soft size limit of each event batch. If the events come in too fast and the limit is already reached, then the indexer will block the incoming events until the number of buffered events drop below this limit after some of the events are sent to the background data consumer.
- sensei.index.custom.class
See sensei.index.custom.class in the section called “Plug-in Properties”.
- sensei.index.directory
Type: String
Required: Yes
Default: None
This is the directory used to save the index.
- sensei.index.freshness
Type: long
Required: No
Default: 500
This controls the freshness of entries in the index reader cache.
- sensei.index.interpreter.class
See sensei.index.interpreter.class in the section called “Plug-in Properties”.
- sensei.index.manager
See sensei.index.manager in the section called “Plug-in Properties”.
- sensei.index.manager.default.batchSize
Type: int
Required: No
Default: 1
This is the batch size to control when data events accumulated in the default index manger should be consumed by the data consumer. The default value is 1.
- sensei.index.manager.default.eventsPerMin
Type: int
Required: No
Default: 40000
This is the maximum number of data events that the indexer can consume per minute. If this threshold is exceeded, the indexer will pause for a short period of time before continuing to consume incoming data events.
This property is helpful in preventing the indexer from being overloaded. The default value is 40,000.
- sensei.index.manager.default.maxpartition.id
Type: int
Required: Yes, if the default indexing manager is chosen; No, otherwise.
Default: None
This is the maximum partition ID number served by this Sensei cluster if the default Sensei indexing manager is used.
Warning
This property is different from the total number of partitions in a Sensei cluster. For example, if a cluster contains 4 partitions, 0, 1, 2, and 3, then sensei.index.manager.default.maxpartition.id should be set to 3.
- sensei.index.manager.default.shardingStrategy.class
See sensei.index.manager.default.shardingStrategy.class in the section called “Plug-in Properties”.
- sensei.index.manager.default.type
See sensei.index.manager.default.type in the section called “Plug-in Properties”.
- sensei.index.maxBatchSize
Type: int
Required: No
Default: 10000
This is the maximum number of incoming data events that can be held by the indexer in a batch before they are flushed to disk. If this number is exceeded, the indexer will stop processing the data events for one minute.
- sensei.index.realtime
Type: boolean
Required: No
Default: true
This specifies whether the indexing mode is real-time or not.
- sensei.index.similarity.class
See sensei.index.similarity.class in the section called “Plug-in Properties”.
- sensei.broker.maxThread
Type: int
Required: No
Default: 50
This is the maximum size of thread pool used by a broker to execute requests.
- sensei.broker.maxWaittime
Type: int
Required: No
Default: 2000
This is the maximum idle time in milliseconds for a thread on a broker. Threads that are idle for longer than this period may be stopped.
- sensei.broker.minThread
Type: int
Required: No
Default: 20
This is the core size of thread pool used by the broker to execute requests.
- sensei.broker.port
Type: int
Required: Yes
Default: None
This is the port number of the Sensei broker.
- sensei.broker.webapp.path
Type: String
Required: Yes
Default: None
This is the resource base of the broker web application.
- sensei.search.cluster.zookeeper.url
Type: String
Required: Yes
Default: None
This is the ZooKeeper URL for the Sensei search cluster that a broker talks to.
- sensei.search.cluster.name
Type: String
Required: Yes
Default: None
This is the Sensei cluster name, i.e. the service name for the network clients and brokers.
- sensei.search.cluster.zookeeper.conn.timeout
Type: int
Required: No
Default: 10000
This is the ZooKeeper network client session timeout value in milliseconds.
- sensei.search.cluster.network.conn.timeout
Type: int
Required: No
Default: 1000
This is the maximum number of milliseconds to allow a connection attempt to take.
- sensei.search.cluster.network.write.timeout
Type: int
Required: No
Default: 150
This is the number of milliseconds a request can be queued for write before it is considered stale.
- sensei.search.cluster.network.max.conn.per.node
Type: int
Required: No
Default: 5
This is the maximum number of open connections to a node.
- sensei.search.cluster.network.stale.timeout.mins
Type: int
Required: No
Default: 10
This is the number of minutes to keep a request that is waiting for a response.
- sensei.search.cluster.network.stale.cleanup.freq.mins
Type: int
Required: No
Default: 10
This is the frequency to clean up stale requests.
- sensei.index.analyzer.class
Type: Class
Required: No
Default: ""
This specifies the class name of the analyzer plug-in for analyzing text. If not specified,
org.apache.lucene.analysis.standard.StandardAnalyzer
will be used.- sensei.index.similarity.class
Type: Class
Required: No
Default: ""
This specifies the class name of similarity plug-in for Lucene scoring. If not specified,
org.apache.lucene.search.DefaultSimilarity
is used.- sensei.index.custom.class
Type: Class
Required: No
Default: ""
This specifies the class name of the custom indexing pipeline implementation. A custom indexing pipeline can be plugged into the indexing process to allow users to modify generated Lucene documents at the last step before they are indexed.
A custom indexing pipeline has to implement interface
com.senseidb.indexing.CustomIndexingPipeline
.- sensei.index.interpreter.class
Type: Class
Required: No
Default: ""
This specifies the bean ID of the interpretor of Zoie indexables. If not specified,
com.senseidb.indexing.DefaultJsonSchemaInterpreter
is used.- sensei.index.manager.class
Type: Class
Required: No
Default: ""
This specifies the class name of the indexing manager object implementing
com.senseidb.search.node.SenseiIndexingManager
. If not specified,com.senseidb.indexing.DefaultStreamingIndexingManager
is used.- sensei.gateway.class
Type: Class
Required: Yes if sensei.gateway.class is not specified, i.e. the default indexing manager is used.
Default: None
This specifies the type of gateway that will be used by the default indexing manager. The value identifies the name of of the class extending
com.senseidb.gateway.SenseiGateway
.Several built-in gateways are provided by Sensei, but you can always define your own based on your need. No matter a built-in gateway or a custom gateway is used, additional parameters can be specified under the names with prefix sensei.gateway<gateway-type>.
Currently the following built-in gateway types are supported:
file
:This type of gateway takes a regular text file as the input. Each line in the file contains a data entry in JSON format.
Only one property needs to be set for this gateway type. See the section called “File Gateway Properties”
kafka
:This type of gateway takes Kafka messages as input.
See the section called “Kafka Gateway Properties” for additional property information.
jms
:This type of gateway takes JMS (Java Messages Service) messages as input. The publish-and-subscribe messaging model is used by Sensei, so parameters like topic need to be provided.
See the section called “JMS Gateway Properties” for additional property information.
jdbc
:This type of gateway takes JDBC data as input.
See the section called “JDBC Gateway Properties” for additional property information.
- sensei.index.manager.filter.class
Type: Class
Required: No
Default: None
This is the name of the class extending the
com.senseidb.indexing.DataSourceFilter
. No matter what gateway type the indexing managers uses, a filter can be plugged in to get the original source data converted to the JSON format defined by the table schema. If the input data is already in the right format, then this filter is not needed.- sensei.sharding.strategy.class
Type: Class
Required: No
Default: ""
This is the class name of the sharding strategy.
- sensei.search.router.factory.class
Type: Class
Required: No
Default: ""
This is the class name of the Sensei request router factory. This factory builds the load balancer that is used by Sensei brokers to route incoming requests to different Sensei nodes.
- sensei.version.comparator.class
Type: Class
Required: No
Default: ""
This specifies the class name of version comparator plug-in to be used by the indexer. If not specified, Zoie's default version comparator is used.
For file
gateway, the following property has
to be specified:
- sensei.gateway.file.path
Type: String
Required: Yes
Default: None
This is the path to the input data file.
For kafka
gateway, the following properties
should/can be specified: [2]
- sensei.gateway.kafka.batchsize
Type: String
Required: Yes
Default: None
This is the batch size for each pull request.
- sensei.gateway.kafka.host
Type: String
Required: Yes
Default: None
This is the host name of the Kafka server.
- sensei.gateway.kafka.port
Type: int
Required: Yes
Default: None
This is the port number on which the Kafka server is listening for connections.
- sensei.gateway.kafka.timeout
Type: int
Required: Yes
Default: 10000
This is the socket timeout in milliseconds.
- sensei.gateway.kafka.topic
Type: String
Required: Yes
Default: None
The topic of the messages to be fetched.
For jms
gateway, the following properties
should/can be specified:
- sensei.gateway.jms.clientId
Type: String
Required: Yes
Default: None
This is the client identifier used to connect to the JMS provider.
- sensei.gateway.jms.topic
Type: String
Required: Yes
Default: None
This is the topic name that the JMS client subscribes to.
- sensei.gateway.jms.topicFactory
Type: String
Required: Yes
Default: None
This is the bean ID of the
proj.zoie.dataprovider.jms.TopicFactory
object. This object is used to generate a topic object based on the given topic name.- sensei.gateway.jms.connectionFactory.class
Type: Class
Required: Yes
Default: None
This is the class name of the
javax.jms.TopicConnectionFactory
object, which is used by the JMS client to create ajavax.jms.TopicConnection
object with the JMS provider.
For jdbc
gateway, the following properties
should/can be specified:
- sensei.gateway.jdbc.adaptor
Type: Class
Required: Yes
Default: None
This is the bean ID of the
com.senseidb.gateway.jdbc.SenseiJDBCAdaptor
object. This object is used to build aproj.zoie.dataprovider.jdbc.PreparedStatementBuilder
object, which is required byproj.zoie.dataprovider.jdbc.JDBCStreamDataProvider
.- sensei.gateway.jdbc.driver
Type: String
Required: Yes
Default: None
This is the class name of the JDBC driver that you want to use.
- sensei.gateway.jdbc.password
Type: String
Required: Yes
Default: None
This is the password for the user name that you use to connect to the database.
- sensei.gateway.jdbc.username
Type: String
Required: Yes
Default: None
This is the user name that you use to connect to the database.