Before it can be searched, all data has to be indexed first.
Depending on where you get the source data from, you can use different
data gateways to get the data converted to the format required by
Sensei. No matter what gateway is used, you need to define the data
model in file schema.xml
(see the section called “Data Modeling” for more details), which is equivalent
to the table definition in a RDBMS.
If the amount of data is small, and the run-time indexing and search rates are not high, then a single Sensei node may be all you need. If the amount of data is large and you cannot fit them into one box, however, you have to split them into multiple shards and store them on a cluster of Sensei nodes, each serving one or multiple shards. When a user query comes in, Sensei performs the search on all shards and merges the search results from all shards for you automatically.
If the run-time indexing or search rate is high, network bandwidth, memory or CPU may become the bottleneck on a Sensei server. In this case, sharding the index only is not enough. You have to replicate the index onto different nodes, and have multiple nodes share the indexing/search workload on the same shard(s).
In this chapter, we explain how to get the data indexed, and how to get the data sharded and replicated.
To get data indexed, the first thing to set up is the indexing
manager (see
com.sensei.search.nodes.SenseiIndexingManager
).
An indexing manager is responsible for:
Initializing the Zoie system(s): one Zoie system is needed for every shard of index on one Sensei node.
Building the data provider: a data provider needs to be built for the chosen data gateway.
Starting and shutting down the data gateway.
For most of the cases you can simply use the default indexing
manger provided by Sensei:
com.sensei.indexing.api.DefaultStreamingIndexingManager
.
However you can always write your own version when it is
needed.
The type of indexing manager is specified via configuration
parameter sensei.index.manager
, which is the bean
ID of the indexing manager object that you use. When
sensei.index.manager
is not set, Sensei just uses
DefaultStreamingIndexingManager
.
Once the indexing manager is selected, the next thing to set up is the data gateway. What data gateway to use depends on how your original source data is stored. Sensei provides four types of built-in data gateways to cover the most common data sources (see Figure 1.2, “Sensei Data Gateway”), however you can also write your own version if needed.
The data gateway type is specified via configuration parameter
sensei.index.manager.<indexing-manager-type>.type
.
Here <indexing-manager-type>
is the bean ID
of your indexing manager. If you use the default indexing manager,
DefaultStreamingIndexingManager
,
<indexing-manager-type>
should be
default
.
Additional configuration parameters may be needed for the data gateway you choose. These configuration parameters are named with the following prefix:
sensei.index.manager.<indexing-manager-type>.<data-gateway-type>
For example, if you use the default indexing manager and the Kafka data gateway, the following configuration parameters need to be specified:
sensei.index.manager.default.type = kafka sensei.index.manager.default.kafka.host = my-kafka-host sensei.index.manager.default.kafka.port = 1234 sensei.index.manager.default.kafka.topic = log-data sensei.index.manager.default.kafka.batchsize = 10000 sensei.index.manager.default.kafka.filter = my-kafka-filter
In the rest of the document, we will use the default indexing manager in most of the examples, unless a different indexing manager type is specified explicitly.