Chapter 5. Indexing

Table of Contents

Data Acquisition
Index Sharding
Index Replication
Sensei Cluster Overview

Before it can be searched, all data has to be indexed first. Depending on where you get the source data from, you can use different data gateways to get the data converted to the format required by Sensei. No matter what gateway is used, you need to define the data model in file schema.xml (see the section called “Data Modeling” for more details), which is equivalent to the table definition in a RDBMS.

If the amount of data is small, and the run-time indexing and search rates are not high, then a single Sensei node may be all you need. If the amount of data is large and you cannot fit them into one box, however, you have to split them into multiple shards and store them on a cluster of Sensei nodes, each serving one or multiple shards. When a user query comes in, Sensei performs the search on all shards and merges the search results from all shards for you automatically.

If the run-time indexing or search rate is high, network bandwidth, memory or CPU may become the bottleneck on a Sensei server. In this case, sharding the index only is not enough. You have to replicate the index onto different nodes, and have multiple nodes share the indexing/search workload on the same shard(s).

In this chapter, we explain how to get the data indexed, and how to get the data sharded and replicated.

Data Acquisition

To get data indexed, the first thing to set up is the indexing manager (see com.sensei.search.nodes.SenseiIndexingManager). An indexing manager is responsible for:

  • Initializing the Zoie system(s): one Zoie system is needed for every shard of index on one Sensei node.

  • Building the data provider: a data provider needs to be built for the chosen data gateway.

  • Starting and shutting down the data gateway.

For most of the cases you can simply use the default indexing manger provided by Sensei: com.sensei.indexing.api.DefaultStreamingIndexingManager. However you can always write your own version when it is needed.

The type of indexing manager is specified via configuration parameter sensei.index.manager, which is the bean ID of the indexing manager object that you use. When sensei.index.manager is not set, Sensei just uses DefaultStreamingIndexingManager.

Once the indexing manager is selected, the next thing to set up is the data gateway. What data gateway to use depends on how your original source data is stored. Sensei provides four types of built-in data gateways to cover the most common data sources (see Figure 1.2, “Sensei Data Gateway”), however you can also write your own version if needed.

The data gateway type is specified via configuration parameter sensei.index.manager.<indexing-manager-type>.type. Here <indexing-manager-type> is the bean ID of your indexing manager. If you use the default indexing manager, DefaultStreamingIndexingManager, <indexing-manager-type> should be default.

Additional configuration parameters may be needed for the data gateway you choose. These configuration parameters are named with the following prefix:

  sensei.index.manager.<indexing-manager-type>.<data-gateway-type>

For example, if you use the default indexing manager and the Kafka data gateway, the following configuration parameters need to be specified:

  sensei.index.manager.default.type = kafka
  sensei.index.manager.default.kafka.host = my-kafka-host
  sensei.index.manager.default.kafka.port = 1234
  sensei.index.manager.default.kafka.topic = log-data
  sensei.index.manager.default.kafka.batchsize = 10000
  sensei.index.manager.default.kafka.filter = my-kafka-filter

In the rest of the document, we will use the default indexing manager in most of the examples, unless a different indexing manager type is specified explicitly.