Index Sharding

Index Sharding
Prev	Chapter 5. Indexing	Next

Index sharding is needed when the amount of data to be indexed is too big for one single machine to handle. Most of the times sharding is required because the disk space on a single machine is not big enough, but limited memory or limited CPU power can also be the reason.

Index sharding is controlled by the following two configuration parameters:

sensei.index.manager.<indexing-manager-type>.maxpartition.id
sensei.node.partitions

The first parameter tells Sensei how many shards in total the index will be divided into, and the second parameter tells Sensei how many and what shards are handled by the current node.

For example, if you want to divide the entire index into 10 shards (shard 0, shard 1, ..., shard 9), and you want to put the first two shards onto the first Sensei node, then you just need to add the following two lines to your configuration file:

  sensei.index.manager.default.maxpartition.id = 9
  sensei.node.partitions = 0,1

How to split your data into different shards is up to the business logic of the application. Sensei allows you to provide a sharding strategy plug-in (see com.sensei.indexing.api.ShardingStrategy) to let the indexing manager know what data should belong to which shards.

A simple but common sharding strategy is implemented in com.sensei.indexing.api.ShardingStrategy: FieldModShardingStrategy. This is basically the round-robin style. To make this sharding strategy work, you need to specify the total number of shards and on which data field the data should be sharded.

If data that does not belong to any partition on a Sensei node is passed to the indexing manager, it is discarded.