You are here

Solr 6 sharding methods

When an index grows too large to be stored on a single search server, it can be distributed across multiple search servers. This is known as sharding. The distributed/sharded index can then be searched using Alfresco/Solr's distributed search capabilities.

Solr 6 can use any of these four different methods for routing documents and ACLs to shards.

  • ACL (ACL_ID) v1: This sharding method is available in Alfresco Search Services 1.0 and later versions. Nodes and access control lists are grouped by their ACL id. This places the nodes together with all the access control information required to determine the access to a node in the same shard. Both the nodes and access control information are sharded. The overall index size will be smaller than other methods as the ACL index information is not duplicated in every shard. Also, the ACL count is usually much smaller than the node count.

    This method is beneficial if you have lots of ACLs and the documents are evenly distributed over those ACLs.For example, if you have many Share sites, nodes and ACLs are assigned to shards randomly based on the ACL and the documents to which it applies.

    The node distribution may be uneven as it depends how many nodes share ACLs. This method replaces the previous ACL based sharding method used in Solr 4 and distributes ACLs over the shards. Each shard contains only the access control information for the nodes it contains.

    To use this method, when creating a shard add a new configuration property:
    shard.method=ACL_ID
  • ACL (ACL_ID) v2: This method is available in Alfresco Search Services 1.0 and later versions.
    This sharding method is the same as ACL ID v1 except that the murmur hash of the ACL ID is used in preference to its modulus. This gives better distribution of ACLs over shards. The distribution of documents over ACLs is not affected and can still be skewed.
    shard.method=ACL_ID
  • DBID (DB_ID): This method is available in Alfresco Search Services 1.0 and later versions and is the default sharding option in Solr 6. Nodes are evenly distributed over the shards at random based on the murmur hash of the DBID. The access control information is duplicated in each shard. The distribution of nodes over each shard is very even and shards grow at the same rate. Also, this is the fall back method if any other sharding information is unavailable.
    To use this method, when creating a shard add a new configuration property:
    shard.method=DB_ID
  • DBID range (DB_ID_RANGE): Alfresco Content Services 5.2.1 supports a new shard method based on DBID range. This routes documents within specific DBID ranges to specific shards. It adds new shards to the cluster without requiring a reindex.

    DBID range sharding is the only option to offer auto-scaling as opposed to defining your exact shard count at the start. All the other sharding methods require repartitioning in some way.

    For each shard, you specify the range of DBIDs to be included. As your repository grows you can add shards.

    Example 1: You may aim for shards of 20M nodes in size and expect it to get to 100M over five years. You could create the first shard for nodes 0-20M and set the maximum number of shards to 10. As you approach node 20M, you can create the next shard for nodes 20M-40M, and so on. Again, each shard has all the access control information.

    To use this method, when creating a shard add a new configuration property:
    shard.method=DB_ID_RANGE
    shard.range=0-20000000
    Example 2: If there are 100M (million) nodes and you want to split them into 10 shards with 10M nodes each. So, at the start you can specify:
    • 10 shards
    • all shards have all ACLs
    • specify a shard to include 0-10M - the range will be inclusive of the bottom value and exclusive of the top value.
    • the second shard will have 10M - 20M nodes, third shard will have 20M - 30M nodes, and so on.

    Date based queries may produce results from only a sub set of shards as DBID increases monotonically over time.

  • Date/Datetime (DATE): The date-based sharding assigns dates sequentially through shards based on the month.

    For example: If there are 12 shards, each month would be assigned sequentially to each shard, wrapping round and starting again for each year. The non-random assignment facilitates easier shard management - dropping shards or scaling out replication for some date range. Typical ageing strategies could be based on the created date or destruction date.

    Each shard contains copies of all the ACL information, so this information is replicated in each shard. However, if the property is not present on a node, sharding falls back to the DBID murmur hash to randomly distribute these nodes.

    To use this method, when creating a shard add the new configuration properties:
    shard.key=exif:dateTimeOriginal
    shard.method=DATE
    Months can be grouped together, for example, by quarter. Each quarter of data would be assigned sequentially through the available shards.
    shard.date.grouping=3
  • Metadata (PROPERTY): This method is available in Alfresco Search Services 1.0 and later versions. In this method, the value of some property is hashed and this hash is used to assign the node to a random shard. All nodes with the same property value will be assigned to the same shard. Each shard will duplicate all the ACL information.

    Only properties of type d:text, d:date and d:datetime can be used. For example, the recipient of an email, the creator of a node, some custom field set by a rule, or by the domain of an email recipient. The keys are randomly distributed over the shards using murmur hash.

    Each shard contains copies of all the ACL information, so this information is replicated in each shard. However, if the property is not present on a node, sharding falls back to the DBID murmur hash to randomly distribute these nodes.

    To use this method, when creating a shard add the new configuration properties:
    shard.key=cm:creator
    shard.method=PROPERTY

    It is possible to extract a part of the property value to use for sharding using a regular expression, for example, a year at the start of a string:

    shard.regex=^\d{4}
Note: The ACL (MOD_ACL_ID) sharding method is used only in Solr4.

Availability matrix

Index Engine ACL v1 DB ID Date/time Metadata ACL v2 DBID range
Alfresco Content Services 5.2.0 + Solr 4
Alfresco Content Services 5.2.0 + Alfresco Search Services 1.0
Alfresco Content Services 5.2.1 + Alfresco Search Services 1.1

Comparison Overview

Index Engine ACL v1 DB ID Date/time Metadata ACL v2 DBID range
All shards required
ACLs replicated on all shards
Can add shards as the index grows
Even shards
Falls back to DBID sharding
One shard gets new content
Possible Possible
Query may use one shard
Possible Possible
Possible
Has Admin advantages
Possible
Nodes can move shard

Sending feedback to the Alfresco documentation team

You don't appear to have JavaScript enabled in your browser. With JavaScript enabled, you can provide feedback to us using our simple form. Here are some instructions on how to enable JavaScript in your web browser.