External Threat Management

Indexing Made Easy: Sharded Bulk Indexing with SolrCloud and MapReduce PT. II

Every day, RiskIQ extracts and aggregates huge portions of telemetry data from the billions of HTTP requests performed by our web crawling network—but making this data valuable, applicable, and accessible to our customers is a challenge that is always testing us.

After clearing the 2.1 billion document hurdle we discussed last week in our blog covering the challenges of sharded bulk indexing with SolrCloud and MapReduce, the next issue was lurking just over the horizon—and only became apparent as our daily delta indexing jobs began to run. After a few days of delta updates, our engineers began encountering inconsistent and incorrect result counts when performing queries.

Oddly, if the maximum allowed size of the query’s result set exceeded the total number of matching documents in the collection, the matching documents count (numFound) returned by Solr was consistent and appeared to be correct. However, if for example, the result set was restricted to ten results (rows = 10), but the collection contained one hundred documents which met the query criteria, the total result count could be more than one hundred and could differ from one identical request to the next.

In moving to SolrCloud and multi-shard collections, we had not considered SolrCloud’s document routing strategies. In the past, when adding new documents to a single core (or a collection with a single shard) via HTTP, there was only one possible destination for our documents, so routing was never an issue. However, when we began creating multi-shard collections using the method outlined in part 1, SolrCloud created our collections using its “implicit” routing strategy.

Implicit routing is designed to allow the client to dictate which shard should receive each incoming document. We had not been attaching that information to our HTTP indexing requests and had no way of determining the correct destination shard at indexing time. In the absence of explicit direction, SolrCloud was routing our incoming documents to random shards within the collection. For new documents, that works, but attempting to update existing documents causes problems. Potentially, multiple versions of the same document could exist simultaneously in one collection—one in each shard.

The odd behavior that we were encountering at query time can be explained by the way Solr conducts queries on collections with multiple shards. Internally, SolrCloud will replicate an incoming query across each shard in the target collection. The results from each shard are combined before being returned to the client. If multiple documents with the same unique ID are present in the result set, Solr will keep the first such document and discard all duplicates. However, due to result size restrictions, there may be duplicate documents which meet the query criteria but are not included in the per-shard result sets, so those documents will not be de-duplicated and will be counted multiple times.

To facilitate more predictable indexing behavior, SolrCloud provides a “compositeId” routing strategy, which is configured by default when collections are created using the collections API. Under this strategy, each shard is assigned a range of 32-bit numbers. The unique id of each incoming document is hashed, and the document is routed to the shard with a range encompassing the hashed value. Thus, a document with a given ID will always be assigned to the same shard.

We had identified the problem and its solution, but implementing the fix was not as simple as altering a configuration. Understandably, SolrCloud does not provide an API method to change the routing strategy of an existing collection. While routing strategies and shard hash ranges can be altered by manually by editing SolrCloud’s clusterstate.json configuration file, this is not recommended, and we did not feel comfortable using that method in production.

The Solr MapReduce library does provide a MapReduce partitioner which is intended to work with compositeId routing, but it is designed to retrieve and use hash ranges from the shards of an existing collection on a live Solrcloud cluster. Since we recreated all of our indices each week and wanted to be able to alter the number of shards in a collection when generating a new version of an index, the included partitioner wouldn’t quite work. We made some alterations which would enable the partitioner to create shard ranges based solely on the desired number of shards in our collection.

Using our modified partitioner, we were able to successfully create shards comprised of documents which were routed by their composite ids. Publishing the new shards to SolrCloud and convincing Solr to assign them the proper hash ranges would be the final challenge. Several more rounds of trial, error, and cursing ensued. Eventually, a new workflow emerged

  1. Create shards offline using the partitioner.
  2. Copy the shards to the destination SolrCloud nodes using rsync.
  3. Publish the configuration to Zookeeper as before.
  4. Create a new collection with the appropriate number of shards using the collections API.
  5. Iterate over the newly-created, empty shards in the new collection. Delete all replicas for each shard using the collections API, then use the core admin API to create cores to replace the deleted replicas. As before, the location of the appropriate pre-generated shard should be specified. Additionally, the correct shard id must be provided so that the hash range assigned by Solr will be the same range used by the partitioner. It is critical that these deletion and core creation operations be performed on one shard at a time, rather than deleting all replicas first and then attempting to create all of the new cores after that.
  6. Create an alias for the new collection.

As mentioned earlier, Solr MapReduce is no longer maintained as part of the Lucene/Solr codebase in version 6.6.0 and beyond. Since that module has become an integral part of our indexing pipeline, we have extracted it from version 6.5.1 of the Solr codebase, which you can find here.

The intrepid engineering team here at RiskIQ will continue to update this library as we move to newer versions of Solr, and of course, tackle any more challenges that come our way.

Subscribe to Our Newsletter

Subscribe to the RiskIQ newsletter to stay up-to-date on our latest content, headlines, research, events, and more.

Base Editor