[120] | 1 | This contrib package provides a utility to build or update an index |
---|
| 2 | using Map/Reduce. |
---|
| 3 | |
---|
| 4 | A distributed "index" is partitioned into "shards". Each shard corresponds |
---|
| 5 | to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex |
---|
| 6 | contains the main() method which uses a Map/Reduce job to analyze documents |
---|
| 7 | and update Lucene instances in parallel. |
---|
| 8 | |
---|
| 9 | The Map phase of the Map/Reduce job formats, analyzes and parses the input |
---|
| 10 | (in parallel), while the Reduce phase collects and applies the updates to |
---|
| 11 | each Lucene instance (again in parallel). The updates are applied using the |
---|
| 12 | local file system where a Reduce task runs and then copied back to HDFS. |
---|
| 13 | For example, if the updates caused a new Lucene segment to be created, the |
---|
| 14 | new segment would be created on the local file system first, and then |
---|
| 15 | copied back to HDFS. |
---|
| 16 | |
---|
| 17 | When the Map/Reduce job completes, a "new version" of the index is ready |
---|
| 18 | to be queried. It is important to note that the new version of the index |
---|
| 19 | is not derived from scratch. By leveraging Lucene's update algorithm, the |
---|
| 20 | new version of each Lucene instance will share as many files as possible |
---|
| 21 | as the previous version. |
---|
| 22 | |
---|
| 23 | The main() method in UpdateIndex requires the following information for |
---|
| 24 | updating the shards: |
---|
| 25 | - Input formatter. This specifies how to format the input documents. |
---|
| 26 | - Analysis. This defines the analyzer to use on the input. The analyzer |
---|
| 27 | determines whether a document is being inserted, updated, or deleted. |
---|
| 28 | For inserts or updates, the analyzer also converts each input document |
---|
| 29 | into a Lucene document. |
---|
| 30 | - Input paths. This provides the location(s) of updated documents, |
---|
| 31 | e.g., HDFS files or directories, or HBase tables. |
---|
| 32 | - Shard paths, or index path with the number of shards. Either specify |
---|
| 33 | the path for each shard, or specify an index path and the shards are |
---|
| 34 | the sub-directories of the index directory. |
---|
| 35 | - Output path. When the update to a shard is done, a message is put here. |
---|
| 36 | - Number of map tasks. |
---|
| 37 | |
---|
| 38 | All of the information can be specified in a configuration file. All but |
---|
| 39 | the first two can also be specified as command line options. Check out |
---|
| 40 | conf/index-config.xml.template for other configurable parameters. |
---|
| 41 | |
---|
| 42 | Note: Because of the parallel nature of Map/Reduce, the behaviour of |
---|
| 43 | multiple inserts, deletes or updates to the same document is undefined. |
---|