1 | This contrib package provides a utility to build or update an index |
---|
2 | using Map/Reduce. |
---|
3 | |
---|
4 | A distributed "index" is partitioned into "shards". Each shard corresponds |
---|
5 | to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex |
---|
6 | contains the main() method which uses a Map/Reduce job to analyze documents |
---|
7 | and update Lucene instances in parallel. |
---|
8 | |
---|
9 | The Map phase of the Map/Reduce job formats, analyzes and parses the input |
---|
10 | (in parallel), while the Reduce phase collects and applies the updates to |
---|
11 | each Lucene instance (again in parallel). The updates are applied using the |
---|
12 | local file system where a Reduce task runs and then copied back to HDFS. |
---|
13 | For example, if the updates caused a new Lucene segment to be created, the |
---|
14 | new segment would be created on the local file system first, and then |
---|
15 | copied back to HDFS. |
---|
16 | |
---|
17 | When the Map/Reduce job completes, a "new version" of the index is ready |
---|
18 | to be queried. It is important to note that the new version of the index |
---|
19 | is not derived from scratch. By leveraging Lucene's update algorithm, the |
---|
20 | new version of each Lucene instance will share as many files as possible |
---|
21 | as the previous version. |
---|
22 | |
---|
23 | The main() method in UpdateIndex requires the following information for |
---|
24 | updating the shards: |
---|
25 | - Input formatter. This specifies how to format the input documents. |
---|
26 | - Analysis. This defines the analyzer to use on the input. The analyzer |
---|
27 | determines whether a document is being inserted, updated, or deleted. |
---|
28 | For inserts or updates, the analyzer also converts each input document |
---|
29 | into a Lucene document. |
---|
30 | - Input paths. This provides the location(s) of updated documents, |
---|
31 | e.g., HDFS files or directories, or HBase tables. |
---|
32 | - Shard paths, or index path with the number of shards. Either specify |
---|
33 | the path for each shard, or specify an index path and the shards are |
---|
34 | the sub-directories of the index directory. |
---|
35 | - Output path. When the update to a shard is done, a message is put here. |
---|
36 | - Number of map tasks. |
---|
37 | |
---|
38 | All of the information can be specified in a configuration file. All but |
---|
39 | the first two can also be specified as command line options. Check out |
---|
40 | conf/index-config.xml.template for other configurable parameters. |
---|
41 | |
---|
42 | Note: Because of the parallel nature of Map/Reduce, the behaviour of |
---|
43 | multiple inserts, deletes or updates to the same document is undefined. |
---|