source: proiecte/HadoopJUnit/hadoop-0.20.1/src/contrib/index/README @ 176

Last change on this file since 176 was 120, checked in by (none), 14 years ago

Added the mail files for the Hadoop JUNit Project

  • Property svn:executable set to *
File size: 2.3 KB
Line 
1This contrib package provides a utility to build or update an index
2using Map/Reduce.
3
4A distributed "index" is partitioned into "shards". Each shard corresponds
5to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex
6contains the main() method which uses a Map/Reduce job to analyze documents
7and update Lucene instances in parallel.
8
9The Map phase of the Map/Reduce job formats, analyzes and parses the input
10(in parallel), while the Reduce phase collects and applies the updates to
11each Lucene instance (again in parallel). The updates are applied using the
12local file system where a Reduce task runs and then copied back to HDFS.
13For example, if the updates caused a new Lucene segment to be created, the
14new segment would be created on the local file system first, and then
15copied back to HDFS.
16
17When the Map/Reduce job completes, a "new version" of the index is ready
18to be queried. It is important to note that the new version of the index
19is not derived from scratch. By leveraging Lucene's update algorithm, the
20new version of each Lucene instance will share as many files as possible
21as the previous version.
22
23The main() method in UpdateIndex requires the following information for
24updating the shards:
25  - Input formatter. This specifies how to format the input documents.
26  - Analysis. This defines the analyzer to use on the input. The analyzer
27    determines whether a document is being inserted, updated, or deleted.
28    For inserts or updates, the analyzer also converts each input document
29    into a Lucene document.
30  - Input paths. This provides the location(s) of updated documents,
31    e.g., HDFS files or directories, or HBase tables.
32  - Shard paths, or index path with the number of shards. Either specify
33    the path for each shard, or specify an index path and the shards are
34    the sub-directories of the index directory.
35  - Output path. When the update to a shard is done, a message is put here.
36  - Number of map tasks.
37
38All of the information can be specified in a configuration file. All but
39the first two can also be specified as command line options. Check out
40conf/index-config.xml.template for other configurable parameters.
41
42Note: Because of the parallel nature of Map/Reduce, the behaviour of
43multiple inserts, deletes or updates to the same document is undefined.
Note: See TracBrowser for help on using the repository browser.