final. @param name resource to be added, the classpath is examined for a file with that name.]]> final. @param url url of the resource to be added, the local filesystem is examined directly to find the resource, without referring to the classpath.]]> final. @param file file-path of resource to be added, the local filesystem is examined directly to find the resource, without referring to the classpath.]]> final. @param in InputStream to deserialize the object from.]]> name property, null if no such property exists. Values are processed for variable expansion before being returned. @param name the property name. @return the value of the name property, or null if no such property exists.]]> name property, without doing variable expansion. @param name the property name. @return the value of the name property, or null if no such property exists.]]> value of the name property. @param name property name. @param value property value.]]> name property. If no such property exists, then defaultValue is returned. @param name property name. @param defaultValue default value. @return property value, or defaultValue if the property doesn't exist.]]> name property as an int. If no such property exists, or if the specified value is not a valid int, then defaultValue is returned. @param name property name. @param defaultValue default value. @return property value as an int, or defaultValue.]]> name property to an int. @param name property name. @param value int value of the property.]]> name property as a long. If no such property is specified, or if the specified value is not a valid long, then defaultValue is returned. @param name property name. @param defaultValue default value. @return property value as a long, or defaultValue.]]> name property to a long. @param name property name. @param value long value of the property.]]> name property as a float. If no such property is specified, or if the specified value is not a valid float, then defaultValue is returned. @param name property name. @param defaultValue default value. @return property value as a float, or defaultValue.]]> name property to a float. @param name property name. @param value property value.]]> name property as a boolean. If no such property is specified, or if the specified value is not a valid boolean, then defaultValue is returned. @param name property name. @param defaultValue default value. @return property value as a boolean, or defaultValue.]]> name property to a boolean. @param name property name. @param value boolean value of the property.]]> name property as a collection of Strings. If no such property is specified then empty collection is returned.

This is an optimized version of {@link #getStrings(String)} @param name property name. @return property value as a collection of Strings.]]> name property as an array of Strings. If no such property is specified then null is returned. @param name property name. @return property value as an array of Strings, or null.]]> name property as an array of Strings. If no such property is specified then default value is returned. @param name property name. @param defaultValue The default value @return property value as an array of Strings, or default value.]]> name property as as comma delimited values. @param name property name. @param values The values]]> name property as an array of Class. The value of the property specifies a list of comma separated class names. If no such property is specified, then defaultValue is returned. @param name the property name. @param defaultValue default value. @return property value as a Class[], or defaultValue.]]> name property as a Class. If no such property is specified, then defaultValue is returned. @param name the class name. @param defaultValue default value. @return property value as a Class, or defaultValue.]]> name property as a Class implementing the interface specified by xface. If no such property is specified, then defaultValue is returned. An exception is thrown if the returned class does not implement the named interface. @param name the class name. @param defaultValue default value. @param xface the interface implemented by the named class. @return property value as a Class, or defaultValue.]]> name property to the name of a theClass implementing the given interface xface. An exception is thrown if theClass does not implement the interface xface. @param name property name. @param theClass property value. @param xface the interface implemented by the named class.]]> dirsProp with the given path. If dirsProp contains multiple directories, then one is chosen based on path's hash code. If the selected directory does not exist, an attempt is made to create it. @param dirsProp directory in which to locate the file. @param path file-path. @return local file under the directory with the given path.]]> dirsProp with the given path. If dirsProp contains multiple directories, then one is chosen based on path's hash code. If the selected directory does not exist, an attempt is made to create it. @param dirsProp directory in which to locate the file. @param path file-path. @return local file under the directory with the given path.]]> name. @param name configuration resource name. @return an input stream attached to the resource.]]> name. @param name configuration resource name. @return a reader attached to the resource.]]> String key-value pairs in the configuration. @return an iterator over the entries.]]> true to set quiet-mode on, false to turn it off.]]> Resources

Configurations are specified by resources. A resource contains a set of name/value pairs as XML data. Each resource is named by either a String or by a {@link Path}. If named by a String, then the classpath is examined for a file with that name. If named by a Path, then the local filesystem is examined directly, without referring to the classpath.

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:

  1. core-default.xml : Read-only defaults for hadoop.
  2. core-site.xml: Site-specific configuration for a given hadoop installation.
Applications may add additional resources, which are loaded subsequent to these resources in the order they are added.

Final Parameters

Configuration parameters may be declared final. Once a resource declares a value final, no subsequently-loaded resource can alter that value. For example, one might define a final parameter with:

  <property>
    <name>dfs.client.buffer.dir</name>
    <value>/tmp/hadoop/dfs/client</value>
    <final>true</final>
  </property>
Administrators typically define parameters as final in core-site.xml for values that user applications may not alter.

Variable Expansion

Value strings are first processed for variable expansion. The available properties are:

  1. Other properties defined in this Configuration; and, if a name is undefined here,
  2. Properties in {@link System#getProperties()}.

For example, if a configuration resource contains the following property definitions:

  <property>
    <name>basedir</name>
    <value>/user/${user.name}</value>
  </property>
  
  <property>
    <name>tempdir</name>
    <value>${basedir}/tmp</value>
  </property>
When conf.get("tempdir") is called, then ${basedir} will be resolved to another property in this Configuration, while ${user.name} would then ordinarily be resolved to the value of the System property with that name.]]>
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the {@link org.apache.hadoop.mapred.JobConf}. The DistributedCache assumes that the files specified via hdfs:// urls are already present on the {@link FileSystem} at the path specified by the url.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. Optionally users can also direct it to symlink the distributed cache file(s) into the working directory of the task.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or externally while the job is executing.

Here is an illustrative example on how to use the DistributedCache:

     // Setting up the cache for the application
     
     1. Copy the requisite files to the FileSystem:
     
     $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat  
     $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip  
     $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
     $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
     $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
     $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
     
     2. Setup the application's JobConf:
     
     JobConf job = new JobConf();
     DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), 
                                   job);
     DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
     DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
     
     3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
     or {@link org.apache.hadoop.mapred.Reducer}:
     
     public static class MapClass extends MapReduceBase  
     implements Mapper<K, V, K, V> {
     
       private Path[] localArchives;
       private Path[] localFiles;
       
       public void configure(JobConf job) {
         // Get the cached archives/files
         localArchives = DistributedCache.getLocalCacheArchives(job);
         localFiles = DistributedCache.getLocalCacheFiles(job);
       }
       
       public void map(K key, V value, 
                       OutputCollector<K, V> output, Reporter reporter) 
       throws IOException {
         // Use data from the cached archives/files here
         // ...
         // ...
         output.collect(k, v);
       }
     }
     
 

@see org.apache.hadoop.mapred.JobConf @see org.apache.hadoop.mapred.JobClient]]>
BufferedFSInputStream with the specified buffer size, and saves its argument, the input stream in, for later use. An internal buffer array of length size is created and stored in buf. @param in the underlying input stream. @param size the buffer size. @exception IllegalArgumentException if size <= 0.]]> setReplication of FileSystem @param src file name @param replication new replication @throws IOException @return true if successful; false if file does not exist or is a directory]]> fs.scheme.class whose value names the FileSystem class. The entire URI is passed to the FileSystem instance's initialize method.]]> Return all the files that match filePattern and are not checksum files. Results are sorted by their names.

A filename pattern is composed of regular characters and special pattern matching characters, which are:

?
Matches any single character.

*
Matches zero or more characters.

[abc]
Matches a single character from character set {a,b,c}.

[a-b]
Matches a single character from the character range {a...b}. Note that character a must be lexicographically less than or equal to character b.

[^a]
Matches a single character that is not from character set or range {a}. Note that the ^ character must occur immediately to the right of the opening bracket.

\c
Removes (escapes) any special meaning of character c.

{ab,cd}
Matches a string from the string set {ab, cd}

{ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}
@param pathPattern a regular expression specifying a pth pattern @return an array of paths that match the path pattern @throws IOException]]>
All user code that may potentially use the Hadoop Distributed File System should be written to use a FileSystem object. The Hadoop DFS is a multi-machine system that appears as a single disk. It's useful because of its fault tolerance and potentially very large capacity.

The local implementation is {@link LocalFileSystem} and distributed implementation is DistributedFileSystem.]]> FilterFileSystem contains some other file system, which it uses as its basic file system, possibly transforming the data along the way or providing additional functionality. The class FilterFileSystem itself simply overrides all methods of FileSystem with versions that pass all requests to the contained file system. Subclasses of FilterFileSystem may further override some of these methods and may also provide additional methods and fields.]]> buf at offset and checksum into checksum. The method is used for implementing read, therefore, it should be optimized for sequential reading @param pos chunkPos @param buf desitination buffer @param offset offset in buf at which to store data @param len maximun number of bytes to read @return number of bytes read]]> -1 if the end of the stream is reached. @exception IOException if an I/O error occurs.]]> This method implements the general contract of the corresponding {@link InputStream#read(byte[], int, int) read} method of the {@link InputStream} class. As an additional convenience, it attempts to read as many bytes as possible by repeatedly invoking the read method of the underlying stream. This iterated read continues until one of the following conditions becomes true:

  • The specified number of bytes have been read,
  • The read method of the underlying stream returns -1, indicating end-of-file.
If the first read on the underlying stream returns -1 to indicate end-of-file then this method returns -1. Otherwise this method returns the number of bytes actually read. @param b destination buffer. @param off offset at which to start storing bytes. @param len maximum number of bytes to read. @return the number of bytes read, or -1 if the end of the stream has been reached. @exception IOException if an I/O error occurs. ChecksumException if any checksum error occurs]]>
n bytes of data from the input stream.

This method may skip more bytes than are remaining in the backing file. This produces no exception and the number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file. Attempting to read from the stream after skipping past the end will result in -1 indicating the end of the file.

If n is negative, no bytes are skipped. @param n the number of bytes to be skipped. @return the actual number of bytes skipped. @exception IOException if an I/O error occurs. ChecksumException if the chunk to skip to is corrupted]]> This method may seek past the end of the file. This produces no exception and an attempt to read from the stream will result in -1 indicating the end of the file. @param pos the postion to seek to. @exception IOException if an I/O error occurs. ChecksumException if the chunk to seek to is corrupted]]> len bytes from stm @param stm an input stream @param buf destiniation buffer @param offset offset at which to store data @param len number of bytes to read @return actual number of bytes read @throws IOException if there is any IO error]]> len bytes from the specified byte array starting at offset off and generate a checksum for each data chunk.

This method stores bytes from the given array into this stream's buffer before it gets checksumed. The buffer gets checksumed and flushed to the underlying output stream when all data in a checksum chunk are in the buffer. If the buffer is empty and requested length is at least as large as the size of next checksum chunk size, this method will checksum and write the chunk directly to the underlying output stream. Thus it avoids uneccessary data copy. @param b the data. @param off the start offset in the data. @param len the number of bytes to write. @exception IOException if an I/O error occurs.]]> true if and only if pathname should be included]]> trash feature. Files are moved to a user's trash directory, a subdirectory of their home directory named ".Trash". Files are initially moved to a current sub-directory of the trash directory. Within that sub-directory their original path is preserved. Periodically one may checkpoint the current trash and remove older checkpoints. (This design permits trash management without enumeration of the full trash content, without date support in the filesystem, and without clock synchronization.)]]> A {@link FileSystem} backed by an FTP client provided by Apache Commons Net.

]]>
(cause==null ? null : cause.toString()) (which typically contains the class and detail message of cause). @param cause the cause (which is saved for later retrieval by the {@link #getCause()} method). (A null value is permitted, and indicates that the cause is nonexistent or unknown.)]]> This class is a tool for migrating data from an older to a newer version of an S3 filesystem.

All files in the filesystem are migrated by re-writing the block metadata - no datafiles are touched.

]]>
Extracts AWS credentials from the filesystem URI or configuration.

]]>
A block-based {@link FileSystem} backed by Amazon S3.

@see NativeS3FileSystem]]>
If f is a file, this method will make a single call to S3. If f is a directory, this method will make a maximum of (n / 1000) + 2 calls to S3, where n is the total number of files and directories contained directly in f.

]]>
A {@link FileSystem} for reading and writing files stored on Amazon S3. Unlike {@link org.apache.hadoop.fs.s3.S3FileSystem} this implementation stores files on S3 in their native form so they can be read by other S3 tools.

@see org.apache.hadoop.fs.s3.S3FileSystem]]>
. @param name The name of the server @param port The port to use on the server @param findPort whether the server should start at the given port and increment by 1 until it finds a free port. @param conf Configuration]]> points to the log directory "/static/" -> points to common static files (src/webapps/static) "/" -> the jsp server code from (src/webapps/)]]> nth value.]]> nth value in the file.]]> public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } } ]]> o is a ByteWritable with the same value.]]> This saves memory over creating a new DataInputStream and ByteArrayInputStream each time data is read.

Typical usage is something like the following:


 DataInputBuffer buffer = new DataInputBuffer();
 while (... loop condition ...) {
   byte[] data = ... get data ...;
   int dataLength = ... get data length ...;
   buffer.reset(data, dataLength);
   ... read buffer using DataInput methods ...
 }
 
]]>
This saves memory over creating a new DataOutputStream and ByteArrayOutputStream each time data is written.

Typical usage is something like the following:


 DataOutputBuffer buffer = new DataOutputBuffer();
 while (... loop condition ...) {
   buffer.reset();
   ... write buffer using DataOutput methods ...
   byte[] data = buffer.getData();
   int dataLength = buffer.getLength();
   ... write data to its ultimate destination ...
 }
 
]]>
the class of the item @param conf the configuration to store @param item the object to be stored @param keyName the name of the key to use @throws IOException : forwards Exceptions from the underlying {@link Serialization} classes.]]> the class of the item @param conf the configuration to use @param keyName the name of the key to use @param itemClass the class of the item @return restored object @throws IOException : forwards Exceptions from the underlying {@link Serialization} classes.]]> the class of the item @param conf the configuration to use @param items the objects to be stored @param keyName the name of the key to use @throws IndexOutOfBoundsException if the items array is empty @throws IOException : forwards Exceptions from the underlying {@link Serialization} classes.]]> the class of the item @param conf the configuration to use @param keyName the name of the key to use @param itemClass the class of the item @return restored object @throws IOException : forwards Exceptions from the underlying {@link Serialization} classes.]]> DefaultStringifier offers convenience methods to store/load objects to/from the configuration. @param the class of the objects to stringify]]> o is a DoubleWritable with the same value.]]> o is a FloatWritable with the same value.]]> When two sequence files, which have same Key type but different Value types, are mapped out to reduce, multiple Value types is not allowed. In this case, this class can help you wrap instances with different types.

Compared with ObjectWritable, this class is much more effective, because ObjectWritable will append the class declaration as a String into the output file in every Key-Value pair.

Generic Writable implements {@link Configurable} interface, so that it will be configured by the framework. The configuration is passed to the wrapped objects implementing {@link Configurable} interface before deserialization.

how to use it:
1. Write your own class, such as GenericObject, which extends GenericWritable.
2. Implements the abstract method getTypes(), defines the classes which will be wrapped in GenericObject in application. Attention: this classes defined in getTypes() method, must implement Writable interface.

The code looks like this:
 public class GenericObject extends GenericWritable {
 
   private static Class[] CLASSES = {
               ClassType1.class, 
               ClassType2.class,
               ClassType3.class,
               };

   protected Class[] getTypes() {
       return CLASSES;
   }

 }
 
@since Nov 8, 2006]]>
This saves memory over creating a new InputStream and ByteArrayInputStream each time data is read.

Typical usage is something like the following:


 InputBuffer buffer = new InputBuffer();
 while (... loop condition ...) {
   byte[] data = ... get data ...;
   int dataLength = ... get data length ...;
   buffer.reset(data, dataLength);
   ... read buffer using InputStream methods ...
 }
 
@see DataInputBuffer @see DataOutput]]>
o is a IntWritable with the same value.]]> closes the input and output streams at the end. @param in InputStrem to read from @param out OutputStream to write to @param conf the Configuration object]]> ignore any {@link IOException} or null pointers. Must only be used for cleanup in exception handlers. @param log the log to record problems to at debug level. Can be null. @param closeables the objects to close]]> o is a LongWritable with the same value.]]> A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by {@link Writer#getIndexInterval()}.

The index file is read entirely into memory. Thus key implementations should try to keep themselves small.

Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with {@link SequenceFile.Sorter}.]]> key and val. Returns true if such a pair exists and false when at the end of the map]]> key or if it does not exist, at the first entry after the named key. - * @param key - key that we're trying to find - * @param val - data value if key is found - * @return - the key that was the closest match or null if eof.]]> key does not exist, return the first entry that falls just before the key. Otherwise, return the record that sorts just after. @return - the key that was the closest match or null if eof.]]> o is an MD5Hash whose digest contains the same values.]]> This saves memory over creating a new OutputStream and ByteArrayOutputStream each time data is written.

Typical usage is something like the following:


 OutputBuffer buffer = new OutputBuffer();
 while (... loop condition ...) {
   buffer.reset();
   ... write buffer using OutputStream methods ...
   byte[] data = buffer.getData();
   int dataLength = buffer.getLength();
   ... write data to its ultimate destination ...
 }
 
@see DataOutputBuffer @see InputBuffer]]>
A {@link Comparator} that operates directly on byte representations of objects.

@param @see DeserializerComparator]]>
SequenceFiles are flat files consisting of binary key/value pairs.

SequenceFile provides {@link Writer}, {@link Reader} and {@link Sorter} classes for writing, reading and sorting respectively.

There are three SequenceFile Writers based on the {@link CompressionType} used to compress key/value pairs:
  1. Writer : Uncompressed records.
  2. RecordCompressWriter : Record-compressed files, only compress values.
  3. BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

The actual compression algorithm used to compress key and/or values can be specified by using the appropriate {@link CompressionCodec}.

The recommended way is to use the static createWriter methods provided by the SequenceFile to chose the preferred format.

The {@link Reader} acts as the bridge and can read any of the above SequenceFile formats.

SequenceFile Formats

Essentially there are 3 different formats for SequenceFiles depending on the CompressionType specified. All of them share a common header described below.

  • version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName - value class
  • compression - A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata - {@link Metadata} for this file.
  • sync - A sync marker to denote end of the header.
Uncompressed SequenceFile Format
  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • A sync-marker every few 100 bytes or so.
Record-Compressed SequenceFile Format
  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value
  • A sync-marker every few 100 bytes or so.
Block-Compressed SequenceFile Format
  • Header
  • Record Block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker every few 100 bytes or so.

The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.

@see CompressionCodec]]>
key, skipping its value. True if another entry exists, and false at end of file.]]> key and val. Returns true if such a pair exists and false when at end of file]]> The position passed must be a position returned by {@link SequenceFile.Writer#getLength()} when writing this file. To seek to an arbitrary position, use {@link SequenceFile.Reader#sync(long)}.]]> SegmentDescriptor @param segments the list of SegmentDescriptors @param tmpDir the directory to write temporary files into @return RawKeyValueIterator @throws IOException]]> For best performance, applications should make sure that the {@link Writable#readFields(DataInput)} implementation of their keys is very efficient. In particular, it should avoid allocating memory.]]> This always returns a synchronized position. In other words, immediately after calling {@link SequenceFile.Reader#seek(long)} with a position returned by this method, {@link SequenceFile.Reader#next(Writable)} may be called. However the key may be earlier in the file than key last written when this method was called (e.g., with block-compression, it may be the first key in the block that was being written when this method was called).]]> key. Returns true if such a key exists and false when at the end of the set.]]> key. Returns key, or null if no match exists.]]> the class of the objects to stringify]]> position. Note that this method avoids using the converter or doing String instatiation @return the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte]]> what in the backing buffer, starting as position start. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation. @return byte position of the first occurence of the search string in the UTF-8 buffer or -1 if not found]]> o is a Text with the same contents.]]> replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.]]> replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException. @return ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()]]> In addition, it provides methods for string traversal without converting the byte array to a string.

Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.]]> o is a UTF8 with the same contents.]]> Also includes utilities for efficiently reading and writing UTF-8. @deprecated replaced by Text]]> This is useful when a class may evolve, so that instances written by the old version of the class may still be processed by the new version. To handle this situation, {@link #readFields(DataInput)} implementations should catch {@link VersionMismatchException}.]]> o is a VIntWritable with the same value.]]> o is a VLongWritable with the same value.]]> out. @param out DataOuput to serialize this object into. @throws IOException]]> in.

For efficiency, implementations should attempt to re-use storage in the existing object where possible.

@param in DataInput to deseriablize this object from. @throws IOException]]>
Any key or value type in the Hadoop Map-Reduce framework implements this interface.

Implementations typically implement a static read(DataInput) method which constructs a new instance, calls {@link #readFields(DataInput)} and returns the instance.

Example:

     public class MyWritable implements Writable {
       // Some data     
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public static MyWritable read(DataInput in) throws IOException {
         MyWritable w = new MyWritable();
         w.readFields(in);
         return w;
       }
     }
 

]]>
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.

Example:

     public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }
 

]]>
The default implementation reads the data into two {@link WritableComparable}s (using {@link Writable#readFields(DataInput)}, then calls {@link #compare(WritableComparable,WritableComparable)}.]]> The default implementation uses the natural ordering, calling {@link Comparable#compareTo(Object)}.]]> This base implemenation uses the natural ordering. To define alternate orderings, override {@link #compare(WritableComparable,WritableComparable)}.

One may optimize compare-intensive operations by overriding {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are provided to assist in optimized implementations of this method.]]> Enum type @param in DataInput to read from @param enumType Class type of Enum @return Enum represented by String read from DataInput @throws IOException]]> len number of bytes in input streamin @param in input stream @param len number of bytes to skip @throws IOException when skipped less number of bytes]]> CompressionCodec for which to get the Compressor @return Compressor for the given CompressionCodec from the pool or a new one]]> CompressionCodec for which to get the Decompressor @return Decompressor for the given CompressionCodec the pool or a new one]]> Compressor to be returned to the pool]]> Decompressor to be returned to the pool]]> Implementations are assumed to be buffered. This permits clients to reposition the underlying input stream then call {@link #resetState()}, without having to also synchronize client buffers.]]> true indicating that more input data is required. @param b Input data @param off Start offset @param len Length]]> true if the input data buffer is empty and #setInput() should be called in order to provide more input.]]> true if the end of the compressed data output stream has been reached.]]> true indicating that more input data is required. @param b Input data @param off Start offset @param len Length]]> true if the input data buffer is empty and #setInput() should be called in order to provide more input.]]> true if a preset dictionary is needed for decompression. @return true if a preset dictionary is needed for decompression]]> true if the end of the compressed data output stream has been reached.]]> FIXME: This array should be in a private or package private location, since it could be modified by malicious code.

]]>
This interface is public for historical purposes. You should have no need to use it.

]]>
Although BZip2 headers are marked with the magic "Bz" this constructor expects the next byte in the stream to be the first one after the magic. Thus callers have to skip the first two bytes. Otherwise this constructor will throw an exception.

@throws IOException if the stream content is malformed or an I/O error occurs. @throws NullPointerException if in == null]]>
The decompression requires large amounts of memory. Thus you should call the {@link #close() close()} method as soon as possible, to force CBZip2InputStream to release the allocated memory. See {@link CBZip2OutputStream CBZip2OutputStream} for information about memory usage.

CBZip2InputStream reads bytes from the compressed source stream via the single byte {@link java.io.InputStream#read() read()} method exclusively. Thus you should consider to use a buffered source stream.

Instances of this class are not threadsafe.

]]>
CBZip2OutputStream with a blocksize of 900k.

Attention: The caller is resonsible to write the two BZip2 magic bytes "BZ" to the specified stream prior to calling this constructor.

@param out * the destination stream. @throws IOException if an I/O error occurs in the specified stream. @throws NullPointerException if out == null.]]>
CBZip2OutputStream with specified blocksize.

Attention: The caller is resonsible to write the two BZip2 magic bytes "BZ" to the specified stream prior to calling this constructor.

@param out the destination stream. @param blockSize the blockSize as 100k units. @throws IOException if an I/O error occurs in the specified stream. @throws IllegalArgumentException if (blockSize < 1) || (blockSize > 9). @throws NullPointerException if out == null. @see #MIN_BLOCKSIZE @see #MAX_BLOCKSIZE]]>
inputLength this method returns MAX_BLOCKSIZE always. @param inputLength The length of the data which will be compressed by CBZip2OutputStream.]]> == 1.]]> == 9.]]> If you are ever unlucky/improbable enough to get a stack overflow whilst sorting, increase the following constant and try again. In practice I have never seen the stack go above 27 elems, so the following limit seems very generous.

]]>
The compression requires large amounts of memory. Thus you should call the {@link #close() close()} method as soon as possible, to force CBZip2OutputStream to release the allocated memory.

You can shrink the amount of allocated memory and maybe raise the compression speed by choosing a lower blocksize, which in turn may cause a lower compression ratio. You can avoid unnecessary memory allocation by avoiding using a blocksize which is bigger than the size of the input.

You can compute the memory usage for compressing by the following formula:

 <code>400k + (9 * blocksize)</code>.
 

To get the memory required for decompression by {@link CBZip2InputStream CBZip2InputStream} use

 <code>65k + (5 * blocksize)</code>.
 
Memory usage by blocksize
Blocksize Compression
memory usage
Decompression
memory usage
100k 1300k 565k
200k 2200k 1065k
300k 3100k 1565k
400k 4000k 2065k
500k 4900k 2565k
600k 5800k 3065k
700k 6700k 3565k
800k 7600k 4065k
900k 8500k 4565k

For decompression CBZip2InputStream allocates less memory if the bzipped input is smaller than one block.

Instances of this class are not threadsafe.

TODO: Update to BZip2 1.0.1

]]>
@return the total (non-negative) number of uncompressed bytes input so far]]> @return the total (non-negative) number of uncompressed bytes input so far]]> true if native-zlib is loaded & initialized and can be loaded for this job, else false]]>
  • "none" - No compression.
  • "lzo" - LZO compression.
  • "gz" - GZIP compression. ]]>
  • Block Compression.
  • Named meta data blocks.
  • Sorted or unsorted keys.
  • Seek by key or by file offset. The memory footprint of a TFile includes the following:
    • Some constant overhead of reading or writing a compressed block.
      • Each compressed block requires one compression/decompression codec for I/O.
      • Temporary space to buffer the key.
      • Temporary space to buffer the value (for TFile.Writer only). Values are chunk encoded, so that we buffer at most one chunk of user data. By default, the chunk buffer is 1MB. Reading chunked value does not require additional memory.
    • TFile index, which is proportional to the total number of Data Blocks. The total amount of memory needed to hold the index can be estimated as (56+AvgKeySize)*NumBlocks.
    • MetaBlock index, which is proportional to the total number of Meta Blocks.The total amount of memory needed to hold the index for Meta Blocks can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.

    The behavior of TFile can be customized by the following variables through Configuration:

    • tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default to 1MB. Values of the length less than the chunk size is guaranteed to have known value length in read time (See {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
    • tfile.fs.output.buffer.size: Buffer size used for FSDataOutputStream. Integer (in bytes). Default to 256KB.
    • tfile.fs.input.buffer.size: Buffer size used for FSDataInputStream. Integer (in bytes). Default to 256KB.

    Suggestions on performance optimization.

    • Minimum block size. We recommend a setting of minimum block size between 256KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB.
    • The current implementation does not offer true multi-threading for reading. The implementation uses FSDataInputStream seek()+read(), which is shown to be much faster than positioned-read call in single thread mode. However, it also means that if multiple threads attempt to access the same TFile (using multiple scanners) simultaneously, the actual I/O is carried out sequentially even if they access different DFS blocks.
    • Compression codec. Use "none" if the data is not very compressable (by compressable, I mean a compression ratio at least 2:1). Generally, use "lzo" as the starting point for experimenting. "gz" overs slightly better compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to decompress, comparing to "lzo".
    • File system buffering, if the underlying FSDataInputStream and FSDataOutputStream is already adequately buffered; or if applications reads/writes keys and values in large buffers, we can reduce the sizes of input/output buffering in TFile layer by setting the configuration parameters "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
    Some design rationale behind TFile can be found at Hadoop-3315.]]> entry of the TFile. @param endKey End key of the scan. If null, scan up to the last entry of the TFile. @throws IOException]]> Use {@link Scanner#atEnd()} to test whether the cursor is at the end location of the scanner.

    Use {@link Scanner#advance()} to move the cursor to the next key-value pair (or end if none exists). Use seekTo methods ( {@link Scanner#seekTo(byte[])} or {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary location in the covered range (including backward seeking). Use {@link Scanner#rewind()} to seek back to the beginning of the scanner. Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.

    Actual keys and values may be obtained through {@link Scanner.Entry} object, which is obtained through {@link Scanner#entry()}.]]>

  • Algorithmic comparator: binary comparators that is language independent. Currently, only "memcmp" is supported.
  • Language-specific comparator: binary comparators that can only be constructed in specific language. For Java, the syntax is "jclass:", followed by the class name of the RawComparator. Currently, we only support RawComparators that can be constructed through the default constructor (with no parameters). Parameterized RawComparators such as {@link WritableComparator} or {@link JavaSerializationComparator} may not be directly used. One should write a wrapper class that inherits from such classes and use its default constructor to perform proper initialization. @param conf The configuration object. @throws IOException]]> If an exception is thrown, the TFile will be in an inconsistent state. The only legitimate call after that would be close]]> Utils#writeVLong(out, n). @param out output stream @param n The integer to be encoded @throws IOException @see Utils#writeVLong(DataOutput, long)]]>
  • if n in [-32, 127): encode in one byte with the actual value. Otherwise,
  • if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52; byte[1]=n&0xff. Otherwise,
  • if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 - 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
  • if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112; byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
  • if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] = (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
  • if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] = (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff; byte[4]=(n>>8)&0xff; byte[5]=n&0xff
  • if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] = (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff; byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
  • if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] = (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff; byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff; byte[7]=n&0xff;
  • if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] = (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff; byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff; byte[7]=(n>>8)&0xff; byte[8]=n&0xff; @param out output stream @param n the integer number @throws IOException]]> (int)Utils#readVLong(in). @param in input stream @return the decoded integer @throws IOException @see Utils#readVLong(DataInput)]]>
  • if (FB >= -32), return (long)FB;
  • if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
  • if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 + NB[1]&0xff;
  • if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 + (NB[1]&0xff)<<8 + NB[2]&0xff;
  • if (FB in [-128, -121]), return interpret NB[FB+129] as a signed big-endian integer. @param in input stream @return the decoded long integer. @throws IOException]]> Type of the input key. @param list The list @param key The input key. @param cmp Comparator for the key. @return The index to the desired element if it exists; or list.size() otherwise.]]> Type of the input key. @param list The list @param key The input key. @param cmp Comparator for the key. @return The index to the desired element if it exists; or list.size() otherwise.]]> Type of the input key. @param list The list @param key The input key. @return The index to the desired element if it exists; or list.size() otherwise.]]> Type of the input key. @param list The list @param key The input key. @return The index to the desired element if it exists; or list.size() otherwise.]]> Keep trying a limited number of times, waiting a fixed time between attempts, and then fail by re-throwing the exception.

    ]]>
    Keep trying for a maximum time, waiting a fixed time between attempts, and then fail by re-throwing the exception.

    ]]>
    Keep trying a limited number of times, waiting a growing amount of time between attempts, and then fail by re-throwing the exception. The time between attempts is sleepTime mutliplied by the number of tries so far.

    ]]>
    Keep trying a limited number of times, waiting a growing amount of time between attempts, and then fail by re-throwing the exception. The time between attempts is sleepTime mutliplied by a random number in the range of [0, 2 to the number of retries)

    ]]>
    Set a default policy with some explicit handlers for specific exceptions.

    ]]>
    A retry policy for RemoteException Set a default policy with some explicit handlers for specific exceptions.

    ]]>
    Try once, and fail by re-throwing the exception. This corresponds to having no retry mechanism in place.

    ]]>
    Try once, and fail silently for void methods, or by re-throwing the exception for non-void methods.

    ]]>
    Keep trying forever.

    ]]>
    A collection of useful implementations of {@link RetryPolicy}.

    ]]>
    Determines whether the framework should retry a method for the given exception, and the number of retries that have been made for that operation so far.

    @param e The exception that caused the method to fail. @param retries The number of times the method has been retried. @return true if the method should be retried, false if the method should not be retried but shouldn't fail with an exception (only for void methods). @throws Exception The re-thrown exception e indicating that the method failed and should not be retried further.]]>
    Specifies a policy for retrying method failures. Implementations of this interface should be immutable.

    ]]>
    Create a proxy for an interface of an implementation class using the same retry policy for each method in the interface.

    @param iface the interface that the retry will implement @param implementation the instance whose methods should be retried @param retryPolicy the policy for retirying method call failures @return the retry proxy]]>
    Create a proxy for an interface of an implementation class using the a set of retry policies specified by method name. If no retry policy is defined for a method then a default of {@link RetryPolicies#TRY_ONCE_THEN_FAIL} is used.

    @param iface the interface that the retry will implement @param implementation the instance whose methods should be retried @param methodNameToPolicyMap a map of method names to retry policies @return the retry proxy]]>
    A factory for creating retry proxies.

    ]]>
    Prepare the deserializer for reading.

    ]]>
    Deserialize the next object from the underlying input stream. If the object t is non-null then this deserializer may set its internal state to the next object read from the input stream. Otherwise, if the object t is null a new deserialized object will be created.

    @return the deserialized object]]>
    Close the underlying input stream and clear up any resources.

    ]]>
    Provides a facility for deserializing objects of type from an {@link InputStream}.

    Deserializers are stateful, but must not buffer the input since other producers may read from the input between calls to {@link #deserialize(Object)}.

    @param ]]>
    A {@link RawComparator} that uses a {@link Deserializer} to deserialize the objects to be compared so that the standard {@link Comparator} can be used to compare them.

    One may optimize compare-intensive operations by using a custom implementation of {@link RawComparator} that operates directly on byte representations.

    @param ]]>
    An experimental {@link Serialization} for Java {@link Serializable} classes.

    @see JavaSerializationComparator]]>
    A {@link RawComparator} that uses a {@link JavaSerialization} {@link Deserializer} to deserialize objects that are then compared via their {@link Comparable} interfaces.

    @param @see JavaSerialization]]>
    Encapsulates a {@link Serializer}/{@link Deserializer} pair.

    @param ]]>
    Serializations are found by reading the io.serializations property from conf, which is a comma-delimited list of classnames.

    ]]>
    A factory for {@link Serialization}s.

    ]]>
    Prepare the serializer for writing.

    ]]>
    Serialize t to the underlying output stream.

    ]]>
    Close the underlying output stream and clear up any resources.

    ]]>
    Provides a facility for serializing objects of type to an {@link OutputStream}.

    Serializers are stateful, but must not buffer the output since other producers may write to the output between calls to {@link #serialize(Object)}.

    @param ]]>
    param, to the IPC server running at address, returning the value. Throws exceptions if there are network problems or if the remote code threw an exception. @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]> param, to the IPC server running at address with the ticket credentials, returning the value. Throws exceptions if there are network problems or if the remote code threw an exception. @deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]> param, to the IPC server running at address which is servicing the protocol protocol, with the ticket credentials, returning the value. Throws exceptions if there are network problems or if the remote code threw an exception.]]> Unwraps any IOException. @param lookupTypes the desired exception class. @return IOException, which is either the lookupClass exception or this.]]> This unwraps any Throwable that has a constructor taking a String as a parameter. Otherwise it returns this. @return Throwable]]> protocol is a Java interface. All parameters and return types must be one of:
    • a primitive type, boolean, byte, char, short, int, long, float, double, or void; or
    • a {@link String}; or
    • a {@link Writable}; or
    • an array of the above types
    All methods in the protocol should throw only IOException. No field data of the protocol instance is transmitted.]]>
    handlerCount determines the number of handler threads that will be used to process calls.]]>
    ,name=RpcActivityForPort" Many of the activity metrics are sampled and averaged on an interval which can be specified in the metrics config file.

    For the metrics that are sampled and averaged, one must specify a metrics context that does periodic update calls. Most metrics contexts do. The default Null metrics context however does NOT. So if you aren't using any other metrics context then you can turn on the viewing and averaging of sampled metrics by specifying the following two lines in the hadoop-meterics.properties file:

            rpc.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
            rpc.period=10
      

    Note that the metrics are collected regardless of the context used. The context with the update thread is used to average the data periodically Impl details: We use a dynamic mbean that gets the list of the metrics from the metrics registry passed as an argument to the constructor]]> This class has a number of metrics variables that are publicly accessible; these variables (objects) have methods to update their values; for example:

    {@link #rpcQueueTime}.inc(time)]]> For the statistics that are sampled and averaged, one must specify a metrics context that does periodic update calls. Most do. The default Null metrics context however does NOT. So if you aren't using any other metrics context then you can turn on the viewing and averaging of sampled metrics by specifying the following two lines in the hadoop-meterics.properties file:

            rpc.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread
            rpc.period=10
      

    Note that the metrics are collected regardless of the context used. The context with the update thread is used to average the data periodically]]> When constructing the instance, if the factory property contextName.class exists, its value is taken to be the name of the class to instantiate. Otherwise, the default is to create an instance of org.apache.hadoop.metrics.spi.NullContext, which is a dummy "no-op" context which will cause all metric data to be discarded. @param contextName the name of the context @return the named MetricsContext]]> When the instance is constructed, this method checks if the file hadoop-metrics.properties exists on the class path. If it exists, it must be in the format defined by java.util.Properties, and all the properties in the file are set as attributes on the newly created ContextFactory instance. @return the singleton ContextFactory instance]]> getFactory() method.]]> startMonitoring() again after calling this. @see #close()]]> recordName. Throws an exception if the metrics implementation is configured with a fixed set of record names and recordName is not in that set. @param recordName the name of the record @throws MetricsException if recordName conflicts with configuration data]]> A record name identifies the kind of data to be reported. For example, a program reporting statistics relating to the disks on a computer might use a record name "diskStats".

    A record has zero or more tags. A tag has a name and a value. To continue the example, the "diskStats" record might use a tag named "diskName" to identify a particular disk. Sometimes it is useful to have more than one tag, so there might also be a "diskType" with value "ide" or "scsi" or whatever.

    A record also has zero or more metrics. These are the named values that are to be reported to the metrics system. In the "diskStats" example, possible metric names would be "diskPercentFull", "diskPercentBusy", "kbReadPerSecond", etc.

    The general procedure for using a MetricsRecord is to fill in its tag and metric values, and then call update() to pass the record to the client library. Metric data is not immediately sent to the metrics system each time that update() is called. An internal table is maintained, identified by the record name. This table has columns corresponding to the tag and the metric names, and rows corresponding to each unique set of tag values. An update either modifies an existing row in the table, or adds a new row with a set of tag values that are different from all the other rows. Note that if there are no tags, then there can be at most one row in the table.

    Once a row is added to the table, its data will be sent to the metrics system on every timer period, whether or not it has been updated since the previous timer period. If this is inappropriate, for example if metrics were being reported by some transient object in an application, the remove() method can be used to remove the row and thus stop the data from being sent.

    Note that the update() method is atomic. This means that it is safe for different threads to be updating the same metric. More precisely, it is OK for different threads to call update() on MetricsRecord instances with the same set of tag names and tag values. Different threads should not use the same MetricsRecord instance at the same time.]]> MetricsContext.registerUpdater().]]> fileName attribute, if specified. Otherwise the data will be written to standard output.]]> This class is configured by setting ContextFactory attributes which in turn are usually configured through a properties file. All the attributes are prefixed by the contextName. For example, the properties file might contain:

     myContextName.fileName=/tmp/metrics.log
     myContextName.period=5
     
    ]]> contextName.tableName. The returned map consists of those attributes with the contextName and tableName stripped off.]]> recordName. Throws an exception if the metrics implementation is configured with a fixed set of record names and recordName is not in that set. @param recordName the name of the record @throws MetricsException if recordName conflicts with configuration data]]> This class implements the internal table of metric data, and the timer on which data is to be sent to the metrics system. Subclasses must override the abstract emitRecord method in order to transmit the data.

    ]]> update and remove().]]> hostname or hostname:port. If the specs string is null, defaults to localhost:defaultPort. @return a list of InetSocketAddress objects.]]> ,name=" Where the and are the supplied parameters @param serviceName @param nameName @param theMbean - the MBean to register @return the named used to register the MBean]]> hadoop.rpc.socket.factory.class.<ClassName>. When no such parameter exists then fall back on the default socket factory as configured by hadoop.rpc.socket.factory.class.default. If this default socket factory is not configured, then fall back on the JVM default socket factory. @param conf the configuration @param clazz the class (usually a {@link VersionedProtocol}) @return a socket factory]]> hadoop.rpc.socket.factory.default @param conf the configuration @return the default socket factory as specified in the configuration or the JVM default socket factory if the configuration does not contain a default socket factory property.]]> : ://:/]]> : ://:/]]>
    From documentation for {@link #getInputStream(Socket, long)}:
    Returns InputStream for the socket. If the socket has an associated SocketChannel then it returns a {@link SocketInputStream} with the given timeout. If the socket does not have a channel, {@link Socket#getInputStream()} is returned. In the later case, the timeout argument is ignored and the timeout set with {@link Socket#setSoTimeout(int)} applies for reads.

    Any socket created using socket factories returned by {@link #NetUtils}, must use this interface instead of {@link Socket#getInputStream()}. @see #getInputStream(Socket, long) @param socket @return InputStream for reading from the socket. @throws IOException]]>

    Any socket created using socket factories returned by {@link #NetUtils}, must use this interface instead of {@link Socket#getInputStream()}. @see Socket#getChannel() @param socket @param timeout timeout in milliseconds. This may not always apply. zero for waiting as long as necessary. @return InputStream for reading from the socket. @throws IOException]]>

    From documentation for {@link #getOutputStream(Socket, long)} :
    Returns OutputStream for the socket. If the socket has an associated SocketChannel then it returns a {@link SocketOutputStream} with the given timeout. If the socket does not have a channel, {@link Socket#getOutputStream()} is returned. In the later case, the timeout argument is ignored and the write will wait until data is available.

    Any socket created using socket factories returned by {@link #NetUtils}, must use this interface instead of {@link Socket#getOutputStream()}. @see #getOutputStream(Socket, long) @param socket @return OutputStream for writing to the socket. @throws IOException]]>

    Any socket created using socket factories returned by {@link #NetUtils}, must use this interface instead of {@link Socket#getOutputStream()}. @see Socket#getChannel() @param socket @param timeout timeout in milliseconds. This may not always apply. zero for waiting as long as necessary. @return OutputStream for writing to the socket. @throws IOException]]>
    socket.connect(endpoint, timeout). If socket.getChannel() returns a non-null channel, connect is implemented using Hadoop's selectors. This is done mainly to avoid Sun's connect implementation from creating thread-local selectors, since Hadoop does not have control on when these are closed and could end up taking all the available file descriptors. @see java.net.Socket#connect(java.net.SocketAddress, int) @param socket @param endpoint @param timeout - timeout in milliseconds]]>
    node @param node a node @return true if node is already in the tree; false otherwise]]> scope if scope starts with ~, choose one from the all nodes except for the ones in scope; otherwise, choose one from scope @param scope range of nodes from which a node will be choosen @return the choosen node]]> scope but not in excludedNodes if scope starts with ~, return the number of nodes that are not in scope and excludedNodes; @param scope a path string that may start with ~ @param excludedNodes a list of nodes @return number of available nodes]]> reader It linearly scans the array, if a local node is found, swap it with the first element of the array. If a local rack node is found, swap it with the first element following the local node. If neither local node or local rack node is found, put a random replica location at postion 0. It leaves the rest nodes untouched.]]>
    Create a new input stream with the given timeout. If the timeout is zero, it will be treated as infinite timeout. The socket's channel will be configured to be non-blocking. @see SocketInputStream#SocketInputStream(ReadableByteChannel, long) @param socket should have a channel associated with it. @param timeout timeout timeout in milliseconds. must not be negative. @throws IOException]]>

    Create a new input stream with the given timeout. If the timeout is zero, it will be treated as infinite timeout. The socket's channel will be configured to be non-blocking. @see SocketInputStream#SocketInputStream(ReadableByteChannel, long) @param socket should have a channel associated with it. @throws IOException]]>

    Create a new ouput stream with the given timeout. If the timeout is zero, it will be treated as infinite timeout. The socket's channel will be configured to be non-blocking. @see SocketOutputStream#SocketOutputStream(WritableByteChannel, long) @param socket should have a channel associated with it. @param timeout timeout timeout in milliseconds. must not be negative. @throws IOException]]>
    = getCount(). @param newCapacity The new capacity in bytes.]]> Index idx = startVector(...); while (!idx.done()) { .... // read element of a vector idx.incr(); } ]]> This task takes the given record definition files and compiles them into java or c++ files. It is then up to the user to compile the generated files.

    The task requires the file or the nested fileset element to be specified. Optional attributes are language (set the output language, default is "java"), destdir (name of the destination directory for generated java/c++ code, default is ".") and failonerror (specifies error handling behavior. default is true).

    Usage

     <recordcc
           destdir="${basedir}/gensrc"
           language="java">
       <fileset include="**\/*.jr" />
     </recordcc>
     
    ]]>
    ]]> (cause==null ? null : cause.toString()) (which typically contains the class and detail message of cause). @param cause the cause (which is saved for later retrieval by the {@link #getCause()} method). (A null value is permitted, and indicates that the cause is nonexistent or unknown.)]]> Group with the given groupname. @param group group name]]> ugi. @param ugi user @return the {@link Subject} for the user identified by ugi]]> ugi as a comma separated string in conf as a property attr The String starts with the user name followed by the default group names, and other group names. @param conf configuration @param attr property name @param ugi a UnixUserGroupInformation]]> conf The object is expected to store with the property name attr as a comma separated string that starts with the user name followed by group names. If the property name is not defined, return null. It's assumed that there is only one UGI per user. If this user already has a UGI in the ugi map, return the ugi in the map. Otherwise, construct a UGI from the configuration, store it in the ugi map and return it. @param conf configuration @param attr property name @return a UnixUGI @throws LoginException if the stored string is ill-formatted.]]> User with the given username. @param user user name]]> (cause==null ? null : cause.toString()) (which typically contains the class and detail message of cause). @param cause the cause (which is saved for later retrieval by the {@link #getCause()} method). (A null value is permitted, and indicates that the cause is nonexistent or unknown.)]]> does not provide the stack trace for security purposes.]]> service as related to Service Level Authorization for Hadoop. Each service defines it's configuration key and also the necessary {@link Permission} required to access the service.]]> in]]> out.]]> reset is true, then resets the checksum. @return number of bytes written. Will be equal to getChecksumSize();]]> reset is true, then resets the checksum. @return number of bytes written. Will be equal to getChecksumSize();]]> GenericOptionsParser to parse only the generic Hadoop arguments. The array of string arguments other than the generic arguments can be obtained by {@link #getRemainingArgs()}. @param conf the Configuration to modify. @param args command-line arguments.]]> GenericOptionsParser to parse given options as well as generic Hadoop options. The resulting CommandLine object can be obtained by {@link #getCommandLine()}. @param conf the configuration to modify @param options options built by the caller @param args User-specified arguments]]> Strings containing the un-parsed arguments or empty array if commandLine was not defined.]]> CommandLine object to process the parsed arguments. Note: If the object is created with {@link #GenericOptionsParser(Configuration, String[])}, then returned object will only contain parsed generic options. @return CommandLine representing list of arguments parsed against Options descriptor.]]> GenericOptionsParser is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser recognizes several standarad command line arguments, enabling applications to easily specify a namenode, a jobtracker, additional configuration resources etc.

    Generic Options

    The supported generic options are:

         -conf <configuration file>     specify a configuration file
         -D <property=value>            use value for given property
         -fs <local|namenode:port>      specify a namenode
         -jt <local|jobtracker:port>    specify a job tracker
         -files <comma separated list of files>    specify comma separated
                                files to be copied to the map reduce cluster
         -libjars <comma separated list of jars>   specify comma separated
                                jar files to include in the classpath.
         -archives <comma separated list of archives>    specify comma
                 separated archives to be unarchived on the compute machines.
    
     

    The general command line syntax is:

     bin/hadoop command [genericOptions] [commandOptions]
     

    Generic command line arguments might modify Configuration objects, given to constructors.

    The functionality is implemented using Commons CLI.

    Examples:

     $ bin/hadoop dfs -fs darwin:8020 -ls /data
     list /data directory in dfs with namenode darwin:8020
     
     $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
     list /data directory in dfs with namenode darwin:8020
         
     $ bin/hadoop dfs -conf hadoop-site.xml -ls /data
     list /data directory in dfs with conf specified in hadoop-site.xml
         
     $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
     submit a job to job tracker darwin:50020
         
     $ bin/hadoop job -jt darwin:50020 -submit job.xml
     submit a job to job tracker darwin:50020
         
     $ bin/hadoop job -jt local -submit job.xml
     submit a job to local runner
     
     $ bin/hadoop jar -libjars testlib.jar 
     -archives test.tgz -files file.txt inputjar args
     job submission with libjars, files and archives
     

    @see Tool @see ToolRunner]]>
    Class<T>) of the argument of type T. @param The type of the argument @param t the object to get it class @return Class<T>]]> List<T> to a an array of T[]. @param c the Class object of the items in the list @param list the list to convert]]> List<T> to a an array of T[]. @param list the list to convert @throws ArrayIndexOutOfBoundsException if the list is empty. Use {@link #toArray(Class, List)} if the list may be empty.]]> io.file.buffer.size specified in the given Configuration. @param in input stream @param conf configuration @throws IOException]]> true if native-hadoop is loaded, else false]]> true if native hadoop libraries, if present, can be used for this job; false otherwise.]]> { pq.top().change(); pq.adjustTop(); } instead of
      { o = pq.pop(); o.change(); pq.push(o); }
     
    ]]>
    Clients and/or applications can use the provided Progressable to explicitly report progress to the Hadoop framework. This is especially important for operations which take an insignificant amount of time since, in-lieu of the reported progress, the framework has to assume that an error has occured and time-out the operation.

    ]]>
    Class is to be obtained @return the correctly typed Class of the given object.]]> Hadoop Pipes or Hadoop Streaming. It also checks to ensure that we are running on a *nix platform else (e.g. in Cygwin/Windows) it returns null. @param conf configuration @return a String[] with the ulimit command arguments or null if we are running on a non *nix platform or if the limit is unspecified.]]> Shell interface. @param cmd shell command to execute. @return the output of the executed command.]]> Shell interface. @param env the map of environment key=value @param cmd shell command to execute. @return the output of the executed command.]]> Shell can be used to run unix commands like du or df. It also offers facilities to gate commands by time-intervals.]]> ShellCommandExecutorshould be used in cases where the output of the command needs no explicit parsing and where the command, working directory and the environment remains unchanged. The output of the command is stored as-is and is expected to be small.]]> ArrayList of string values]]> charToEscape in the string with the escape char escapeChar @param str string @param escapeChar escape char @param charToEscape the char to be escaped @return an escaped string]]> charToEscape in the string with the escape char escapeChar @param str string @param escapeChar escape char @param charToEscape the escaped char @return an unescaped string]]> Tool, is the standard for any Map-Reduce tool/application. The tool/application should delegate the handling of standard command-line options to {@link ToolRunner#run(Tool, String[])} and only handle its custom arguments.

    Here is how a typical Tool is implemented:

         public class MyApp extends Configured implements Tool {
         
           public int run(String[] args) throws Exception {
             // Configuration processed by ToolRunner
             Configuration conf = getConf();
             
             // Create a JobConf using the processed conf
             JobConf job = new JobConf(conf, MyApp.class);
             
             // Process custom command-line options
             Path in = new Path(args[1]);
             Path out = new Path(args[2]);
             
             // Specify various job-specific parameters     
             job.setJobName("my-app");
             job.setInputPath(in);
             job.setOutputPath(out);
             job.setMapperClass(MyApp.MyMapper.class);
             job.setReducerClass(MyApp.MyReducer.class);
    
             // Submit the job, then poll for progress until the job is complete
             JobClient.runJob(job);
           }
           
           public static void main(String[] args) throws Exception {
             // Let ToolRunner handle generic command-line options 
             int res = ToolRunner.run(new Configuration(), new Sort(), args);
             
             System.exit(res);
           }
         }
     

    @see GenericOptionsParser @see ToolRunner]]>
    Tool by {@link Tool#run(String[])}, after parsing with the given generic arguments. Uses the given Configuration, or builds one if null. Sets the Tool's configuration with the possibly modified version of the conf. @param conf Configuration for the Tool. @param tool Tool to run. @param args command-line arguments to the tool. @return exit code of the {@link Tool#run(String[])} method.]]> Tool with its Configuration. Equivalent to run(tool.getConf(), tool, args). @param tool Tool to run. @param args command-line arguments to the tool. @return exit code of the {@link Tool#run(String[])} method.]]> ToolRunner can be used to run classes implementing Tool interface. It works in conjunction with {@link GenericOptionsParser} to parse the generic hadoop command line arguments and modifies the Configuration of the Tool. The application-specific options are passed along without being modified.

    @see Tool @see GenericOptionsParser]]>
    this filter. @param nbHash The number of hash function to consider. @param hashType type of the hashing function (see {@link org.apache.hadoop.util.hash.Hash}).]]> Bloom filter, as defined by Bloom in 1970.

    The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.

    Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Space/Time Trade-Offs in Hash Coding with Allowable Errors]]> this filter. @param nbHash The number of hash function to consider. @param hashType type of the hashing function (see {@link org.apache.hadoop.util.hash.Hash}).]]> this counting Bloom filter.

    Invariant: nothing happens if the specified key does not belong to this counter Bloom filter. @param key The key to remove.]]> key -> count map.

    NOTE: due to the bucket size of this filter, inserting the same key more than 15 times will cause an overflow at all filter positions associated with this key, and it will significantly increase the error rate for this and other keys. For this reason the filter can only be used to store small count values 0 <= N << 15. @param key key to be tested @return 0 if the key is not present. Otherwise, a positive value v will be returned such that v == count with probability equal to the error rate of this filter, and v > count otherwise. Additionally, if the filter experienced an underflow as a result of {@link #delete(Key)} operation, the return value may be lower than the count with the probability of the false negative rate of such filter.]]> counting Bloom filter, as defined by Fan et al. in a ToN 2000 paper.

    A counting Bloom filter is an improvement to standard a Bloom filter as it allows dynamic additions and deletions of set membership information. This is achieved through the use of a counting vector instead of a bit vector.

    Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Summary cache: a scalable wide-area web cache sharing protocol]]> Builds an empty Dynamic Bloom filter. @param vectorSize The number of bits in the vector. @param nbHash The number of hash function to consider. @param hashType type of the hashing function (see {@link org.apache.hadoop.util.hash.Hash}). @param nr The threshold for the maximum number of keys to record in a dynamic Bloom filter row.]]> dynamic Bloom filter, as defined in the INFOCOM 2006 paper.

    A dynamic Bloom filter (DBF) makes use of a s * m bit matrix but each of the s rows is a standard Bloom filter. The creation process of a DBF is iterative. At the start, the DBF is a 1 * m bit matrix, i.e., it is composed of a single standard Bloom filter. It assumes that nr elements are recorded in the initial bit vector, where nr <= n (n is the cardinality of the set A to record in the filter).

    As the size of A grows during the execution of the application, several keys must be inserted in the DBF. When inserting a key into the DBF, one must first get an active Bloom filter in the matrix. A Bloom filter is active when the number of recorded keys, nr, is strictly less than the current cardinality of A, n. If an active Bloom filter is found, the key is inserted and nr is incremented by one. On the other hand, if there is no active Bloom filter, a new one is created (i.e., a new row is added to the matrix) according to the current size of A and the element is added in this new Bloom filter and the nr value of this new Bloom filter is set to one. A given key is said to belong to the DBF if the k positions are set to one in one of the matrix rows.

    Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see Theory and Network Applications of Dynamic Bloom Filters]]> this filter. @param nbHash The number of hash functions to consider. @param hashType type of the hashing function (see {@link Hash}).]]> this filter. @param key The key to add.]]> this filter. @param key The key to test. @return boolean True if the specified key belongs to this filter. False otherwise.]]> this filter and a specified filter.

    Invariant: The result is assigned to this filter. @param filter The filter to AND with.]]> this filter and a specified filter.

    Invariant: The result is assigned to this filter. @param filter The filter to OR with.]]> this filter and a specified filter.

    Invariant: The result is assigned to this filter. @param filter The filter to XOR with.]]> this filter.

    The result is assigned to this filter.]]> this filter. @param keys The list of keys.]]> this filter. @param keys The collection of keys.]]> this filter. @param keys The array of keys.]]> this filter.]]> A filter is a data structure which aims at offering a lossy summary of a set A. The key idea is to map entries of A (also called keys) into several positions in a vector through the use of several hash functions.

    Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).

    It must be extended in order to define the real behavior. @see Key The general behavior of a key @see HashFunction A hash function]]> Builds a hash function that must obey to a given maximum number of returned values and a highest value. @param maxValue The maximum highest returned value. @param nbHash The number of resulting hashed values. @param hashType type of the hashing function (see {@link Hash}).]]> this hash function. A NOOP]]> Builds a key with a default weight. @param value The byte value of this key.]]> Builds a key with a specified weight. @param value The value of this key. @param weight The weight associated to this key.]]> this key.]]> this key.]]> this key with a specified value. @param weight The increment.]]> this key by one.]]> The idea is to randomly select a bit to reset.]]> The idea is to select the bit to reset that will generate the minimum number of false negative.]]> The idea is to select the bit to reset that will remove the maximum number of false positive.]]> The idea is to select the bit to reset that will, at the same time, remove the maximum number of false positve while minimizing the amount of false negative generated.]]> Originally created by European Commission One-Lab Project 034819.]]> this filter. @param nbHash The number of hash function to consider. @param hashType type of the hashing function (see {@link org.apache.hadoop.util.hash.Hash}).]]> this retouched Bloom filter.

    Invariant: if the false positive is null, nothing happens. @param key The false positive key to add.]]> this retouched Bloom filter. @param coll The collection of false positive.]]> this retouched Bloom filter. @param keys The list of false positive.]]> this retouched Bloom filter. @param keys The array of false positive.]]> this retouched Bloom filter. @param scheme The selective clearing scheme to apply.]]> retouched Bloom filter, as defined in the CoNEXT 2006 paper.

    It allows the removal of selected false positives at the cost of introducing random false negatives, and with the benefit of eliminating some random false positives at the same time.

    Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see RemoveScheme The different selective clearing algorithms @see Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives]]> length, and the provided seed value @param bytes input bytes @param length length of the valid bytes to consider @param initval seed value @return hash value]]> The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask. For example, if you need only 10 bits, do h = (h & hashmask(10)); In which case, the hash table should have hashsize(10) elements.

    If you are hashing n strings byte[][] k, do it like this: for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);

    By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this code any way you wish, private, educational, or commercial. It's free.

    Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes.]]> lookup3.c, by Bob Jenkins, May 2006, Public Domain. You can use this free for any purpose. It's in the public domain. It has no warranty. @see lookup3.c @see Hash Functions (and how this function compares to others such as CRC, MD?, etc @see Has update on the Dr. Dobbs Article]]> The C version of MurmurHash 2.0 found at that site was ported to Java by Andrzej Bialecki (ab at getopt org).

    ]]>
    JobTracker, as {@link JobTracker.State} @return the current state of the JobTracker.]]> JobTracker @return the size of heap memory used by the JobTracker]]> JobTracker @return the configured size of max heap memory that can be used by the JobTracker]]> ClusterStatus provides clients with information such as:
    1. Size of the cluster.
    2. Name of the trackers.
    3. Task capacity of the cluster.
    4. The number of currently running map & reduce tasks.
    5. State of the JobTracker.

    Clients can query for the latest ClusterStatus, via {@link JobClient#getClusterStatus()}.

    @see JobClient]]>
    Counters represent global counters, defined either by the Map-Reduce framework or applications. Each Counter can be of any {@link Enum} type.

    Counters are bunched into {@link Group}s, each comprising of counters from a particular Enum class. @deprecated Use {@link org.apache.hadoop.mapreduce.Counters} instead.]]> Group of counters, comprising of counters from a particular counter {@link Enum} class.

    Grouphandles localization of the class name and the counter names.

    ]]>
    FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that {@link Mapper}s process entire files. @param fs the file system that the file is on @param filename the file name to check @return is this file splitable?]]> FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of {@link #getSplits(JobConf, int)}. Subclasses of FileInputFormat can also override the {@link #isSplitable(FileSystem, Path)} method to ensure input-files are not split-up and are processed as a whole by {@link Mapper}s. @deprecated Use {@link org.apache.hadoop.mapreduce.lib.input.FileInputFormat} instead.]]> true if the job output should be compressed, false otherwise]]> Tasks' Side-Effect Files

    Note: The following is valid only if the {@link OutputCommitter} is {@link FileOutputCommitter}. If OutputCommitter is not a FileOutputCommitter, the task's temporary output directory is same as {@link #getOutputPath(JobConf)} i.e. ${mapred.output.dir}$

    Some applications need to create/write-to side-files, which differ from the actual job-outputs.

    In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.

    To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_temporary/_${taskid} (only) are promoted to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

    The application-writer can take advantage of this by creating any side-files required in ${mapred.work.output.dir} during execution of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the framework will move them out similarly - thus she doesn't have to pick unique paths per task-attempt.

    Note: the value of ${mapred.work.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_temporary/_{$taskid}, and this value is set by the map-reduce framework. So, just create any side-files in the path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce task to take advantage of this feature.

    The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.

    @return the {@link Path} to the task's temporary output directory for the map-reduce job.]]>
    The generated name can be used to create custom files from within the different tasks for the job, the names for different tasks will not collide with each other.

    The given name is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number. For example, give a name 'test' running on the first map o the job the generated name will be 'test-m-00000'.

    @param conf the configuration for the job. @param name the name to make unique. @return a unique name accross all tasks of the job.]]>
    The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.

    ls

    This method uses the {@link #getUniqueName} method to make the file name unique for the task.

    @param conf the configuration for the job. @param name the name for the file. @return a unique path accross all tasks of the job.]]>
    Each {@link InputSplit} is then assigned to an individual {@link Mapper} for processing.

    Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. @param job job configuration. @param numSplits the desired number of splits, a hint. @return an array of {@link InputSplit}s for the job.]]> It is the responsibility of the RecordReader to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.

    @param split the {@link InputSplit} @param job the job that this split belongs to @return a {@link RecordReader}]]>
    InputFormat describes the input-specification for a Map-Reduce job.

    The Map-Reduce framework relies on the InputFormat of the job to:

    1. Validate the input-specification of the job.
    2. Split-up the input file(s) into logical {@link InputSplit}s, each of which is then assigned to an individual {@link Mapper}.
    3. Provide the {@link RecordReader} implementation to be used to glean input records from the logical InputSplit for processing by the {@link Mapper}.

    The default behavior of file-based {@link InputFormat}s, typically sub-classes of {@link FileInputFormat}, is to split the input into logical {@link InputSplit}s based on the total size, in bytes, of the input files. However, the {@link FileSystem} blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

    Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a {@link RecordReader} on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task. @see InputSplit @see RecordReader @see JobClient @see FileInputFormat @deprecated Use {@link org.apache.hadoop.mapreduce.InputFormat} instead.]]> InputSplit. @return the number of bytes in the input split. @throws IOException]]> InputSplit is located as an array of Strings. @throws IOException]]> InputSplit represents the data to be processed by an individual {@link Mapper}.

    Typically, it presents a byte-oriented view on the input and is the responsibility of {@link RecordReader} of the job to process this and present a record-oriented view. @see InputFormat @see RecordReader @deprecated Use {@link org.apache.hadoop.mapreduce.InputSplit} instead.]]> JobClient.]]> jobid doesn't correspond to any known job. @throws IOException]]> JobClient is the primary interface for the user-job to interact with the {@link JobTracker}. JobClient provides facilities to submit jobs, track their progress, access component-tasks' reports/logs, get the Map-Reduce cluster status information etc.

    The job submission process involves:

    1. Checking the input and output specifications of the job.
    2. Computing the {@link InputSplit}s for the job.
    3. Setup the requisite accounting information for the {@link DistributedCache} of the job, if necessary.
    4. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system.
    5. Submitting the job to the JobTracker and optionally monitoring it's status.

    Normally the user creates the application, describes various facets of the job via {@link JobConf} and then uses the JobClient to submit the job and monitor its progress.

    Here is an example on how to use JobClient:

         // Create a new JobConf
         JobConf job = new JobConf(new Configuration(), MyJob.class);
         
         // Specify various job-specific parameters     
         job.setJobName("myjob");
         
         job.setInputPath(new Path("in"));
         job.setOutputPath(new Path("out"));
         
         job.setMapperClass(MyJob.MyMapper.class);
         job.setReducerClass(MyJob.MyReducer.class);
    
         // Submit the job, then poll for progress until the job is complete
         JobClient.runJob(job);
     

    Job Control

    At times clients would chain map-reduce jobs to accomplish complex tasks which cannot be done via a single map-reduce job. This is fairly easy since the output of the job, typically, goes to distributed file-system and that can be used as the input for the next job.

    However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on the clients. In such situations the various job-control options are:

    1. {@link #runJob(JobConf)} : submits the job and returns only after the job has completed.
    2. {@link #submitJob(JobConf)} : only submits the job, then poll the returned handle to the {@link RunningJob} to query status and make scheduling decisions.
    3. {@link JobConf#setJobEndNotificationURI(String)} : setup a notification on job-completion, thus avoiding polling.

    @see JobConf @see ClusterStatus @see Tool @see DistributedCache]]>
    If the parameter {@code loadDefaults} is false, the new instance will not load resources from the default files. @param loadDefaults specifies whether to load from the default files]]> true if framework should keep the intermediate files for failed tasks, false otherwise.]]> true if the outputs of the maps are to be compressed, false otherwise.]]> This comparator should be provided if the equivalence rules for keys for sorting the intermediates are different from those for grouping keys before each call to {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.

    For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed in a single call to the reduce function if K1 and K2 compare as equal.

    Since {@link #setOutputKeyComparatorClass(Class)} can be used to control how keys are sorted, this can be used in conjunction to simulate secondary sort on values.

    Note: This is not a guarantee of the reduce sort being stable in any sense. (In any case, with the order of available map-outputs to the reduce being non-deterministic, it wouldn't make that much sense.)

    @param theClass the comparator class to be used for grouping keys. It should implement RawComparator. @see #setOutputKeyComparatorClass(Class)]]>
    combiner class used to combine map-outputs before being sent to the reducers. Typically the combiner is same as the the {@link Reducer} for the job i.e. {@link #getReducerClass()}. @return the user-defined combiner class used to combine map-outputs.]]> combiner class used to combine map-outputs before being sent to the reducers.

    The combiner is an application-specified aggregation operation, which can help cut down the amount of data transferred between the {@link Mapper} and the {@link Reducer}, leading to better performance.

    The framework may invoke the combiner 0, 1, or multiple times, in both the mapper and reducer tasks. In general, the combiner is called as the sort/merge result is written to disk. The combiner must:

    • be side-effect free
    • have the same input and output key types and the same input and output value types

    Typically the combiner is same as the Reducer for the job i.e. {@link #setReducerClass(Class)}.

    @param theClass the user-defined combiner class used to combine map-outputs.]]>
    true. @return true if speculative execution be used for this job, false otherwise.]]> true if speculative execution should be turned on, else false.]]> true. @return true if speculative execution be used for this job for map tasks, false otherwise.]]> true if speculative execution should be turned on for map tasks, else false.]]> true. @return true if speculative execution be used for reduce tasks for this job, false otherwise.]]> true if speculative execution should be turned on for reduce tasks, else false.]]> 1. @return the number of reduce tasks for this job.]]> Note: This is only a hint to the framework. The actual number of spawned map tasks depends on the number of {@link InputSplit}s generated by the job's {@link InputFormat#getSplits(JobConf, int)}. A custom {@link InputFormat} is typically used to accurately control the number of map tasks for the job.

    How many maps?

    The number of maps is usually driven by the total size of the inputs i.e. total number of blocks of the input files.

    The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

    The default behavior of file-based {@link InputFormat}s is to split the input into logical {@link InputSplit}s based on the total size, in bytes, of input files. However, the {@link FileSystem} blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

    Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is used to set it even higher.

    @param n the number of map tasks for this job. @see InputFormat#getSplits(JobConf, int) @see FileInputFormat @see FileSystem#getDefaultBlockSize() @see FileStatus#getBlockSize()]]>
    1. @return the number of reduce tasks for this job.]]> How many reduces?

    The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).

    With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

    Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

    The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks, failures etc.

    Reducer NONE

    It is legal to set the number of reduce-tasks to zero.

    In this case the output of the map-tasks directly go to distributed file-system, to the path set by {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the framework doesn't sort the map-outputs before writing it out to HDFS.

    @param n the number of reduce tasks for this job.]]>
    mapred.map.max.attempts property. If this property is not already set, the default is 4 attempts. @return the max number of attempts per map task.]]> mapred.reduce.max.attempts property. If this property is not already set, the default is 4 attempts. @return the max number of attempts per reduce task.]]> noFailures, the tasktracker is blacklisted for this job. @param noFailures maximum no. of failures of a given job per tasktracker.]]> blacklisted for this job. @return the maximum no. of failures of a given job per tasktracker.]]> failed. Defaults to zero, i.e. any failed map-task results in the job being declared as {@link JobStatus#FAILED}. @return the maximum percentage of map tasks that can fail without the job being aborted.]]> failed. @param percent the maximum percentage of map tasks that can fail without the job being aborted.]]> failed. Defaults to zero, i.e. any failed reduce-task results in the job being declared as {@link JobStatus#FAILED}. @return the maximum percentage of reduce tasks that can fail without the job being aborted.]]> failed. @param percent the maximum percentage of reduce tasks that can fail without the job being aborted.]]> The debug script can aid debugging of failed map tasks. The script is given task's stdout, stderr, syslog, jobconf files as arguments.

    The debug command, run on the node where the map failed, is:

    $script $stdout $stderr $syslog $jobconf.

    The script file is distributed through {@link DistributedCache} APIs. The script needs to be symlinked.

    Here is an example on how to submit a script

     job.setMapDebugScript("./myscript");
     DistributedCache.createSymlink(job);
     DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
     

    @param mDbgScript the script name]]>
    The debug script can aid debugging of failed reduce tasks. The script is given task's stdout, stderr, syslog, jobconf files as arguments.

    The debug command, run on the node where the map failed, is:

    $script $stdout $stderr $syslog $jobconf.

    The script file is distributed through {@link DistributedCache} APIs. The script file needs to be symlinked

    Here is an example on how to submit a script

     job.setReduceDebugScript("./myscript");
     DistributedCache.createSymlink(job);
     DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
     

    @param rDbgScript the script name]]>
    null if it hasn't been set. @see #setJobEndNotificationURI(String)]]> The uri can contain 2 special parameters: $jobId and $jobStatus. Those, if present, are replaced by the job's identifier and completion-status respectively.

    This is typically used by application-writers to implement chaining of Map-Reduce jobs in an asynchronous manner.

    @param uri the job end notification uri @see JobStatus @see Job Completion and Chaining]]>
    When a job starts, a shared directory is created at location ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ . This directory is exposed to the users through job.local.dir . So, the tasks can use this space as scratch space and share files among them.

    This value is available as System property also. @return The localized job specific shared directory]]>
    mapred.task.maxvmem is split into mapred.job.map.memory.mb and mapred.job.map.memory.mb,mapred each of the new key are set as mapred.task.maxvmem / 1024 as new values are in MB @return The maximum amount of memory any task of this job will use, in bytes. @see #setMaxVirtualMemoryForTask(long) @deprecated Use {@link #getMemoryForMapTask()} and {@link #getMemoryForReduceTask()}]]> mapred.task.maxvmem is split into mapred.job.map.memory.mb and mapred.job.map.memory.mb,mapred each of the new key are set as mapred.task.maxvmem / 1024 as new values are in MB @param vmem Maximum amount of virtual memory in bytes any task of this job can use. @see #getMaxVirtualMemoryForTask() @deprecated Use {@link #setMemoryForMapTask(long mem)} and Use {@link #setMemoryForReduceTask(long mem)}]]> JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as-is described by JobConf, however:
    1. Some configuration parameters might have been marked as final by administrators and hence cannot be altered.
    2. While some job parameters are straight-forward to set (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly rest of the framework and/or job-configuration and is relatively more complex for the user to control finely (e.g. {@link #setNumMapTasks(int)}).

    JobConf typically specifies the {@link Mapper}, combiner (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and {@link OutputFormat} implementations to be used etc.

    Optionally JobConf is used to specify other advanced facets of the job such as Comparators to be used, files to be put in the {@link DistributedCache}, whether or not intermediate and/or job outputs are to be compressed (and how), debugability via user-provided scripts ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}), for doing post-processing on task logs, task's stdout, stderr, syslog. and etc.

    Here is an example on how to configure a job via JobConf:

         // Create a new JobConf
         JobConf job = new JobConf(new Configuration(), MyJob.class);
         
         // Specify various job-specific parameters     
         job.setJobName("myjob");
         
         FileInputFormat.setInputPaths(job, new Path("in"));
         FileOutputFormat.setOutputPath(job, new Path("out"));
         
         job.setMapperClass(MyJob.MyMapper.class);
         job.setCombinerClass(MyJob.MyReducer.class);
         job.setReducerClass(MyJob.MyReducer.class);
         
         job.setInputFormat(SequenceFileInputFormat.class);
         job.setOutputFormat(SequenceFileOutputFormat.class);
     

    @see JobClient @see ClusterStatus @see Tool @see DistributedCache @deprecated Use {@link Configuration} instead]]>
    .]]> ]]> any job run on the jobtracker started at 200707121733, we would use :
     
     JobID.getTaskIDsPattern("200707121733", null);
     
    which will return :
     "job_200707121733_[0-9]*" 
    @param jtIdentifier jobTracker identifier, or null @param jobId job number, or null @return a regex pattern matching JobIDs]]>
    An example JobID is : job_200707121733_0003 , which represents the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse JobID strings, but rather use appropriate constructors or {@link #forName(String)} method. @see TaskID @see TaskAttemptID]]> "N/A" @return Scheduling information associated to particular Job Queue]]> zero. @param conf configuration for the JobTracker. @throws IOException]]> Output pairs need not be of the same types as input pairs. A given input pair may map to zero or many output pairs. Output pairs are collected with calls to {@link OutputCollector#collect(Object,Object)}.

    Applications can use the {@link Reporter} provided to report progress or just indicate that they are alive. In scenarios where the application takes an insignificant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. The other way of avoiding this is to set mapred.task.timeout to a high-enough value (or even zero for no time-outs).

    @param key the input key. @param value the input value. @param output collects mapped keys and values. @param reporter facility to report progress.]]>
    Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

    The Hadoop Map-Reduce framework spawns one map task for each {@link InputSplit} generated by the {@link InputFormat} for the job. Mapper implementations can access the {@link JobConf} for the job via the {@link JobConfigurable#configure(JobConf)} and initialize themselves. Similarly they can use the {@link Closeable#close()} method for de-initialization.

    The framework then calls {@link #map(Object, Object, OutputCollector, Reporter)} for each key/value pair in the InputSplit for that task.

    All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a {@link Reducer} to determine the final output. Users can control the grouping by specifying a Comparator via {@link JobConf#setOutputKeyComparatorClass(Class)}.

    The grouped Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom {@link Partitioner}.

    Users can optionally specify a combiner, via {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

    The intermediate, grouped outputs are always stored in {@link SequenceFile}s. Applications can specify if and how the intermediate outputs are to be compressed and which {@link CompressionCodec}s are to be used via the JobConf.

    If the job has zero reduces then the output of the Mapper is directly written to the {@link FileSystem} without grouping by keys.

    Example:

         public class MyMapper<K extends WritableComparable, V extends Writable> 
         extends MapReduceBase implements Mapper<K, V, K, V> {
         
           static enum MyCounters { NUM_RECORDS }
           
           private String mapTaskId;
           private String inputFile;
           private int noRecords = 0;
           
           public void configure(JobConf job) {
             mapTaskId = job.get("mapred.task.id");
             inputFile = job.get("map.input.file");
           }
           
           public void map(K key, V val,
                           OutputCollector<K, V> output, Reporter reporter)
           throws IOException {
             // Process the <key, value> pair (assume this takes a while)
             // ...
             // ...
             
             // Let the framework know that we are alive, and kicking!
             // reporter.progress();
             
             // Process some more
             // ...
             // ...
             
             // Increment the no. of <key, value> pairs processed
             ++noRecords;
    
             // Increment counters
             reporter.incrCounter(NUM_RECORDS, 1);
            
             // Every 100 records update application-level status
             if ((noRecords%100) == 0) {
               reporter.setStatus(mapTaskId + " processed " + noRecords + 
                                  " from input-file: " + inputFile); 
             }
             
             // Output the result
             output.collect(key, val);
           }
         }
     

    Applications may write a custom {@link MapRunnable} to exert greater control on map processing e.g. multi-threaded Mappers etc.

    @see JobConf @see InputFormat @see Partitioner @see Reducer @see MapReduceBase @see MapRunnable @see SequenceFile @deprecated Use {@link org.apache.hadoop.mapreduce.Mapper} instead.]]>
    Provides default no-op implementations for a few methods, most non-trivial applications need to override some of them.

    ]]>
    <key, value> pairs.

    Mapping of input records to output records is complete when this method returns.

    @param input the {@link RecordReader} to read the input records. @param output the {@link OutputCollector} to collect the outputrecords. @param reporter {@link Reporter} to report progress, status-updates etc. @throws IOException]]>
    Custom implementations of MapRunnable can exert greater control on map processing e.g. multi-threaded, asynchronous mappers etc.

    @see Mapper @deprecated Use {@link org.apache.hadoop.mapreduce.Mapper} instead.]]>
    nearly equal content length.
    Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)} to construct RecordReader's for MultiFileSplit's. @see MultiFileSplit @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileInputFormat} instead]]>
    MultiFileSplit can be used to implement {@link RecordReader}'s, with reading one record per file. @see FileSplit @see MultiFileInputFormat @deprecated Use {@link org.apache.hadoop.mapred.lib.CombineFileSplit} instead]]> <key, value> pairs output by {@link Mapper}s and {@link Reducer}s.

    OutputCollector is the generalization of the facility provided by the Map-Reduce framework to collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job.

    ]]>
    OutputCommitter describes the commit of task output for a Map-Reduce job.

    The Map-Reduce framework relies on the OutputCommitter of the job to:

    1. Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job.
    2. Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion.
    3. Setup the task temporary output.
    4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
    5. Commit of the task output.
    6. Discard the task commit.
    @see FileOutputCommitter @see JobContext @see TaskAttemptContext @deprecated Use {@link org.apache.hadoop.mapreduce.OutputCommitter} instead.]]>
    This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.

    @param ignored @param job job configuration. @throws IOException when output should not be attempted]]>
    OutputFormat describes the output-specification for a Map-Reduce job.

    The Map-Reduce framework relies on the OutputFormat of the job to:

    1. Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
    2. Provide the {@link RecordWriter} implementation to be used to write out the output files of the job. Output files are stored in a {@link FileSystem}.
    @see RecordWriter @see JobConf @deprecated Use {@link org.apache.hadoop.mapreduce.OutputFormat} instead.]]>
    Typically a hash function on a all or a subset of the key.

    @param key the key to be paritioned. @param value the entry value. @param numPartitions the total number of partitions. @return the partition number for the key.]]>
    Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

    @see Reducer @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.]]>
    true if there exists a key/value, false otherwise. @throws IOException]]> RawKeyValueIterator is an iterator used to iterate over the raw keys and values during sort/merge of intermediate data.]]> 0.0 to 1.0. @throws IOException]]> RecordReader reads <key, value> pairs from an {@link InputSplit}.

    RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the {@link Mapper} & {@link Reducer} tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.

    @see InputSplit @see InputFormat]]>
    RecordWriter to future operations. @param reporter facility to report progress. @throws IOException]]> RecordWriter writes the output <key, value> pairs to an output file.

    RecordWriter implementations write the job outputs to the {@link FileSystem}. @see OutputFormat]]> Reduces values for a given key.

    The framework calls this method for each <key, (list of values)> pair in the grouped inputs. Output values must be of the same type as input values. Input keys must not be altered. The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of. In many cases, all values are combined into zero or one value.

    Output pairs are collected with calls to {@link OutputCollector#collect(Object,Object)}.

    Applications can use the {@link Reporter} provided to report progress or just indicate that they are alive. In scenarios where the application takes an insignificant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. The other way of avoiding this is to set mapred.task.timeout to a high-enough value (or even zero for no time-outs).

    @param key the key. @param values the list of values to reduce. @param output to collect keys and combined values. @param reporter facility to report progress.]]>
    The number of Reducers for the job is set by the user via {@link JobConf#setNumReduceTasks(int)}. Reducer implementations can access the {@link JobConf} for the job via the {@link JobConfigurable#configure(JobConf)} method and initialize themselves. Similarly they can use the {@link Closeable#close()} method for de-initialization.

    Reducer has 3 primary phases:

    1. Shuffle

      Reducer is input the grouped output of a {@link Mapper}. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

    2. Sort

      The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage.

      The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

      SecondarySort

      If equivalence rules for keys while grouping the intermediates are different from those for grouping keys before reduction, then one may specify a Comparator via {@link JobConf#setOutputValueGroupingComparator(Class)}.Since {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.

      For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
      • Map Input Key: url
      • Map Input Value: document
      • Map Output Key: document checksum, url pagerank
      • Map Output Value: url
      • Partitioner: by checksum
      • OutputKeyComparator: by checksum and then decreasing pagerank
      • OutputValueGroupingComparator: by checksum
    3. Reduce

      In this phase the {@link #reduce(Object, Iterator, OutputCollector, Reporter)} method is called for each <key, (list of values)> pair in the grouped inputs.

      The output of the reduce task is typically written to the {@link FileSystem} via {@link OutputCollector#collect(Object, Object)}.

    The output of the Reducer is not re-sorted.

    Example:

         public class MyReducer<K extends WritableComparable, V extends Writable> 
         extends MapReduceBase implements Reducer<K, V, K, V> {
         
           static enum MyCounters { NUM_RECORDS }
            
           private String reduceTaskId;
           private int noKeys = 0;
           
           public void configure(JobConf job) {
             reduceTaskId = job.get("mapred.task.id");
           }
           
           public void reduce(K key, Iterator<V> values,
                              OutputCollector<K, V> output, 
                              Reporter reporter)
           throws IOException {
           
             // Process
             int noValues = 0;
             while (values.hasNext()) {
               V value = values.next();
               
               // Increment the no. of values for this key
               ++noValues;
               
               // Process the <key, value> pair (assume this takes a while)
               // ...
               // ...
               
               // Let the framework know that we are alive, and kicking!
               if ((noValues%10) == 0) {
                 reporter.progress();
               }
             
               // Process some more
               // ...
               // ...
               
               // Output the <key, value> 
               output.collect(key, value);
             }
             
             // Increment the no. of <key, list of values> pairs processed
             ++noKeys;
             
             // Increment counters
             reporter.incrCounter(NUM_RECORDS, 1);
             
             // Every 100 keys update application-level status
             if ((noKeys%100) == 0) {
               reporter.setStatus(reduceTaskId + " processed " + noKeys);
             }
           }
         }
     

    @see Mapper @see Partitioner @see Reporter @see MapReduceBase @deprecated Use {@link org.apache.hadoop.mapreduce.Reducer} instead.]]>
    Counter of the given group/name.]]> Counter of the given group/name.]]> Enum. @param amount A non-negative amount by which the counter is to be incremented.]]> InputSplit that the map is reading from. @throws UnsupportedOperationException if called outside a mapper]]> {@link Mapper} and {@link Reducer} can use the Reporter provided to report progress or just indicate that they are alive. In scenarios where the application takes an insignificant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.

    Applications can also update {@link Counters} via the provided Reporter .

    @see Progressable @see Counters]]>
    progress of the job's map-tasks, as a float between 0.0 and 1.0. When all map tasks have completed, the function returns 1.0. @return the progress of the job's map-tasks. @throws IOException]]> progress of the job's reduce-tasks, as a float between 0.0 and 1.0. When all reduce tasks have completed, the function returns 1.0. @return the progress of the job's reduce-tasks. @throws IOException]]> progress of the job's cleanup-tasks, as a float between 0.0 and 1.0. When all cleanup tasks have completed, the function returns 1.0. @return the progress of the job's cleanup-tasks. @throws IOException]]> progress of the job's setup-tasks, as a float between 0.0 and 1.0. When all setup tasks have completed, the function returns 1.0. @return the progress of the job's setup-tasks. @throws IOException]]> true if the job is complete, else false. @throws IOException]]> true if the job succeeded, else false. @throws IOException]]> RunningJob is the user-interface to query for details on a running Map-Reduce job.

    Clients can get hold of RunningJob via the {@link JobClient} and then query the running-job for details such as name, configuration, progress etc.

    @see JobClient]]>
    This allows the user to specify the key class to be different from the actual class ({@link BytesWritable}) used for writing

    @param conf the {@link JobConf} to modify @param theClass the SequenceFile output key class.]]>
    This allows the user to specify the value class to be different from the actual class ({@link BytesWritable}) used for writing

    @param conf the {@link JobConf} to modify @param theClass the SequenceFile output key class.]]>
    f. The filtering criteria is MD5(key) % f == 0.]]> f using the criteria record# % f == 0. For example, if the frequency is 10, one out of 10 records is returned.]]> true if auto increment {@link SkipBadRecords#COUNTER_MAP_PROCESSED_RECORDS}. false otherwise.]]> true if auto increment {@link SkipBadRecords#COUNTER_REDUCE_PROCESSED_GROUPS}. false otherwise.]]> Hadoop provides an optional mode of execution in which the bad records are detected and skipped in further attempts.

    This feature can be used when map/reduce tasks crashes deterministically on certain input. This happens due to bugs in the map/reduce function. The usual course would be to fix these bugs. But sometimes this is not possible; perhaps the bug is in third party libraries for which the source code is not available. Due to this, the task never reaches to completion even with multiple attempts and complete data for that task is lost.

    With this feature, only a small portion of data is lost surrounding the bad record, which may be acceptable for some user applications. see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}

    The skipping mode gets kicked off after certain no of failures see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}

    In the skipping mode, the map/reduce task maintains the record range which is getting processed at all times. Before giving the input to the map/reduce function, it sends this record range to the Task tracker. If task crashes, the Task tracker knows which one was the last reported range. On further attempts that range get skipped.

    ]]>
    all task attempt IDs of any jobtracker, in any job, of the first map task, we would use :
     
     TaskAttemptID.getTaskAttemptIDsPattern(null, null, true, 1, null);
     
    which will return :
     "attempt_[^_]*_[0-9]*_m_000001_[0-9]*" 
    @param jtIdentifier jobTracker identifier, or null @param jobId job number, or null @param isMap whether the tip is a map, or null @param taskId taskId number, or null @param attemptId the task attempt number, or null @return a regex pattern matching TaskAttemptIDs]]>
    An example TaskAttemptID is : attempt_200707121733_0003_m_000005_0 , which represents the zeroth task attempt for the fifth map task in the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse TaskAttemptID strings , but rather use appropriate constructors or {@link #forName(String)} method. @see JobID @see TaskID]]> the first map task of any jobtracker, of any job, we would use :

     
     TaskID.getTaskIDsPattern(null, null, true, 1);
     
    which will return :
     "task_[^_]*_[0-9]*_m_000001*" 
    @param jtIdentifier jobTracker identifier, or null @param jobId job number, or null @param isMap whether the tip is a map, or null @param taskId taskId number, or null @return a regex pattern matching TaskIDs]]>
    An example TaskID is : task_200707121733_0003_m_000005 , which represents the fifth map task in the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse TaskID strings , but rather use appropriate constructors or {@link #forName(String)} method. @see JobID @see TaskAttemptID]]> hadoop.log.dir.]]> true if the Job was added.]]> ([,]*) func ::= tbl(,"") class ::= @see java.lang.Class#forName(java.lang.String) path ::= @see org.apache.hadoop.fs.Path#Path(java.lang.String) } Reads expression from the mapred.join.expr property and user-supplied join types from mapred.join.define.<ident> types. Paths supplied to tbl are given as input paths to the InputFormat class listed. @see #compose(java.lang.String, java.lang.Class, java.lang.String...)]]> ,

    ) }]]> (tbl(,),tbl(,),...,tbl(,)) }]]> (tbl(,),tbl(,),...,tbl(,)) }]]> mapred.join.define.<ident> to a classname. In the expression mapred.join.expr, the identifier will be assumed to be a ComposableRecordReader. mapred.join.keycomparator can be a classname used to compare keys in the join. @see JoinRecordReader @see MultiFilterRecordReader]]> ...... }]]> capacity children to position id in the parent reader. The id of a root CompositeRecordReader is -1 by convention, but relying on this is not recommended.]]> override(S1,S2,S3) will prefer values from S3 over S2, and values from S2 over S1 for all keys emitted from all sources.]]> [,,...,]]]> out. TupleWritable format: {@code ...... }]]> It has to be specified how key and values are passed from one element of the chain to the next, by value or by reference. If a Mapper leverages the assumed semantics that the key and values are not modified by the collector 'by value' must be used. If the Mapper does not expect this semantics, as an optimization to avoid serialization and deserialization 'by reference' can be used.

    For the added Mapper the configuration given for it, mapperConf, have precedence over the job's JobConf. This precedence is in effect when the task is running.

    IMPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addMapper for the last mapper in the chain

    @param job job's JobConf to add the Mapper class. @param klass the Mapper class to add. @param inputKeyClass mapper input key class. @param inputValueClass mapper input value class. @param outputKeyClass mapper output key class. @param outputValueClass mapper output value class. @param byValue indicates if key/values should be passed by value to the next Mapper in the chain, if any. @param mapperConf a JobConf with the configuration for the Mapper class. It is recommended to use a JobConf without default values using the JobConf(boolean loadDefaults) constructor with FALSE.]]> If this method is overriden super.configure(...) should be invoked at the beginning of the overwriter method.]]> map(...) methods of the Mappers in the chain.]]> If this method is overriden super.close() should be invoked at the end of the overwriter method.]]> The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

    The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.

    Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use maching output and input key and value classes as no conversion is done by the chaining code.

    Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.

    IMPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addMapper for the last mapper in the chain.

    ChainMapper usage pattern:

     ...
     conf.setJobName("chain");
     conf.setInputFormat(TextInputFormat.class);
     conf.setOutputFormat(TextOutputFormat.class);
     

    JobConf mapAConf = new JobConf(false); ... ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);

    JobConf mapBConf = new JobConf(false); ... ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf);

    JobConf reduceConf = new JobConf(false); ... ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf);

    ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null);

    ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null);

    FileInputFormat.setInputPaths(conf, inDir); FileOutputFormat.setOutputPath(conf, outDir); ...

    JobClient jc = new JobClient(conf); RunningJob job = jc.submitJob(conf); ...

    ]]>
    It has to be specified how key and values are passed from one element of the chain to the next, by value or by reference. If a Reducer leverages the assumed semantics that the key and values are not modified by the collector 'by value' must be used. If the Reducer does not expect this semantics, as an optimization to avoid serialization and deserialization 'by reference' can be used.

    For the added Reducer the configuration given for it, reducerConf, have precedence over the job's JobConf. This precedence is in effect when the task is running.

    IMPORTANT: There is no need to specify the output key/value classes for the ChainReducer, this is done by the setReducer or the addMapper for the last element in the chain. @param job job's JobConf to add the Reducer class. @param klass the Reducer class to add. @param inputKeyClass reducer input key class. @param inputValueClass reducer input value class. @param outputKeyClass reducer output key class. @param outputValueClass reducer output value class. @param byValue indicates if key/values should be passed by value to the next Mapper in the chain, if any. @param reducerConf a JobConf with the configuration for the Reducer class. It is recommended to use a JobConf without default values using the JobConf(boolean loadDefaults) constructor with FALSE.]]> It has to be specified how key and values are passed from one element of the chain to the next, by value or by reference. If a Mapper leverages the assumed semantics that the key and values are not modified by the collector 'by value' must be used. If the Mapper does not expect this semantics, as an optimization to avoid serialization and deserialization 'by reference' can be used.

    For the added Mapper the configuration given for it, mapperConf, have precedence over the job's JobConf. This precedence is in effect when the task is running.

    IMPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addMapper for the last mapper in the chain . @param job chain job's JobConf to add the Mapper class. @param klass the Mapper class to add. @param inputKeyClass mapper input key class. @param inputValueClass mapper input value class. @param outputKeyClass mapper output key class. @param outputValueClass mapper output value class. @param byValue indicates if key/values should be passed by value to the next Mapper in the chain, if any. @param mapperConf a JobConf with the configuration for the Mapper class. It is recommended to use a JobConf without default values using the JobConf(boolean loadDefaults) constructor with FALSE.]]> If this method is overriden super.configure(...) should be invoked at the beginning of the overwriter method.]]> reduce(...) method of the Reducer with the map(...) methods of the Mappers in the chain.]]> If this method is overriden super.close() should be invoked at the end of the overwriter method.]]> For each record output by the Reducer, the Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

    The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed after the Reducer or in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.

    Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use maching output and input key and value classes as no conversion is done by the chaining code.

    Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.

    IMPORTANT: There is no need to specify the output key/value classes for the ChainReducer, this is done by the setReducer or the addMapper for the last element in the chain.

    ChainReducer usage pattern:

     ...
     conf.setJobName("chain");
     conf.setInputFormat(TextInputFormat.class);
     conf.setOutputFormat(TextOutputFormat.class);
     

    JobConf mapAConf = new JobConf(false); ... ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);

    JobConf mapBConf = new JobConf(false); ... ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf);

    JobConf reduceConf = new JobConf(false); ... ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf);

    ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null);

    ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null);

    FileInputFormat.setInputPaths(conf, inDir); FileOutputFormat.setOutputPath(conf, outDir); ...

    JobClient jc = new JobClient(conf); RunningJob job = jc.submitJob(conf); ...

    ]]>
    RecordReader's for CombineFileSplit's. @see CombineFileSplit]]> th Path]]> th Path]]> th Path]]> CombineFileSplit can be used to implement {@link org.apache.hadoop.mapred.RecordReader}'s, with reading one record per file. @see org.apache.hadoop.mapred.FileSplit @see CombineFileInputFormat]]> all splits. @param freq The frequency with which records will be emitted.]]> all splits. This will read every split at the client, which is very expensive. @param freq Probability with which a key will be chosen. @param numSamples Total number of samples to obtain from all selected splits.]]> all splits. Takes the first numSamples / numSplits records from each split. @param numSamples Total number of samples to obtain from all selected splits.]]> true if the name output is multi, false if it is single. If the name output is not defined it returns false]]> @param conf job conf to add the named output @param namedOutput named output name, it has to be a word, letters and numbers only, cannot be the word 'part' as that is reserved for the default output. @param outputFormatClass OutputFormat class. @param keyClass key class @param valueClass value class]]> @param conf job conf to add the named output @param namedOutput named output name, it has to be a word, letters and numbers only, cannot be the word 'part' as that is reserved for the default output. @param outputFormatClass OutputFormat class. @param keyClass key class @param valueClass value class]]> By default these counters are disabled.

    MultipleOutputs supports counters, by default the are disabled. The counters group is the {@link MultipleOutputs} class name.

    The names of the counters are the same as the named outputs. For multi named outputs the name of the counter is the concatenation of the named output, and underscore '_' and the multiname. @param conf job conf to enableadd the named output. @param enabled indicates if the counters will be enabled or not.]]>
    By default these counters are disabled.

    MultipleOutputs supports counters, by default the are disabled. The counters group is the {@link MultipleOutputs} class name.

    The names of the counters are the same as the named outputs. For multi named outputs the name of the counter is the concatenation of the named output, and underscore '_' and the multiname. @param conf job conf to enableadd the named output. @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
    @param namedOutput the named output name @param reporter the reporter @return the output collector for the given named output @throws IOException thrown if output collector could not be created]]> @param namedOutput the named output name @param multiName the multi name part @param reporter the reporter @return the output collector for the given named output @throws IOException thrown if output collector could not be created]]> If overriden subclasses must invoke super.close() at the end of their close() @throws java.io.IOException thrown if any of the MultipleOutput files could not be closed properly.]]> OutputCollector passed to the map() and reduce() methods of the Mapper and Reducer implementations.

    Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.

    A named output can be a single file or a multi file. The later is refered as a multi named output.

    A multi named output is an unbound set of files all sharing the same OutputFormat, key class and value class configuration.

    When named outputs are used within a Mapper implementation, key/values written to a name output are not part of the reduce phase, only key/values written to the job OutputCollector are part of the reduce phase.

    MultipleOutputs supports counters, by default the are disabled. The counters group is the {@link MultipleOutputs} class name.

    The names of the counters are the same as the named outputs. For multi named outputs the name of the counter is the concatenation of the named output, and underscore '_' and the multiname.

    Job configuration usage pattern is:

    
     JobConf conf = new JobConf();
    
     conf.setInputPath(inDir);
     FileOutputFormat.setOutputPath(conf, outDir);
    
     conf.setMapperClass(MOMap.class);
     conf.setReducerClass(MOReduce.class);
     ...
    
     // Defines additional single text based output 'text' for the job
     MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
     LongWritable.class, Text.class);
    
     // Defines additional multi sequencefile based output 'sequence' for the
     // job
     MultipleOutputs.addMultiNamedOutput(conf, "seq",
       SequenceFileOutputFormat.class,
       LongWritable.class, Text.class);
     ...
    
     JobClient jc = new JobClient();
     RunningJob job = jc.submitJob(conf);
    
     ...
     

    Job configuration usage pattern is:

    
     public class MOReduce implements
       Reducer<WritableComparable, Writable> {
     private MultipleOutputs mos;
    
     public void configure(JobConf conf) {
     ...
     mos = new MultipleOutputs(conf);
     }
    
     public void reduce(WritableComparable key, Iterator<Writable> values,
     OutputCollector output, Reporter reporter)
     throws IOException {
     ...
     mos.getCollector("text", reporter).collect(key, new Text("Hello"));
     mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
     mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
     ...
     }
    
     public void close() throws IOException {
     mos.close();
     ...
     }
    
     }
     
    ]]>
    It can be used instead of the default implementation, @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU bound in order to improve throughput.

    Map implementations using this MapRunnable must be thread-safe.

    The Map-Reduce job has to be configured to use this MapRunnable class (using the JobConf.setMapRunnerClass method) and the number of thread the thread-pool can use with the mapred.map.multithreadedrunner.threads property, its default value is 10 threads.

    ]]> pairs. Uses {@link StringTokenizer} to break text into tokens. @deprecated Use {@link org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper} instead.]]> total.order.partitioner.natural.order is not false, a trie of the first total.order.partitioner.max.trie.depth(2) + 1 bytes will be built. Otherwise, keys will be located using a binary search of the partition keyset using the {@link org.apache.hadoop.io.RawComparator} defined for this job. The input file must be sorted with the same comparator and contain {@link org.apache.hadoop.mapred.JobConf#getNumReduceTasks} - 1 keys.]]> R reduces, there are R-1 keys in the SequenceFile.]]> generateKeyValPairs(Object key, Object value); public void configure(JobConfjob); } The package also provides a base class, ValueAggregatorBaseDescriptor, implementing the above interface. The user can extend the base class and implement generateKeyValPairs accordingly. The primary work of generateKeyValPairs is to emit one or more key/value pairs based on the input key/value pair. The key in an output key/value pair encode two pieces of information: aggregation type and aggregation id. The value will be aggregated onto the aggregation id according the aggregation type. This class offers a function to generate a map/reduce job using Aggregate framework. The function takes the following parameters: input directory spec input format (text or sequence file) output directory a file specifying the user plugin class]]> The job can be configured using the static methods in this class, {@link DBInputFormat}, and {@link DBOutputFormat}.

    Alternatively, the properties can be set in the configuration with proper values. @see DBConfiguration#configureDB(JobConf, String, String, String, String) @see DBInputFormat#setInput(JobConf, Class, String, String) @see DBInputFormat#setInput(JobConf, Class, String, String, String, String...) @see DBOutputFormat#setOutput(JobConf, String, String...)]]> 20070101 AND length > 0)' @param orderBy the fieldNames in the orderBy clause. @param fieldNames The field names in the table @see #setInput(JobConf, Class, String, String)]]> DBInputFormat emits LongWritables containing the record number as key and DBWritables as value. The SQL query, and input class can be using one of the two setInput methods.]]> {@link DBOutputFormat} accepts <key,value> pairs, where key has a type extending DBWritable. Returned {@link RecordWriter} writes only the key to the database with a batch SQL query.]]> DBWritable. DBWritable, is similar to {@link Writable} except that the {@link #write(PreparedStatement)} method takes a {@link PreparedStatement}, and {@link #readFields(ResultSet)} takes a {@link ResultSet}.

    Implementations are responsible for writing the fields of the object to PreparedStatement, and reading the fields of the object from the ResultSet.

    Example:

    If we have the following table in the database :
     CREATE TABLE MyTable (
       counter        INTEGER NOT NULL,
       timestamp      BIGINT  NOT NULL,
     );
     
    then we can read/write the tuples from/to the table with :

     public class MyWritable implements Writable, DBWritable {
       // Some data     
       private int counter;
       private long timestamp;
           
       //Writable#write() implementation
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
           
       //Writable#readFields() implementation
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
           
       public void write(PreparedStatement statement) throws SQLException {
         statement.setInt(1, counter);
         statement.setLong(2, timestamp);
       }
           
       public void readFields(ResultSet resultSet) throws SQLException {
         counter = resultSet.getInt(1);
         timestamp = resultSet.getLong(2);
       } 
     }
     

    ]]>
    Counters represent global counters, defined either by the Map-Reduce framework or applications. Each Counter is named by an {@link Enum} and has a long for the value.

    Counters are bunched into Groups, each comprising of counters from a particular Enum class.]]> Each {@link InputSplit} is then assigned to an individual {@link Mapper} for processing.

    Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the {@link RecordReader} to read the {@link InputSplit}. @param context job configuration. @return an array of {@link InputSplit}s for the job.]]> InputFormat describes the input-specification for a Map-Reduce job.

    The Map-Reduce framework relies on the InputFormat of the job to:

    1. Validate the input-specification of the job.
    2. Split-up the input file(s) into logical {@link InputSplit}s, each of which is then assigned to an individual {@link Mapper}.
    3. Provide the {@link RecordReader} implementation to be used to glean input records from the logical InputSplit for processing by the {@link Mapper}.

    The default behavior of file-based {@link InputFormat}s, typically sub-classes of {@link FileInputFormat}, is to split the input into logical {@link InputSplit}s based on the total size, in bytes, of the input files. However, the {@link FileSystem} blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

    Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a {@link RecordReader} on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task. @see InputSplit @see RecordReader @see FileInputFormat]]> InputSplit represents the data to be processed by an individual {@link Mapper}.

    Typically, it presents a byte-oriented view on the input and is the responsibility of {@link RecordReader} of the job to process this and present a record-oriented view. @see InputFormat @see RecordReader]]> InputFormat to use @throws IllegalStateException if the job is submitted]]> OutputFormat to use @throws IllegalStateException if the job is submitted]]> Mapper to use @throws IllegalStateException if the job is submitted]]> Reducer to use @throws IllegalStateException if the job is submitted]]> Partitioner to use @throws IllegalStateException if the job is submitted]]> progress of the job's map-tasks, as a float between 0.0 and 1.0. When all map tasks have completed, the function returns 1.0. @return the progress of the job's map-tasks. @throws IOException]]> progress of the job's reduce-tasks, as a float between 0.0 and 1.0. When all reduce tasks have completed, the function returns 1.0. @return the progress of the job's reduce-tasks. @throws IOException]]> true if the job is complete, else false. @throws IOException]]> true if the job succeeded, else false. @throws IOException]]> JobTracker is lost]]> 1. @return the number of reduce tasks for this job.]]> An example JobID is : job_200707121733_0003 , which represents the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse JobID strings, but rather use appropriate constructors or {@link #forName(String)} method. @see TaskID @see TaskAttemptID @see org.apache.hadoop.mapred.JobTracker#getNewJobId() @see org.apache.hadoop.mapred.JobTracker#getStartTime()]]> the key input type to the Mapper @param the value input type to the Mapper @param the key output type from the Mapper @param the value output type from the Mapper]]> Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

    The Hadoop Map-Reduce framework spawns one map task for each {@link InputSplit} generated by the {@link InputFormat} for the job. Mapper implementations can access the {@link Configuration} for the job via the {@link JobContext#getConfiguration()}.

    The framework first calls {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by {@link #map(Object, Object, Context)} for each key/value pair in the InputSplit. Finally {@link #cleanup(Context)} is called.

    All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a {@link Reducer} to determine the final output. Users can control the sorting and grouping by specifying two key {@link RawComparator} classes.

    The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom {@link Partitioner}.

    Users can optionally specify a combiner, via {@link Job#setCombinerClass(Class)}, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

    Applications can specify if and how the intermediate outputs are to be compressed and which {@link CompressionCodec}s are to be used via the Configuration.

    If the job has zero reduces then the output of the Mapper is directly written to the {@link OutputFormat} without sorting by keys.

    Example:

     public class TokenCounterMapper 
         extends Mapper{
        
       private final static IntWritable one = new IntWritable(1);
       private Text word = new Text();
       
       public void map(Object key, Text value, Context context) throws IOException {
         StringTokenizer itr = new StringTokenizer(value.toString());
         while (itr.hasMoreTokens()) {
           word.set(itr.nextToken());
           context.collect(word, one);
         }
       }
     }
     

    Applications may override the {@link #run(Context)} method to exert greater control on map processing e.g. multi-threaded Mappers etc.

    @see InputFormat @see JobContext @see Partitioner @see Reducer]]>
    OutputCommitter describes the commit of task output for a Map-Reduce job.

    The Map-Reduce framework relies on the OutputCommitter of the job to:

    1. Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job.
    2. Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion.
    3. Setup the task temporary output.
    4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
    5. Commit of the task output.
    6. Discard the task commit.
    @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter @see JobContext @see TaskAttemptContext]]>
    This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.

    @param context information about the job @throws IOException when output should not be attempted]]>
    OutputFormat describes the output-specification for a Map-Reduce job.

    The Map-Reduce framework relies on the OutputFormat of the job to:

    1. Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
    2. Provide the {@link RecordWriter} implementation to be used to write out the output files of the job. Output files are stored in a {@link FileSystem}.
    @see RecordWriter]]>
    Typically a hash function on a all or a subset of the key.

    @param key the key to be partioned. @param value the entry value. @param numPartitions the total number of partitions. @return the partition number for the key.]]>
    Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

    @see Reducer]]>
    @param ]]> RecordWriter to future operations. @param context the context of the task @throws IOException]]> RecordWriter writes the output <key, value> pairs to an output file.

    RecordWriter implementations write the job outputs to the {@link FileSystem}. @see OutputFormat]]> the class of the input keys @param the class of the input values @param the class of the output keys @param the class of the output values]]> Reducer implementations can access the {@link Configuration} for the job via the {@link JobContext#getConfiguration()} method.

    Reducer has 3 primary phases:

    1. Shuffle

      The Reducer copies the sorted output from each {@link Mapper} using HTTP across the network.

    2. Sort

      The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

      The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

      SecondarySort

      To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via {@link Job#setGroupingComparatorClass(Class)}. The sort order is controlled by {@link Job#setSortComparatorClass(Class)}.

      For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
      • Map Input Key: url
      • Map Input Value: document
      • Map Output Key: document checksum, url pagerank
      • Map Output Value: url
      • Partitioner: by checksum
      • OutputKeyComparator: by checksum and then decreasing pagerank
      • OutputValueGroupingComparator: by checksum
    3. Reduce

      In this phase the {@link #reduce(Object, Iterable, Context)} method is called for each <key, (collection of values)> in the sorted inputs.

      The output of the reduce task is typically written to a {@link RecordWriter} via {@link Context#write(Object, Object)}.

    The output of the Reducer is not re-sorted.

    Example:

     public class IntSumReducer extends Reducer {
       private IntWritable result = new IntWritable();
     
       public void reduce(Key key, Iterable values, 
                          Context context) throws IOException {
         int sum = 0;
         for (IntWritable val : values) {
           sum += val.get();
         }
         result.set(sum);
         context.collect(key, result);
       }
     }
     

    @see Mapper @see Partitioner]]>
    An example TaskAttemptID is : attempt_200707121733_0003_m_000005_0 , which represents the zeroth task attempt for the fifth map task in the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse TaskAttemptID strings , but rather use appropriate constructors or {@link #forName(String)} method. @see JobID @see TaskID]]> An example TaskID is : task_200707121733_0003_m_000005 , which represents the fifth map task in the third job running at the jobtracker started at 200707121733.

    Applications should never construct or parse TaskID strings , but rather use appropriate constructors or {@link #forName(String)} method. @see JobID @see TaskAttemptID]]> the input key type for the task @param the input value type for the task @param the output key type for the task @param the output value type for the task]]> FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that {@link Mapper}s process entire files. @param context the job context @param filename the file name to check @return is this file splitable?]]> FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of {@link #getSplits(JobContext)}. Subclasses of FileInputFormat can also override the {@link #isSplitable(JobContext, Path)} method to ensure input-files are not split-up and are processed as a whole by {@link Mapper}s.]]> the map's input key type @param the map's input value type @param the map's output key type @param the map's output value type @param job the job @return the mapper class to run]]> the map input key type @param the map input value type @param the map output key type @param the map output value type @param job the job to modify @param cls the class to use as the mapper]]> It can be used instead of the default implementation, @link org.apache.hadoop.mapred.MapRunner, when the Map operation is not CPU bound in order to improve throughput.

    Mapper implementations using this MapRunnable must be thread-safe.

    The Map-Reduce job has to be configured with the mapper to use via {@link #setMapperClass(Configuration, Class)} and the number of thread the thread-pool can use with the {@link #getNumberOfThreads(Configuration) method. The default value is 10 threads.

    ]]> true if the job output should be compressed, false otherwise]]> Tasks' Side-Effect Files

    Some applications need to create/write-to side-files, which differ from the actual job-outputs.

    In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.

    To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_temporary/_${taskid} (only) are promoted to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

    The application-writer can take advantage of this by creating any side-files required in a work directory during execution of his task i.e. via {@link #getWorkOutputPath(TaskInputOutputContext)}, and the framework will move them out similarly - thus she doesn't have to pick unique paths per task-attempt.

    The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.

    @return the {@link Path} to the task's temporary output directory for the map-reduce job.]]>
    The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.

    ls

    This method uses the {@link #getUniqueFile} method to make the file name unique for the task.

    @param context the context for the task. @param name the name for the file. @param extension the extension for the file @return a unique path accross all tasks of the job.]]>
    This tool supports archiving and anaylzing (sort/grep) of log-files. It takes as input a) Input uri which will serve uris of the logs to be archived. b) Output directory (not mandatory). b) Directory on dfs to archive the logs. c) The sort/grep patterns for analyzing the files and separator for boundaries. Usage: Logalyzer -archive -archiveDir -analysis -logs -grep -sort -separator

    ]]>