|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Core | |
---|---|
org.apache.hadoop | |
org.apache.hadoop.conf | Configuration of system parameters. |
org.apache.hadoop.filecache | |
org.apache.hadoop.fs | An abstract file system API. |
org.apache.hadoop.fs.ftp | |
org.apache.hadoop.fs.kfs | A client for the Kosmos filesystem (KFS) |
org.apache.hadoop.fs.permission | |
org.apache.hadoop.fs.s3 | A distributed, block-based implementation of FileSystem that uses Amazon S3
as a backing store. |
org.apache.hadoop.fs.s3native |
A distributed implementation of FileSystem for reading and writing files on
Amazon S3. |
org.apache.hadoop.fs.shell | |
org.apache.hadoop.http | |
org.apache.hadoop.io | Generic i/o code for use when reading and writing data to the network, to databases, and to files. |
org.apache.hadoop.io.compress | |
org.apache.hadoop.io.compress.bzip2 | |
org.apache.hadoop.io.compress.zlib | |
org.apache.hadoop.io.file.tfile | |
org.apache.hadoop.io.retry | A mechanism for selectively retrying methods that throw exceptions under certain circumstances. |
org.apache.hadoop.io.serializer | This package provides a mechanism for using different serialization frameworks in Hadoop. |
org.apache.hadoop.ipc | Tools to help define network clients and servers. |
org.apache.hadoop.ipc.metrics | |
org.apache.hadoop.log | |
org.apache.hadoop.mapred | A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner. |
org.apache.hadoop.mapred.jobcontrol | Utilities for managing dependent jobs. |
org.apache.hadoop.mapred.join | Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. |
org.apache.hadoop.mapred.lib | Library of generally useful mappers, reducers, and partitioners. |
org.apache.hadoop.mapred.lib.aggregate | Classes for performing various counting and aggregations. |
org.apache.hadoop.mapred.lib.db | org.apache.hadoop.mapred.lib.db Package |
org.apache.hadoop.mapred.pipes | Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. |
org.apache.hadoop.mapred.tools | |
org.apache.hadoop.mapreduce | |
org.apache.hadoop.mapreduce.lib.input | |
org.apache.hadoop.mapreduce.lib.map | |
org.apache.hadoop.mapreduce.lib.output | |
org.apache.hadoop.mapreduce.lib.partition | |
org.apache.hadoop.mapreduce.lib.reduce | |
org.apache.hadoop.metrics | This package defines an API for reporting performance metric information. |
org.apache.hadoop.metrics.file | Implementation of the metrics package that writes the metrics to a file. |
org.apache.hadoop.metrics.ganglia | Implementation of the metrics package that sends metric data to Ganglia. |
org.apache.hadoop.metrics.jvm | |
org.apache.hadoop.metrics.spi | The Service Provider Interface for the Metrics API. |
org.apache.hadoop.metrics.util | |
org.apache.hadoop.net | Network-related classes. |
org.apache.hadoop.record | Hadoop record I/O contains classes and a record description language translator for simplifying serialization and deserialization of records in a language-neutral manner. |
org.apache.hadoop.record.compiler | This package contains classes needed for code generation from the hadoop record compiler. |
org.apache.hadoop.record.compiler.ant | |
org.apache.hadoop.record.compiler.generated | This package contains code generated by JavaCC from the Hadoop record syntax file rcc.jj. |
org.apache.hadoop.record.meta | |
org.apache.hadoop.security | |
org.apache.hadoop.security.authorize | |
org.apache.hadoop.util | Common utilities. |
org.apache.hadoop.util.bloom | |
org.apache.hadoop.util.hash |
Examples | |
---|---|
org.apache.hadoop.examples | Hadoop example code. |
org.apache.hadoop.examples.dancing | This package is a distributed implementation of Knuth's dancing links algorithm that can run under Hadoop. |
org.apache.hadoop.examples.terasort | This package consists of 3 map/reduce applications for Hadoop to compete in the annual terabyte sort competition. |
contrib: Streaming | |
---|---|
org.apache.hadoop.streaming | Hadoop Streaming is a utility which allows users to create and run Map-Reduce jobs with any executables (e.g. |
contrib: DataJoin | |
---|---|
org.apache.hadoop.contrib.utils.join |
contrib: Index | |
---|---|
org.apache.hadoop.contrib.index.example | |
org.apache.hadoop.contrib.index.lucene | |
org.apache.hadoop.contrib.index.main | |
org.apache.hadoop.contrib.index.mapred |
contrib: FailMon | |
---|---|
org.apache.hadoop.contrib.failmon |
Hadoop is a distributed computing platform.
Hadoop primarily consists of the Hadoop Distributed FileSystem (HDFS) and an implementation of the Map-Reduce programming paradigm.
Hadoop is a software framework that lets one easily write and run applications that process vast amounts of data. Here's what makes Hadoop especially useful:
If your platform does not have the required software listed above, you will have to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:
First, you need to get a copy of the Hadoop code.
Edit the file conf/hadoop-env.sh to define at least JAVA_HOME.
Try the following command:
bin/hadoopThis will display the documentation for the Hadoop command script.
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir inputThis will display counts for each match of the regular expression.
Note that input is specified as a directory containing input files and that output is also specified as a directory where parts are written.
JobTracker
(MapReduce master)
host and port. This is specified with the configuration property
mapred.job.tracker.
(We also set the HDFS replication level to 1 in order to reduce warnings when running on a single node.)
Now check that the command
ssh localhost
does not
require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
The Hadoop daemons are started with the following command:
bin/start-all.sh
Daemon log output is written to the logs/ directory.
Input files are copied into the distributed filesystem as follows:
bin/hadoop fs -put input input
Things are run as before, but output must be copied locally to examine it:
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'When you're done, stop the daemons with:
bin/stop-all.sh
Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |