source: proiecte/HadoopJUnit/hadoop-0.20.1/contrib/hod/getting_started.txt @ 173

Last change on this file since 173 was 120, checked in by (none), 14 years ago

Added the mail files for the Hadoop JUNit Project

  • Property svn:executable set to *
File size: 9.2 KB
Line 
1            Getting Started With Hadoop On Demand (HOD)
2            ===========================================
3
41. Pre-requisites:
5==================
6
7Hardware:
8HOD requires a minimum of 3 nodes configured through a resource manager.
9
10Software:
11The following components are assumed to be installed before using HOD:
12* Torque:
13  (http://www.clusterresources.com/pages/products/torque-resource-manager.php)
14  Currently HOD supports Torque out of the box. We assume that you are
15  familiar with configuring Torque. You can get information about this
16  from the following link:
17  http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki
18* Python (http://www.python.org/)
19  We require version 2.5.1 of Python.
20   
21The following components can be optionally installed for getting better
22functionality from HOD:
23* Twisted Python: This can be used for improving the scalability of HOD
24  (http://twistedmatrix.com/trac/)
25* Hadoop: HOD can automatically distribute Hadoop to all nodes in the
26  cluster. However, it can also use a pre-installed version of Hadoop,
27  if it is available on all nodes in the cluster.
28  (http://hadoop.apache.org/core)
29  HOD currently supports Hadoop 0.15 and above.
30
31NOTE: HOD configuration requires the location of installs of these
32components to be the same on all nodes in the cluster. It will also
33make the configuration simpler to have the same location on the submit
34nodes.
35
362. Resource Manager Configuration Pre-requisites:
37=================================================
38
39For using HOD with Torque:
40* Install Torque components: pbs_server on a head node, pbs_moms on all
41  compute nodes, and PBS client tools on all compute nodes and submit
42  nodes.
43* Create a queue for submitting jobs on the pbs_server.
44* Specify a name for all nodes in the cluster, by setting a 'node
45  property' to all the nodes.
46  This can be done by using the 'qmgr' command. For example:
47  qmgr -c "set node node properties=cluster-name"
48* Ensure that jobs can be submitted to the nodes. This can be done by
49  using the 'qsub' command. For example:
50  echo "sleep 30" | qsub -l nodes=3
51* More information about setting up Torque can be found by referring
52  to the documentation under:
53http://www.clusterresources.com/pages/products/torque-resource-manager.php
54
553. Setting up HOD:
56==================
57
58* HOD is available under the 'contrib' section of Hadoop under the root
59  directory 'hod'.
60* Distribute the files under this directory to all the nodes in the
61  cluster. Note that the location where the files are copied should be
62  the same on all the nodes.
63* On the node from where you want to run hod, edit the file hodrc
64  which can be found in the <install dir>/conf directory. This file
65  contains the minimal set of values required for running hod.
66* Specify values suitable to your environment for the following
67  variables defined in the configuration file. Note that some of these
68  variables are defined at more than one place in the file.
69
70  * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK
71    1.5.x
72  * ${CLUSTER_NAME}: Name of the cluster which is specified in the
73    'node property' as mentioned in resource manager configuration.
74  * ${HADOOP_HOME}: Location of Hadoop installation on the compute and
75    submit nodes.
76  * ${RM_QUEUE}: Queue configured for submiting jobs in the resource
77    manager configuration.
78  * ${RM_HOME}: Location of the resource manager installation on the
79    compute and submit nodes.
80
81* The following environment variables *may* need to be set depending on
82  your environment. These variables must be defined where you run the
83  HOD client, and also be specified in the HOD configuration file as the
84  value of the key resource_manager.env-vars. Multiple variables can be
85  specified as a comma separated list of key=value pairs.
86
87  * HOD_PYTHON_HOME: If you install python to a non-default location
88    of the compute nodes, or submit nodes, then, this variable must be
89    defined to point to the python executable in the non-standard
90    location.
91
92
93NOTE:
94
95You can also review other configuration options in the file and
96modify them to suit your needs. Refer to the file config.txt for
97information about the HOD configuration.
98
99
1004. Running HOD:
101===============
102
1034.1 Overview:
104-------------
105
106A typical session of HOD will involve atleast three steps: allocate,
107run hadoop jobs, deallocate.
108
1094.1.1 Operation allocate
110------------------------
111
112The allocate operation is used to allocate a set of nodes and install and
113provision Hadoop on them. It has the following syntax:
114
115  hod -c config_file -t hadoop_tarball_location -o "allocate \
116                                                cluster_dir number_of_nodes"
117
118The hadoop_tarball_location must be a location on a shared file system
119accesible from all nodes in the cluster. Note, the cluster_dir must exist
120before running the command. If the command completes successfully then
121cluster_dir/hadoop-site.xml will be generated and will contain information
122about the allocated cluster's JobTracker and NameNode.
123
124For example, the following command uses a hodrc file in ~/hod-config/hodrc and
125allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes,
126storing the generated Hadoop configuration in a directory named
127~/hadoop-cluster:
128
129  $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \
130                                                        ~/hadoop-cluster 10"
131
132HOD also supports an environment variable called HOD_CONF_DIR. If this is
133defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc.
134Defining this allows the above command to also be run as follows:
135
136  $ export HOD_CONF_DIR=~/hod-config
137  $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10"
138
1394.1.2 Running Hadoop jobs using the allocated cluster
140-----------------------------------------------------
141
142Now, one can run Hadoop jobs using the allocated cluster in the usual manner:
143
144  hadoop --config cluster_dir hadoop_command hadoop_command_args
145
146Continuing our example, the following command will run a wordcount example on
147the allocated cluster:
148
149  $ hadoop --config ~/hadoop-cluster jar \
150       /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output
151
1524.1.3 Operation deallocate
153--------------------------
154
155The deallocate operation is used to release an allocated cluster. When
156finished with a cluster, deallocate must be run so that the nodes become free
157for others to use. The deallocate operation has the following syntax:
158
159  hod -o "deallocate cluster_dir"
160
161Continuing our example, the following command will deallocate the cluster:
162
163  $ hod -o "deallocate ~/hadoop-cluster"
164
1654.2 Command Line Options
166------------------------
167
168This section covers the major command line options available via the hod
169command:
170
171--help
172Prints out the help message to see the basic options.
173
174--verbose-help
175All configuration options provided in the hodrc file can be passed on the
176command line, using the syntax --section_name.option_name[=value]. When
177provided this way, the value provided on command line overrides the option
178provided in hodrc. The verbose-help command lists all the available options in
179the hodrc file. This is also a nice way to see the meaning of the
180configuration options.
181
182-c config_file
183Provides the configuration file to use. Can be used with all other options of
184HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to
185specify a directory that contains a file named hodrc, alleviating the need to
186specify the configuration file in each HOD command.
187
188-b 1|2|3|4
189Enables the given debug level. Can be used with all other options of HOD. 4 is
190most verbose.
191
192-o "help"
193Lists the operations available in the operation mode.
194
195-o "allocate cluster_dir number_of_nodes"
196Allocates a cluster on the given number of cluster nodes, and store the
197allocation information in cluster_dir for use with subsequent hadoop commands.
198Note that the cluster_dir must exist before running the command.
199
200-o "list"
201Lists the clusters allocated by this user. Information provided includes the
202Torque job id corresponding to the cluster, the cluster directory where the
203allocation information is stored, and whether the Map/Reduce daemon is still
204active or not.
205
206-o "info cluster_dir"
207Lists information about the cluster whose allocation information is stored in
208the specified cluster directory.
209
210-o "deallocate cluster_dir"
211Deallocates the cluster whose allocation information is stored in the
212specified cluster directory.
213
214-t hadoop_tarball
215Provisions Hadoop from the given tar.gz file. This option is only applicable
216to the allocate operation. For better distribution performance it is
217recommended that the Hadoop tarball contain only the libraries and binaries,
218and not the source or documentation.
219
220-Mkey1=value1 -Mkey2=value2
221Provides configuration parameters for the provisioned Map/Reduce daemons
222(JobTracker and TaskTrackers). A hadoop-site.xml is generated with these
223values on the cluster nodes
224
225-Hkey1=value1 -Hkey2=value2
226Provides configuration parameters for the provisioned HDFS daemons (NameNode
227and DataNodes). A hadoop-site.xml is generated with these values on the
228cluster nodes
229
230-Ckey1=value1 -Ckey2=value2
231Provides configuration parameters for the client from where jobs can be
232submitted. A hadoop-site.xml is generated with these values on the submit
233node.
Note: See TracBrowser for help on using the repository browser.