1 | Getting Started With Hadoop On Demand (HOD) |
---|
2 | =========================================== |
---|
3 | |
---|
4 | 1. Pre-requisites: |
---|
5 | ================== |
---|
6 | |
---|
7 | Hardware: |
---|
8 | HOD requires a minimum of 3 nodes configured through a resource manager. |
---|
9 | |
---|
10 | Software: |
---|
11 | The following components are assumed to be installed before using HOD: |
---|
12 | * Torque: |
---|
13 | (http://www.clusterresources.com/pages/products/torque-resource-manager.php) |
---|
14 | Currently HOD supports Torque out of the box. We assume that you are |
---|
15 | familiar with configuring Torque. You can get information about this |
---|
16 | from the following link: |
---|
17 | http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki |
---|
18 | * Python (http://www.python.org/) |
---|
19 | We require version 2.5.1 of Python. |
---|
20 | |
---|
21 | The following components can be optionally installed for getting better |
---|
22 | functionality from HOD: |
---|
23 | * Twisted Python: This can be used for improving the scalability of HOD |
---|
24 | (http://twistedmatrix.com/trac/) |
---|
25 | * Hadoop: HOD can automatically distribute Hadoop to all nodes in the |
---|
26 | cluster. However, it can also use a pre-installed version of Hadoop, |
---|
27 | if it is available on all nodes in the cluster. |
---|
28 | (http://hadoop.apache.org/core) |
---|
29 | HOD currently supports Hadoop 0.15 and above. |
---|
30 | |
---|
31 | NOTE: HOD configuration requires the location of installs of these |
---|
32 | components to be the same on all nodes in the cluster. It will also |
---|
33 | make the configuration simpler to have the same location on the submit |
---|
34 | nodes. |
---|
35 | |
---|
36 | 2. Resource Manager Configuration Pre-requisites: |
---|
37 | ================================================= |
---|
38 | |
---|
39 | For using HOD with Torque: |
---|
40 | * Install Torque components: pbs_server on a head node, pbs_moms on all |
---|
41 | compute nodes, and PBS client tools on all compute nodes and submit |
---|
42 | nodes. |
---|
43 | * Create a queue for submitting jobs on the pbs_server. |
---|
44 | * Specify a name for all nodes in the cluster, by setting a 'node |
---|
45 | property' to all the nodes. |
---|
46 | This can be done by using the 'qmgr' command. For example: |
---|
47 | qmgr -c "set node node properties=cluster-name" |
---|
48 | * Ensure that jobs can be submitted to the nodes. This can be done by |
---|
49 | using the 'qsub' command. For example: |
---|
50 | echo "sleep 30" | qsub -l nodes=3 |
---|
51 | * More information about setting up Torque can be found by referring |
---|
52 | to the documentation under: |
---|
53 | http://www.clusterresources.com/pages/products/torque-resource-manager.php |
---|
54 | |
---|
55 | 3. Setting up HOD: |
---|
56 | ================== |
---|
57 | |
---|
58 | * HOD is available under the 'contrib' section of Hadoop under the root |
---|
59 | directory 'hod'. |
---|
60 | * Distribute the files under this directory to all the nodes in the |
---|
61 | cluster. Note that the location where the files are copied should be |
---|
62 | the same on all the nodes. |
---|
63 | * On the node from where you want to run hod, edit the file hodrc |
---|
64 | which can be found in the <install dir>/conf directory. This file |
---|
65 | contains the minimal set of values required for running hod. |
---|
66 | * Specify values suitable to your environment for the following |
---|
67 | variables defined in the configuration file. Note that some of these |
---|
68 | variables are defined at more than one place in the file. |
---|
69 | |
---|
70 | * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK |
---|
71 | 1.5.x |
---|
72 | * ${CLUSTER_NAME}: Name of the cluster which is specified in the |
---|
73 | 'node property' as mentioned in resource manager configuration. |
---|
74 | * ${HADOOP_HOME}: Location of Hadoop installation on the compute and |
---|
75 | submit nodes. |
---|
76 | * ${RM_QUEUE}: Queue configured for submiting jobs in the resource |
---|
77 | manager configuration. |
---|
78 | * ${RM_HOME}: Location of the resource manager installation on the |
---|
79 | compute and submit nodes. |
---|
80 | |
---|
81 | * The following environment variables *may* need to be set depending on |
---|
82 | your environment. These variables must be defined where you run the |
---|
83 | HOD client, and also be specified in the HOD configuration file as the |
---|
84 | value of the key resource_manager.env-vars. Multiple variables can be |
---|
85 | specified as a comma separated list of key=value pairs. |
---|
86 | |
---|
87 | * HOD_PYTHON_HOME: If you install python to a non-default location |
---|
88 | of the compute nodes, or submit nodes, then, this variable must be |
---|
89 | defined to point to the python executable in the non-standard |
---|
90 | location. |
---|
91 | |
---|
92 | |
---|
93 | NOTE: |
---|
94 | |
---|
95 | You can also review other configuration options in the file and |
---|
96 | modify them to suit your needs. Refer to the file config.txt for |
---|
97 | information about the HOD configuration. |
---|
98 | |
---|
99 | |
---|
100 | 4. Running HOD: |
---|
101 | =============== |
---|
102 | |
---|
103 | 4.1 Overview: |
---|
104 | ------------- |
---|
105 | |
---|
106 | A typical session of HOD will involve atleast three steps: allocate, |
---|
107 | run hadoop jobs, deallocate. |
---|
108 | |
---|
109 | 4.1.1 Operation allocate |
---|
110 | ------------------------ |
---|
111 | |
---|
112 | The allocate operation is used to allocate a set of nodes and install and |
---|
113 | provision Hadoop on them. It has the following syntax: |
---|
114 | |
---|
115 | hod -c config_file -t hadoop_tarball_location -o "allocate \ |
---|
116 | cluster_dir number_of_nodes" |
---|
117 | |
---|
118 | The hadoop_tarball_location must be a location on a shared file system |
---|
119 | accesible from all nodes in the cluster. Note, the cluster_dir must exist |
---|
120 | before running the command. If the command completes successfully then |
---|
121 | cluster_dir/hadoop-site.xml will be generated and will contain information |
---|
122 | about the allocated cluster's JobTracker and NameNode. |
---|
123 | |
---|
124 | For example, the following command uses a hodrc file in ~/hod-config/hodrc and |
---|
125 | allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes, |
---|
126 | storing the generated Hadoop configuration in a directory named |
---|
127 | ~/hadoop-cluster: |
---|
128 | |
---|
129 | $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \ |
---|
130 | ~/hadoop-cluster 10" |
---|
131 | |
---|
132 | HOD also supports an environment variable called HOD_CONF_DIR. If this is |
---|
133 | defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc. |
---|
134 | Defining this allows the above command to also be run as follows: |
---|
135 | |
---|
136 | $ export HOD_CONF_DIR=~/hod-config |
---|
137 | $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10" |
---|
138 | |
---|
139 | 4.1.2 Running Hadoop jobs using the allocated cluster |
---|
140 | ----------------------------------------------------- |
---|
141 | |
---|
142 | Now, one can run Hadoop jobs using the allocated cluster in the usual manner: |
---|
143 | |
---|
144 | hadoop --config cluster_dir hadoop_command hadoop_command_args |
---|
145 | |
---|
146 | Continuing our example, the following command will run a wordcount example on |
---|
147 | the allocated cluster: |
---|
148 | |
---|
149 | $ hadoop --config ~/hadoop-cluster jar \ |
---|
150 | /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output |
---|
151 | |
---|
152 | 4.1.3 Operation deallocate |
---|
153 | -------------------------- |
---|
154 | |
---|
155 | The deallocate operation is used to release an allocated cluster. When |
---|
156 | finished with a cluster, deallocate must be run so that the nodes become free |
---|
157 | for others to use. The deallocate operation has the following syntax: |
---|
158 | |
---|
159 | hod -o "deallocate cluster_dir" |
---|
160 | |
---|
161 | Continuing our example, the following command will deallocate the cluster: |
---|
162 | |
---|
163 | $ hod -o "deallocate ~/hadoop-cluster" |
---|
164 | |
---|
165 | 4.2 Command Line Options |
---|
166 | ------------------------ |
---|
167 | |
---|
168 | This section covers the major command line options available via the hod |
---|
169 | command: |
---|
170 | |
---|
171 | --help |
---|
172 | Prints out the help message to see the basic options. |
---|
173 | |
---|
174 | --verbose-help |
---|
175 | All configuration options provided in the hodrc file can be passed on the |
---|
176 | command line, using the syntax --section_name.option_name[=value]. When |
---|
177 | provided this way, the value provided on command line overrides the option |
---|
178 | provided in hodrc. The verbose-help command lists all the available options in |
---|
179 | the hodrc file. This is also a nice way to see the meaning of the |
---|
180 | configuration options. |
---|
181 | |
---|
182 | -c config_file |
---|
183 | Provides the configuration file to use. Can be used with all other options of |
---|
184 | HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to |
---|
185 | specify a directory that contains a file named hodrc, alleviating the need to |
---|
186 | specify the configuration file in each HOD command. |
---|
187 | |
---|
188 | -b 1|2|3|4 |
---|
189 | Enables the given debug level. Can be used with all other options of HOD. 4 is |
---|
190 | most verbose. |
---|
191 | |
---|
192 | -o "help" |
---|
193 | Lists the operations available in the operation mode. |
---|
194 | |
---|
195 | -o "allocate cluster_dir number_of_nodes" |
---|
196 | Allocates a cluster on the given number of cluster nodes, and store the |
---|
197 | allocation information in cluster_dir for use with subsequent hadoop commands. |
---|
198 | Note that the cluster_dir must exist before running the command. |
---|
199 | |
---|
200 | -o "list" |
---|
201 | Lists the clusters allocated by this user. Information provided includes the |
---|
202 | Torque job id corresponding to the cluster, the cluster directory where the |
---|
203 | allocation information is stored, and whether the Map/Reduce daemon is still |
---|
204 | active or not. |
---|
205 | |
---|
206 | -o "info cluster_dir" |
---|
207 | Lists information about the cluster whose allocation information is stored in |
---|
208 | the specified cluster directory. |
---|
209 | |
---|
210 | -o "deallocate cluster_dir" |
---|
211 | Deallocates the cluster whose allocation information is stored in the |
---|
212 | specified cluster directory. |
---|
213 | |
---|
214 | -t hadoop_tarball |
---|
215 | Provisions Hadoop from the given tar.gz file. This option is only applicable |
---|
216 | to the allocate operation. For better distribution performance it is |
---|
217 | recommended that the Hadoop tarball contain only the libraries and binaries, |
---|
218 | and not the source or documentation. |
---|
219 | |
---|
220 | -Mkey1=value1 -Mkey2=value2 |
---|
221 | Provides configuration parameters for the provisioned Map/Reduce daemons |
---|
222 | (JobTracker and TaskTrackers). A hadoop-site.xml is generated with these |
---|
223 | values on the cluster nodes |
---|
224 | |
---|
225 | -Hkey1=value1 -Hkey2=value2 |
---|
226 | Provides configuration parameters for the provisioned HDFS daemons (NameNode |
---|
227 | and DataNodes). A hadoop-site.xml is generated with these values on the |
---|
228 | cluster nodes |
---|
229 | |
---|
230 | -Ckey1=value1 -Ckey2=value2 |
---|
231 | Provides configuration parameters for the client from where jobs can be |
---|
232 | submitted. A hadoop-site.xml is generated with these values on the submit |
---|
233 | node. |
---|