Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

hod_admin_guide.xml @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago
Added the mail files for the Hadoop JUNit Project
Property svn:executable set to ``*
File size: 17.5 KB

Line
1	<?xml version="1.0"?>
2
3	<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
4	"http://forrest.apache.org/dtd/document-v20.dtd">
5
6
7	<document>
8
9	<header>
10	<title>
11	HOD Administrator Guide
12	</title>
13	</header>
14
15	<body>
16	<section>
17	<title>Overview</title>
18	<p>Hadoop On Demand (HOD) is a system for provisioning and
19	managing independent Hadoop Map/Reduce and Hadoop Distributed File System (HDFS)
20	instances on a shared cluster
21	of nodes. HOD is a tool that makes it easy for administrators and users to
22	quickly setup and use Hadoop. HOD is also a very useful tool for Hadoop developers
23	and testers who need to share a physical cluster for testing their own Hadoop
24	versions.
25	</p>
26
27	<p>HOD relies on a resource manager (RM) for allocation of nodes that it can use for
28	running Hadoop instances. At present it runs with the <a href="ext:hod/torque">Torque
29	resource manager</a>.
30	</p>
31
32	<p>
33	The basic system architecture of HOD includes these components:</p>
34	<ul>
35	<li>A Resource manager (possibly together with a scheduler)</li>
36	<li>Various HOD components</li>
37	<li>Hadoop Map/Reduce and HDFS daemons</li>
38	</ul>
39
40	<p>
41	HOD provisions and maintains Hadoop Map/Reduce and, optionally, HDFS instances
42	through interaction with the above components on a given cluster of nodes. A cluster of
43	nodes can be thought of as comprising two sets of nodes:</p>
44	<ul>
45	<li>Submit nodes: Users use the HOD client on these nodes to allocate clusters, and then
46	use the Hadoop client to submit Hadoop jobs. </li>
47	<li>Compute nodes: Using the resource manager, HOD components are run on these nodes to
48	provision the Hadoop daemons. After that Hadoop jobs run on them.</li>
49	</ul>
50
51	<p>
52	Here is a brief description of the sequence of operations in allocating a cluster and
53	running jobs on them.
54	</p>
55
56	<ul>
57	<li>The user uses the HOD client on the Submit node to allocate a desired number of
58	cluster nodes and to provision Hadoop on them.</li>
59	<li>The HOD client uses a resource manager interface (qsub, in Torque) to submit a HOD
60	process, called the RingMaster, as a Resource Manager job, to request the user's desired number
61	of nodes. This job is submitted to the central server of the resource manager (pbs_server, in Torque).</li>
62	<li>On the compute nodes, the resource manager slave daemons (pbs_moms in Torque) accept
63	and run jobs that they are assigned by the central server (pbs_server in Torque). The RingMaster
64	process is started on one of the compute nodes (mother superior, in Torque).</li>
65	<li>The RingMaster then uses another resource manager interface (pbsdsh, in Torque) to run
66	the second HOD component, HodRing, as distributed tasks on each of the compute
67	nodes allocated.</li>
68	<li>The HodRings, after initializing, communicate with the RingMaster to get Hadoop commands,
69	and run them accordingly. Once the Hadoop commands are started, they register with the RingMaster,
70	giving information about the daemons.</li>
71	<li>All the configuration files needed for Hadoop instances are generated by HOD itself,
72	some obtained from options given by user in its own configuration file.</li>
73	<li>The HOD client keeps communicating with the RingMaster to find out the location of the
74	JobTracker and HDFS daemons.</li>
75	</ul>
76
77	<p>This guide shows you how to get started using HOD, reviews various HOD features and command line options, and provides detailed troubleshooting help.</p>
78
79	</section>
80
81	<section>
82	<title>Pre-requisites</title>
83	<p>To use HOD, your system should include the following hardware and software
84	components.</p>
85	<p>Operating System: HOD is currently tested on RHEL4.<br/>
86	Nodes : HOD requires a minimum of three nodes configured through a resource manager.<br/></p>
87
88	<p> Software </p>
89	<p>The following components must be installed on ALL nodes before using HOD:</p>
90	<ul>
91	<li><a href="ext:hod/torque">Torque: Resource manager</a></li>
92	<li><a href="ext:hod/python">Python</a> : HOD requires version 2.5.1 of Python.</li>
93	</ul>
94
95	<p>The following components are optional and can be installed to obtain better
96	functionality from HOD:</p>
97	<ul>
98	<li><a href="ext:hod/twisted-python">Twisted Python</a>: This can be
99	used for improving the scalability of HOD. If this module is detected to be
100	installed, HOD uses it, else it falls back to default modules.</li>
101	<li><a href="ext:site">Hadoop</a>: HOD can automatically
102	distribute Hadoop to all nodes in the cluster. However, it can also use a
103	pre-installed version of Hadoop, if it is available on all nodes in the cluster.
104	HOD currently supports Hadoop 0.15 and above.</li>
105	</ul>
106
107	<p>NOTE: HOD configuration requires the location of installs of these
108	components to be the same on all nodes in the cluster. It will also
109	make the configuration simpler to have the same location on the submit
110	nodes.
111	</p>
112	</section>
113
114	<section>
115	<title>Resource Manager</title>
116	<p> Currently HOD works with the Torque resource manager, which it uses for its node
117	allocation and job submission. Torque is an open source resource manager from
118	<a href="ext:hod/cluster-resources">Cluster Resources</a>, a community effort
119	based on the PBS project. It provides control over batch jobs and distributed compute nodes. Torque is
120	freely available for download from <a href="ext:hod/torque-download">here</a>.
121	</p>
122
123	<p> All documentation related to torque can be seen under
124	the section TORQUE Resource Manager <a
125	href="ext:hod/torque-docs">here</a>. You can
126	get wiki documentation from <a
127	href="ext:hod/torque-wiki">here</a>.
128	Users may wish to subscribe to TORQUEâs mailing list or view the archive for questions,
129	comments <a
130	href="ext:hod/torque-mailing-list">here</a>.
131	</p>
132
133	<p>To use HOD with Torque:</p>
134	<ul>
135	<li>Install Torque components: pbs_server on one node (head node), pbs_mom on all
136	compute nodes, and PBS client tools on all compute nodes and submit
137	nodes. Perform at least a basic configuration so that the Torque system is up and
138	running, that is, pbs_server knows which machines to talk to. Look <a
139	href="ext:hod/torque-basic-config">here</a>
140	for basic configuration.
141
142	For advanced configuration, see <a
143	href="ext:hod/torque-advanced-config">here</a></li>
144	<li>Create a queue for submitting jobs on the pbs_server. The name of the queue is the
145	same as the HOD configuration parameter, resource-manager.queue. The HOD client uses this queue to
146	submit the RingMaster process as a Torque job.</li>
147	<li>Specify a cluster name as a property for all nodes in the cluster.
148	This can be done by using the qmgr command. For example:
149	<code>qmgr -c "set node node properties=cluster-name"</code>. The name of the cluster is the same as
150	the HOD configuration parameter, hod.cluster. </li>
151	<li>Make sure that jobs can be submitted to the nodes. This can be done by
152	using the qsub command. For example:
153	<code>echo "sleep 30" \| qsub -l nodes=3</code></li>
154	</ul>
155
156	</section>
157
158	<section>
159	<title>Installing HOD</title>
160
161	<p>Once the resource manager is set up, you can obtain and
162	install HOD.</p>
163	<ul>
164	<li>If you are getting HOD from the Hadoop tarball, it is available under the
165	'contrib' section of Hadoop, under the root directory 'hod'.</li>
166	<li>If you are building from source, you can run ant tar from the Hadoop root
167	directory to generate the Hadoop tarball, and then get HOD from there,
168	as described above.</li>
169	<li>Distribute the files under this directory to all the nodes in the
170	cluster. Note that the location where the files are copied should be
171	the same on all the nodes.</li>
172	<li>Note that compiling hadoop would build HOD with appropriate permissions
173	set on all the required script files in HOD.</li>
174	</ul>
175	</section>
176
177	<section>
178	<title>Configuring HOD</title>
179
180	<p>You can configure HOD once it is installed. The minimal configuration needed
181	to run HOD is described below. More advanced configuration options are discussed
182	in the HOD Configuration Guide.</p>
183	<section>
184	<title>Minimal Configuration</title>
185	<p>To get started using HOD, the following minimal configuration is
186	required:</p>
187	<ul>
188	<li>On the node from where you want to run HOD, edit the file hodrc
189	located in the <install dir>/conf directory. This file
190	contains the minimal set of values required to run hod.</li>
191	<li>
192	<p>Specify values suitable to your environment for the following
193	variables defined in the configuration file. Note that some of these
194	variables are defined at more than one place in the file.</p>
195
196	<ul>
197	<li>${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK
198	1.6.x and above.</li>
199	<li>${CLUSTER_NAME}: Name of the cluster which is specified in the
200	'node property' as mentioned in resource manager configuration.</li>
201	<li>${HADOOP_HOME}: Location of Hadoop installation on the compute and
202	submit nodes.</li>
203	<li>${RM_QUEUE}: Queue configured for submitting jobs in the resource
204	manager configuration.</li>
205	<li>${RM_HOME}: Location of the resource manager installation on the
206	compute and submit nodes.</li>
207	</ul>
208	</li>
209
210	<li>
211	<p>The following environment variables may need to be set depending on
212	your environment. These variables must be defined where you run the
213	HOD client and must also be specified in the HOD configuration file as the
214	value of the key resource_manager.env-vars. Multiple variables can be
215	specified as a comma separated list of key=value pairs.</p>
216
217	<ul>
218	<li>HOD_PYTHON_HOME: If you install python to a non-default location
219	of the compute nodes, or submit nodes, then this variable must be
220	defined to point to the python executable in the non-standard
221	location.</li>
222	</ul>
223	</li>
224	</ul>
225	</section>
226
227	<section>
228	<title>Advanced Configuration</title>
229	<p> You can review and modify other configuration options to suit
230	your specific needs. Refer to the <a href="hod_config_guide.html">HOD Configuration
231	Guide</a> for more information.</p>
232	</section>
233	</section>
234
235	<section>
236	<title>Running HOD</title>
237	<p>You can run HOD once it is configured. Refer to the<a
238	href="hod_user_guide.html"> HOD User Guide</a> for more information.</p>
239	</section>
240
241	<section>
242	<title>Supporting Tools and Utilities</title>
243	<p>This section describes supporting tools and utilities that can be used to
244	manage HOD deployments.</p>
245
246	<section>
247	<title>logcondense.py - Manage Log Files</title>
248	<p>As mentioned in the
249	<a href="hod_user_guide.html#Collecting+and+Viewing+Hadoop+Logs">HOD User Guide</a>,
250	HOD can be configured to upload
251	Hadoop logs to a statically configured HDFS. Over time, the number of logs uploaded
252	to HDFS could increase. logcondense.py is a tool that helps
253	administrators to remove log files uploaded to HDFS. </p>
254	<section>
255	<title>Running logcondense.py</title>
256	<p>logcondense.py is available under hod_install_location/support folder. You can either
257	run it using python, for example, <em>python logcondense.py</em>, or give execute permissions
258	to the file, and directly run it as <em>logcondense.py</em>. logcondense.py needs to be
259	run by a user who has sufficient permissions to remove files from locations where log
260	files are uploaded in the HDFS, if permissions are enabled. For example as mentioned in the
261	<a href="hod_config_guide.html#3.7+hodring+options">HOD Configuration Guide</a>, the logs could
262	be configured to come under the user's home directory in HDFS. In that case, the user
263	running logcondense.py should have super user privileges to remove the files from under
264	all user home directories.</p>
265	</section>
266	<section>
267	<title>Command Line Options for logcondense.py</title>
268	<p>The following command line options are supported for logcondense.py.</p>
269	<table>
270	<tr>
271	<td>Short Option</td>
272	<td>Long option</td>
273	<td>Meaning</td>
274	<td>Example</td>
275	</tr>
276	<tr>
277	<td>-p</td>
278	<td>--package</td>
279	<td>Complete path to the hadoop script. The version of hadoop must be the same as the
280	one running HDFS.</td>
281	<td>/usr/bin/hadoop</td>
282	</tr>
283	<tr>
284	<td>-d</td>
285	<td>--days</td>
286	<td>Delete log files older than the specified number of days</td>
287	<td>7</td>
288	</tr>
289	<tr>
290	<td>-c</td>
291	<td>--config</td>
292	<td>Path to the Hadoop configuration directory, under which hadoop-site.xml resides.
293	The hadoop-site.xml must point to the HDFS NameNode from where logs are to be removed.</td>
294	<td>/home/foo/hadoop/conf</td>
295	</tr>
296	<tr>
297	<td>-l</td>
298	<td>--logs</td>
299	<td>A HDFS path, this must be the same HDFS path as specified for the log-destination-uri,
300	as mentioned in the <a href="hod_config_guide.html#3.7+hodring+options">HOD Configuration Guide</a>,
301	without the hdfs:// URI string</td>
302	<td>/user</td>
303	</tr>
304	<tr>
305	<td>-n</td>
306	<td>--dynamicdfs</td>
307	<td>If true, this will indicate that the logcondense.py script should delete HDFS logs
308	in addition to Map/Reduce logs. Otherwise, it only deletes Map/Reduce logs, which is also the
309	default if this option is not specified. This option is useful if
310	dynamic HDFS installations
311	are being provisioned by HOD, and the static HDFS installation is being used only to collect
312	logs - a scenario that may be common in test clusters.</td>
313	<td>false</td>
314	</tr>
315	</table>
316	<p>So, for example, to delete all log files older than 7 days using a hadoop-site.xml stored in
317	~/hadoop-conf, using the hadoop installation under ~/hadoop-0.17.0, you could say:</p>
318	<p><em>python logcondense.py -p ~/hadoop-0.17.0/bin/hadoop -d 7 -c ~/hadoop-conf -l /user</em></p>
319	</section>
320	</section>
321	<section>
322	<title>checklimits.sh - Monitor Resource Limits</title>
323	<p>checklimits.sh is a HOD tool specific to the Torque/Maui environment
324	(<a href="ext:hod/maui">Maui Cluster Scheduler</a> is an open source job
325	scheduler for clusters and supercomputers, from clusterresources). The
326	checklimits.sh script
327	updates the torque comment field when newly submitted job(s) violate or
328	exceed
329	over user limits set up in Maui scheduler. It uses qstat, does one pass
330	over the torque job-list to determine queued or unfinished jobs, runs Maui
331	tool checkjob on each job to see if user limits are violated and then
332	runs torque's qalter utility to update job attribute 'comment'. Currently
333	it updates the comment as <em>User-limits exceeded. Requested:([0-9]*)
334	Used:([0-9]) MaxLimit:([0-9])</em> for those jobs that violate limits.
335	This comment field is then used by HOD to behave accordingly depending on
336	the type of violation.</p>
337	<section>
338	<title>Running checklimits.sh</title>
339	<p>checklimits.sh is available under the hod_install_location/support
340	folder. This shell script can be run directly as <em>sh
341	checklimits.sh </em>or as <em>./checklimits.sh</em> after enabling
342	execute permissions. Torque and Maui binaries should be available
343	on the machine where the tool is run and should be in the path
344	of the shell script process. To update the
345	comment field of jobs from different users, this tool must be run with
346	torque administrative privileges. This tool must be run repeatedly
347	after specific intervals of time to frequently update jobs violating
348	constraints, for example via cron. Please note that the resource manager
349	and scheduler commands used in this script can be expensive and so
350	it is better not to run this inside a tight loop without sleeping.</p>
351	</section>
352	</section>
353
354	<section>
355	<title>verify-account - Script to verify an account under which
356	jobs are submitted</title>
357	<p>Production systems use accounting packages to charge users for using
358	shared compute resources. HOD supports a parameter
359	<em>resource_manager.pbs-account</em> to allow users to identify the
360	account under which they would like to submit jobs. It may be necessary
361	to verify that this account is a valid one configured in an accounting
362	system. The <em>hod-install-dir/bin/verify-account</em> script
363	provides a mechanism to plug-in a custom script that can do this
364	verification.</p>
365
366	<section>
367	<title>Integrating the verify-account script with HOD</title>
368	<p>HOD runs the <em>verify-account</em> script passing in the
369	<em>resource_manager.pbs-account</em> value as argument to the script,
370	before allocating a cluster. Sites can write a script that verify this
371	account against their accounting systems. Returning a non-zero exit
372	code from this script will cause HOD to fail allocation. Also, in
373	case of an error, HOD will print the output of script to the user.
374	Any descriptive error message can be passed to the user from the
375	script in this manner.</p>
376	<p>The default script that comes with the HOD installation does not
377	do any validation, and returns a zero exit code.</p>
378	<p>If the verify-account script is not found, then HOD will treat
379	that verification is disabled, and continue allocation as is.</p>
380	</section>
381	</section>
382
383	</section>
384
385	</body>
386	</document>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: proiecte/HadoopJUnit/hadoop-0.20.1/src/docs/src/documentation/content/xdocs/hod_admin_guide.xml @ 120

Download in other formats: