source: proiecte/HadoopJUnit/hadoop-0.20.1/docs/hod_user_guide.html @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago

Added the mail files for the Hadoop JUNit Project

  • Property svn:executable set to *
File size: 67.1 KB
Line 
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
2<html>
3<head>
4<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
5<meta content="Apache Forrest" name="Generator">
6<meta name="Forrest-version" content="0.8">
7<meta name="Forrest-skin-name" content="pelt">
8<title>
9      HOD User Guide
10    </title>
11<link type="text/css" href="skin/basic.css" rel="stylesheet">
12<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
13<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
14<link type="text/css" href="skin/profile.css" rel="stylesheet">
15<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
16<link rel="shortcut icon" href="images/favicon.ico">
17</head>
18<body onload="init()">
19<script type="text/javascript">ndeSetTextSize();</script>
20<div id="top">
21<!--+
22    |breadtrail
23    +-->
24<div class="breadtrail">
25<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/core/">Core</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
26</div>
27<!--+
28    |header
29    +-->
30<div class="header">
31<!--+
32    |start group logo
33    +-->
34<div class="grouplogo">
35<a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
36</div>
37<!--+
38    |end group logo
39    +-->
40<!--+
41    |start Project Logo
42    +-->
43<div class="projectlogo">
44<a href="http://hadoop.apache.org/core/"><img class="logoImage" alt="Hadoop" src="images/core-logo.gif" title="Scalable Computing Platform"></a>
45</div>
46<!--+
47    |end Project Logo
48    +-->
49<!--+
50    |start Search
51    +-->
52<div class="searchbox">
53<form action="http://www.google.com/search" method="get" class="roundtopsmall">
54<input value="hadoop.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp; 
55                    <input name="Search" value="Search" type="submit">
56</form>
57</div>
58<!--+
59    |end search
60    +-->
61<!--+
62    |start Tabs
63    +-->
64<ul id="tabs">
65<li>
66<a class="unselected" href="http://hadoop.apache.org/core/">Project</a>
67</li>
68<li>
69<a class="unselected" href="http://wiki.apache.org/hadoop">Wiki</a>
70</li>
71<li class="current">
72<a class="selected" href="index.html">Hadoop 0.20 Documentation</a>
73</li>
74</ul>
75<!--+
76    |end Tabs
77    +-->
78</div>
79</div>
80<div id="main">
81<div id="publishedStrip">
82<!--+
83    |start Subtabs
84    +-->
85<div id="level2tabs"></div>
86<!--+
87    |end Endtabs
88    +-->
89<script type="text/javascript"><!--
90document.write("Last Published: " + document.lastModified);
91//  --></script>
92</div>
93<!--+
94    |breadtrail
95    +-->
96<div class="breadtrail">
97
98             &nbsp;
99           </div>
100<!--+
101    |start Menu, mainarea
102    +-->
103<!--+
104    |start Menu
105    +-->
106<div id="menu">
107<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Getting Started</div>
108<div id="menu_1.1" class="menuitemgroup">
109<div class="menuitem">
110<a href="index.html">Overview</a>
111</div>
112<div class="menuitem">
113<a href="quickstart.html">Quick Start</a>
114</div>
115<div class="menuitem">
116<a href="cluster_setup.html">Cluster Setup</a>
117</div>
118<div class="menuitem">
119<a href="mapred_tutorial.html">Map/Reduce Tutorial</a>
120</div>
121</div>
122<div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Programming Guides</div>
123<div id="menu_1.2" class="menuitemgroup">
124<div class="menuitem">
125<a href="commands_manual.html">Commands</a>
126</div>
127<div class="menuitem">
128<a href="distcp.html">DistCp</a>
129</div>
130<div class="menuitem">
131<a href="native_libraries.html">Native Libraries</a>
132</div>
133<div class="menuitem">
134<a href="streaming.html">Streaming</a>
135</div>
136<div class="menuitem">
137<a href="fair_scheduler.html">Fair Scheduler</a>
138</div>
139<div class="menuitem">
140<a href="capacity_scheduler.html">Capacity Scheduler</a>
141</div>
142<div class="menuitem">
143<a href="service_level_auth.html">Service Level Authorization</a>
144</div>
145<div class="menuitem">
146<a href="vaidya.html">Vaidya</a>
147</div>
148<div class="menuitem">
149<a href="hadoop_archives.html">Archives</a>
150</div>
151</div>
152<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">HDFS</div>
153<div id="menu_1.3" class="menuitemgroup">
154<div class="menuitem">
155<a href="hdfs_user_guide.html">User Guide</a>
156</div>
157<div class="menuitem">
158<a href="hdfs_design.html">Architecture</a>
159</div>
160<div class="menuitem">
161<a href="hdfs_shell.html">File System Shell Guide</a>
162</div>
163<div class="menuitem">
164<a href="hdfs_permissions_guide.html">Permissions Guide</a>
165</div>
166<div class="menuitem">
167<a href="hdfs_quota_admin_guide.html">Quotas Guide</a>
168</div>
169<div class="menuitem">
170<a href="SLG_user_guide.html">Synthetic Load Generator Guide</a>
171</div>
172<div class="menuitem">
173<a href="libhdfs.html">C API libhdfs</a>
174</div>
175</div>
176<div onclick="SwitchMenu('menu_selected_1.4', 'skin/')" id="menu_selected_1.4Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">HOD</div>
177<div id="menu_selected_1.4" class="selectedmenuitemgroup" style="display: block;">
178<div class="menupage">
179<div class="menupagetitle">User Guide</div>
180</div>
181<div class="menuitem">
182<a href="hod_admin_guide.html">Admin Guide</a>
183</div>
184<div class="menuitem">
185<a href="hod_config_guide.html">Config Guide</a>
186</div>
187</div>
188<div onclick="SwitchMenu('menu_1.5', 'skin/')" id="menu_1.5Title" class="menutitle">Miscellaneous</div>
189<div id="menu_1.5" class="menuitemgroup">
190<div class="menuitem">
191<a href="api/index.html">API Docs</a>
192</div>
193<div class="menuitem">
194<a href="jdiff/changes.html">API Changes</a>
195</div>
196<div class="menuitem">
197<a href="http://wiki.apache.org/hadoop/">Wiki</a>
198</div>
199<div class="menuitem">
200<a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a>
201</div>
202<div class="menuitem">
203<a href="releasenotes.html">Release Notes</a>
204</div>
205<div class="menuitem">
206<a href="changes.html">Change Log</a>
207</div>
208</div>
209<div id="credit"></div>
210<div id="roundbottom">
211<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
212<!--+
213  |alternative credits
214  +-->
215<div id="credit2"></div>
216</div>
217<!--+
218    |end Menu
219    +-->
220<!--+
221    |start content
222    +-->
223<div id="content">
224<div title="Portable Document Format" class="pdflink">
225<a class="dida" href="hod_user_guide.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
226        PDF</a>
227</div>
228<h1>
229      HOD User Guide
230    </h1>
231<div id="minitoc-area">
232<ul class="minitoc">
233<li>
234<a href="#Introduction-N1000C"> Introduction </a>
235</li>
236<li>
237<a href="#Getting+Started+Using+HOD"> Getting Started Using HOD </a>
238<ul class="minitoc">
239<li>
240<a href="#A+typical+HOD+session">A typical HOD session</a>
241</li>
242<li>
243<a href="#Running+hadoop+scripts+using+HOD">Running hadoop scripts using HOD</a>
244</li>
245</ul>
246</li>
247<li>
248<a href="#HOD+Features"> HOD Features </a>
249<ul class="minitoc">
250<li>
251<a href="#Provisioning+and+Managing+Hadoop+Clusters"> Provisioning and Managing Hadoop Clusters </a>
252</li>
253<li>
254<a href="#Using+a+tarball+to+distribute+Hadoop"> Using a tarball to distribute Hadoop </a>
255</li>
256<li>
257<a href="#Using+an+external+HDFS"> Using an external HDFS </a>
258</li>
259<li>
260<a href="#Options+for+Configuring+Hadoop"> Options for Configuring Hadoop </a>
261</li>
262<li>
263<a href="#Viewing+Hadoop+Web-UIs"> Viewing Hadoop Web-UIs </a>
264</li>
265<li>
266<a href="#Collecting+and+Viewing+Hadoop+Logs"> Collecting and Viewing Hadoop Logs </a>
267</li>
268<li>
269<a href="#Auto-deallocation+of+Idle+Clusters"> Auto-deallocation of Idle Clusters </a>
270</li>
271<li>
272<a href="#Specifying+Additional+Job+Attributes"> Specifying Additional Job Attributes </a>
273</li>
274<li>
275<a href="#Capturing+HOD+exit+codes+in+Torque"> Capturing HOD exit codes in Torque </a>
276</li>
277<li>
278<a href="#Command+Line"> Command Line</a>
279</li>
280<li>
281<a href="#Options+Configuring+HOD"> Options Configuring HOD </a>
282</li>
283</ul>
284</li>
285<li>
286<a href="#Troubleshooting-N10579"> Troubleshooting </a>
287<ul class="minitoc">
288<li>
289<a href="#Hangs+During+Allocation">hod Hangs During Allocation </a>
290</li>
291<li>
292<a href="#Hangs+During+Deallocation">hod Hangs During Deallocation </a>
293</li>
294<li>
295<a href="#Fails+With+an+Error+Code+and+Error+Message">hod Fails With an Error Code and Error Message </a>
296</li>
297<li>
298<a href="#Hadoop+DFSClient+Warns+with+a%0A++NotReplicatedYetException">Hadoop DFSClient Warns with a
299  NotReplicatedYetException</a>
300</li>
301<li>
302<a href="#Hadoop+Jobs+Not+Running+on+a+Successfully+Allocated+Cluster"> Hadoop Jobs Not Running on a Successfully Allocated Cluster </a>
303</li>
304<li>
305<a href="#My+Hadoop+Job+Got+Killed"> My Hadoop Job Got Killed </a>
306</li>
307<li>
308<a href="#Hadoop+Job+Fails+with+Message%3A+%27Job+tracker+still+initializing%27"> Hadoop Job Fails with Message: 'Job tracker still initializing' </a>
309</li>
310<li>
311<a href="#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque"> The Exit Codes For HOD Are Not Getting Into Torque </a>
312</li>
313<li>
314<a href="#The+Hadoop+Logs+are+Not+Uploaded+to+HDFS"> The Hadoop Logs are Not Uploaded to HDFS </a>
315</li>
316<li>
317<a href="#Locating+Ringmaster+Logs"> Locating Ringmaster Logs </a>
318</li>
319<li>
320<a href="#Locating+Hodring+Logs"> Locating Hodring Logs </a>
321</li>
322</ul>
323</li>
324</ul>
325</div>
326 
327<a name="N1000C"></a><a name="Introduction-N1000C"></a>
328<h2 class="h3"> Introduction </h2>
329<div class="section">
330<a name="Introduction" id="Introduction"></a>
331<p>Hadoop On Demand (HOD) is a system for provisioning virtual Hadoop clusters over a large physical cluster. It uses the Torque resource manager to do node allocation. On the allocated nodes, it can start Hadoop Map/Reduce and HDFS daemons. It automatically generates the appropriate configuration files (hadoop-site.xml) for the Hadoop daemons and client. HOD also has the capability to distribute Hadoop to the nodes in the virtual cluster that it allocates. In short, HOD makes it easy for administrators and users to quickly setup and use Hadoop. It is also a very useful tool for Hadoop developers and testers who need to share a physical cluster for testing their own Hadoop versions.</p>
332<p>HOD supports Hadoop from version 0.15 onwards.</p>
333<p>This guide shows you how to get started using HOD, reviews various HOD features and command line options, and provides detailed troubleshooting help.</p>
334</div>
335 
336<a name="N1001E"></a><a name="Getting+Started+Using+HOD"></a>
337<h2 class="h3"> Getting Started Using HOD </h2>
338<div class="section">
339<a name="Getting_Started_Using_HOD_0_4" id="Getting_Started_Using_HOD_0_4"></a>
340<p>In this section, we shall see a step-by-step introduction on how to use HOD for the most basic operations. Before following these steps, it is assumed that HOD and its dependent hardware and software components are setup and configured correctly. This is a step that is generally performed by system administrators of the cluster.</p>
341<p>The HOD user interface is a command line utility called <span class="codefrag">hod</span>. It is driven by a configuration file, that is typically setup for users by system administrators. Users can override this configuration when using the <span class="codefrag">hod</span>, which is described later in this documentation. The configuration file can be specified in two ways when using <span class="codefrag">hod</span>, as described below: </p>
342<ul>
343   
344<li> Specify it on command line, using the -c option. Such as <span class="codefrag">hod &lt;operation&gt; &lt;required-args&gt; -c path-to-the-configuration-file [other-options]</span>
345</li>
346   
347<li> Set up an environment variable <em>HOD_CONF_DIR</em> where <span class="codefrag">hod</span> will be run. This should be pointed to a directory on the local file system, containing a file called <em>hodrc</em>. Note that this is analogous to the <em>HADOOP_CONF_DIR</em> and <em>hadoop-site.xml</em> file for Hadoop. If no configuration file is specified on the command line, <span class="codefrag">hod</span> shall look for the <em>HOD_CONF_DIR</em> environment variable and a <em>hodrc</em> file under that.</li>
348   
349</ul>
350<p>In examples listed below, we shall not explicitly point to the configuration option, assuming it is correctly specified.</p>
351<a name="N1005B"></a><a name="A+typical+HOD+session"></a>
352<h3 class="h4">A typical HOD session</h3>
353<a name="HOD_Session" id="HOD_Session"></a>
354<p>A typical session of HOD will involve at least three steps: allocate, run hadoop jobs, deallocate. In order to do this, perform the following steps.</p>
355<p>
356<strong> Create a Cluster Directory </strong>
357</p>
358<a name="Create_a_Cluster_Directory" id="Create_a_Cluster_Directory"></a>
359<p>The <em>cluster directory</em> is a directory on the local file system where <span class="codefrag">hod</span> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Pass this directory to the <span class="codefrag">hod</span> operations as stated below. If the cluster directory passed doesn't already exist, HOD will automatically try to create it and use it. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the cluster directory as the Hadoop --config option. </p>
360<p>
361<strong> Operation <em>allocate</em></strong>
362</p>
363<a name="Operation_allocate" id="Operation_allocate"></a>
364<p>The <em>allocate</em> operation is used to allocate a set of nodes and install and provision Hadoop on them. It has the following syntax. Note that it requires a cluster_dir ( -d, --hod.clusterdir) and the number of nodes (-n, --hod.nodecount) needed to be allocated:</p>
365<table class="ForrestTable" cellspacing="1" cellpadding="4">
366     
367       
368<tr>
369         
370<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes [OPTIONS]</span></td>
371       
372</tr>
373     
374   
375</table>
376<p>If the command completes successfully, then <span class="codefrag">cluster_dir/hadoop-site.xml</span> will be generated and will contain information about the allocated cluster. It will also print out the information about the Hadoop web UIs.</p>
377<p>An example run of this command produces the following output. Note in this example that <span class="codefrag">~/hod-clusters/test</span> is the cluster directory, and we are allocating 5 nodes:</p>
378<table class="ForrestTable" cellspacing="1" cellpadding="4">
379   
380<tr>
381     
382<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d ~/hod-clusters/test -n 5</span>
383<br>
384     
385<span class="codefrag">INFO - HDFS UI on http://foo1.bar.com:53422</span>
386<br>
387     
388<span class="codefrag">INFO - Mapred UI on http://foo2.bar.com:55380</span>
389<br>
390</td>
391     
392</tr>
393   
394</table>
395<p>
396<strong> Running Hadoop jobs using the allocated cluster </strong>
397</p>
398<a name="Running_Hadoop_jobs_using_the_al" id="Running_Hadoop_jobs_using_the_al"></a>
399<p>Now, one can run Hadoop jobs using the allocated cluster in the usual manner. This assumes variables like <em>JAVA_HOME</em> and path to the Hadoop installation are set up correctly.:</p>
400<table class="ForrestTable" cellspacing="1" cellpadding="4">
401     
402       
403<tr>
404         
405<td colspan="1" rowspan="1"><span class="codefrag">$ hadoop --config cluster_dir hadoop_command hadoop_command_args</span></td>
406       
407</tr>
408     
409   
410</table>
411<p>or</p>
412<table class="ForrestTable" cellspacing="1" cellpadding="4">
413     
414       
415<tr>
416         
417<td colspan="1" rowspan="1"><span class="codefrag">$ export HADOOP_CONF_DIR=cluster_dir</span> 
418<br>
419             
420<span class="codefrag">$ hadoop hadoop_command hadoop_command_args</span></td>
421       
422</tr>
423     
424   
425</table>
426<p>Continuing our example, the following command will run a wordcount example on the allocated cluster:</p>
427<table class="ForrestTable" cellspacing="1" cellpadding="4">
428<tr>
429<td colspan="1" rowspan="1"><span class="codefrag">$ hadoop --config ~/hod-clusters/test jar /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output</span></td>
430</tr>
431</table>
432<p>or</p>
433<table class="ForrestTable" cellspacing="1" cellpadding="4">
434<tr>
435   
436<td colspan="1" rowspan="1"><span class="codefrag">$ export HADOOP_CONF_DIR=~/hod-clusters/test</span>
437<br>
438   
439<span class="codefrag">$ hadoop jar /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output</span></td>
440   
441</tr>
442 
443</table>
444<p>
445<strong> Operation <em>deallocate</em></strong>
446</p>
447<a name="Operation_deallocate" id="Operation_deallocate"></a>
448<p>The <em>deallocate</em> operation is used to release an allocated cluster. When finished with a cluster, deallocate must be run so that the nodes become free for others to use. The <em>deallocate</em> operation has the following syntax. Note that it requires the cluster_dir (-d, --hod.clusterdir) argument:</p>
449<table class="ForrestTable" cellspacing="1" cellpadding="4">
450     
451       
452<tr>
453         
454<td colspan="1" rowspan="1"><span class="codefrag">$ hod deallocate -d cluster_dir</span></td>
455       
456</tr>
457     
458   
459</table>
460<p>Continuing our example, the following command will deallocate the cluster:</p>
461<table class="ForrestTable" cellspacing="1" cellpadding="4">
462<tr>
463<td colspan="1" rowspan="1"><span class="codefrag">$ hod deallocate -d ~/hod-clusters/test</span></td>
464</tr>
465</table>
466<p>As can be seen, HOD allows the users to allocate a cluster, and use it flexibly for running Hadoop jobs. For example, users can run multiple jobs in parallel on the same cluster, by running hadoop from multiple shells pointing to the same configuration.</p>
467<a name="N1012A"></a><a name="Running+hadoop+scripts+using+HOD"></a>
468<h3 class="h4">Running hadoop scripts using HOD</h3>
469<a name="HOD_Script_Mode" id="HOD_Script_Mode"></a>
470<p>The HOD <em>script operation</em> combines the operations of allocating, using and deallocating a cluster into a single operation. This is very useful for users who want to run a script of hadoop jobs and let HOD handle the cleanup automatically once the script completes. In order to run hadoop scripts using <span class="codefrag">hod</span>, do the following:</p>
471<p>
472<strong> Create a script file </strong>
473</p>
474<a name="Create_a_script_file" id="Create_a_script_file"></a>
475<p>This will be a regular shell script that will typically contain hadoop commands, such as:</p>
476<table class="ForrestTable" cellspacing="1" cellpadding="4">
477<tr>
478<td colspan="1" rowspan="1"><span class="codefrag">$ hadoop jar jar_file options</span></td>
479 
480</tr>
481</table>
482<p>However, the user can add any valid commands as part of the script. HOD will execute this script setting <em>HADOOP_CONF_DIR</em> automatically to point to the allocated cluster. So users do not need to worry about this. The users however need to specify a cluster directory just like when using the allocate operation.</p>
483<p>
484<strong> Running the script </strong>
485</p>
486<a name="Running_the_script" id="Running_the_script"></a>
487<p>The syntax for the <em>script operation</em> as is as follows. Note that it requires a cluster directory ( -d, --hod.clusterdir), number of nodes (-n, --hod.nodecount) and a script file (-s, --hod.script):</p>
488<table class="ForrestTable" cellspacing="1" cellpadding="4">
489     
490       
491<tr>
492         
493<td colspan="1" rowspan="1"><span class="codefrag">$ hod script -d cluster_directory -n number_of_nodes -s script_file</span></td>
494       
495</tr>
496     
497   
498</table>
499<p>Note that HOD will deallocate the cluster as soon as the script completes, and this means that the script must not complete until the hadoop jobs themselves are completed. Users must take care of this while writing the script. </p>
500</div>
501 
502<a name="N1016F"></a><a name="HOD+Features"></a>
503<h2 class="h3"> HOD Features </h2>
504<div class="section">
505<a name="HOD_0_4_Features" id="HOD_0_4_Features"></a><a name="N10177"></a><a name="Provisioning+and+Managing+Hadoop+Clusters"></a>
506<h3 class="h4"> Provisioning and Managing Hadoop Clusters </h3>
507<a name="Provisioning_and_Managing_Hadoop" id="Provisioning_and_Managing_Hadoop"></a>
508<p>The primary feature of HOD is to provision Hadoop Map/Reduce and HDFS clusters. This is described above in the Getting Started section. Also, as long as nodes are available, and organizational policies allow, a user can use HOD to allocate multiple Map/Reduce clusters simultaneously. The user would need to specify different paths for the <span class="codefrag">cluster_dir</span> parameter mentioned above for each cluster he/she allocates. HOD provides the <em>list</em> and the <em>info</em> operations to enable managing multiple clusters.</p>
509<p>
510<strong> Operation <em>list</em></strong>
511</p>
512<a name="Operation_list" id="Operation_list"></a>
513<p>The list operation lists all the clusters allocated so far by a user. The cluster directory where the hadoop-site.xml is stored for the cluster, and its status vis-a-vis connectivity with the JobTracker and/or HDFS is shown. The list operation has the following syntax:</p>
514<table class="ForrestTable" cellspacing="1" cellpadding="4">
515     
516       
517<tr>
518         
519<td colspan="1" rowspan="1"><span class="codefrag">$ hod list</span></td>
520       
521</tr>
522     
523   
524</table>
525<p>
526<strong> Operation <em>info</em></strong>
527</p>
528<a name="Operation_info" id="Operation_info"></a>
529<p>The info operation shows information about a given cluster. The information shown includes the Torque job id, and locations of the important daemons like the HOD Ringmaster process, and the Hadoop JobTracker and NameNode daemons. The info operation has the following syntax. Note that it requires a cluster directory (-d, --hod.clusterdir):</p>
530<table class="ForrestTable" cellspacing="1" cellpadding="4">
531     
532       
533<tr>
534         
535<td colspan="1" rowspan="1"><span class="codefrag">$ hod info -d cluster_dir</span></td>
536       
537</tr>
538     
539   
540</table>
541<p>The <span class="codefrag">cluster_dir</span> should be a valid cluster directory specified in an earlier <em>allocate</em> operation.</p>
542<a name="N101C2"></a><a name="Using+a+tarball+to+distribute+Hadoop"></a>
543<h3 class="h4"> Using a tarball to distribute Hadoop </h3>
544<a name="Using_a_tarball_to_distribute_Ha" id="Using_a_tarball_to_distribute_Ha"></a>
545<p>When provisioning Hadoop, HOD can use either a pre-installed Hadoop on the cluster nodes or distribute and install a Hadoop tarball as part of the provisioning operation. If the tarball option is being used, there is no need to have a pre-installed Hadoop on the cluster nodes, nor a need to use a pre-installed one. This is especially useful in a development / QE environment where individual developers may have different versions of Hadoop to test on a shared cluster. </p>
546<p>In order to use a pre-installed Hadoop, you must specify, in the hodrc, the <span class="codefrag">pkgs</span> option in the <span class="codefrag">gridservice-hdfs</span> and <span class="codefrag">gridservice-mapred</span> sections. This must point to the path where Hadoop is installed on all nodes of the cluster.</p>
547<p>The syntax for specifying tarball is as follows:</p>
548<table class="ForrestTable" cellspacing="1" cellpadding="4">
549       
550<tr>
551         
552<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -t hadoop_tarball_location</span></td>
553       
554</tr>
555   
556</table>
557<p>For example, the following command allocates Hadoop provided by the tarball <span class="codefrag">~/share/hadoop.tar.gz</span>:</p>
558<table class="ForrestTable" cellspacing="1" cellpadding="4">
559<tr>
560<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d ~/hadoop-cluster -n 10 -t ~/share/hadoop.tar.gz</span></td>
561</tr>
562</table>
563<p>Similarly, when using hod script, the syntax is as follows:</p>
564<table class="ForrestTable" cellspacing="1" cellpadding="4">
565       
566<tr>
567         
568<td colspan="1" rowspan="1"><span class="codefrag">$ hod script -d cluster_directory -s script_file -n number_of_nodes -t hadoop_tarball_location</span></td>
569       
570</tr>
571   
572</table>
573<p>The hadoop_tarball specified in the syntax above should point to a path on a shared file system that is accessible from all the compute nodes. Currently, HOD only supports NFS mounted file systems.</p>
574<p>
575<em>Note:</em>
576</p>
577<ul>
578   
579<li> For better distribution performance it is recommended that the Hadoop tarball contain only the libraries and binaries, and not the source or documentation.</li>
580   
581<li> When you want to run jobs against a cluster allocated using the tarball, you must use a compatible version of hadoop to submit your jobs. The best would be to untar and use the version that is present in the tarball itself.</li>
582   
583<li> You need to make sure that there are no Hadoop configuration files, hadoop-env.sh and hadoop-site.xml, present in the conf directory of the tarred distribution. The presence of these files with incorrect values could make the cluster allocation to fail.</li>
584 
585</ul>
586<a name="N10218"></a><a name="Using+an+external+HDFS"></a>
587<h3 class="h4"> Using an external HDFS </h3>
588<a name="Using_an_external_HDFS" id="Using_an_external_HDFS"></a>
589<p>In typical Hadoop clusters provisioned by HOD, HDFS is already set up statically (without using HOD). This allows data to persist in HDFS after the HOD provisioned clusters is deallocated. To use a statically configured HDFS, your hodrc must point to an external HDFS. Specifically, set the following options to the correct values in the section <span class="codefrag">gridservice-hdfs</span> of the hodrc:</p>
590<table class="ForrestTable" cellspacing="1" cellpadding="4">
591<tr>
592<td colspan="1" rowspan="1">external = true</td>
593</tr>
594<tr>
595<td colspan="1" rowspan="1">host = Hostname of the HDFS NameNode</td>
596</tr>
597<tr>
598<td colspan="1" rowspan="1">fs_port = Port number of the HDFS NameNode</td>
599</tr>
600<tr>
601<td colspan="1" rowspan="1">info_port = Port number of the HDFS NameNode web UI</td>
602</tr>
603</table>
604<p>
605<em>Note:</em> You can also enable this option from command line. That is, to use a static HDFS, you will need to say: <br>
606   
607</p>
608<table class="ForrestTable" cellspacing="1" cellpadding="4">
609       
610<tr>
611         
612<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes --gridservice-hdfs.external</span></td>
613       
614</tr>
615   
616</table>
617<p>HOD can be used to provision an HDFS cluster as well as a Map/Reduce cluster, if required. To do so, set the following option in the section <span class="codefrag">gridservice-hdfs</span> of the hodrc:</p>
618<table class="ForrestTable" cellspacing="1" cellpadding="4">
619<tr>
620<td colspan="1" rowspan="1">external = false</td>
621</tr>
622</table>
623<a name="N1025C"></a><a name="Options+for+Configuring+Hadoop"></a>
624<h3 class="h4"> Options for Configuring Hadoop </h3>
625<a name="Options_for_Configuring_Hadoop" id="Options_for_Configuring_Hadoop"></a>
626<p>HOD provides a very convenient mechanism to configure both the Hadoop daemons that it provisions and also the hadoop-site.xml that it generates on the client side. This is done by specifying Hadoop configuration parameters in either the HOD configuration file, or from the command line when allocating clusters.</p>
627<p>
628<strong> Configuring Hadoop Daemons </strong>
629</p>
630<a name="Configuring_Hadoop_Daemons" id="Configuring_Hadoop_Daemons"></a>
631<p>For configuring the Hadoop daemons, you can do the following:</p>
632<p>For Map/Reduce, specify the options as a comma separated list of key-value pairs to the <span class="codefrag">server-params</span> option in the <span class="codefrag">gridservice-mapred</span> section. Likewise for a dynamically provisioned HDFS cluster, specify the options in the <span class="codefrag">server-params</span> option in the <span class="codefrag">gridservice-hdfs</span> section. If these parameters should be marked as <em>final</em>, then include these in the <span class="codefrag">final-server-params</span> option of the appropriate section.</p>
633<p>For example:</p>
634<table class="ForrestTable" cellspacing="1" cellpadding="4">
635<tr>
636<td colspan="1" rowspan="1"><span class="codefrag">server-params = mapred.reduce.parallel.copies=20,io.sort.factor=100,io.sort.mb=128,io.file.buffer.size=131072</span></td>
637</tr>
638<tr>
639<td colspan="1" rowspan="1"><span class="codefrag">final-server-params = mapred.child.java.opts=-Xmx512m,dfs.block.size=134217728,fs.inmemory.size.mb=128</span></td>
640 
641</tr>
642</table>
643<p>In order to provide the options from command line, you can use the following syntax:</p>
644<p>For configuring the Map/Reduce daemons use:</p>
645<table class="ForrestTable" cellspacing="1" cellpadding="4">
646       
647<tr>
648         
649<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -Mmapred.reduce.parallel.copies=20 -Mio.sort.factor=100</span></td>
650       
651</tr>
652   
653</table>
654<p>In the example above, the <em>mapred.reduce.parallel.copies</em> parameter and the <em>io.sort.factor</em> parameter will be appended to the other <span class="codefrag">server-params</span> or if they already exist in <span class="codefrag">server-params</span>, will override them. In order to specify these are <em>final</em> parameters, you can use:</p>
655<table class="ForrestTable" cellspacing="1" cellpadding="4">
656       
657<tr>
658         
659<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -Fmapred.reduce.parallel.copies=20 -Fio.sort.factor=100</span></td>
660       
661</tr>
662   
663</table>
664<p>However, note that final parameters cannot be overwritten from command line. They can only be appended if not already specified.</p>
665<p>Similar options exist for configuring dynamically provisioned HDFS daemons. For doing so, replace -M with -H and -F with -S.</p>
666<p>
667<strong> Configuring Hadoop Job Submission (Client) Programs </strong>
668</p>
669<a name="Configuring_Hadoop_Job_Submissio" id="Configuring_Hadoop_Job_Submissio"></a>
670<p>As mentioned above, if the allocation operation completes successfully then <span class="codefrag">cluster_dir/hadoop-site.xml</span> will be generated and will contain information about the allocated cluster's JobTracker and NameNode. This configuration is used when submitting jobs to the cluster. HOD provides an option to include additional Hadoop configuration parameters into this file. The syntax for doing so is as follows:</p>
671<table class="ForrestTable" cellspacing="1" cellpadding="4">
672       
673<tr>
674         
675<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -Cmapred.userlog.limit.kb=200 -Cmapred.child.java.opts=-Xmx512m</span></td>
676       
677</tr>
678   
679</table>
680<p>In this example, the <em>mapred.userlog.limit.kb</em> and <em>mapred.child.java.opts</em> options will be included into the hadoop-site.xml that is generated by HOD.</p>
681<a name="N102EE"></a><a name="Viewing+Hadoop+Web-UIs"></a>
682<h3 class="h4"> Viewing Hadoop Web-UIs </h3>
683<a name="Viewing_Hadoop_Web_UIs" id="Viewing_Hadoop_Web_UIs"></a>
684<p>The HOD allocation operation prints the JobTracker and NameNode web UI URLs. For example:</p>
685<table class="ForrestTable" cellspacing="1" cellpadding="4">
686<tr>
687<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d ~/hadoop-cluster -n 10 -c ~/hod-conf-dir/hodrc</span>
688<br>
689   
690<span class="codefrag">INFO - HDFS UI on http://host242.foo.com:55391</span>
691<br>
692   
693<span class="codefrag">INFO - Mapred UI on http://host521.foo.com:54874</span>
694    </td>
695</tr>
696</table>
697<p>The same information is also available via the <em>info</em> operation described above.</p>
698<a name="N10310"></a><a name="Collecting+and+Viewing+Hadoop+Logs"></a>
699<h3 class="h4"> Collecting and Viewing Hadoop Logs </h3>
700<a name="Collecting_and_Viewing_Hadoop_Lo" id="Collecting_and_Viewing_Hadoop_Lo"></a>
701<p>To get the Hadoop logs of the daemons running on one of the allocated nodes: </p>
702<ul>
703   
704<li> Log into the node of interest. If you want to look at the logs of the JobTracker or NameNode, then you can find the node running these by using the <em>list</em> and <em>info</em> operations mentioned above.</li>
705   
706<li> Get the process information of the daemon of interest (for example, <span class="codefrag">ps ux | grep TaskTracker</span>)</li>
707   
708<li> In the process information, search for the value of the variable <span class="codefrag">-Dhadoop.log.dir</span>. Typically this will be a decendent directory of the <span class="codefrag">hodring.temp-dir</span> value from the hod configuration file.</li>
709   
710<li> Change to the <span class="codefrag">hadoop.log.dir</span> directory to view daemon and user logs.</li>
711 
712</ul>
713<p>HOD also provides a mechanism to collect logs when a cluster is being deallocated and persist them into a file system, or an externally configured HDFS. By doing so, these logs can be viewed after the jobs are completed and the nodes are released. In order to do so, configure the log-destination-uri to a URI as follows:</p>
714<table class="ForrestTable" cellspacing="1" cellpadding="4">
715<tr>
716<td colspan="1" rowspan="1"><span class="codefrag">log-destination-uri = hdfs://host123:45678/user/hod/logs</span> or</td>
717</tr>
718   
719<tr>
720<td colspan="1" rowspan="1"><span class="codefrag">log-destination-uri = file://path/to/store/log/files</span></td>
721</tr>
722   
723</table>
724<p>Under the root directory specified above in the path, HOD will create a path user_name/torque_jobid and store gzipped log files for each node that was part of the job.</p>
725<p>Note that to store the files to HDFS, you may need to configure the <span class="codefrag">hodring.pkgs</span> option with the Hadoop version that matches the HDFS mentioned. If not, HOD will try to use the Hadoop version that it is using to provision the Hadoop cluster itself.</p>
726<a name="N10359"></a><a name="Auto-deallocation+of+Idle+Clusters"></a>
727<h3 class="h4"> Auto-deallocation of Idle Clusters </h3>
728<a name="Auto_deallocation_of_Idle_Cluste" id="Auto_deallocation_of_Idle_Cluste"></a>
729<p>HOD automatically deallocates clusters that are not running Hadoop jobs for a given period of time. Each HOD allocation includes a monitoring facility that constantly checks for running Hadoop jobs. If it detects no running Hadoop jobs for a given period, it will automatically deallocate its own cluster and thus free up nodes which are not being used effectively.</p>
730<p>
731<em>Note:</em> While the cluster is deallocated, the <em>cluster directory</em> is not cleaned up automatically. The user must deallocate this cluster through the regular <em>deallocate</em> operation to clean this up.</p>
732<a name="N1036F"></a><a name="Specifying+Additional+Job+Attributes"></a>
733<h3 class="h4"> Specifying Additional Job Attributes </h3>
734<a name="Specifying_Additional_Job_Attrib" id="Specifying_Additional_Job_Attrib"></a>
735<p>HOD allows the user to specify a wallclock time and a name (or title) for a Torque job. </p>
736<p>The wallclock time is the estimated amount of time for which the Torque job will be valid. After this time has expired, Torque will automatically delete the job and free up the nodes. Specifying the wallclock time can also help the job scheduler to better schedule jobs, and help improve utilization of cluster resources.</p>
737<p>To specify the wallclock time, use the following syntax:</p>
738<table class="ForrestTable" cellspacing="1" cellpadding="4">
739       
740<tr>
741         
742<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -l time_in_seconds</span></td>
743       
744</tr>
745   
746</table>
747<p>The name or title of a Torque job helps in user friendly identification of the job. The string specified here will show up in all information where Torque job attributes are displayed, including the <span class="codefrag">qstat</span> command.</p>
748<p>To specify the name or title, use the following syntax:</p>
749<table class="ForrestTable" cellspacing="1" cellpadding="4">
750       
751<tr>
752         
753<td colspan="1" rowspan="1"><span class="codefrag">$ hod allocate -d cluster_dir -n number_of_nodes -N name_of_job</span></td>
754       
755</tr>
756   
757</table>
758<p>
759<em>Note:</em> Due to restriction in the underlying Torque resource manager, names which do not start with an alphabet character or contain a 'space' will cause the job to fail. The failure message points to the problem being in the specified job name.</p>
760<a name="N103A6"></a><a name="Capturing+HOD+exit+codes+in+Torque"></a>
761<h3 class="h4"> Capturing HOD exit codes in Torque </h3>
762<a name="Capturing_HOD_exit_codes_in_Torq" id="Capturing_HOD_exit_codes_in_Torq"></a>
763<p>HOD exit codes are captured in the Torque exit_status field. This will help users and system administrators to distinguish successful runs from unsuccessful runs of HOD. The exit codes are 0 if allocation succeeded and all hadoop jobs ran on the allocated cluster correctly. They are non-zero if allocation failed or some of the hadoop jobs failed on the allocated cluster. The exit codes that are possible are mentioned in the table below. <em>Note: Hadoop job status is captured only if the version of Hadoop used is 16 or above.</em>
764</p>
765<table class="ForrestTable" cellspacing="1" cellpadding="4">
766   
767     
768<tr>
769       
770<td colspan="1" rowspan="1"> Exit Code </td>
771        <td colspan="1" rowspan="1"> Meaning </td>
772     
773</tr>
774     
775<tr>
776       
777<td colspan="1" rowspan="1"> 6 </td>
778        <td colspan="1" rowspan="1"> Ringmaster failure </td>
779     
780</tr>
781     
782<tr>
783       
784<td colspan="1" rowspan="1"> 7 </td>
785        <td colspan="1" rowspan="1"> HDFS failure </td>
786     
787</tr>
788     
789<tr>
790       
791<td colspan="1" rowspan="1"> 8 </td>
792        <td colspan="1" rowspan="1"> Job tracker failure </td>
793     
794</tr>
795     
796<tr>
797       
798<td colspan="1" rowspan="1"> 10 </td>
799        <td colspan="1" rowspan="1"> Cluster dead </td>
800     
801</tr>
802     
803<tr>
804       
805<td colspan="1" rowspan="1"> 12 </td>
806        <td colspan="1" rowspan="1"> Cluster already allocated </td>
807     
808</tr>
809     
810<tr>
811       
812<td colspan="1" rowspan="1"> 13 </td>
813        <td colspan="1" rowspan="1"> HDFS dead </td>
814     
815</tr>
816     
817<tr>
818       
819<td colspan="1" rowspan="1"> 14 </td>
820        <td colspan="1" rowspan="1"> Mapred dead </td>
821     
822</tr>
823     
824<tr>
825       
826<td colspan="1" rowspan="1"> 16 </td>
827        <td colspan="1" rowspan="1"> All Map/Reduce jobs that ran on the cluster failed. Refer to hadoop logs for more details. </td>
828     
829</tr>
830     
831<tr>
832       
833<td colspan="1" rowspan="1"> 17 </td>
834        <td colspan="1" rowspan="1"> Some of the Map/Reduce jobs that ran on the cluster failed. Refer to hadoop logs for more details. </td>
835     
836</tr>
837   
838 
839</table>
840<a name="N10438"></a><a name="Command+Line"></a>
841<h3 class="h4"> Command Line</h3>
842<a name="Command_Line" id="Command_Line"></a>
843<p>HOD command line has the following general syntax:<br>
844     
845<em>hod &lt;operation&gt; [ARGS] [OPTIONS]<br>
846</em>
847      Allowed operations are 'allocate', 'deallocate', 'info', 'list', 'script' and 'help'. For help on a particular operation one can do : <span class="codefrag">hod help &lt;operation&gt;</span>. To have a look at possible options one can do a <span class="codefrag">hod help options.</span>
848</p>
849<p>
850<em>allocate</em>
851<br>
852     
853<em>Usage : hod allocate -d cluster_dir -n number_of_nodes [OPTIONS]</em>
854<br>
855        Allocates a cluster on the given number of cluster nodes, and store the allocation information in cluster_dir for use with subsequent <span class="codefrag">hadoop</span> commands. Note that the <span class="codefrag">cluster_dir</span> must exist before running the command.</p>
856<p>
857<em>list</em>
858<br>
859     
860<em>Usage : hod list [OPTIONS]</em>
861<br>
862       Lists the clusters allocated by this user. Information provided includes the Torque job id corresponding to the cluster, the cluster directory where the allocation information is stored, and whether the Map/Reduce daemon is still active or not.</p>
863<p>
864<em>info</em>
865<br>
866     
867<em>Usage : hod info -d cluster_dir [OPTIONS]</em>
868<br>
869        Lists information about the cluster whose allocation information is stored in the specified cluster directory.</p>
870<p>
871<em>deallocate</em>
872<br>
873     
874<em>Usage : hod deallocate -d cluster_dir [OPTIONS]</em>
875<br>
876        Deallocates the cluster whose allocation information is stored in the specified cluster directory.</p>
877<p>
878<em>script</em>
879<br>
880     
881<em>Usage : hod script -s script_file -d cluster_directory -n number_of_nodes [OPTIONS]</em>
882<br>
883        Runs a hadoop script using HOD<em>script</em> operation. Provisions Hadoop on a given number of nodes, executes the given script from the submitting node, and deallocates the cluster when the script completes.</p>
884<p>
885<em>help</em>
886<br>
887     
888<em>Usage : hod help [operation | 'options']</em>
889<br>
890       When no argument is specified, <span class="codefrag">hod help</span> gives the usage and basic options, and is equivalent to <span class="codefrag">hod --help</span> (See below). When 'options' is given as argument, hod displays only the basic options that hod takes. When an operation is specified, it displays the usage and description corresponding to that particular operation. For e.g, to know about allocate operation, one can do a <span class="codefrag">hod help allocate</span>
891</p>
892<p>Besides the operations, HOD can take the following command line options.</p>
893<p>
894<em>--help</em>
895<br>
896        Prints out the help message to see the usage and basic options.</p>
897<p>
898<em>--verbose-help</em>
899<br>
900        All configuration options provided in the hodrc file can be passed on the command line, using the syntax <span class="codefrag">--section_name.option_name[=value]</span>. When provided this way, the value provided on command line overrides the option provided in hodrc. The verbose-help command lists all the available options in the hodrc file. This is also a nice way to see the meaning of the configuration options.</p>
901<p>See the <a href="#Options_Configuring_HOD">next section</a> for a description of most important hod configuration options. For basic options, one can do a <span class="codefrag">hod help options</span> and for all options possible in hod configuration, one can see <span class="codefrag">hod --verbose-help</span>. See <a href="hod_config_guide.html">config guide</a> for a description of all options.</p>
902<a name="N104BF"></a><a name="Options+Configuring+HOD"></a>
903<h3 class="h4"> Options Configuring HOD </h3>
904<a name="Options_Configuring_HOD" id="Options_Configuring_HOD"></a>
905<p>As described above, HOD is configured using a configuration file that is usually set up by system administrators. This is a INI style configuration file that is divided into sections, and options inside each section. Each section relates to one of the HOD processes: client, ringmaster, hodring, mapreduce or hdfs. The options inside a section comprise of an option name and value. </p>
906<p>Users can override the configuration defined in the default configuration in two ways: </p>
907<ul>
908   
909<li> Users can supply their own configuration file to HOD in each of the commands, using the <span class="codefrag">-c</span> option</li>
910   
911<li> Users can supply specific configuration options to HOD/ Options provided on command line <em>override</em> the values provided in the configuration file being used.</li>
912 
913</ul>
914<p>This section describes some of the most commonly used configuration options. These commonly used options are provided with a <em>short</em> option for convenience of specification. All other options can be specified using a <em>long</em> option that is also described below.</p>
915<p>
916<em>-c config_file</em>
917<br>
918    Provides the configuration file to use. Can be used with all other options of HOD. Alternatively, the <span class="codefrag">HOD_CONF_DIR</span> environment variable can be defined to specify a directory that contains a file named <span class="codefrag">hodrc</span>, alleviating the need to specify the configuration file in each HOD command.</p>
919<p>
920<em>-d cluster_dir</em>
921<br>
922        This is required for most of the hod operations. As described <a href="#Create_a_Cluster_Directory">here</a>, the <em>cluster directory</em> is a directory on the local file system where <span class="codefrag">hod</span> will generate the Hadoop configuration, <em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Pass it to the <span class="codefrag">hod</span> operations as an argument to -d or --hod.clusterdir. If it doesn't already exist, HOD will automatically try to create it and use it. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the clusterdirectory as the Hadoop --config option.</p>
923<p>
924<em>-n number_of_nodes</em>
925<br>
926  This is required for the hod 'allocation' operation and for script operation. This denotes the number of nodes to be allocated.</p>
927<p>
928<em>-s script-file</em>
929<br>
930   Required when using script operation, specifies the script file to execute.</p>
931<p>
932<em>-b 1|2|3|4</em>
933<br>
934    Enables the given debug level. Can be used with all other options of HOD. 4 is most verbose.</p>
935<p>
936<em>-t hadoop_tarball</em>
937<br>
938    Provisions Hadoop from the given tar.gz file. This option is only applicable to the <em>allocate</em> operation. For better distribution performance it is strongly recommended that the Hadoop tarball is created <em>after</em> removing the source or documentation.</p>
939<p>
940<em>-N job-name</em>
941<br>
942    The Name to give to the resource manager job that HOD uses underneath. For e.g. in the case of Torque, this translates to the <span class="codefrag">qsub -N</span> option, and can be seen as the job name using the <span class="codefrag">qstat</span> command.</p>
943<p>
944<em>-l wall-clock-time</em>
945<br>
946    The amount of time for which the user expects to have work on the allocated cluster. This is passed to the resource manager underneath HOD, and can be used in more efficient scheduling and utilization of the cluster. Note that in the case of Torque, the cluster is automatically deallocated after this time expires.</p>
947<p>
948<em>-j java-home</em>
949<br>
950    Path to be set to the JAVA_HOME environment variable. This is used in the <em>script</em> operation. HOD sets the JAVA_HOME environment variable tot his value and launches the user script in that.</p>
951<p>
952<em>-A account-string</em>
953<br>
954    Accounting information to pass to underlying resource manager.</p>
955<p>
956<em>-Q queue-name</em>
957<br>
958    Name of the queue in the underlying resource manager to which the job must be submitted.</p>
959<p>
960<em>-Mkey1=value1 -Mkey2=value2</em>
961<br>
962    Provides configuration parameters for the provisioned Map/Reduce daemons (JobTracker and TaskTrackers). A hadoop-site.xml is generated with these values on the cluster nodes. <br>
963   
964<em>Note:</em> Values which have the following characters: space, comma, equal-to, semi-colon need to be escaped with a '\' character, and need to be enclosed within quotes. You can escape a '\' with a '\' too. </p>
965<p>
966<em>-Hkey1=value1 -Hkey2=value2</em>
967<br>
968    Provides configuration parameters for the provisioned HDFS daemons (NameNode and DataNodes). A hadoop-site.xml is generated with these values on the cluster nodes <br>
969   
970<em>Note:</em> Values which have the following characters: space, comma, equal-to, semi-colon need to be escaped with a '\' character, and need to be enclosed within quotes. You can escape a '\' with a '\' too. </p>
971<p>
972<em>-Ckey1=value1 -Ckey2=value2</em>
973<br>
974    Provides configuration parameters for the client from where jobs can be submitted. A hadoop-site.xml is generated with these values on the submit node. <br>
975   
976<em>Note:</em> Values which have the following characters: space, comma, equal-to, semi-colon need to be escaped with a '\' character, and need to be enclosed within quotes. You can escape a '\' with a '\' too. </p>
977<p>
978<em>--section-name.option-name=value</em>
979<br>
980    This is the method to provide options using the <em>long</em> format. For e.g. you could say <em>--hod.script-wait-time=20</em>
981</p>
982</div>
983       
984<a name="N10579"></a><a name="Troubleshooting-N10579"></a>
985<h2 class="h3"> Troubleshooting </h2>
986<div class="section">
987<a name="Troubleshooting" id="Troubleshooting"></a>
988<p>The following section identifies some of the most likely error conditions users can run into when using HOD and ways to trouble-shoot them</p>
989<a name="N10584"></a><a name="Hangs+During+Allocation"></a>
990<h3 class="h4">hod Hangs During Allocation </h3>
991<a name="_hod_Hangs_During_Allocation" id="_hod_Hangs_During_Allocation"></a><a name="hod_Hangs_During_Allocation" id="hod_Hangs_During_Allocation"></a>
992<p>
993<em>Possible Cause:</em> One of the HOD or Hadoop components have failed to come up. In such a case, the <span class="codefrag">hod</span> command will return after a few minutes (typically 2-3 minutes) with an error code of either 7 or 8 as defined in the Error Codes section. Refer to that section for further details. </p>
994<p>
995<em>Possible Cause:</em> A large allocation is fired with a tarball. Sometimes due to load in the network, or on the allocated nodes, the tarball distribution might be significantly slow and take a couple of minutes to come back. Wait for completion. Also check that the tarball does not have the Hadoop sources or documentation.</p>
996<p>
997<em>Possible Cause:</em> A Torque related problem. If the cause is Torque related, the <span class="codefrag">hod</span> command will not return for more than 5 minutes. Running <span class="codefrag">hod</span> in debug mode may show the <span class="codefrag">qstat</span> command being executed repeatedly. Executing the <span class="codefrag">qstat</span> command from a separate shell may show that the job is in the <span class="codefrag">Q</span> (Queued) state. This usually indicates a problem with Torque. Possible causes could include some nodes being down, or new nodes added that Torque is not aware of. Generally, system administator help is needed to resolve this problem.</p>
998<a name="N105B1"></a><a name="Hangs+During+Deallocation"></a>
999<h3 class="h4">hod Hangs During Deallocation </h3>
1000<a name="_hod_Hangs_During_Deallocation" id="_hod_Hangs_During_Deallocation"></a><a name="hod_Hangs_During_Deallocation" id="hod_Hangs_During_Deallocation"></a>
1001<p>
1002<em>Possible Cause:</em> A Torque related problem, usually load on the Torque server, or the allocation is very large. Generally, waiting for the command to complete is the only option.</p>
1003<a name="N105C2"></a><a name="Fails+With+an+Error+Code+and+Error+Message"></a>
1004<h3 class="h4">hod Fails With an Error Code and Error Message </h3>
1005<a name="hod_Fails_With_an_error_code_and" id="hod_Fails_With_an_error_code_and"></a><a name="_hod_Fails_With_an_error_code_an" id="_hod_Fails_With_an_error_code_an"></a>
1006<p>If the exit code of the <span class="codefrag">hod</span> command is not <span class="codefrag">0</span>, then refer to the following table of error exit codes to determine why the code may have occurred and how to debug the situation.</p>
1007<p>
1008<strong> Error Codes </strong>
1009</p>
1010<a name="Error_Codes" id="Error_Codes"></a>
1011<table class="ForrestTable" cellspacing="1" cellpadding="4">
1012   
1013     
1014<tr>
1015       
1016<th colspan="1" rowspan="1">Error Code</th>
1017        <th colspan="1" rowspan="1">Meaning</th>
1018        <th colspan="1" rowspan="1">Possible Causes and Remedial Actions</th>
1019     
1020</tr>
1021     
1022<tr>
1023       
1024<td colspan="1" rowspan="1"> 1 </td>
1025        <td colspan="1" rowspan="1"> Configuration error </td>
1026        <td colspan="1" rowspan="1"> Incorrect configuration values specified in hodrc, or other errors related to HOD configuration. The error messages in this case must be sufficient to debug and fix the problem. </td>
1027     
1028</tr>
1029     
1030<tr>
1031       
1032<td colspan="1" rowspan="1"> 2 </td>
1033        <td colspan="1" rowspan="1"> Invalid operation </td>
1034        <td colspan="1" rowspan="1"> Do <span class="codefrag">hod help</span> for the list of valid operations. </td>
1035     
1036</tr>
1037     
1038<tr>
1039       
1040<td colspan="1" rowspan="1"> 3 </td>
1041        <td colspan="1" rowspan="1"> Invalid operation arguments </td>
1042        <td colspan="1" rowspan="1"> Do <span class="codefrag">hod help operation</span> for listing the usage of a particular operation.</td>
1043     
1044</tr>
1045     
1046<tr>
1047       
1048<td colspan="1" rowspan="1"> 4 </td>
1049        <td colspan="1" rowspan="1"> Scheduler failure </td>
1050        <td colspan="1" rowspan="1"> 1. Requested more resources than available. Run <span class="codefrag">checknodes cluster_name</span> to see if enough nodes are available. <br>
1051          2. Requested resources exceed resource manager limits. <br>
1052          3. Torque is misconfigured, the path to Torque binaries is misconfigured, or other Torque problems. Contact system administrator. </td>
1053     
1054</tr>
1055     
1056<tr>
1057       
1058<td colspan="1" rowspan="1"> 5 </td>
1059        <td colspan="1" rowspan="1"> Job execution failure </td>
1060        <td colspan="1" rowspan="1"> 1. Torque Job was deleted from outside. Execute the Torque <span class="codefrag">qstat</span> command to see if you have any jobs in the <span class="codefrag">R</span> (Running) state. If none exist, try re-executing HOD. <br>
1061          2. Torque problems such as the server momentarily going down, or becoming unresponsive. Contact system administrator. <br>
1062          3. The system administrator might have configured account verification, and an invalid account is specified. Contact system administrator.</td>
1063     
1064</tr>
1065     
1066<tr>
1067       
1068<td colspan="1" rowspan="1"> 6 </td>
1069        <td colspan="1" rowspan="1"> Ringmaster failure </td>
1070        <td colspan="1" rowspan="1"> HOD prints the message "Cluster could not be allocated because of the following errors on the ringmaster host &lt;hostname&gt;". The actual error message may indicate one of the following:<br>
1071          1. Invalid configuration on the node running the ringmaster, specified by the hostname in the error message.<br>
1072          2. Invalid configuration in the <span class="codefrag">ringmaster</span> section,<br>
1073          3. Invalid <span class="codefrag">pkgs</span> option in <span class="codefrag">gridservice-mapred or gridservice-hdfs</span> section,<br>
1074          4. An invalid hadoop tarball, or a tarball which has bundled an invalid configuration file in the conf directory,<br>
1075          5. Mismatched version in Hadoop between the MapReduce and an external HDFS.<br>
1076          The Torque <span class="codefrag">qstat</span> command will most likely show a job in the <span class="codefrag">C</span> (Completed) state. <br>
1077          One can login to the ringmaster host as given by HOD failure message and debug the problem with the help of the error message. If the error message doesn't give complete information, ringmaster logs should help finding out the root cause of the problem. Refer to the section <em>Locating Ringmaster Logs</em> below for more information. </td>
1078     
1079</tr>
1080     
1081<tr>
1082       
1083<td colspan="1" rowspan="1"> 7 </td>
1084        <td colspan="1" rowspan="1"> HDFS failure </td>
1085        <td colspan="1" rowspan="1"> When HOD fails to allocate due to HDFS failures (or Job tracker failures, error code 8, see below), it prints a failure message "Hodring at &lt;hostname&gt; failed with following errors:" and then gives the actual error message, which may indicate one of the following:<br>
1086          1. Problem in starting Hadoop clusters. Usually the actual cause in the error message will indicate the problem on the hostname mentioned. Also, review the Hadoop related configuration in the HOD configuration files. Look at the Hadoop logs using information specified in <em>Collecting and Viewing Hadoop Logs</em> section above. <br>
1087          2. Invalid configuration on the node running the hodring, specified by the hostname in the error message <br>
1088          3. Invalid configuration in the <span class="codefrag">hodring</span> section of hodrc. <span class="codefrag">ssh</span> to the hostname specified in the error message and grep for <span class="codefrag">ERROR</span> or <span class="codefrag">CRITICAL</span> in hodring logs. Refer to the section <em>Locating Hodring Logs</em> below for more information. <br>
1089          4. Invalid tarball specified which is not packaged correctly. <br>
1090          5. Cannot communicate with an externally configured HDFS.<br>
1091          When such HDFS or Job tracker failure occurs, one can login into the host with hostname mentioned in HOD failure message and debug the problem. While fixing the problem, one should also review other log messages in the ringmaster log to see which other machines also might have had problems bringing up the jobtracker/namenode, apart from the hostname that is reported in the failure message. This possibility of other machines also having problems occurs because HOD continues to try and launch hadoop daemons on multiple machines one after another depending upon the value of the configuration variable <a href="hod_config_guide.html#3.4+ringmaster+options">ringmaster.max-master-failures</a>. Refer to the section <em>Locating Ringmaster Logs</em> below to find more about ringmaster logs.
1092          </td>
1093     
1094</tr>
1095     
1096<tr>
1097       
1098<td colspan="1" rowspan="1"> 8 </td>
1099        <td colspan="1" rowspan="1"> Job tracker failure </td>
1100        <td colspan="1" rowspan="1"> Similar to the causes in <em>DFS failure</em> case. </td>
1101     
1102</tr>
1103     
1104<tr>
1105       
1106<td colspan="1" rowspan="1"> 10 </td>
1107        <td colspan="1" rowspan="1"> Cluster dead </td>
1108        <td colspan="1" rowspan="1"> 1. Cluster was auto-deallocated because it was idle for a long time. <br>
1109          2. Cluster was auto-deallocated because the wallclock time specified by the system administrator or user was exceeded. <br>
1110          3. Cannot communicate with the JobTracker and HDFS NameNode which were successfully allocated. Deallocate the cluster, and allocate again. </td>
1111     
1112</tr>
1113     
1114<tr>
1115       
1116<td colspan="1" rowspan="1"> 12 </td>
1117        <td colspan="1" rowspan="1"> Cluster already allocated </td>
1118        <td colspan="1" rowspan="1"> The cluster directory specified has been used in a previous allocate operation and is not yet deallocated. Specify a different directory, or deallocate the previous allocation first. </td>
1119     
1120</tr>
1121     
1122<tr>
1123       
1124<td colspan="1" rowspan="1"> 13 </td>
1125        <td colspan="1" rowspan="1"> HDFS dead </td>
1126        <td colspan="1" rowspan="1"> Cannot communicate with the HDFS NameNode. HDFS NameNode went down. </td>
1127     
1128</tr>
1129     
1130<tr>
1131       
1132<td colspan="1" rowspan="1"> 14 </td>
1133        <td colspan="1" rowspan="1"> Mapred dead </td>
1134        <td colspan="1" rowspan="1"> 1. Cluster was auto-deallocated because it was idle for a long time. <br>
1135          2. Cluster was auto-deallocated because the wallclock time specified by the system administrator or user was exceeded. <br>
1136          3. Cannot communicate with the Map/Reduce JobTracker. JobTracker node went down. <br>
1137         
1138</td>
1139     
1140</tr>
1141     
1142<tr>
1143       
1144<td colspan="1" rowspan="1"> 15 </td>
1145        <td colspan="1" rowspan="1"> Cluster not allocated </td>
1146        <td colspan="1" rowspan="1"> An operation which requires an allocated cluster is given a cluster directory with no state information. </td>
1147     
1148</tr>
1149   
1150     
1151<tr>
1152       
1153<td colspan="1" rowspan="1"> Any non-zero exit code </td>
1154        <td colspan="1" rowspan="1"> HOD script error </td>
1155        <td colspan="1" rowspan="1"> If the hod script option was used, it is likely that the exit code is from the script. Unfortunately, this could clash with the exit codes of the hod command itself. In order to help users differentiate these two, hod writes the script's exit code to a file called script.exitcode in the cluster directory, if the script returned an exit code. You can cat this file to determine the script's exit code. If it does not exist, then it is a hod command exit code.</td> 
1156     
1157</tr>
1158 
1159</table>
1160<a name="N10757"></a><a name="Hadoop+DFSClient+Warns+with+a%0A++NotReplicatedYetException"></a>
1161<h3 class="h4">Hadoop DFSClient Warns with a
1162  NotReplicatedYetException</h3>
1163<p>Sometimes, when you try to upload a file to the HDFS immediately after
1164  allocating a HOD cluster, DFSClient warns with a NotReplicatedYetException. It
1165  usually shows a message something like - </p>
1166<table class="ForrestTable" cellspacing="1" cellpadding="4">
1167<tr>
1168<td colspan="1" rowspan="1"><span class="codefrag">WARN
1169  hdfs.DFSClient: NotReplicatedYetException sleeping &lt;filename&gt; retries
1170  left 3</span></td>
1171</tr>
1172<tr>
1173<td colspan="1" rowspan="1"><span class="codefrag">08/01/25 16:31:40 INFO hdfs.DFSClient:
1174  org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
1175  &lt;filename&gt; could only be replicated to 0 nodes, instead of
1176  1</span></td>
1177</tr>
1178</table>
1179<p> This scenario arises when you try to upload a file
1180  to the HDFS while the DataNodes are still in the process of contacting the
1181  NameNode. This can be resolved by waiting for some time before uploading a new
1182  file to the HDFS, so that enough DataNodes start and contact the
1183  NameNode.</p>
1184<a name="N1076F"></a><a name="Hadoop+Jobs+Not+Running+on+a+Successfully+Allocated+Cluster"></a>
1185<h3 class="h4"> Hadoop Jobs Not Running on a Successfully Allocated Cluster </h3>
1186<a name="Hadoop_Jobs_Not_Running_on_a_Suc" id="Hadoop_Jobs_Not_Running_on_a_Suc"></a>
1187<p>This scenario generally occurs when a cluster is allocated, and is left inactive for sometime, and then hadoop jobs are attempted to be run on them. Then Hadoop jobs fail with the following exception:</p>
1188<table class="ForrestTable" cellspacing="1" cellpadding="4">
1189<tr>
1190<td colspan="1" rowspan="1"><span class="codefrag">08/01/25 16:31:40 INFO ipc.Client: Retrying connect to server: foo.bar.com/1.1.1.1:53567. Already tried 1 time(s).</span></td>
1191</tr>
1192</table>
1193<p>
1194<em>Possible Cause:</em> No Hadoop jobs were run for a significant portion of time. Thus the cluster would have got deallocated as described in the section <em>Auto-deallocation of Idle Clusters</em>. Deallocate the cluster and allocate it again.</p>
1195<p>
1196<em>Possible Cause:</em> The wallclock limit specified by the Torque administrator or the <span class="codefrag">-l</span> option defined in the section <em>Specifying Additional Job Attributes</em> was exceeded since allocation time. Thus the cluster would have got released. Deallocate the cluster and allocate it again.</p>
1197<p>
1198<em>Possible Cause:</em> There is a version mismatch between the version of the hadoop being used in provisioning (typically via the tarball option) and the external HDFS. Ensure compatible versions are being used.</p>
1199<p>
1200<em>Possible Cause:</em> There is a version mismatch between the version of the hadoop client being used to submit jobs and the hadoop used in provisioning (typically via the tarball option). Ensure compatible versions are being used.</p>
1201<p>
1202<em>Possible Cause:</em> You used one of the options for specifying Hadoop configuration <span class="codefrag">-M or -H</span>, which had special characters like space or comma that were not escaped correctly. Refer to the section <em>Options Configuring HOD</em> for checking how to specify such options correctly.</p>
1203<a name="N107AA"></a><a name="My+Hadoop+Job+Got+Killed"></a>
1204<h3 class="h4"> My Hadoop Job Got Killed </h3>
1205<a name="My_Hadoop_Job_Got_Killed" id="My_Hadoop_Job_Got_Killed"></a>
1206<p>
1207<em>Possible Cause:</em> The wallclock limit specified by the Torque administrator or the <span class="codefrag">-l</span> option defined in the section <em>Specifying Additional Job Attributes</em> was exceeded since allocation time. Thus the cluster would have got released. Deallocate the cluster and allocate it again, this time with a larger wallclock time.</p>
1208<p>
1209<em>Possible Cause:</em> Problems with the JobTracker node. Refer to the section in <em>Collecting and Viewing Hadoop Logs</em> to get more information.</p>
1210<a name="N107C5"></a><a name="Hadoop+Job+Fails+with+Message%3A+%27Job+tracker+still+initializing%27"></a>
1211<h3 class="h4"> Hadoop Job Fails with Message: 'Job tracker still initializing' </h3>
1212<a name="Hadoop_Job_Fails_with_Message_Jo" id="Hadoop_Job_Fails_with_Message_Jo"></a>
1213<p>
1214<em>Possible Cause:</em> The hadoop job was being run as part of the HOD script command, and it started before the JobTracker could come up fully. Allocate the cluster using a large value for the configuration option <span class="codefrag">--hod.script-wait-time</span>. Typically a value of 120 should work, though it is typically unnecessary to be that large.</p>
1215<a name="N107D5"></a><a name="The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque"></a>
1216<h3 class="h4"> The Exit Codes For HOD Are Not Getting Into Torque </h3>
1217<a name="The_Exit_Codes_For_HOD_Are_Not_G" id="The_Exit_Codes_For_HOD_Are_Not_G"></a>
1218<p>
1219<em>Possible Cause:</em> Version 0.16 of hadoop is required for this functionality to work. The version of Hadoop used does not match. Use the required version of Hadoop.</p>
1220<p>
1221<em>Possible Cause:</em> The deallocation was done without using the <span class="codefrag">hod</span> command; for e.g. directly using <span class="codefrag">qdel</span>. When the cluster is deallocated in this manner, the HOD processes are terminated using signals. This results in the exit code to be based on the signal number, rather than the exit code of the program.</p>
1222<a name="N107ED"></a><a name="The+Hadoop+Logs+are+Not+Uploaded+to+HDFS"></a>
1223<h3 class="h4"> The Hadoop Logs are Not Uploaded to HDFS </h3>
1224<a name="The_Hadoop_Logs_are_Not_Uploaded" id="The_Hadoop_Logs_are_Not_Uploaded"></a>
1225<p>
1226<em>Possible Cause:</em> There is a version mismatch between the version of the hadoop being used for uploading the logs and the external HDFS. Ensure that the correct version is specified in the <span class="codefrag">hodring.pkgs</span> option.</p>
1227<a name="N107FD"></a><a name="Locating+Ringmaster+Logs"></a>
1228<h3 class="h4"> Locating Ringmaster Logs </h3>
1229<a name="Locating_Ringmaster_Logs" id="Locating_Ringmaster_Logs"></a>
1230<p>To locate the ringmaster logs, follow these steps: </p>
1231<ul>
1232   
1233<li> Execute hod in the debug mode using the -b option. This will print the Torque job id for the current run.</li>
1234   
1235<li> Execute <span class="codefrag">qstat -f torque_job_id</span> and look up the value of the <span class="codefrag">exec_host</span> parameter in the output. The first host in this list is the ringmaster node.</li>
1236   
1237<li> Login to this node.</li>
1238   
1239<li> The ringmaster log location is specified by the <span class="codefrag">ringmaster.log-dir</span> option in the hodrc. The name of the log file will be <span class="codefrag">username.torque_job_id/ringmaster-main.log</span>.</li>
1240   
1241<li> If you don't get enough information, you may want to set the ringmaster debug level to 4. This can be done by passing <span class="codefrag">--ringmaster.debug 4</span> to the hod command line.</li>
1242 
1243</ul>
1244<a name="N10829"></a><a name="Locating+Hodring+Logs"></a>
1245<h3 class="h4"> Locating Hodring Logs </h3>
1246<a name="Locating_Hodring_Logs" id="Locating_Hodring_Logs"></a>
1247<p>To locate hodring logs, follow the steps below: </p>
1248<ul>
1249   
1250<li> Execute hod in the debug mode using the -b option. This will print the Torque job id for the current run.</li>
1251   
1252<li> Execute <span class="codefrag">qstat -f torque_job_id</span> and look up the value of the <span class="codefrag">exec_host</span> parameter in the output. All nodes in this list should have a hodring on them.</li>
1253   
1254<li> Login to any of these nodes.</li>
1255   
1256<li> The hodring log location is specified by the <span class="codefrag">hodring.log-dir</span> option in the hodrc. The name of the log file will be <span class="codefrag">username.torque_job_id/hodring-main.log</span>.</li>
1257   
1258<li> If you don't get enough information, you may want to set the hodring debug level to 4. This can be done by passing <span class="codefrag">--hodring.debug 4</span> to the hod command line.</li>
1259 
1260</ul>
1261</div>
1262
1263</div>
1264<!--+
1265    |end content
1266    +-->
1267<div class="clearboth">&nbsp;</div>
1268</div>
1269<div id="footer">
1270<!--+
1271    |start bottomstrip
1272    +-->
1273<div class="lastmodified">
1274<script type="text/javascript"><!--
1275document.write("Last Published: " + document.lastModified);
1276//  --></script>
1277</div>
1278<div class="copyright">
1279        Copyright &copy;
1280         2008 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
1281</div>
1282<!--+
1283    |end bottomstrip
1284    +-->
1285</div>
1286</body>
1287</html>
Note: See TracBrowser for help on using the repository browser.