[120] | 1 | Getting Started With Hadoop On Demand (HOD) |
---|
| 2 | =========================================== |
---|
| 3 | |
---|
| 4 | 1. Pre-requisites: |
---|
| 5 | ================== |
---|
| 6 | |
---|
| 7 | Hardware: |
---|
| 8 | HOD requires a minimum of 3 nodes configured through a resource manager. |
---|
| 9 | |
---|
| 10 | Software: |
---|
| 11 | The following components are assumed to be installed before using HOD: |
---|
| 12 | * Torque: |
---|
| 13 | (http://www.clusterresources.com/pages/products/torque-resource-manager.php) |
---|
| 14 | Currently HOD supports Torque out of the box. We assume that you are |
---|
| 15 | familiar with configuring Torque. You can get information about this |
---|
| 16 | from the following link: |
---|
| 17 | http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki |
---|
| 18 | * Python (http://www.python.org/) |
---|
| 19 | We require version 2.5.1 of Python. |
---|
| 20 | |
---|
| 21 | The following components can be optionally installed for getting better |
---|
| 22 | functionality from HOD: |
---|
| 23 | * Twisted Python: This can be used for improving the scalability of HOD |
---|
| 24 | (http://twistedmatrix.com/trac/) |
---|
| 25 | * Hadoop: HOD can automatically distribute Hadoop to all nodes in the |
---|
| 26 | cluster. However, it can also use a pre-installed version of Hadoop, |
---|
| 27 | if it is available on all nodes in the cluster. |
---|
| 28 | (http://hadoop.apache.org/core) |
---|
| 29 | HOD currently supports Hadoop 0.15 and above. |
---|
| 30 | |
---|
| 31 | NOTE: HOD configuration requires the location of installs of these |
---|
| 32 | components to be the same on all nodes in the cluster. It will also |
---|
| 33 | make the configuration simpler to have the same location on the submit |
---|
| 34 | nodes. |
---|
| 35 | |
---|
| 36 | 2. Resource Manager Configuration Pre-requisites: |
---|
| 37 | ================================================= |
---|
| 38 | |
---|
| 39 | For using HOD with Torque: |
---|
| 40 | * Install Torque components: pbs_server on a head node, pbs_moms on all |
---|
| 41 | compute nodes, and PBS client tools on all compute nodes and submit |
---|
| 42 | nodes. |
---|
| 43 | * Create a queue for submitting jobs on the pbs_server. |
---|
| 44 | * Specify a name for all nodes in the cluster, by setting a 'node |
---|
| 45 | property' to all the nodes. |
---|
| 46 | This can be done by using the 'qmgr' command. For example: |
---|
| 47 | qmgr -c "set node node properties=cluster-name" |
---|
| 48 | * Ensure that jobs can be submitted to the nodes. This can be done by |
---|
| 49 | using the 'qsub' command. For example: |
---|
| 50 | echo "sleep 30" | qsub -l nodes=3 |
---|
| 51 | * More information about setting up Torque can be found by referring |
---|
| 52 | to the documentation under: |
---|
| 53 | http://www.clusterresources.com/pages/products/torque-resource-manager.php |
---|
| 54 | |
---|
| 55 | 3. Setting up HOD: |
---|
| 56 | ================== |
---|
| 57 | |
---|
| 58 | * HOD is available under the 'contrib' section of Hadoop under the root |
---|
| 59 | directory 'hod'. |
---|
| 60 | * Distribute the files under this directory to all the nodes in the |
---|
| 61 | cluster. Note that the location where the files are copied should be |
---|
| 62 | the same on all the nodes. |
---|
| 63 | * On the node from where you want to run hod, edit the file hodrc |
---|
| 64 | which can be found in the <install dir>/conf directory. This file |
---|
| 65 | contains the minimal set of values required for running hod. |
---|
| 66 | * Specify values suitable to your environment for the following |
---|
| 67 | variables defined in the configuration file. Note that some of these |
---|
| 68 | variables are defined at more than one place in the file. |
---|
| 69 | |
---|
| 70 | * ${JAVA_HOME}: Location of Java for Hadoop. Hadoop supports Sun JDK |
---|
| 71 | 1.5.x |
---|
| 72 | * ${CLUSTER_NAME}: Name of the cluster which is specified in the |
---|
| 73 | 'node property' as mentioned in resource manager configuration. |
---|
| 74 | * ${HADOOP_HOME}: Location of Hadoop installation on the compute and |
---|
| 75 | submit nodes. |
---|
| 76 | * ${RM_QUEUE}: Queue configured for submiting jobs in the resource |
---|
| 77 | manager configuration. |
---|
| 78 | * ${RM_HOME}: Location of the resource manager installation on the |
---|
| 79 | compute and submit nodes. |
---|
| 80 | |
---|
| 81 | * The following environment variables *may* need to be set depending on |
---|
| 82 | your environment. These variables must be defined where you run the |
---|
| 83 | HOD client, and also be specified in the HOD configuration file as the |
---|
| 84 | value of the key resource_manager.env-vars. Multiple variables can be |
---|
| 85 | specified as a comma separated list of key=value pairs. |
---|
| 86 | |
---|
| 87 | * HOD_PYTHON_HOME: If you install python to a non-default location |
---|
| 88 | of the compute nodes, or submit nodes, then, this variable must be |
---|
| 89 | defined to point to the python executable in the non-standard |
---|
| 90 | location. |
---|
| 91 | |
---|
| 92 | |
---|
| 93 | NOTE: |
---|
| 94 | |
---|
| 95 | You can also review other configuration options in the file and |
---|
| 96 | modify them to suit your needs. Refer to the file config.txt for |
---|
| 97 | information about the HOD configuration. |
---|
| 98 | |
---|
| 99 | |
---|
| 100 | 4. Running HOD: |
---|
| 101 | =============== |
---|
| 102 | |
---|
| 103 | 4.1 Overview: |
---|
| 104 | ------------- |
---|
| 105 | |
---|
| 106 | A typical session of HOD will involve atleast three steps: allocate, |
---|
| 107 | run hadoop jobs, deallocate. |
---|
| 108 | |
---|
| 109 | 4.1.1 Operation allocate |
---|
| 110 | ------------------------ |
---|
| 111 | |
---|
| 112 | The allocate operation is used to allocate a set of nodes and install and |
---|
| 113 | provision Hadoop on them. It has the following syntax: |
---|
| 114 | |
---|
| 115 | hod -c config_file -t hadoop_tarball_location -o "allocate \ |
---|
| 116 | cluster_dir number_of_nodes" |
---|
| 117 | |
---|
| 118 | The hadoop_tarball_location must be a location on a shared file system |
---|
| 119 | accesible from all nodes in the cluster. Note, the cluster_dir must exist |
---|
| 120 | before running the command. If the command completes successfully then |
---|
| 121 | cluster_dir/hadoop-site.xml will be generated and will contain information |
---|
| 122 | about the allocated cluster's JobTracker and NameNode. |
---|
| 123 | |
---|
| 124 | For example, the following command uses a hodrc file in ~/hod-config/hodrc and |
---|
| 125 | allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes, |
---|
| 126 | storing the generated Hadoop configuration in a directory named |
---|
| 127 | ~/hadoop-cluster: |
---|
| 128 | |
---|
| 129 | $ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o "allocate \ |
---|
| 130 | ~/hadoop-cluster 10" |
---|
| 131 | |
---|
| 132 | HOD also supports an environment variable called HOD_CONF_DIR. If this is |
---|
| 133 | defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc. |
---|
| 134 | Defining this allows the above command to also be run as follows: |
---|
| 135 | |
---|
| 136 | $ export HOD_CONF_DIR=~/hod-config |
---|
| 137 | $ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10" |
---|
| 138 | |
---|
| 139 | 4.1.2 Running Hadoop jobs using the allocated cluster |
---|
| 140 | ----------------------------------------------------- |
---|
| 141 | |
---|
| 142 | Now, one can run Hadoop jobs using the allocated cluster in the usual manner: |
---|
| 143 | |
---|
| 144 | hadoop --config cluster_dir hadoop_command hadoop_command_args |
---|
| 145 | |
---|
| 146 | Continuing our example, the following command will run a wordcount example on |
---|
| 147 | the allocated cluster: |
---|
| 148 | |
---|
| 149 | $ hadoop --config ~/hadoop-cluster jar \ |
---|
| 150 | /path/to/hadoop/hadoop-examples.jar wordcount /path/to/input /path/to/output |
---|
| 151 | |
---|
| 152 | 4.1.3 Operation deallocate |
---|
| 153 | -------------------------- |
---|
| 154 | |
---|
| 155 | The deallocate operation is used to release an allocated cluster. When |
---|
| 156 | finished with a cluster, deallocate must be run so that the nodes become free |
---|
| 157 | for others to use. The deallocate operation has the following syntax: |
---|
| 158 | |
---|
| 159 | hod -o "deallocate cluster_dir" |
---|
| 160 | |
---|
| 161 | Continuing our example, the following command will deallocate the cluster: |
---|
| 162 | |
---|
| 163 | $ hod -o "deallocate ~/hadoop-cluster" |
---|
| 164 | |
---|
| 165 | 4.2 Command Line Options |
---|
| 166 | ------------------------ |
---|
| 167 | |
---|
| 168 | This section covers the major command line options available via the hod |
---|
| 169 | command: |
---|
| 170 | |
---|
| 171 | --help |
---|
| 172 | Prints out the help message to see the basic options. |
---|
| 173 | |
---|
| 174 | --verbose-help |
---|
| 175 | All configuration options provided in the hodrc file can be passed on the |
---|
| 176 | command line, using the syntax --section_name.option_name[=value]. When |
---|
| 177 | provided this way, the value provided on command line overrides the option |
---|
| 178 | provided in hodrc. The verbose-help command lists all the available options in |
---|
| 179 | the hodrc file. This is also a nice way to see the meaning of the |
---|
| 180 | configuration options. |
---|
| 181 | |
---|
| 182 | -c config_file |
---|
| 183 | Provides the configuration file to use. Can be used with all other options of |
---|
| 184 | HOD. Alternatively, the HOD_CONF_DIR environment variable can be defined to |
---|
| 185 | specify a directory that contains a file named hodrc, alleviating the need to |
---|
| 186 | specify the configuration file in each HOD command. |
---|
| 187 | |
---|
| 188 | -b 1|2|3|4 |
---|
| 189 | Enables the given debug level. Can be used with all other options of HOD. 4 is |
---|
| 190 | most verbose. |
---|
| 191 | |
---|
| 192 | -o "help" |
---|
| 193 | Lists the operations available in the operation mode. |
---|
| 194 | |
---|
| 195 | -o "allocate cluster_dir number_of_nodes" |
---|
| 196 | Allocates a cluster on the given number of cluster nodes, and store the |
---|
| 197 | allocation information in cluster_dir for use with subsequent hadoop commands. |
---|
| 198 | Note that the cluster_dir must exist before running the command. |
---|
| 199 | |
---|
| 200 | -o "list" |
---|
| 201 | Lists the clusters allocated by this user. Information provided includes the |
---|
| 202 | Torque job id corresponding to the cluster, the cluster directory where the |
---|
| 203 | allocation information is stored, and whether the Map/Reduce daemon is still |
---|
| 204 | active or not. |
---|
| 205 | |
---|
| 206 | -o "info cluster_dir" |
---|
| 207 | Lists information about the cluster whose allocation information is stored in |
---|
| 208 | the specified cluster directory. |
---|
| 209 | |
---|
| 210 | -o "deallocate cluster_dir" |
---|
| 211 | Deallocates the cluster whose allocation information is stored in the |
---|
| 212 | specified cluster directory. |
---|
| 213 | |
---|
| 214 | -t hadoop_tarball |
---|
| 215 | Provisions Hadoop from the given tar.gz file. This option is only applicable |
---|
| 216 | to the allocate operation. For better distribution performance it is |
---|
| 217 | recommended that the Hadoop tarball contain only the libraries and binaries, |
---|
| 218 | and not the source or documentation. |
---|
| 219 | |
---|
| 220 | -Mkey1=value1 -Mkey2=value2 |
---|
| 221 | Provides configuration parameters for the provisioned Map/Reduce daemons |
---|
| 222 | (JobTracker and TaskTrackers). A hadoop-site.xml is generated with these |
---|
| 223 | values on the cluster nodes |
---|
| 224 | |
---|
| 225 | -Hkey1=value1 -Hkey2=value2 |
---|
| 226 | Provides configuration parameters for the provisioned HDFS daemons (NameNode |
---|
| 227 | and DataNodes). A hadoop-site.xml is generated with these values on the |
---|
| 228 | cluster nodes |
---|
| 229 | |
---|
| 230 | -Ckey1=value1 -Ckey2=value2 |
---|
| 231 | Provides configuration parameters for the client from where jobs can be |
---|
| 232 | submitted. A hadoop-site.xml is generated with these values on the submit |
---|
| 233 | node. |
---|