1 | ****************** FailMon Quick Start Guide *********************** |
---|
2 | |
---|
3 | This document is a guide to quickly setting up and running FailMon. |
---|
4 | For more information and details please see the FailMon User Manual. |
---|
5 | |
---|
6 | ***** Building FailMon ***** |
---|
7 | |
---|
8 | Normally, FailMon lies under <hadoop-dir>/src/contrib/failmon, where |
---|
9 | <hadoop-source-dir> is the Hadoop project root folder. To compile it, |
---|
10 | one can either run ant for the whole Hadoop project, i.e.: |
---|
11 | |
---|
12 | $ cd <hadoop-dir> |
---|
13 | $ ant |
---|
14 | |
---|
15 | or run ant only for FailMon: |
---|
16 | |
---|
17 | $ cd <hadoop-dir>/src/contrib/failmon |
---|
18 | $ ant |
---|
19 | |
---|
20 | The above will compile FailMon and place all class files under |
---|
21 | <hadoop-dir>/build/contrib/failmon/classes. |
---|
22 | |
---|
23 | By invoking: |
---|
24 | |
---|
25 | $ cd <hadoop-dir>/src/contrib/failmon |
---|
26 | $ ant tar |
---|
27 | |
---|
28 | FailMon is packaged as a standalone jar application in |
---|
29 | <hadoop-dir>/src/contrib/failmon/failmon.tar.gz. |
---|
30 | |
---|
31 | |
---|
32 | ***** Deploying FailMon ***** |
---|
33 | |
---|
34 | There are two ways FailMon can be deployed in a cluster: |
---|
35 | |
---|
36 | a) Within Hadoop, in which case the whole Hadoop package is uploaded |
---|
37 | to the cluster nodes. In that case, nothing else needs to be done on |
---|
38 | individual nodes. |
---|
39 | |
---|
40 | b) Independently of the Hadoop deployment, i.e., by uploading |
---|
41 | failmon.tar.gz to all nodes and uncompressing it. In that case, the |
---|
42 | bin/failmon.sh script needs to be edited; environment variable |
---|
43 | HADOOPDIR should point to the root directory of the Hadoop |
---|
44 | distribution. Also the location of the Hadoop configuration files |
---|
45 | should be pointed by the property 'hadoop.conf.path' in file |
---|
46 | conf/failmon.properties. Note that these files refer to the HDFS in |
---|
47 | which we want to store the FailMon data (which can potentially be |
---|
48 | different than the one on the cluster we are monitoring). |
---|
49 | |
---|
50 | We assume that either way FailMon is placed in the same directory on |
---|
51 | all nodes, which is typical for most clusters. If this is not |
---|
52 | feasible, one should create the same symbolic link on all nodes of the |
---|
53 | cluster, that points to the FailMon directory of each node. |
---|
54 | |
---|
55 | One should also edit the conf/failmon.properties file on each node to |
---|
56 | set his own property values. However, the default values are expected |
---|
57 | to serve most practical cases. Refer to the FailMon User Manual about |
---|
58 | the various properties and configuration parameters. |
---|
59 | |
---|
60 | |
---|
61 | ***** Running FailMon ***** |
---|
62 | |
---|
63 | In order to run FailMon using a node to do the ad-hoc scheduling of |
---|
64 | monitoring jobs, one needs edit the hosts.list file to specify the |
---|
65 | list of machine hostnames on which FailMon is to be run. Also, in file |
---|
66 | conf/global.config the username used to connect to the machines has to |
---|
67 | be specified (passwordless SSH is assumed) in property 'ssh.username'. |
---|
68 | In property 'failmon.dir', the path to the FailMon folder has to be |
---|
69 | specified as well (it is assumed to be the same on all machines in the |
---|
70 | cluster). Then one only needs to invoke the command: |
---|
71 | |
---|
72 | $ cd <hadoop-dir> |
---|
73 | $ bin/scheduler.py |
---|
74 | |
---|
75 | to start the system. |
---|
76 | |
---|
77 | |
---|
78 | ***** Merging HDFS files ***** |
---|
79 | |
---|
80 | For the purpose of merging the files created on HDFS by FailMon, the |
---|
81 | following command can be used: |
---|
82 | |
---|
83 | $ cd <hadoop-dir> |
---|
84 | $ bin/failmon.sh --mergeFiles |
---|
85 | |
---|
86 | This will concatenate all files in the HDFS folder (pointed to by the |
---|
87 | 'hdfs.upload.dir' property in conf/failmon.properties file) into a |
---|
88 | single file, which will be placed in the same folder. Also the |
---|
89 | location of the Hadoop configuration files should be pointed by the |
---|
90 | property 'hadoop.conf.path' in file conf/failmon.properties. Note that |
---|
91 | these files refer to the HDFS in which have stored the FailMon data |
---|
92 | (which can potentially be different than the one on the cluster we are |
---|
93 | monitoring). Also, the scheduler.py script can be set up to merge the |
---|
94 | HDFS files when their number surpasses a configurable limit (see |
---|
95 | 'conf/global.config' file). |
---|
96 | |
---|
97 | Please refer to the FailMon User Manual for more details. |
---|