1 | ### "Gridmix" Benchmark ### |
---|
2 | |
---|
3 | Contents: |
---|
4 | |
---|
5 | 0 Overview |
---|
6 | 1 Getting Started |
---|
7 | 1.0 Build |
---|
8 | 1.1 Configure |
---|
9 | 1.2 Generate test data |
---|
10 | 2 Running |
---|
11 | 2.0 General |
---|
12 | 2.1 Non-Hod cluster |
---|
13 | 2.2 Hod |
---|
14 | 2.2.0 Static cluster |
---|
15 | 2.2.1 Hod cluster |
---|
16 | |
---|
17 | |
---|
18 | * 0 Overview |
---|
19 | |
---|
20 | The scripts in this package model a cluster workload. The workload is |
---|
21 | simulated by generating random data and submitting map/reduce jobs that |
---|
22 | mimic observed data-access patterns in user jobs. The full benchmark |
---|
23 | generates approximately 2.5TB of (often compressed) input data operated on |
---|
24 | by the following simulated jobs: |
---|
25 | |
---|
26 | 1) Three stage map/reduce job |
---|
27 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
28 | (k,v) = (5 words, 100 words) |
---|
29 | hadoop-env: FIXCOMPSEQ |
---|
30 | Compute1: keep 10% map, 40% reduce |
---|
31 | Compute2: keep 100% map, 77% reduce |
---|
32 | Input from Compute1 |
---|
33 | Compute3: keep 116% map, 91% reduce |
---|
34 | Input from Compute2 |
---|
35 | Motivation: Many user workloads are implemented as pipelined map/reduce |
---|
36 | jobs, including Pig workloads |
---|
37 | |
---|
38 | 2) Large sort of variable key/value size |
---|
39 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
40 | (k,v) = (5-10 words, 100-10000 words) |
---|
41 | hadoop-env: VARCOMPSEQ |
---|
42 | Compute: keep 100% map, 100% reduce |
---|
43 | Motivation: Processing large, compressed datsets is common. |
---|
44 | |
---|
45 | 3) Reference select |
---|
46 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
47 | (k,v) = (5-10 words, 100-10000 words) |
---|
48 | hadoop-env: VARCOMPSEQ |
---|
49 | Compute: keep 0.2% map, 5% reduce |
---|
50 | 1 Reducer |
---|
51 | Motivation: Sampling from a large, reference dataset is common. |
---|
52 | |
---|
53 | 4) API text sort (java, streaming) |
---|
54 | Input: 500GB uncompressed Text |
---|
55 | (k,v) = (1-10 words, 0-200 words) |
---|
56 | hadoop-env: VARINFLTEXT |
---|
57 | Compute: keep 100% map, 100% reduce |
---|
58 | Motivation: This benchmark should exercise each of the APIs to |
---|
59 | map/reduce |
---|
60 | |
---|
61 | 5) Jobs with combiner (word count jobs) |
---|
62 | |
---|
63 | A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types. |
---|
64 | The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to |
---|
65 | construct those jobs based on the xml file and put them under the control of a JobControl object. |
---|
66 | The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete. |
---|
67 | |
---|
68 | |
---|
69 | Notes(1-3): Since input data are compressed, this means that each mapper |
---|
70 | outputs a lot more bytes than it reads in, typically causing map output |
---|
71 | spills. |
---|
72 | |
---|
73 | |
---|
74 | |
---|
75 | * 1 Getting Started |
---|
76 | |
---|
77 | 1.0 Build |
---|
78 | |
---|
79 | In the src/benchmarks/gridmix dir, type "ant". |
---|
80 | gridmix.jar will be created in the build subdir. |
---|
81 | copy gridmix.jar to gridmix dir. |
---|
82 | |
---|
83 | 1.1 Configure environment variables |
---|
84 | |
---|
85 | One must modify gridmix-env-2 to set the following variables: |
---|
86 | |
---|
87 | HADOOP_HOME The hadoop install location |
---|
88 | HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev |
---|
89 | HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used. |
---|
90 | USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true. |
---|
91 | |
---|
92 | |
---|
93 | 1.2 Configure the job mixture |
---|
94 | |
---|
95 | A default gridmix_conf.xml file is provided. |
---|
96 | One may make appropriate changes as necessary on the number of jobs of various types |
---|
97 | and sizes. One can also change the number of reducers of each jobs, and specify whether |
---|
98 | to compress the output data of a map/reduce job. |
---|
99 | Note that one can specify multiple numbers of in the |
---|
100 | numOfJobs field and numOfReduces field, like: |
---|
101 | <property> |
---|
102 | <name>javaSort.smallJobs.numOfJobs</name> |
---|
103 | <value>8,2</value> |
---|
104 | <description></description> |
---|
105 | </property> |
---|
106 | |
---|
107 | |
---|
108 | <property> |
---|
109 | <name>javaSort.smallJobs.numOfReduces</name> |
---|
110 | <value>15,70</value> |
---|
111 | <description></description> |
---|
112 | </property> |
---|
113 | |
---|
114 | The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort |
---|
115 | jobs with 17 reducers. |
---|
116 | |
---|
117 | 1.3 Generate test data |
---|
118 | |
---|
119 | Test data is generated using the generateGridmix2Data.sh script. |
---|
120 | ./generateGridmix2Data.sh |
---|
121 | One may modify the structure and size of the data generated here. |
---|
122 | |
---|
123 | It is sufficient to run the script without modification, though it may |
---|
124 | require up to 4TB of free space in the default filesystem. Changing the size |
---|
125 | of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES, |
---|
126 | INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block |
---|
127 | compressed data is typical. |
---|
128 | |
---|
129 | * 2 Running |
---|
130 | |
---|
131 | You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists. |
---|
132 | Then you just need to type |
---|
133 | ./rungridmix_2 |
---|
134 | It will create start.out to record the start time, and at the end, it will create end.out to record the |
---|
135 | endi time. |
---|
136 | |
---|