[120] | 1 | ### "Gridmix" Benchmark ### |
---|
| 2 | |
---|
| 3 | Contents: |
---|
| 4 | |
---|
| 5 | 0 Overview |
---|
| 6 | 1 Getting Started |
---|
| 7 | 1.0 Build |
---|
| 8 | 1.1 Configure |
---|
| 9 | 1.2 Generate test data |
---|
| 10 | 2 Running |
---|
| 11 | 2.0 General |
---|
| 12 | 2.1 Non-Hod cluster |
---|
| 13 | 2.2 Hod |
---|
| 14 | 2.2.0 Static cluster |
---|
| 15 | 2.2.1 Hod cluster |
---|
| 16 | |
---|
| 17 | |
---|
| 18 | * 0 Overview |
---|
| 19 | |
---|
| 20 | The scripts in this package model a cluster workload. The workload is |
---|
| 21 | simulated by generating random data and submitting map/reduce jobs that |
---|
| 22 | mimic observed data-access patterns in user jobs. The full benchmark |
---|
| 23 | generates approximately 2.5TB of (often compressed) input data operated on |
---|
| 24 | by the following simulated jobs: |
---|
| 25 | |
---|
| 26 | 1) Three stage map/reduce job |
---|
| 27 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
| 28 | (k,v) = (5 words, 100 words) |
---|
| 29 | hadoop-env: FIXCOMPSEQ |
---|
| 30 | Compute1: keep 10% map, 40% reduce |
---|
| 31 | Compute2: keep 100% map, 77% reduce |
---|
| 32 | Input from Compute1 |
---|
| 33 | Compute3: keep 116% map, 91% reduce |
---|
| 34 | Input from Compute2 |
---|
| 35 | Motivation: Many user workloads are implemented as pipelined map/reduce |
---|
| 36 | jobs, including Pig workloads |
---|
| 37 | |
---|
| 38 | 2) Large sort of variable key/value size |
---|
| 39 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
| 40 | (k,v) = (5-10 words, 100-10000 words) |
---|
| 41 | hadoop-env: VARCOMPSEQ |
---|
| 42 | Compute: keep 100% map, 100% reduce |
---|
| 43 | Motivation: Processing large, compressed datsets is common. |
---|
| 44 | |
---|
| 45 | 3) Reference select |
---|
| 46 | Input: 500GB compressed (2TB uncompressed) SequenceFile |
---|
| 47 | (k,v) = (5-10 words, 100-10000 words) |
---|
| 48 | hadoop-env: VARCOMPSEQ |
---|
| 49 | Compute: keep 0.2% map, 5% reduce |
---|
| 50 | 1 Reducer |
---|
| 51 | Motivation: Sampling from a large, reference dataset is common. |
---|
| 52 | |
---|
| 53 | 4) API text sort (java, streaming) |
---|
| 54 | Input: 500GB uncompressed Text |
---|
| 55 | (k,v) = (1-10 words, 0-200 words) |
---|
| 56 | hadoop-env: VARINFLTEXT |
---|
| 57 | Compute: keep 100% map, 100% reduce |
---|
| 58 | Motivation: This benchmark should exercise each of the APIs to |
---|
| 59 | map/reduce |
---|
| 60 | |
---|
| 61 | 5) Jobs with combiner (word count jobs) |
---|
| 62 | |
---|
| 63 | A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types. |
---|
| 64 | The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to |
---|
| 65 | construct those jobs based on the xml file and put them under the control of a JobControl object. |
---|
| 66 | The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete. |
---|
| 67 | |
---|
| 68 | |
---|
| 69 | Notes(1-3): Since input data are compressed, this means that each mapper |
---|
| 70 | outputs a lot more bytes than it reads in, typically causing map output |
---|
| 71 | spills. |
---|
| 72 | |
---|
| 73 | |
---|
| 74 | |
---|
| 75 | * 1 Getting Started |
---|
| 76 | |
---|
| 77 | 1.0 Build |
---|
| 78 | |
---|
| 79 | In the src/benchmarks/gridmix dir, type "ant". |
---|
| 80 | gridmix.jar will be created in the build subdir. |
---|
| 81 | copy gridmix.jar to gridmix dir. |
---|
| 82 | |
---|
| 83 | 1.1 Configure environment variables |
---|
| 84 | |
---|
| 85 | One must modify gridmix-env-2 to set the following variables: |
---|
| 86 | |
---|
| 87 | HADOOP_HOME The hadoop install location |
---|
| 88 | HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev |
---|
| 89 | HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used. |
---|
| 90 | USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true. |
---|
| 91 | |
---|
| 92 | |
---|
| 93 | 1.2 Configure the job mixture |
---|
| 94 | |
---|
| 95 | A default gridmix_conf.xml file is provided. |
---|
| 96 | One may make appropriate changes as necessary on the number of jobs of various types |
---|
| 97 | and sizes. One can also change the number of reducers of each jobs, and specify whether |
---|
| 98 | to compress the output data of a map/reduce job. |
---|
| 99 | Note that one can specify multiple numbers of in the |
---|
| 100 | numOfJobs field and numOfReduces field, like: |
---|
| 101 | <property> |
---|
| 102 | <name>javaSort.smallJobs.numOfJobs</name> |
---|
| 103 | <value>8,2</value> |
---|
| 104 | <description></description> |
---|
| 105 | </property> |
---|
| 106 | |
---|
| 107 | |
---|
| 108 | <property> |
---|
| 109 | <name>javaSort.smallJobs.numOfReduces</name> |
---|
| 110 | <value>15,70</value> |
---|
| 111 | <description></description> |
---|
| 112 | </property> |
---|
| 113 | |
---|
| 114 | The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort |
---|
| 115 | jobs with 17 reducers. |
---|
| 116 | |
---|
| 117 | 1.3 Generate test data |
---|
| 118 | |
---|
| 119 | Test data is generated using the generateGridmix2Data.sh script. |
---|
| 120 | ./generateGridmix2Data.sh |
---|
| 121 | One may modify the structure and size of the data generated here. |
---|
| 122 | |
---|
| 123 | It is sufficient to run the script without modification, though it may |
---|
| 124 | require up to 4TB of free space in the default filesystem. Changing the size |
---|
| 125 | of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES, |
---|
| 126 | INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block |
---|
| 127 | compressed data is typical. |
---|
| 128 | |
---|
| 129 | * 2 Running |
---|
| 130 | |
---|
| 131 | You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists. |
---|
| 132 | Then you just need to type |
---|
| 133 | ./rungridmix_2 |
---|
| 134 | It will create start.out to record the start time, and at the end, it will create end.out to record the |
---|
| 135 | endi time. |
---|
| 136 | |
---|