source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix2/README.gridmix2 @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago

Added the mail files for the Hadoop JUNit Project

  • Property svn:executable set to *
File size: 4.6 KB
Line 
1### "Gridmix" Benchmark ###
2
3Contents:
4
50 Overview
61 Getting Started
7  1.0 Build
8  1.1 Configure
9  1.2 Generate test data
102 Running
11  2.0 General
12  2.1 Non-Hod cluster
13  2.2 Hod
14    2.2.0 Static cluster
15    2.2.1 Hod cluster
16
17
18* 0 Overview
19
20The scripts in this package model a cluster workload. The workload is
21simulated by generating random data and submitting map/reduce jobs that
22mimic observed data-access patterns in user jobs. The full benchmark
23generates approximately 2.5TB of (often compressed) input data operated on
24by the following simulated jobs:
25
261) Three stage map/reduce job
27           Input:      500GB compressed (2TB uncompressed) SequenceFile
28                 (k,v) = (5 words, 100 words)
29                 hadoop-env: FIXCOMPSEQ
30     Compute1:   keep 10% map, 40% reduce
31           Compute2:   keep 100% map, 77% reduce
32                 Input from Compute1
33     Compute3:   keep 116% map, 91% reduce
34                 Input from Compute2
35     Motivation: Many user workloads are implemented as pipelined map/reduce
36                 jobs, including Pig workloads
37
382) Large sort of variable key/value size
39     Input:      500GB compressed (2TB uncompressed) SequenceFile
40                 (k,v) = (5-10 words, 100-10000 words)
41                 hadoop-env: VARCOMPSEQ
42     Compute:    keep 100% map, 100% reduce
43     Motivation: Processing large, compressed datsets is common.
44
453) Reference select
46     Input:      500GB compressed (2TB uncompressed) SequenceFile
47                 (k,v) = (5-10 words, 100-10000 words)
48                 hadoop-env: VARCOMPSEQ
49     Compute:    keep 0.2% map, 5% reduce
50                 1 Reducer
51     Motivation: Sampling from a large, reference dataset is common.
52
534) API text sort (java, streaming)
54     Input:      500GB uncompressed Text
55                 (k,v) = (1-10 words, 0-200 words)
56                 hadoop-env: VARINFLTEXT
57     Compute:    keep 100% map, 100% reduce
58     Motivation: This benchmark should exercise each of the APIs to
59                 map/reduce
60
615) Jobs with combiner (word count jobs)
62
63A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
64The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
65construct those jobs based on the xml file and put them under the control of a JobControl object.
66The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.
67
68
69Notes(1-3): Since input data are compressed, this means that each mapper
70outputs a lot more bytes than it reads in, typically causing map output
71spills.
72
73
74
75* 1 Getting Started
76
771.0 Build
78
79In the src/benchmarks/gridmix dir, type "ant".
80gridmix.jar will be created in the build subdir.
81copy gridmix.jar to gridmix dir.
82
831.1 Configure environment variables
84
85One must modify gridmix-env-2 to set the following variables:
86
87HADOOP_HOME     The hadoop install location
88HADOOP_VERSION  The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
89HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
90USE_REAL_DATA   A large data-set will be created and used by the benchmark if it is set to true.
91
92
931.2 Configure the job mixture
94
95A default gridmix_conf.xml file is provided.
96One may make appropriate changes as necessary on the number of jobs of various types
97and sizes. One can also change the number of reducers of each jobs, and specify whether
98to compress the output data of a map/reduce job.
99Note that one can specify multiple numbers of in the
100numOfJobs field and numOfReduces field, like:
101<property>
102  <name>javaSort.smallJobs.numOfJobs</name>
103  <value>8,2</value>
104  <description></description>
105</property>
106
107
108<property>
109  <name>javaSort.smallJobs.numOfReduces</name>
110  <value>15,70</value>
111  <description></description>
112</property>
113
114The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
115jobs with 17 reducers.
116
1171.3 Generate test data
118
119Test data is generated using the generateGridmix2Data.sh script.
120        ./generateGridmix2Data.sh
121One may modify the structure and size of the data generated here.
122
123It is sufficient to run the script without modification, though it may
124require up to 4TB of free space in the default filesystem. Changing the size
125of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
126INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
127compressed data is typical.
128
129* 2 Running
130
131You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
132Then you just need to type
133        ./rungridmix_2
134It will create start.out to record the start time, and at the end, it will create end.out to record the
135endi time.
136
Note: See TracBrowser for help on using the repository browser.