source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix/README @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago

Added the mail files for the Hadoop JUNit Project

  • Property svn:executable set to *
File size: 6.4 KB
Line 
1### "Gridmix" Benchmark ###
2
3Contents:
4
50 Overview
61 Getting Started
7  1.0 Build
8  1.1 Configure
9  1.2 Generate test data
102 Running
11  2.0 General
12  2.1 Non-Hod cluster
13  2.2 Hod
14    2.2.0 Static cluster
15    2.2.1 Hod cluster
16
17
18* 0 Overview
19
20The scripts in this package model a cluster workload. The workload is
21simulated by generating random data and submitting map/reduce jobs that
22mimic observed data-access patterns in user jobs. The full benchmark
23generates approximately 2.5TB of (often compressed) input data operated on
24by the following simulated jobs:
25
261) Three stage map/reduce job
27           Input:      500GB compressed (2TB uncompressed) SequenceFile
28                 (k,v) = (5 words, 100 words)
29                 hadoop-env: FIXCOMPSEQ
30     Compute1:   keep 10% map, 40% reduce
31           Compute2:   keep 100% map, 77% reduce
32                 Input from Compute1
33     Compute3:   keep 116% map, 91% reduce
34                 Input from Compute2
35     Motivation: Many user workloads are implemented as pipelined map/reduce
36                 jobs, including Pig workloads
37
382) Large sort of variable key/value size
39     Input:      500GB compressed (2TB uncompressed) SequenceFile
40                 (k,v) = (5-10 words, 100-10000 words)
41                 hadoop-env: VARCOMPSEQ
42     Compute:    keep 100% map, 100% reduce
43     Motivation: Processing large, compressed datsets is common.
44
453) Reference select
46     Input:      500GB compressed (2TB uncompressed) SequenceFile
47                 (k,v) = (5-10 words, 100-10000 words)
48                 hadoop-env: VARCOMPSEQ
49     Compute:    keep 0.2% map, 5% reduce
50                 1 Reducer
51     Motivation: Sampling from a large, reference dataset is common.
52
534) Indirect Read
54     Input:      500GB compressed (2TB uncompressed) Text
55                 (k,v) = (5 words, 20 words)
56                 hadoop-env: FIXCOMPTEXT
57     Compute:    keep 50% map, 100% reduce Each map reads 1 input file,
58                 adding additional input files from the output of the
59                 previous iteration for 10 iterations
60     Motivation: User jobs in the wild will often take input data without
61                 consulting the framework. This simulates an iterative job
62                 whose input data is all "indirect," i.e. given to the
63                 framework sans locality metadata.
64
655) API text sort (java, pipes, streaming)
66     Input:      500GB uncompressed Text
67                 (k,v) = (1-10 words, 0-200 words)
68                 hadoop-env: VARINFLTEXT
69     Compute:    keep 100% map, 100% reduce
70     Motivation: This benchmark should exercise each of the APIs to
71                 map/reduce
72
73Each of these jobs may be run individually or- using the scripts provided-
74as a simulation of user activity sized to run in approximately 4 hours on a
75480-500 node cluster using Hadoop 0.15.0. The benchmark runs a mix of small,
76medium, and large jobs simultaneously, submitting each at fixed intervals.
77
78Notes(1-4): Since input data are compressed, this means that each mapper
79outputs a lot more bytes than it reads in, typically causing map output
80spills.
81
82
83
84* 1 Getting Started
85
861.0 Build
87
881) Compile the examples, including the C++ sources:
89  > ant -Dcompile.c++=yes examples
902) Copy the pipe sort example to a location in the default filesystem
91   (usually HDFS, default /gridmix/programs)
92  > $HADOOP_HOME/hadoop dfs -mkdir $GRID_MIX_PROG
93  > $HADOOP_HOME/hadoop dfs -put build/c++-examples/$PLATFORM_STR/bin/pipes-sort $GRID_MIX_PROG
94
951.1 Configure
96
97One must modify hadoop-env to supply the following information:
98
99HADOOP_HOME     The hadoop install location
100GRID_MIX_HOME   The location of these scripts
101APP_JAR         The location of the hadoop example
102GRID_MIX_DATA   The location of the datsets for these benchmarks
103GRID_MIX_PROG   The location of the pipe-sort example
104
105Reasonable defaults are provided for all but HADOOP_HOME. The datasets used
106by each of the respective benchmarks are recorded in the Input::hadoop-env
107comment in section 0 and their location may be changed in hadoop-env. Note
108that each job expects particular input data and the parameters given to it
109must be changed in each script if a different InputFormat, keytype, or
110valuetype is desired.
111
112Note that NUM_OF_REDUCERS_FOR_*_JOB properties should be sized to the
113cluster on which the benchmarks will be run. The default assumes a large
114(450-500 node) cluster.
115
1161.2 Generate test data
117
118Test data is generated using the generateData.sh script. While one may
119modify the structure and size of the data generated here, note that many of
120the scripts- particularly for medium and small sized jobs- rely not only on
121specific InputFormats and key/value types, but also on a particular
122structure to the input data. Changing these values will likely be necessary
123to run on small and medium-sized clusters, but any modifications must be
124informed by an explicit familiarity with the underlying scripts.
125
126It is sufficient to run the script without modification, though it may
127require up to 4TB of free space in the default filesystem. Changing the size
128of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
129INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
130compressed data is typical.
131
132* 2 Running
133
1342.0 General
135
136The submissionScripts directory contains the high-level scripts submitting
137sized jobs for the gridmix benchmark. Each submits $NUM_OF_*_JOBS_PER_CLASS
138instances as specified in the gridmix-env script, where an instance is an
139invocation of a script as in $JOBTYPE/$JOBTYPE.$CLASS (e.g.
140javasort/text-sort.large). Each instance may submit one or more map/reduce
141jobs.
142
143There is a backoff script, submissionScripts/sleep_if_too_busy that can be
144modified to define throttling criteria. By default, it simply counts running
145java processes.
146
1472.1 Non-Hod cluster
148
149The submissionScripts/allToSameCluster script will invoke each of the other
150submission scripts for the gridmix benchmark. Depending on how your cluster
151manages job submission, these scripts may require modification. The details
152are very context-dependent.
153
1542.2 Hod
155
156Note that there are options in hadoop-env that control jobs sumitted thruogh
157Hod. One may specify the location of a config (HOD_CONFIG), the number of
158nodes to allocate for classes of jobs, and any additional options one wants
159to apply. The default includes an example for supplying a Hadoop tarball for
160testing platform changes (see Hod documentation).
161
1622.2.0 Static Cluster
163
164> hod --hod.script=submissionScripts/allToSameCluster -m 500
165
1662.2.1 Hod-allocated cluster
167
168> ./submissionScripts/allThroughHod
Note: See TracBrowser for help on using the repository browser.