Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

README @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago
Added the mail files for the Hadoop JUNit Project
Property svn:executable set to ``*
File size: 6.4 KB

Line
1	### "Gridmix" Benchmark ###
2
3	Contents:
4
5	0 Overview
6	1 Getting Started
7	1.0 Build
8	1.1 Configure
9	1.2 Generate test data
10	2 Running
11	2.0 General
12	2.1 Non-Hod cluster
13	2.2 Hod
14	2.2.0 Static cluster
15	2.2.1 Hod cluster
16
17
18	* 0 Overview
19
20	The scripts in this package model a cluster workload. The workload is
21	simulated by generating random data and submitting map/reduce jobs that
22	mimic observed data-access patterns in user jobs. The full benchmark
23	generates approximately 2.5TB of (often compressed) input data operated on
24	by the following simulated jobs:
25
26	1) Three stage map/reduce job
27	Input: 500GB compressed (2TB uncompressed) SequenceFile
28	(k,v) = (5 words, 100 words)
29	hadoop-env: FIXCOMPSEQ
30	Compute1: keep 10% map, 40% reduce
31	Compute2: keep 100% map, 77% reduce
32	Input from Compute1
33	Compute3: keep 116% map, 91% reduce
34	Input from Compute2
35	Motivation: Many user workloads are implemented as pipelined map/reduce
36	jobs, including Pig workloads
37
38	2) Large sort of variable key/value size
39	Input: 500GB compressed (2TB uncompressed) SequenceFile
40	(k,v) = (5-10 words, 100-10000 words)
41	hadoop-env: VARCOMPSEQ
42	Compute: keep 100% map, 100% reduce
43	Motivation: Processing large, compressed datsets is common.
44
45	3) Reference select
46	Input: 500GB compressed (2TB uncompressed) SequenceFile
47	(k,v) = (5-10 words, 100-10000 words)
48	hadoop-env: VARCOMPSEQ
49	Compute: keep 0.2% map, 5% reduce
50	1 Reducer
51	Motivation: Sampling from a large, reference dataset is common.
52
53	4) Indirect Read
54	Input: 500GB compressed (2TB uncompressed) Text
55	(k,v) = (5 words, 20 words)
56	hadoop-env: FIXCOMPTEXT
57	Compute: keep 50% map, 100% reduce Each map reads 1 input file,
58	adding additional input files from the output of the
59	previous iteration for 10 iterations
60	Motivation: User jobs in the wild will often take input data without
61	consulting the framework. This simulates an iterative job
62	whose input data is all "indirect," i.e. given to the
63	framework sans locality metadata.
64
65	5) API text sort (java, pipes, streaming)
66	Input: 500GB uncompressed Text
67	(k,v) = (1-10 words, 0-200 words)
68	hadoop-env: VARINFLTEXT
69	Compute: keep 100% map, 100% reduce
70	Motivation: This benchmark should exercise each of the APIs to
71	map/reduce
72
73	Each of these jobs may be run individually or- using the scripts provided-
74	as a simulation of user activity sized to run in approximately 4 hours on a
75	480-500 node cluster using Hadoop 0.15.0. The benchmark runs a mix of small,
76	medium, and large jobs simultaneously, submitting each at fixed intervals.
77
78	Notes(1-4): Since input data are compressed, this means that each mapper
79	outputs a lot more bytes than it reads in, typically causing map output
80	spills.
81
82
83
84	* 1 Getting Started
85
86	1.0 Build
87
88	1) Compile the examples, including the C++ sources:
89	> ant -Dcompile.c++=yes examples
90	2) Copy the pipe sort example to a location in the default filesystem
91	(usually HDFS, default /gridmix/programs)
92	> $HADOOP_HOME/hadoop dfs -mkdir $GRID_MIX_PROG
93	> $HADOOP_HOME/hadoop dfs -put build/c++-examples/$PLATFORM_STR/bin/pipes-sort $GRID_MIX_PROG
94
95	1.1 Configure
96
97	One must modify hadoop-env to supply the following information:
98
99	HADOOP_HOME The hadoop install location
100	GRID_MIX_HOME The location of these scripts
101	APP_JAR The location of the hadoop example
102	GRID_MIX_DATA The location of the datsets for these benchmarks
103	GRID_MIX_PROG The location of the pipe-sort example
104
105	Reasonable defaults are provided for all but HADOOP_HOME. The datasets used
106	by each of the respective benchmarks are recorded in the Input::hadoop-env
107	comment in section 0 and their location may be changed in hadoop-env. Note
108	that each job expects particular input data and the parameters given to it
109	must be changed in each script if a different InputFormat, keytype, or
110	valuetype is desired.
111
112	Note that NUM_OF_REDUCERS_FOR_*_JOB properties should be sized to the
113	cluster on which the benchmarks will be run. The default assumes a large
114	(450-500 node) cluster.
115
116	1.2 Generate test data
117
118	Test data is generated using the generateData.sh script. While one may
119	modify the structure and size of the data generated here, note that many of
120	the scripts- particularly for medium and small sized jobs- rely not only on
121	specific InputFormats and key/value types, but also on a particular
122	structure to the input data. Changing these values will likely be necessary
123	to run on small and medium-sized clusters, but any modifications must be
124	informed by an explicit familiarity with the underlying scripts.
125
126	It is sufficient to run the script without modification, though it may
127	require up to 4TB of free space in the default filesystem. Changing the size
128	of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
129	INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
130	compressed data is typical.
131
132	* 2 Running
133
134	2.0 General
135
136	The submissionScripts directory contains the high-level scripts submitting
137	sized jobs for the gridmix benchmark. Each submits $NUM_OF_*_JOBS_PER_CLASS
138	instances as specified in the gridmix-env script, where an instance is an
139	invocation of a script as in $JOBTYPE/$JOBTYPE.$CLASS (e.g.
140	javasort/text-sort.large). Each instance may submit one or more map/reduce
141	jobs.
142
143	There is a backoff script, submissionScripts/sleep_if_too_busy that can be
144	modified to define throttling criteria. By default, it simply counts running
145	java processes.
146
147	2.1 Non-Hod cluster
148
149	The submissionScripts/allToSameCluster script will invoke each of the other
150	submission scripts for the gridmix benchmark. Depending on how your cluster
151	manages job submission, these scripts may require modification. The details
152	are very context-dependent.
153
154	2.2 Hod
155
156	Note that there are options in hadoop-env that control jobs sumitted thruogh
157	Hod. One may specify the location of a config (HOD_CONFIG), the number of
158	nodes to allocate for classes of jobs, and any additional options one wants
159	to apply. The default includes an example for supplying a Hadoop tarball for
160	testing platform changes (see Hod documentation).
161
162	2.2.0 Static Cluster
163
164	> hod --hod.script=submissionScripts/allToSameCluster -m 500
165
166	2.2.1 Hod-allocated cluster
167
168	> ./submissionScripts/allThroughHod

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix/README @ 120

Download in other formats: