Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

README.gridmix2 @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago
Added the mail files for the Hadoop JUNit Project
Property svn:executable set to ``*
File size: 4.6 KB

Line
1	### "Gridmix" Benchmark ###
2
3	Contents:
4
5	0 Overview
6	1 Getting Started
7	1.0 Build
8	1.1 Configure
9	1.2 Generate test data
10	2 Running
11	2.0 General
12	2.1 Non-Hod cluster
13	2.2 Hod
14	2.2.0 Static cluster
15	2.2.1 Hod cluster
16
17
18	* 0 Overview
19
20	The scripts in this package model a cluster workload. The workload is
21	simulated by generating random data and submitting map/reduce jobs that
22	mimic observed data-access patterns in user jobs. The full benchmark
23	generates approximately 2.5TB of (often compressed) input data operated on
24	by the following simulated jobs:
25
26	1) Three stage map/reduce job
27	Input: 500GB compressed (2TB uncompressed) SequenceFile
28	(k,v) = (5 words, 100 words)
29	hadoop-env: FIXCOMPSEQ
30	Compute1: keep 10% map, 40% reduce
31	Compute2: keep 100% map, 77% reduce
32	Input from Compute1
33	Compute3: keep 116% map, 91% reduce
34	Input from Compute2
35	Motivation: Many user workloads are implemented as pipelined map/reduce
36	jobs, including Pig workloads
37
38	2) Large sort of variable key/value size
39	Input: 500GB compressed (2TB uncompressed) SequenceFile
40	(k,v) = (5-10 words, 100-10000 words)
41	hadoop-env: VARCOMPSEQ
42	Compute: keep 100% map, 100% reduce
43	Motivation: Processing large, compressed datsets is common.
44
45	3) Reference select
46	Input: 500GB compressed (2TB uncompressed) SequenceFile
47	(k,v) = (5-10 words, 100-10000 words)
48	hadoop-env: VARCOMPSEQ
49	Compute: keep 0.2% map, 5% reduce
50	1 Reducer
51	Motivation: Sampling from a large, reference dataset is common.
52
53	4) API text sort (java, streaming)
54	Input: 500GB uncompressed Text
55	(k,v) = (1-10 words, 0-200 words)
56	hadoop-env: VARINFLTEXT
57	Compute: keep 100% map, 100% reduce
58	Motivation: This benchmark should exercise each of the APIs to
59	map/reduce
60
61	5) Jobs with combiner (word count jobs)
62
63	A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
64	The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
65	construct those jobs based on the xml file and put them under the control of a JobControl object.
66	The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.
67
68
69	Notes(1-3): Since input data are compressed, this means that each mapper
70	outputs a lot more bytes than it reads in, typically causing map output
71	spills.
72
73
74
75	* 1 Getting Started
76
77	1.0 Build
78
79	In the src/benchmarks/gridmix dir, type "ant".
80	gridmix.jar will be created in the build subdir.
81	copy gridmix.jar to gridmix dir.
82
83	1.1 Configure environment variables
84
85	One must modify gridmix-env-2 to set the following variables:
86
87	HADOOP_HOME The hadoop install location
88	HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
89	HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
90	USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true.
91
92
93	1.2 Configure the job mixture
94
95	A default gridmix_conf.xml file is provided.
96	One may make appropriate changes as necessary on the number of jobs of various types
97	and sizes. One can also change the number of reducers of each jobs, and specify whether
98	to compress the output data of a map/reduce job.
99	Note that one can specify multiple numbers of in the
100	numOfJobs field and numOfReduces field, like:
101	<property>
102	<name>javaSort.smallJobs.numOfJobs</name>
103	<value>8,2</value>
104	<description></description>
105	</property>
106
107
108	<property>
109	<name>javaSort.smallJobs.numOfReduces</name>
110	<value>15,70</value>
111	<description></description>
112	</property>
113
114	The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
115	jobs with 17 reducers.
116
117	1.3 Generate test data
118
119	Test data is generated using the generateGridmix2Data.sh script.
120	./generateGridmix2Data.sh
121	One may modify the structure and size of the data generated here.
122
123	It is sufficient to run the script without modification, though it may
124	require up to 4TB of free space in the default filesystem. Changing the size
125	of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
126	INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
127	compressed data is typical.
128
129	* 2 Running
130
131	You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
132	Then you just need to type
133	./rungridmix_2
134	It will create start.out to record the start time, and at the end, it will create end.out to record the
135	endi time.
136

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix2/README.gridmix2 @ 120

Download in other formats: