Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

README.gridmix2 @ 176

Last change on this file since 176 was 120, checked in by (none), 14 years ago
Added the mail files for the Hadoop JUNit Project
Property svn:executable set to ``*
File size: 4.6 KB

Rev	Line
[120]	1	### "Gridmix" Benchmark ###
	2
	3	Contents:
	4
	5	0 Overview
	6	1 Getting Started
	7	1.0 Build
	8	1.1 Configure
	9	1.2 Generate test data
	10	2 Running
	11	2.0 General
	12	2.1 Non-Hod cluster
	13	2.2 Hod
	14	2.2.0 Static cluster
	15	2.2.1 Hod cluster
	16
	17
	18	* 0 Overview
	19
	20	The scripts in this package model a cluster workload. The workload is
	21	simulated by generating random data and submitting map/reduce jobs that
	22	mimic observed data-access patterns in user jobs. The full benchmark
	23	generates approximately 2.5TB of (often compressed) input data operated on
	24	by the following simulated jobs:
	25
	26	1) Three stage map/reduce job
	27	Input: 500GB compressed (2TB uncompressed) SequenceFile
	28	(k,v) = (5 words, 100 words)
	29	hadoop-env: FIXCOMPSEQ
	30	Compute1: keep 10% map, 40% reduce
	31	Compute2: keep 100% map, 77% reduce
	32	Input from Compute1
	33	Compute3: keep 116% map, 91% reduce
	34	Input from Compute2
	35	Motivation: Many user workloads are implemented as pipelined map/reduce
	36	jobs, including Pig workloads
	37
	38	2) Large sort of variable key/value size
	39	Input: 500GB compressed (2TB uncompressed) SequenceFile
	40	(k,v) = (5-10 words, 100-10000 words)
	41	hadoop-env: VARCOMPSEQ
	42	Compute: keep 100% map, 100% reduce
	43	Motivation: Processing large, compressed datsets is common.
	44
	45	3) Reference select
	46	Input: 500GB compressed (2TB uncompressed) SequenceFile
	47	(k,v) = (5-10 words, 100-10000 words)
	48	hadoop-env: VARCOMPSEQ
	49	Compute: keep 0.2% map, 5% reduce
	50	1 Reducer
	51	Motivation: Sampling from a large, reference dataset is common.
	52
	53	4) API text sort (java, streaming)
	54	Input: 500GB uncompressed Text
	55	(k,v) = (1-10 words, 0-200 words)
	56	hadoop-env: VARINFLTEXT
	57	Compute: keep 100% map, 100% reduce
	58	Motivation: This benchmark should exercise each of the APIs to
	59	map/reduce
	60
	61	5) Jobs with combiner (word count jobs)
	62
	63	A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
	64	The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
	65	construct those jobs based on the xml file and put them under the control of a JobControl object.
	66	The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.
	67
	68
	69	Notes(1-3): Since input data are compressed, this means that each mapper
	70	outputs a lot more bytes than it reads in, typically causing map output
	71	spills.
	72
	73
	74
	75	* 1 Getting Started
	76
	77	1.0 Build
	78
	79	In the src/benchmarks/gridmix dir, type "ant".
	80	gridmix.jar will be created in the build subdir.
	81	copy gridmix.jar to gridmix dir.
	82
	83	1.1 Configure environment variables
	84
	85	One must modify gridmix-env-2 to set the following variables:
	86
	87	HADOOP_HOME The hadoop install location
	88	HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
	89	HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
	90	USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true.
	91
	92
	93	1.2 Configure the job mixture
	94
	95	A default gridmix_conf.xml file is provided.
	96	One may make appropriate changes as necessary on the number of jobs of various types
	97	and sizes. One can also change the number of reducers of each jobs, and specify whether
	98	to compress the output data of a map/reduce job.
	99	Note that one can specify multiple numbers of in the
	100	numOfJobs field and numOfReduces field, like:
	101	<property>
	102	<name>javaSort.smallJobs.numOfJobs</name>
	103	<value>8,2</value>
	104	<description></description>
	105	</property>
	106
	107
	108	<property>
	109	<name>javaSort.smallJobs.numOfReduces</name>
	110	<value>15,70</value>
	111	<description></description>
	112	</property>
	113
	114	The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
	115	jobs with 17 reducers.
	116
	117	1.3 Generate test data
	118
	119	Test data is generated using the generateGridmix2Data.sh script.
	120	./generateGridmix2Data.sh
	121	One may modify the structure and size of the data generated here.
	122
	123	It is sufficient to run the script without modification, though it may
	124	require up to 4TB of free space in the default filesystem. Changing the size
	125	of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
	126	INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
	127	compressed data is typical.
	128
	129	* 2 Running
	130
	131	You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
	132	Then you just need to type
	133	./rungridmix_2
	134	It will create start.out to record the start time, and at the end, it will create end.out to record the
	135	endi time.
	136

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix2/README.gridmix2 @ 176

Download in other formats: