Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

README @ 120

Last change on this file since 120 was 120, checked in by (none), 14 years ago
Added the mail files for the Hadoop JUNit Project
Property svn:executable set to ``*
File size: 6.4 KB

Rev	Line
[120]	1	### "Gridmix" Benchmark ###
	2
	3	Contents:
	4
	5	0 Overview
	6	1 Getting Started
	7	1.0 Build
	8	1.1 Configure
	9	1.2 Generate test data
	10	2 Running
	11	2.0 General
	12	2.1 Non-Hod cluster
	13	2.2 Hod
	14	2.2.0 Static cluster
	15	2.2.1 Hod cluster
	16
	17
	18	* 0 Overview
	19
	20	The scripts in this package model a cluster workload. The workload is
	21	simulated by generating random data and submitting map/reduce jobs that
	22	mimic observed data-access patterns in user jobs. The full benchmark
	23	generates approximately 2.5TB of (often compressed) input data operated on
	24	by the following simulated jobs:
	25
	26	1) Three stage map/reduce job
	27	Input: 500GB compressed (2TB uncompressed) SequenceFile
	28	(k,v) = (5 words, 100 words)
	29	hadoop-env: FIXCOMPSEQ
	30	Compute1: keep 10% map, 40% reduce
	31	Compute2: keep 100% map, 77% reduce
	32	Input from Compute1
	33	Compute3: keep 116% map, 91% reduce
	34	Input from Compute2
	35	Motivation: Many user workloads are implemented as pipelined map/reduce
	36	jobs, including Pig workloads
	37
	38	2) Large sort of variable key/value size
	39	Input: 500GB compressed (2TB uncompressed) SequenceFile
	40	(k,v) = (5-10 words, 100-10000 words)
	41	hadoop-env: VARCOMPSEQ
	42	Compute: keep 100% map, 100% reduce
	43	Motivation: Processing large, compressed datsets is common.
	44
	45	3) Reference select
	46	Input: 500GB compressed (2TB uncompressed) SequenceFile
	47	(k,v) = (5-10 words, 100-10000 words)
	48	hadoop-env: VARCOMPSEQ
	49	Compute: keep 0.2% map, 5% reduce
	50	1 Reducer
	51	Motivation: Sampling from a large, reference dataset is common.
	52
	53	4) Indirect Read
	54	Input: 500GB compressed (2TB uncompressed) Text
	55	(k,v) = (5 words, 20 words)
	56	hadoop-env: FIXCOMPTEXT
	57	Compute: keep 50% map, 100% reduce Each map reads 1 input file,
	58	adding additional input files from the output of the
	59	previous iteration for 10 iterations
	60	Motivation: User jobs in the wild will often take input data without
	61	consulting the framework. This simulates an iterative job
	62	whose input data is all "indirect," i.e. given to the
	63	framework sans locality metadata.
	64
	65	5) API text sort (java, pipes, streaming)
	66	Input: 500GB uncompressed Text
	67	(k,v) = (1-10 words, 0-200 words)
	68	hadoop-env: VARINFLTEXT
	69	Compute: keep 100% map, 100% reduce
	70	Motivation: This benchmark should exercise each of the APIs to
	71	map/reduce
	72
	73	Each of these jobs may be run individually or- using the scripts provided-
	74	as a simulation of user activity sized to run in approximately 4 hours on a
	75	480-500 node cluster using Hadoop 0.15.0. The benchmark runs a mix of small,
	76	medium, and large jobs simultaneously, submitting each at fixed intervals.
	77
	78	Notes(1-4): Since input data are compressed, this means that each mapper
	79	outputs a lot more bytes than it reads in, typically causing map output
	80	spills.
	81
	82
	83
	84	* 1 Getting Started
	85
	86	1.0 Build
	87
	88	1) Compile the examples, including the C++ sources:
	89	> ant -Dcompile.c++=yes examples
	90	2) Copy the pipe sort example to a location in the default filesystem
	91	(usually HDFS, default /gridmix/programs)
	92	> $HADOOP_HOME/hadoop dfs -mkdir $GRID_MIX_PROG
	93	> $HADOOP_HOME/hadoop dfs -put build/c++-examples/$PLATFORM_STR/bin/pipes-sort $GRID_MIX_PROG
	94
	95	1.1 Configure
	96
	97	One must modify hadoop-env to supply the following information:
	98
	99	HADOOP_HOME The hadoop install location
	100	GRID_MIX_HOME The location of these scripts
	101	APP_JAR The location of the hadoop example
	102	GRID_MIX_DATA The location of the datsets for these benchmarks
	103	GRID_MIX_PROG The location of the pipe-sort example
	104
	105	Reasonable defaults are provided for all but HADOOP_HOME. The datasets used
	106	by each of the respective benchmarks are recorded in the Input::hadoop-env
	107	comment in section 0 and their location may be changed in hadoop-env. Note
	108	that each job expects particular input data and the parameters given to it
	109	must be changed in each script if a different InputFormat, keytype, or
	110	valuetype is desired.
	111
	112	Note that NUM_OF_REDUCERS_FOR_*_JOB properties should be sized to the
	113	cluster on which the benchmarks will be run. The default assumes a large
	114	(450-500 node) cluster.
	115
	116	1.2 Generate test data
	117
	118	Test data is generated using the generateData.sh script. While one may
	119	modify the structure and size of the data generated here, note that many of
	120	the scripts- particularly for medium and small sized jobs- rely not only on
	121	specific InputFormats and key/value types, but also on a particular
	122	structure to the input data. Changing these values will likely be necessary
	123	to run on small and medium-sized clusters, but any modifications must be
	124	informed by an explicit familiarity with the underlying scripts.
	125
	126	It is sufficient to run the script without modification, though it may
	127	require up to 4TB of free space in the default filesystem. Changing the size
	128	of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
	129	INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
	130	compressed data is typical.
	131
	132	* 2 Running
	133
	134	2.0 General
	135
	136	The submissionScripts directory contains the high-level scripts submitting
	137	sized jobs for the gridmix benchmark. Each submits $NUM_OF_*_JOBS_PER_CLASS
	138	instances as specified in the gridmix-env script, where an instance is an
	139	invocation of a script as in $JOBTYPE/$JOBTYPE.$CLASS (e.g.
	140	javasort/text-sort.large). Each instance may submit one or more map/reduce
	141	jobs.
	142
	143	There is a backoff script, submissionScripts/sleep_if_too_busy that can be
	144	modified to define throttling criteria. By default, it simply counts running
	145	java processes.
	146
	147	2.1 Non-Hod cluster
	148
	149	The submissionScripts/allToSameCluster script will invoke each of the other
	150	submission scripts for the gridmix benchmark. Depending on how your cluster
	151	manages job submission, these scripts may require modification. The details
	152	are very context-dependent.
	153
	154	2.2 Hod
	155
	156	Note that there are options in hadoop-env that control jobs sumitted thruogh
	157	Hod. One may specify the location of a config (HOD_CONFIG), the number of
	158	nodes to allocate for classes of jobs, and any additional options one wants
	159	to apply. The default includes an example for supplying a Hadoop tarball for
	160	testing platform changes (see Hod documentation).
	161
	162	2.2.0 Static Cluster
	163
	164	> hod --hod.script=submissionScripts/allToSameCluster -m 500
	165
	166	2.2.1 Hod-allocated cluster
	167
	168	> ./submissionScripts/allThroughHod

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: proiecte/HadoopJUnit/hadoop-0.20.1/src/benchmarks/gridmix/README @ 120

Download in other formats: