Changes between Version 13 and Version 14 of PDAD_Performance


Ignore:
Timestamp:
Jan 15, 2010, 8:43:17 PM (14 years ago)
Author:
claudiu.gheorghe
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PDAD_Performance

    v13 v14  
    99As the main idea of a distributed system is to use commodity hardware, we have used a maximum of four computers from ED202, running Ubuntu 8.04, Hadoop 0.20.1, Pig 0.5.0. They were interconnected using a Gigabit switch, so as to obtain the maximum from infrastructure's point of view.
    1010
    11 First, we have to overview the Hadoop architecture, to clear-up the setup decisions. There are 4 types of entities in a Hadoop cluster: NameNode, JobTracker, DataNode, TaskTracker. TBD
     11First, we have to overview the Hadoop architecture, to clear-up the setup decisions. There are 4 types of entities in a Hadoop cluster: !NameNode, !JobTracker, !DataNode, !TaskTracker. The !NameNode and !DataNode are related to Hadoop distributed filesystem. The !JobTracker and !TaskTracker are related to Map Reduce job running. The !NameNode is the central point for HDFS and the !JobTracker is the central point for Map-Reduce. On the Hadoop cluster setup guide is recommended to use separate machines for master capabilities (!NameNode and !JobTracker) and slaves with both !DataNode and !TaskTracker capabilities.
    1212
    13 The first test scenario uses two nodes. For the Hadoop Framework, one of them is master, having in the same time master attributions (namenode - keeps the structure of the file system, jobTracker - keeps track of the jobs' execution), taskTracker - keeps track of the tasks...
     13Because our cluster was not so big, not only that we aggregated the master capabilities into a single machines, on the first scenario the master machines is also a slave.
     14In the second scenario we have used three slaves with normal capabilities and a master with only !NameNode and !JobTracker capabilities.
    1415
    1516[[Image(testing.png)]]
    16 
    17 // TODO continue Claudiu for the whole testing infrastructure
    1817
    1918== Parameters ==
     
    2120=== Hadoop ===
    2221
    23 The MapReduce Framework from Hadoop offers a very large number of parameters to be configured for running a job.
     22The '''!MapReduce''' Framework from Hadoop offers a very large number of parameters to be configured for running a job.
    2423
    2524One can set the number of maps, which is usually driven by the total size of the inputs and the right level of parallelism seems to be around 10-100 maps per node. Because task setup takes a while, one should make sure it's worth it, consequently the maps should run at least a minute. Hadoop dinamically configures this parameter.
     
    2726Also, one can set the number of reduce tasks, a thing we payed more attention to. The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Moreover, increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Running jobs as they were resulted in just one reducer, which obviously had no load balancing at all. Having these in mind, we set the number of reducers at 5 as appropriate for out testing infrastructure mentioned above.
    2827
    29 Another interesting parameter is the replication factor in HDFS, which should be lower than the number of DataNodes so as not to use too much space, but sufficient to allow parallelism while placing jobs where data is. We set this to 2 both when we had 2 and 4 nodes in the cluster.
     28Another interesting parameter is the replication factor in HDFS, which should be lower than the number of !DataNodes so as not to use too much space, but sufficient to allow parallelism while placing jobs where data is. We set this to 2 both when we had 2 and 4 nodes in the cluster.
    3029
    3130There are no parameters to configure for Pig. Although this gives no headackes, it isn't a performance friendly solution.