Changes between Version 6 and Version 7 of PDAD_Performance


Ignore:
Timestamp:
Jan 14, 2010, 3:13:22 PM (14 years ago)
Author:
cristina.basescu
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PDAD_Performance

    v6 v7  
    1818=== Hadoop ===
    1919
    20 The MapReduce Framework from Hadoop offers a very large number of parameters to be configured for running a job. One can set the number of maps, which is usually driven by the total size of the inputs and the right level of parallelism seems to be around 10-100 maps per node. Because task setup takes a while, one should make sure it's worth it, consequently the maps should run at least a minute. Hadoop dinamically configures this parameter. Also, one can set the number of reduce tasks, a thing we should pay more attention to.
     20The MapReduce Framework from Hadoop offers a very large number of parameters to be configured for running a job.
     21
     22One can set the number of maps, which is usually driven by the total size of the inputs and the right level of parallelism seems to be around 10-100 maps per node. Because task setup takes a while, one should make sure it's worth it, consequently the maps should run at least a minute. Hadoop dinamically configures this parameter.
     23
     24Also, one can set the number of reduce tasks, a thing we payed more attention to. The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Moreover, increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Running jobs as they were resulted in just one reducer, which obviously had no load balancing at all. Having these in mind, we set the number of reducers at 5 as appropriate for out testing infrastructure mentioned above.
     25
     26Another interesting parameter is the replication factor in HDFS, which should be lower than the number of DataNodes so as not to use too much space, but sufficient to allow parallelism while placing jobs where data is. We set this to 2 both when we had 2 and 4 nodes in the cluster.
     27
     28There are no parameters to configure for Pig. Although this gives no headackes, it isn't a performance friendly solution.
    2129
    2230=== MPI ===
    2331
     32A major bottleneck in MPI is the comunication, so out main preoccupation was to pay attention to message sizes, not to large but also not too small and not too frequent. Also overlapping IO and computation is of high importance. Moreover, MPI2 has some interesting features, like dynamic creation of processes and parallel I/O.
    2433
    25 
    26 
     34=== Results ===
    2735
    2836== Comparison ==