Changes between Version 16 and Version 17 of PDAD_Applications


Ignore:
Timestamp:
Jan 15, 2010, 8:18:21 PM (14 years ago)
Author:
claudiu.gheorghe
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PDAD_Applications

    v16 v17  
    2525Making design choices for !MapReduce is an intricate task. On the one hand, there are many decisions to be made, such as to have or not to have a Combiner (cuts down the amount of data transferred from the Mapper to the Reducer), a Partitioner (partitions the output of mappers per reducer), or even a !CompressionCodec (compresses the intermediate outputs from mappers to reducers) or a Comparator to do a secondary sort before the reduce phase. On the other hand, complex combinations in specifying these extra features may lead to a too long development time, which is not worth it.
    2626
     27a. Finding the most fault reasons implies only simple counting. On mapper I just parse and extract the right column, obtaining the code of the fault reason. The output of the mapper is a pair <fault_code, 1>. The reducers only sums the code counts and emits the fault code as a String. The only problem was to deal with input format error, because the fault reason codes were not all correct, and most of them were NULL values.
     28
    2729b. We have tried two approaches here. The first one is to get the Mapper compute the jobs duration, giving the same key to each pair, and the Reducer will sum up all these values and compute the medium. Unfortunately, no Combiner can be specified here, as the Reducer would not know afterwards how many elements the Mapper generated.
    2830
    2931The second approach is to compute the medium on chunks having a fixed size of elements, and then the result would be the medium of all these mediums. Although this is appropriate for specifying a Combiner, it will give an approximate value of the medium, depending on the distribution of values in each of the chunks. In this case, the Mapper will generate a new key for durations at each chunk number of pairs, the Combiner will make the medium for each chunk and output mediums having the same key, and the reduce will compute the medium like it did in the previous example.
    3032
     33c. The solution was very similar with a., and I needed only to count the component types.
     34
    3135d. The Mapper classifies inputs and emits keys 'duration classification - reason' having the value of 1, while the Reducer counts the values and if they exceed 1000, they are outputted. The Combiner does basically the same thing as the Reducer.
     36
     37e. On the Mapper we emit a multiple value composed from the fault id and the count of 1. So the output of the Mapper looks like <fault_domain, <fault_code, 1>>. The multiple values will be grouped by each fault_domain, so we compute the sum on each fault_code, using a HashMap<fault_code, sum>. So the reducer will emit multiple values to the output, one for each fault_code found.
    3238
    3339f. This is a more interesting application, as it requires a natural join between the node table and the event_trace table on platform_id and node_id fields. In order to do that, we concatenate the two join columns. The idea here is that the Mapper will read both files and will emit for a platform_id;node_id key values consisting both of values 1 (meaning the failures on that node), but also location. So after the map phase, we should have for a platform_id;node_id key many values of 1 and also a location. These paits will reach the Combiner, however depending on which of the Mappers found the location, some Combiners may receive amongst the values just values of 1, so the best effort here is to output the same key, but adding the 1 values. If the Combiner finds a locations between the values, it will output the pair without changing it. Now, in the third phase, all the pairs having the same keys reach the Reducer, who will sum the numeric values and output the location it finds amongst the values as a key. However, having multiple nodes in the same location, each reducer may output more than one numeric value for the same location, which then have to be some up. That's why we need a '''second map-reduce job''', with an identity Mapper and a reducer that will just sum the values having the same key.