Changes between Version 17 and Version 18 of PDAD_Applications

Jan 15, 2010, 8:23:03 PM (14 years ago)



  • PDAD_Applications

    v17 v18  
    3939f. This is a more interesting application, as it requires a natural join between the node table and the event_trace table on platform_id and node_id fields. In order to do that, we concatenate the two join columns. The idea here is that the Mapper will read both files and will emit for a platform_id;node_id key values consisting both of values 1 (meaning the failures on that node), but also location. So after the map phase, we should have for a platform_id;node_id key many values of 1 and also a location. These paits will reach the Combiner, however depending on which of the Mappers found the location, some Combiners may receive amongst the values just values of 1, so the best effort here is to output the same key, but adding the 1 values. If the Combiner finds a locations between the values, it will output the pair without changing it. Now, in the third phase, all the pairs having the same keys reach the Reducer, who will sum the numeric values and output the location it finds amongst the values as a key. However, having multiple nodes in the same location, each reducer may output more than one numeric value for the same location, which then have to be some up. That's why we need a '''second map-reduce job''', with an identity Mapper and a reducer that will just sum the values having the same key.
    41 TODO Claudiu for his apps
    4343== Pig ==
    4545On the contrary to !MapReduce, writing code in !PigLatin is as straight forward as it can get. There's no need to wory about ''how'' things are done, one just has to specify ''what'' needs to be done. Describing this seems more natural in PigLatin, for example the b. program is just computing an average. As a general idea, one has to group what one would have used in MapReduce as a key and a big advantage is that a join is much easier to do.
    47 TODO Claudiu for his apps
     47Because the code is the more explicit, here is how one of our Pig applications looks like:
     50A = load '$inputDir/' as (a, b, c, d, e, f, g, h, end_reason: int);
     51B = filter A by end_reason is not null;
     52C = group B by end_reason;
     53D = foreach C generate group, COUNT($1);
     54E = order D by $1;
     55STORE E INTO '$outputDir' USING PigStorage();
    4959== MPI ==