Context Navigation

Version 21 (modified by cristina.basescu, 14 years ago) (diff)
--

PDAD: Parallel Data Analysis Diff

Team members: Cristina Basescu - cristina.basescu, Claudiu-Dan Gheorghe - claudiu.gheorghe
Project description: compare data analysis performed using (a) Hadoop's MapReduce? (b) Hadoop's Pig (c) MPI

Oct 25 - install Hadoop framework and get familiar with MapReduce? and Pig; run examples
Nov 2 - project roadmap
Nov 22 - ideas for data analysis applications to implement
Dec 3 - decide on two data analysis applications; start implementation
Jan 9 - finish testing

Image filters and metadata processing - this is a scenario where people upload pictures on an website and want to apply a filter (such as blurr, sharpen, emboss etc) on them, while the company would like to make statistics regarding the pictures' metadata, such as camera type, shutter speed, ambient light levels, whether the flash was used, etc. This is a typical map-reduce application, especially for the metadata phase: map jobs extract the necessary metadata information and group it, for example, by producer, and the reduce jobs count the number of occurences. For the filter phase, the map job applies the filter, while the reduce job is an idempotent one.
- TODO: find source for downloading data

Inverted-index for e-mails Email servers generate huge amount of text information. Just like web documents, email messages can be classified based on their content and an inverted index would be useful to find relevant emails containers for a given query. This can be useful as an indoor application used by email service owners to find information about the users or for advertising, but is not appropriate as a public application because it breaks privacy rules.
- http://cfdr.usenix.org/
- http://fta.inria.fr/apache2-default/pmwiki/

Weather Analysis
- There are lots of data sets freely available on the Internet.
- Issue - what could we analyse more precisely? Maybe Emil can give us a hint of interesting analysis.
- TODO Complete description

[available/unavailable se refera probabil la tipul resursei care a generat fault-ul (eg cpu availability 60% sau o resursa unavailable)]

care dintre motivele de fault apare cel mai des in event-uri event_trace.event_end_reason - claudiu
care este durata medie a event-urilor - cristina
- MapReduce? DONE
- Pig DONE
ce componenta apare cel mai des in fault event-uri component.component_type code - claudiu
avand event-urile impartite pe categorii dupa durata, care este cauza de fault cea mai intalnita pe fiecare categ event_trace.event_end_reason - cristina [-> sch in enumerarea pe fiecare categ a numarului de joburi terminate din fiecare cauza frecventa (>1000 failed)]
- MapReduce? DONE
- Pig DONE
pt fiecare categ din event_trace.event_end_reason code ranges, care dintre event_trace.event_end_reason code definitions apare cel mai des (numarul de dati cat apare fiecare..) - claudiu
in ce locatie geografica sunt nodurile pe care se inregistreaza cele mai multe failure-uri (node_location luat uitandu-ne dupa node_id din event_trace) - cristina
- MapReduce? DONE
- Pig DONE