Version 16 (modified by cristina.basescu, 14 years ago) (diff)


PDAD: Parallel Data Analysis Diff

  • Team members: Cristina Basescu - cristina.basescu, Claudiu-Dan Gheorghe - claudiu.gheorghe
  • Project description: compare data analysis performed using (a) Hadoop's MapReduce? (b) Hadoop's Pig (c) MPI

Technologies and Languages

Project Activity

  • Oct 25 - install Hadoop framework and get familiar with MapReduce? and Pig; run examples
  • Nov 2 - project roadmap
  • Nov 22 - ideas for data analysis applications to implement
  • Dec 3 - decide on two data analysis applications; start implementation
  • Jan 9 - finish testing

Proposed Data Analysis Applications

  • Image filters and metadata processing - this is a scenario where people upload pictures on an website and want to apply a filter (such as blurr, sharpen, emboss etc) on them, while the company would like to make statistics regarding the pictures' metadata, such as camera type, shutter speed, ambient light levels, whether the flash was used, etc. This is a typical map-reduce application, especially for the metadata phase: map jobs extract the necessary metadata information and group it, for example, by producer, and the reduce jobs count the number of occurences. For the filter phase, the map job applies the filter, while the reduce job is an idempotent one.
    • TODO: find source for downloading data
  • Inverted-index for e-mails Email servers generate huge amount of text information. Just like web documents, email messages can be classified based on their content and an inverted index would be useful to find relevant emails containers for a given query. This can be useful as an indoor application used by email service owners to find information about the users or for advertising, but is not appropriate as a public application because it breaks privacy rules.
  • Semantic web - Reccomendation system
    • TODO Add description
  • Weather Analysis
    • There are lots of data sets freely available on the Internet.
    • Issue - what could we analyse more precisely? Maybe Emil can give us a hint of interesting analysis.
    • TODO Complete description

Next meeting

  • testing infrastructure?
  • decide on the application(s)
  • 'play' on cluster
  • care e termenul de predare al proj?


[available/unavailable se refera probabil la tipul resursei care a generat fault-ul (eg cpu availability 60% sau o resursa unavailable)]

  • care dintre motivele de fault apare cel mai des in event-uri event_trace.event_end_reason - claudiu
  • care este durata medie a event-urilor - cristina
  • ce componenta apare cel mai des in fault event-uri component.component_type code - claudiu
  • avand event-urile impartite pe categorii dupa durata, care este cauza de fault cea mai intalnita pe fiecare categ event_trace.event_end_reason - cristina

-> sch in enumerarea pe fiecare categ a numarului de joburi terminate din fiecare cauza

  • pt fiecare categ din event_trace.event_end_reason code ranges, care dintre event_trace.event_end_reason code definitions apare cel mai des (numarul de dati cat apare fiecare..) - claudiu
  • in ce locatie geografica sunt nodurile pe care se inregistreaza cele mai multe failure-uri (node_location luat uitandu-ne dupa node_id din event_trace) - cristina