wiki:PDAD_Introduction

Introduction

In this section we define the problem, basically the reasons for this project to exist and also present the goals.

Motivation

However, performance is not the only issue. Some applications are intrinsically suitable for distributed processing, even to the point when this is the only choice, for instance data processing when the order of magnitude exceeds GB or event TB of data. Moreover, any type of application that should eventually scale is subject to distributed computing.

Goals

Identifying that the distributed computing approach is suitable for an application is just the first step. First of all, one should not forget the underlying infrastructure the application will be running on, as a poorly configured one will eventually lead unsatisfactory result, regardless of how the problem is solved. Assuming the infrastructure is ok, an important aspect one should decide upon is what framework/paradigm/programming language/library is suitable for the application he/she is developing. Not only does this target performance, but also development time, understanding of the model and the API, tunning parameters and configuration time.

Hadoop Framework

Apache Hadoop is an open-source implementation of the popular MapReduce? paradigm introduced by Google for large-scale data processing. Hadoop is written in Java, but it provides support for specifying jobs in multiple languages like Python for example.

Hadoop provides a solid infrastructure that contains a distributed filesystem and a centralized fault-tolerant cluster architecture. Even the framework is not so mature, multiple sub-projects appeared, being built on the Hadoop's MapReduce? and targeting different types of application. We found interesting the Pig project, which is a high-level scripting language very similar to SQL used to analyze huge data sets.

Our curiosity is: it's this language so cool? Is it so hard to design and implement data analysis problems in MapReduce??

MPI

Message Passing Interface is an open standard proposed as a standard by a broadly based committee of vendors, implementors, and users for implementing distributed applications.

Last modified 14 years ago Last modified on Feb 19, 2010, 7:00:25 PM