wiki:PDAD_Introduction

Version 7 (modified by claudiu.gheorghe, 14 years ago) (diff)

--

Introduction

In this section we define the problem, basically the reasons for this project to exist and also present the goals.

Motivation

IT world has reached a turning point in its evolution: whether we refer to shifting to parallel computing or to the emerging cloud computing model, it becomes obvious we hit the ceiling of performance when using just one super fast and incredibly powerful resource. Hardware is still making progress, but it cannot do everything by itself for the mainstream – or at least, not any more. If we want to continue the trend of rapidly improving information technology, we have to think in terms of 'the power of the group'.

However, performance is not the only issue. Some applications are intrinsically suitable for distributed processing, even to the point when this is the only choice, for instance data processing when the order of magnitude exceeds GB or event TB of data. Moreover, any type of application that should eventually scale is subject to distributed computing.

Goals

Identifying that the distributed computing approach is suitable for an application is just the first step. First of all, one should not forget the underlying infrastructure the application will be running on, as a poorly configured one will eventually lead unsatisfactory result, regardless of how the problem is solved. Assuming the infrastructure is ok, an important aspect one should decide upon is what framework/paradigm/programming language/library is suitable for the application he/she is developing. Not only does this target performance, but also development time, understanding of the model and the API, tunning parameters and configuration time.

Hadoop Framework

Apache Hadoop is an open-source implementation of the popular MapReduce? paradigm introduced by Google for large-scale data processing. Hadoop is written in Java, but it provides support for specifying jobs in multiple languages like Python for example. Hadoop provides a solid infrastructure that contains a distributed filesystem and a centralized fault-tolerant cluster architecture.

MPI

Message Passing Interface is an open standard proposed as a standard by a broadly based committee of vendors, implementors, and users for implementing distributed applications.