Version 5 (modified by cristina.basescu, 14 years ago) (diff)



In this section we define the problem, basically the reasons for this project to exist and also present the goals.


IT world has reached a turning point in its evolution: whether we refer to shifting to parallel computing or to the emerging cloud computing model, it becomes obvious we hit the ceiling of performance when using just one super fast and incredibly powerful resource. Hardware is still making progress, but it cannot do everything by itself for the mainstream – or at least, not any more. If we want to continue the trend of rapidly improving information technology, we have to think in terms of 'the power of the group'.

However, performance is not the only issue. Some applications are intrinsically suitable for distributed processing, even to the point when this is the only choice, for instance data processing when the order of magnitude exceeds GB or event TB of data. Moreover, any type of application that should eventually scale is subject to distributed computing.


Identifying that the distributed computing approach is suitable for an application is just the first step. First of all, one should not forget the underlying infrastructure the application will be running on, as a poorly configured one will eventually lead unsatisfactory result, regardless of how the problem is solved. Assuming the infrastructure is ok, an important aspect one should decide upon is what framework/paradigm/programming language/library is suitable for the application he/she is developing. Not only does this target performance, but also development time, understanding of the model and the API, tunning parameters and configuration time.

Hadoop Framework