Monday 29 May 2017

BIG DATA and HADOOP Introduction

The term BIG DATA refers to all the data that is being generated across the globe at an unprecedented rate. This data could be either structure or unstructured. Data drives the modern organizations of the world and hence making sense of this data and unrevealing the various patterns and revealing unseen connections within the vast sea of data becomes critical and a hugely rewarding endeavor indeed. There is a need to convert Big Data into business Intelligence that enterprises can readily deploy. Better data leads to better decision making and an improved way to strategize for organizations regardless of their size, geography, market share, customer segmentation and such other categorizations. Hadoop is the platform of choice for working with extremely large volumes of data. The most successful enterprises of tomorrow will be the one that can make sense of all that data at extremely high volumes and speeds in order to capture newer markets and customer base.

Big Data is typically characterized by 3 V's

Volume: Data is collected from a variety of sources, which may include business transactions, social media and even information from sensors by organizations. Storing these data would have been a problem in the past. But new technologies (such as Hadoop) have made this an easy task.

Velocity: Velocity typically indicates the rapid speed at which data is transferred or received. This can be understood just by visualizing the amount of data in terms of comments, likes, video uploads and tags, that is handled by the social networking sites like Facebook in just one hour.

Variety: Data can be present in any type of formats. It can be in structured format that is numeric data in traditional databases to unstructured documents such as text, video, email, audio, data from financial transactions.


Advantages Of BIG DATA

Big Data is really critical to our life and its emerging as one of the most important technologies in modern world.
  • Using the information kept in the social network like Facebook, the marketing agencies are learning about the responsive for their campaigns, promotions and other advertising mediums.
  • Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.
  • Using the Data regarding the previous medical history of patients, hospitals are providing better and quick service.

BIG DATA Challenges

The major challenges which are associated with Big Data
  • Capturing Data
  • Storing Data
  • Curation
  • Searching Data
  • Sharing Data
  • Transferring Data
  • Analysis of the previously stored Data
  • Presentation

HADOOP

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frameworke application works in an environment that provides distributed storage and computationn across clusters of computers. Hadoop is designed to scale up from server to thousands of machines, each offering local computation and storage.

It contains two modules

MapReduce: It can be understood as a parallel programming model used for processing large amounts of any type of data whether it is structured, semi-structured or unstructured on large clusters of commodity hardware.


HDFS: Hadoop Distributed File System is also a part of Hadoop framework, which is used to store and process the datasets. It provides a fault tolerant file system to run on commodity hardware.


Advantages of Hadoop

  • Hadoop framework allows the user to quickly write and test distrbuted systems. It is efficient and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of CPU cores.
  • Hadoop does not rely on hardware to provide fault tolerance and high availability, rather Hadoop library itself has been designed to detect and handle failures at the application layer.
  • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
  • Hadoop is that apart from being open source, it is compatible on all platforms since it is Java based.

No comments:

Post a Comment