Paper Title
An Improved Partitioning Mechanism For Optimizing Massive Data Analysis Using Map Reduce
Abstract
Big data is a popular term for the data sets which are very large and complex to handle.The traditional databases
can not be used for processing the data which may be structured or unstructured. Using big data, many companies and users
started to move their data towards cloud storage so as to simplify data management and reduce data maintainance cost. In most
companies the size of data is too big or it moves too fast and it exceeds current processing capacity. Other than these problems,
big data has the ability to help companies improve operations and make faster and more intelligent decisions. MapReduce is
a programming model which is an associated implementation for processing and generating for large data sets with the help
of algorithm of a parallel and distributed on a cluster. The MapReduce model has two part first part of MapReduce is ”Map,”
and second part is Reduce. In MapReduce Map function allows different points of the distributed cluster to a distribute their
work and Reduce is designed to reduce the final form of the clusters results into one output. The problems of unbalanced load
which is generated from data skew(i.e data is generated in invariant capacity) can be avoided by using data sampling. Data
sampling is a statistical analysis technique. It is used to analyze ,manipulate and select a representative subset of data points
in order to identify patterns and trends in the larger data set being examined. Load balancing is used to optimize resource use,
maximize throughput, minimize response time, and avoid unbalncing load of any single resource. The partitioning mechanism
analyze how evenly the practitioner distributes the data depends on how large and represent the sample is and on how well the
samples.This project proposes an improved partitioning algorithm that overcome the unbalancing load,memory consumption
and improve partitioning mechanism.
Index Terms— Big Data, Hadoop, HDFS, MapReduce, Data sampling, Partitioning.