Paper Title
Reducing Data Shuffling and Improving Map Reduce Performance using Enhanced Data Locality
Abstract
Map Reduce is the most used solution depending on the parallel processing to handle big data. Hadoop is an open source software implementation of the MapReduce concept. The shuffling phase in Map Reduce increases the overhead on performance. In this work, we focus on an approach to improve the performance of Map Reduce by reducing the overhead caused by the shuffling phase. Improving the locality of data will lead to eliminating the network overhead in the shuffling phase for the MapReduce. We can achieve this by pre-partitioning data based on the query-based similarity through TF-IDF, Cosine similarity algorithms and grouping the related queries with each other using a K-means clustering algorithm. In this regard, we support HDFS with the related data and control where data are stored to collocate the related data files in the same nodes. In this paper, we present our new approach called the enhance locality and reduce shuffling (ELRS) technique to improve specific aspects of MapReduce such as fast processing. thus, reducing the execution time. The experimental results obtained by the implementation of ELRS technique showed, improve data locality by around 27.2 % and a reduction in the execution time of Hadoop jobs by around 40.1%.
Keywords – Map Reduce, Hadoop, Shuffling, Locality, Similarity, Big Data.