Research Paper : Parallel Computing Solutions – Hadoop Mapreduce

Data Export using Sqoop

This was a research paper that we submitted to ICAPADS-2012 an IEEE – Institute of High performance  distributed computing conference . It talks about a map reduce based solution to maze traversal problem which is applicable in many practical problems.

(Authors : Shantanu Deo, Sricharan R, Pradeep Thammaneni, Mujtaba Ahmed )

( Status : The paper was rejected for publication on technical grounds )

Abstract—MapReduce is a popular model for developing scalable parallel programs that run on a cluster of commodity computers. Traditionally used for processing Big Data and other data-heavy processes, it is a challenge to adopt this programming platform for computationally intensive problems with large instruction sets on a small data set (Multiple Instruction and Small Data MISD). This document describes a parallel approach to find all the possible paths in a maze by using Hadoop MapReduce. The parallelism is achieved by splitting inputs and distributing them over the nodes in the cluster. The simulation programs developed showed that a route validation/rejection approach was faster and more
reliable than a sequential route building approach.

Keywords– Parallel Processing, Hadoop, Mapreduce, Number Systems, Maze Traversal, Input Splitting, Virtual Machine Computing, Multidimensional graph, Multiple Instruction and Small Data

Mazes, typically a set of inter-connected roads with many branches and confusing pathways, can be conceived as a connected graph with the entrance to the maze leading to all the possible exits and dead ends. Solving a maze is equivalent to the traversal of an equivalent graph form the node corresponding to the maze entrance to the node corresponding to the maze exit. Many methods have been developed for solving a maze such as Dijkstra’s algorithm, Tarry’s algorithm, Breath first search, Depth first search, etc.
While these have been investigated and tested for sequential computing on a single machine, they have not been adopted completely to a parallel programming platform [1]. The following work seeks to implement a breadth-first search and a depth-first search on Apache’s Hadoop MapReduce
framework and compare their performance.Parallel computing poses many challenges such as, but not limited to, concurrency in a program (how to make different processors work simultaneously), data localization (making the data available at the right place) and task scheduling at the right granularity (dividing up the tasks to the best size and executing them in parallel). The MapReduce framework proposed by Google [2], simplifies
the development of programs in parallel by providing a skeletal framework in which the programmer represents the task as a series of operations on a large data set that is distributed over many nodes in a cluster. Each chunk of data, known as a block, is typically 64 megabytes to 256 megabytes in size. This enables the framework to break up huge data sets of the order of terabytes and perform simple operations on them by using the processor power of hundreds or even thousands of computing nodes in the cluster. If the data set is smaller than the block size of the framework’s file-system, as in our case, the default methodology for parallelism fails and the whole task runs on a single node. Parallelism has to be achieved to perform complex instructions on a smaller data set by adjusting the way in which the data is split among the different computing nodes. This is done by manually specifying the way in which data has to be split by mapreduce and making it available on every node.

The mazes dealt in this paper are basically rectangular 2-D mazes. An example of the representation we have used is shown below

Figure 1. An example maze text file

In this representation, the x’s represent the walls in the maze and o’s represent the available paths to traverse. ‘i’ represents the entrance and ‘e’ represents the exit of the maze. It is possible to represent any two dimensional maze as a square maze (i.e. a maze where a maximum of four paths meet at a point). Any point that has more than four paths meeting together can be changed into a combination of two or more connected points each having not more than four paths meeting together without changing the connectedness of the maze. This representation is easy to visualize and we can see the maze itself.

For a square maze, each point (hence forth referred to as a maze node or simply node) has no more than four paths meeting at it. Each path connects the node to another node in the maze. These nodes shall be called the neighbors of the first node. From the representation of the maze as
mentioned, each node has no more than four neighbors (above, below, left and right). In the representation, we treat each ‘o’ as a node and the other surrounding ‘o’s as its neighbors. We assign an integer as an identifier to each node in the order in which it is encountered in the file. A java
class is created to hold all details of the node. All the nodes are stored in an array A route is represented as a string containing the identifiers of the nodes in the order in which they were visited delimited by hyphen (‘-‘).
IV. Approach 1 : BREADTH FIRST SEARCH ( Later Discarded )
A. Preprocessing:
The maze text-file is read from the hadoop file-system. The nodes are given their identifiers and their neighbors are estimated.
B. Solution:
Here, we implement an adaptation of the traditional breadth first search and adapt it to our problem. The first mapper receives the entrance and emits out all untraversed neighbors with their routes. The reducer checks if the route is complete (dead end or exit) or incomplete (last node still has untraversed neighbors) and marks it appropriately.
The subsequent mappers each take each route and append on to them the neighbors of the last node [4]. This goes on until all the routes have reached completion. Parallelism is achieved by sending a desired number of input routes to each mapper. The disadvantage with this method is the sequential scheduling of mapreduce jobs. Since each input to a job has to be in a text-file, this causes significant delay due to the hard-disk read-writes. Intermediate routes can be many in number due to the exponential nature of the growth of the problem before convergence which can crash the file-system as was observed in the trials.

Flowchart Breadth First Search
Flowchart Breadth First Search

C. Implementation (pseudo code):
1. add entrance to route
2. write route to input file 0
3. let i=0
4. while all routes not complete, do for i in steps of 1
—a. open file i
—b. check all routes
—c. if all routes no finished
——i)map phase with input file i
——ii)reduce phase with output in file (i+1)

Map phase:
1. for each route in input file, do
—a. for each neighbor of last node in route not present in route do
——i)add neighbor to route
——ii)emit route and status as ‘unfinished’
—b. if last node is exit, emit route and status as ‘finished’

Reduce phase:
1. for each route from map phase do
—a. emit route and status into output file

Figure 3. Example maze text file that grows roots exponentially.

Table 1 - Node representation
Table 1 – Node representation
Figure4 : Progression chart showing how routes grow exponentially
Figure4 : Progression chart showing how routes grow exponentially

V. Approach2 – DEPTH FIRST SEARCH  (Later Finalized)
A. Preprocessing:
The text file is read and the number of nodes is found out. Here, we implement an adaptation of the traditional depth first search and adapt it to our problem.
B. Solution:
The challenge in the case of a depth wise search was to parallelize the search. In hadoop framework, it is not possible for each node in the cluster to communicate with another node while working on a task. So it is not possible for a node to know what every other node is working on. The challenge is to develop a way in which each route for validation can be sent to an appropriate mapper. To address this issue, we represent each route as a sequence of quaternary digits with {0, 1, 2, 3} representing the neighbor that is to the {top, bottom, right, left} of the last traversed node in the route. If a route is written as 02112133, it means that we start from the entrance-go up-go right-go down-go down-go right-go down-go left-go left. This way, we can achieve a one to one mapping from the set of routes to the set of quaternary numbers. Each mapper gets a range of quaternary numbers for it to check. Splitting the input in this way is not supported by hadoop’s default input format [3]. To perform an input split where the split size is smaller than the hadoop file system block size, we had to write a custom Input Format class . Every unique valid route with an exit is emitted out. This allows us flexibility in deciding which mapper gets which set of numbers to check. Since each number is unique and no number is repeated, this allows us to check each possible combination independent of each other. Every mapper can thus work independently and in parallel, thus achieving parallelism. It is faster than the breadthwise search as hard-disk operations are done only at the initialization of the program and at the end when all the outputs have been generated.

The difference in achieving parallelism when compared to the method described in section IV is that each mapper gets a range of quaternary numbers for it to check.

Figure5 : Flowchart for Depth First Search
Figure5 : Flowchart for Depth First Search
Figure6 : Example representation of directions in quaternary digits
Figure6 : Example representation of directions in quaternary digits

C. Implementation (pseudo code):
1. numNodes = get number of nodes in maze
2. numSplits = get number of mappers required
3. for i=0 to numSplits-1 do
—a. set input for mapper as {i*(4^numNodes)/numSplits, (i+1)* (4^numNodes)/numSplits}
4. map phase
5. reduce phase
6. end

Map phase:
1. for each route_number in range do
—a. for each route digit in route number do
——i) if route.lastnode.neighbor[route digit] is not present in route, add lastnode.neighbor[route digit] to route.
——ii) if exit is encountered emit route and status as ‘finished’
—b. get next route_number

Reduce phase:
1. for each output record from map phase, do
—a. write record to output file
For the maze in figure 3, when checking a set of directions in quaternary notation from 00000000000000000 to 002200000000000000 in a mapper, the routes are checked in the following manner.

1. first digit 0: go up from entrance : (3,1)→ (2,1)
2. second digit 0: go up : (3,1) → (2,1) → (1,1)
3. third digit 0: go up : not possible, get next direction
4. next direction : 001200000000000000
5. third digit 1: go down : back to (2,1), already visited. Get next direction
6. next direction: 002200000000000000
7. third digit 2:go right: (3,1) → (2,1) → (1,1) → (1,2)
8. fourth digit 2: go right:(3,1) → (2,1) → (1,1) → (1,2) → (1,3)
9. fifth digit 0: go up: not possible, get next direction
10. next direction: 002210000000000000. Out of the range of current mapper. exit mapper.

We have tested both the BFS and DFS codes on a cluster of two virtual machines. We used Cloudera’s CDH3 VM that has 2 GB memory and 5 GB hard disk running on each node.

The breadth first search ran for 2 minutes and 43 seconds while taking 10 iterations to find all the exits. The depth first search ran for 28 seconds with 4 mappers for the input shown in figure 3.
The breadth first search ran for 3 minutes and 32 seconds and depth first for 21 seconds with 4 mappers for the file in figure 3.

Figure 7. Example maze text file with only one route to the exit.


In this case (figure 3), we see that there is only one accessible route to the exit, namely (4,1) → (4,2) → (4,3) → (4,4) → (4,5) → (4,6) → (4,7) → (3,7) → (2,7) → (2,8) → (2,9). The corresponding directions in quaternary number notation will be 2222220022.

Figure 8. Example maze text file with two routes to the exit


In the case of figure 4, there are 2 possible routes: (4,1) → (4,2) → (4,3) → (4,4) → (4,5) → (4,6) → (4,7) → (3,7) → (2,7) → (2,8) → (2,9) and (4,1) → (4,2) → (4,3) → (4,4) → (4,5) → (4,6) → (4,7) → (5,7) → (5,8) → (5,9) → (4,9) → (3,9) → (2,9). The corresponding directions in quaternary number notation will be 2222220022 and 222222122000.
In this work, we explored a depth-first search and a breadth first search on a distributed network with hadoop mapreduce. The performance was benchmarked on a two node VM cluster with each instance running on a separate machine. The main goal was to find a way of listing all the exits to a maze. The performance comparison gives insight on the inherent shortcomings of the hadoop mapreduce model, especially in chaining mapreduce jobs. Listed below are some future optimizations we hope to explore.

A. Alternate programming models:
We hope to test out and compare maze solvers on different parallel computing platforms like HPCC.

B. Larger Clusters:

The benchmarking can be done on larger clusters showing better parallel performance.

C. Application to Social Networking:
We hope to adapt this methodology to find different ways in which people are connected on social networks like Facebook and LinkedIn to help like-minded people meet and exchange ideas.

( This was a research paper that we submitted to ICAPADS-2012 an IEEE distributed computing conference )

(Authors : Shantanu Deo, Sricharan R, Pradeep Thammaneni, Mujtaba Ahmed )


One thought on “Research Paper : Parallel Computing Solutions – Hadoop Mapreduce

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s