MapReduce – Unwinding Map

In last discussion on MapReduce, we discussed the algorithm which is used by Hadoop for data processing using MapReduce.
In this blog, we will discuss the specific section of MAP in MapReduce and it’s functionality.

Unwinding Map

We will explain this in details and with example here.

Example:

Lets consider our scenario :

The Scenario:

  • We have 7 Node cluster where 1 Node is Name Node (NN) and rest of 6 node is Data Node (DN). With these 6 nodes one node is already down and is not available for our processing purpose. Hence, only 5 Node is available for us to take part as DN.
  • We have configured for replication factor to 2.
  • We have default block size configured to 64 MB using configuration dfs.block.size
  • We will be executing WordCount program to find the count of each word which appear in the our data file.

The Data (sample):

"With the use of distributed parallel processing design Hadoop overcome these scenarios. This can be configured using low cost commodity machines. This technique relies on horizontal scaling of hardware. Horizontal scaling can be achieved by adding new commodity machines without any limitations. The best part of this type of hardware is fault tolerant and can accommodate one or more machine failure without much performance hit. By harnessing the true capability of distributed parallel processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very low granular level but also speeds up its processing on decentralized data to the maximum. By enabling Hadoop an organization, with huge data, can get its analysis within minute for which it was waiting for hours in traditional ETL with very less data."
The Data as image:
  • Assuming the size of this content to 612 MB.

Explanation of Data (Sample):

As of now we have defined our scenario now, lets proceed to the processing part.
First of all these packets are split in chunks of 64 MB by NN. Because the data size is 612 MB, at least one packet would be less than 64 MB. Let’s see the distribution of these packets (size including its contents) done by NN for further processing. Size of contents given here is hypothetical (just of the sake of example).
Total Size : 612 MB
Packet Size : 64 MB
Details of packets with their sizes:

PKT-1:“With the use of distributed parallel processing design Hadoop overcome “ — 64 MB
PKT-2:“these scenarios. This can be configured using low cost commodity machines. This “— 64 MB
PKT-3:“technique relies on horizontal scaling of hardware. Horizontal scaling can be”— 64 MB
PKT-4:“achieved by adding new commodity machines without any limitations. The best part of this”— 64 MB
PKT-5:“type of hardware is fault tolerant and can accommodate one or more machine failure without “— 64 MB
PKT-6:“much performance hit.By harnessing the true capability of distributed parallel”— 64 MB
PKT-7:“processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very”— 64 MB
PKT-8:“low granular level but also speeds up its processing on decentralized data to the maximum.”— 64 MB
PKT-9:” By enabling Hadoop an organization, with huge data, can get its analysis within minute for”— 64 MB
PKT-10:“which it was waiting for hours in traditional ETL with very less data.”— 36 MB

Summary of data explanation:
No of Packet : 10 (64 MB * 9 = 576 MB, 36 MB * 1 = 36 MB.)
Hence, total 10 Packet would be created, out of which 9 Packet would be of 64 MB and 1 Packet would be of 36 MB. Seeing the redundancy factor of 2, there will total 20 packets (9*2 =18 & 1*2 =2, total 20) to distributed among DNs where 10 would be for processing and rest of 10 would be to maintain the redundancy factor. It is NN which will decide which packet would go to which DN. Let’s see the distribution of the packets on DN.

To mark assignment of packet for redundancy factor clearly, text is marked in light yellow color and Packet which is assigned for processing is marked as light orange color

Packets distribution on nodes:

DN -1:

DN -1 is assigned packet 1 & 6 for processing and packet 5 & 10 would for redundancy.

PKT-1With the use of distributed parallel processing design Hadoop overcome
PKT-6much performance hit.By harnessing the true capability of distributed parallel
PKT-5type of hardware is fault tolerant and can accommodate one or more machine failure without
PKT-10which it was waiting for hours in traditional ETL with very less data.

DN -2:
Node not available
DN-3:

DN -3 is assigned packet 4 & 9 for processing and packet 2 & 7 would for redundancy.

PKT-4achieved by adding new commodity machines without any limitations. The best part of this
PKT-9By enabling Hadoop an organization, with huge data, can get its analysis within minute for
PKT-2these scenarios. This can be configured using low cost commodity machines. This
PKT-7processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very
DN -4:

DN -4 is assigned packet 3 & 8 for processing while packet 1 & 6 assigned for redundancy.

PKT-3technique relies on horizontal scaling of hardware. Horizontal scaling can be
PKT-8low granular level but also speeds up its processing on decentralized data to the maximum.
PKT-1With the use of distributed parallel processing design Hadoop overcome
PKT-6much performance hit.By harnessing the true capability of distributed parallel
DN-5:

DN -5 is assigned packet 2 & 7 for processing and packet 4 & 9 would for redundancy.

PKT-2these scenarios. This can be configured using low cost commodity machines. This
PKT-7processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very
PKT-4achieved by adding new commodity machines without any limitations. The best part of this
PKT-9By enabling Hadoop an organization, with huge data, can get its analysis within minute for
DN-6:

 DN -6 is assigned packet 5 & 10 for processing while packet 3 & 8 assigned for redundancy.

PKT-5type of hardware is fault tolerant and can accommodate one or more machine failure without
PKT-10which it was waiting for hours in traditional ETL with very less data.
PKT-3technique relies on horizontal scaling of hardware. Horizontal scaling can be
PKT-8low granular level but also speeds up its processing on decentralized data to the maximum.

Once these packets are distributed, DN starts its processing.

MAP’s role:

Here comes MAP in picutre.

On each DN, it starts its executions. it starts splitting the text it received. By doing this, it will get list of word as an output [L <K>] on each DN.

[L <K>] :

List of word as per our submitted text file :
DN -1 : “With”, “the”, “use”, “of “, “parallel”, “processing”, “design”, “Hadoop”, “overcome”, “distributed”, “much”, “performance”, “.”, “hit”, “By”, “harnessing”, “the”, “true “, “capability”, “of “, “distributed”, “parallel”
DN -2: No assignment is done as this node is dead.
DN -3: “achieved”, “by”, “adding”, “new”, “commodity”, “machines”, “without”, “any “, “the”, “best”, “part”, “of”, “this”, “limitations”, “”, “By”, “enabling”, “Hadoop”, “an”, “organization”, “with”, “,”, “huge”, “data”, “,”, “can”, “get”, “its”, “analysis”, “within”, “minute”, “for”
DN -4: “technique”, “relies”, “on”, “horizontal”, “scaling”, “of”, “hardware”, “.”, “Horizontal”, “scaling”, “can”, “be”, “low”, “granular”, “level”, “but”, “also”, “speeds”, “up”, “its”, “processing”, “on”, “decentralized”, “data”, “to”, “the”, “maximum”
DN -5: “these”, “scenarios”, “.”, “This”, “can”, “be”, “configured”, “using”, “low”, “cost”, “commodity”, “machines”, “.”, “This”, “processing”, “Hadoop”, “not”, “only”, “enables”, “organization”, “to “, “decentralize”, “its”, “data”, “and”, “allow”, “them”, “to”, “capture”, “data”, “at”, “every”
DN -6: “type”, “of”, “hardware”, “is”, “falut”, “tolerent”, “and”, “can”, “accommodate”, “one”, “or”, “more”, “machine”, “failure”, “without”, “which”, “it”, “was”, “waiting”, “for”, “hours”, “in”, “tradition”, “ETL”, “with”, “very”, “less”, “data”

[L <K>] with word count :

During this split it not only prepares the list of word [L<K>] on each DN but also mark a count of occurrence against each word as well.

This split and count will be done without further knowing that which word will come next. This make sense as it will not be aware what word will come next.

Work done by MAP:

Lets see the work done by MAP on all nodes.

Map of DN -1:

NODE – 1 [ L (K, V)]

PKT-1 (K)

V

PKT-6(K)

V

With

1

much

1

the

1

performance

1

use

1

.

1

of

1

hit

1

parallel

1

By

1

processing

1

harnessing

1

design

1

the

1

Hadoop

1

true

1

overcome

1

capability

1

distributed

1

of

1

distributed

1

parallel

1

Map of DN -2:

NODE – 2 [ L (K, V)]

Map of DN -3:

NODE – 3 [ L (K, V)]

PKT-4(K)

V

PKT-9(K)

V

achieved

1

1

by

1

By

1

adding

1

enabling

1

new

1

Hadoop

1

commodity

1

an

1

machines

1

organization

1

without

1

with

1

any

1

,

1

the

1

huge

1

best

1

data

1

part

1

,

1

of

1

can

1

this

1

get

1

limitations

its

1

analysis

1

within

1

minute

1

for

1

Map of DN -4:

NODE – 4 [ L (K, V)]

PKT-3(K)

V

PKT-8(K)

V

technique

1

low

1

relies

1

granular

1

on

1

level

1

horizontal

1

but

1

scaling

1

also

1

of

1

speeds

1

hardware

1

up

1

.

1

its

1

Horizontal

1

processing

1

scaling

1

on

1

can

1

decentralized

1

be

1

data

1

to

1

the

1

maximum

1

Map of DN -5:

NODE – 5 [ L (K, V)]

PKT-2(K)

V

PKT-7(K)

V

these

1

processing

1

scenarios

1

Hadoop

1

.

1

not

1

This

1

only

1

can

1

enables

1

be

1

organization

1

configured

1

to

1

using

1

decentralize

1

low

1

its

1

cost

1

data

1

commodity

1

and

1

machines

1

allow

1

.

1

them

1

This

1

to

1

capture

1

data

1

at

1

every

1

Map of DN -6:

NODE – 6 [ L (K, V)]

PKT-5(K)

V

PKT-10(K)

V

type

1

which

1

of

1

it

1

hardware

1

was

1

is

1

waiting

1

falut

1

for

1

tolerent

1

hours

1

and

1

in

1

can

1

tradition

1

accommodate

1

ETL

1

one

1

with

1

or

1

very

1

more

1

less

1

machine

1

data

1

failure

1

1

without

1

1

As we can see here, it is marking 1 against each occurrence count without even looking previous and next word, whether it is already counted.
As soon as splitting and counting of list of word [L <K> ] is completed it will be converted into list of word and count [L<K,V>].
Once this counting is completed by by MAP, the output will be saved in Local Disk of each DN. Unless MAP finishes its task on all nodes, next step will not be kicked off.
Coming up next ….. SORT.

1 thought on “MapReduce – Unwinding Map

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.