MapReduce - Unwinding Map - ENGINEERING PERFORMANCES BigData

In last discussion on MapReduce, we discussed the algorithm which is used by Hadoop for data processing using MapReduce.

In this blog, we will discuss the specific section of MAP in MapReduce and it’s functionality.

Unwinding Map

We will explain this in details and with example here.

Example:

Lets consider our scenario :

The Scenario:

We have 7 Node cluster where 1 Node is Name Node (NN) and rest of 6 node is Data Node (DN). With these 6 nodes one node is already down and is not available for our processing purpose. Hence, only 5 Node is available for us to take part as DN.

We have configured for replication factor to 2.

We have default block size configured to 64 MB using configuration dfs.block.size

We will be executing WordCount program to find the count of each word which appear in the our data file.

Content of Data file is taken from my previous blog “Magic of Hadoop“

The Data (sample):

"With the use of distributed parallel processing design Hadoop overcome these scenarios. This can be configured using low cost commodity machines. This technique relies on horizontal scaling of hardware. Horizontal scaling can be achieved by adding new commodity machines without any limitations. The best part of this type of hardware is fault tolerant and can accommodate one or more machine failure without much performance hit. By harnessing the true capability of distributed parallel processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very low granular level but also speeds up its processing on decentralized data to the maximum. By enabling Hadoop an organization, with huge data, can get its analysis within minute for which it was waiting for hours in traditional ETL with very less data."

The Data as image:

Assuming the size of this content to 612 MB.

Explanation of Data (Sample):

As of now we have defined our scenario now, lets proceed to the processing part.

First of all these packets are split in chunks of 64 MB by NN. Because the data size is 612 MB, at least one packet would be less than 64 MB. Let’s see the distribution of these packets (size including its contents) done by NN for further processing. Size of contents given here is hypothetical (just of the sake of example).
Total Size : 612 MB
Packet Size : 64 MB

Details of packets with their sizes:

PKT-1:“With the use of distributed parallel processing design Hadoop overcome “ — 64 MB
PKT-2:“these scenarios. This can be configured using low cost commodity machines. This “— 64 MB
PKT-3:“technique relies on horizontal scaling of hardware. Horizontal scaling can be”— 64 MB
PKT-4:“achieved by adding new commodity machines without any limitations. The best part of this”— 64 MB
PKT-5:“type of hardware is fault tolerant and can accommodate one or more machine failure without “— 64 MB
PKT-6:“much performance hit.By harnessing the true capability of distributed parallel”— 64 MB
PKT-7:“processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very”— 64 MB
PKT-8:“low granular level but also speeds up its processing on decentralized data to the maximum.”— 64 MB
PKT-9:” By enabling Hadoop an organization, with huge data, can get its analysis within minute for”— 64 MB
PKT-10:“which it was waiting for hours in traditional ETL with very less data.”— 36 MB

Summary of data explanation:

No of Packet : 10 (64 MB * 9 = 576 MB, 36 MB * 1 = 36 MB.)
Hence, total 10 Packet would be created, out of which 9 Packet would be of 64 MB and 1 Packet would be of 36 MB. Seeing the redundancy factor of 2, there will total 20 packets (9*2 =18 & 1*2 =2, total 20) to distributed among DNs where 10 would be for processing and rest of 10 would be to maintain the redundancy factor. It is NN which will decide which packet would go to which DN. Let’s see the distribution of the packets on DN.

To mark assignment of packet for redundancy factor clearly, text is marked in light yellow color and Packet which is assigned for processing is marked as light orange color

Packets distribution on nodes:

DN -1:

DN -1 is assigned packet 1 & 6 for processing and packet 5 & 10 would for redundancy.

PKT-1	With the use of distributed parallel processing design Hadoop overcome
PKT-6	much performance hit.By harnessing the true capability of distributed parallel
PKT-5	type of hardware is fault tolerant and can accommodate one or more machine failure without
PKT-10	which it was waiting for hours in traditional ETL with very less data.

DN -2:
Node not available

DN-3:

DN -3 is assigned packet 4 & 9 for processing and packet 2 & 7 would for redundancy.

PKT-4	achieved by adding new commodity machines without any limitations. The best part of this
PKT-9	By enabling Hadoop an organization, with huge data, can get its analysis within minute for
PKT-2	these scenarios. This can be configured using low cost commodity machines. This
PKT-7	processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very

DN -4:

DN -4 is assigned packet 3 & 8 for processing while packet 1 & 6 assigned for redundancy.

PKT-3	technique relies on horizontal scaling of hardware. Horizontal scaling can be
PKT-8	low granular level but also speeds up its processing on decentralized data to the maximum.
PKT-1	With the use of distributed parallel processing design Hadoop overcome
PKT-6	much performance hit.By harnessing the true capability of distributed parallel

DN-5:

DN -5 is assigned packet 2 & 7 for processing and packet 4 & 9 would for redundancy.

PKT-2	these scenarios. This can be configured using low cost commodity machines. This
PKT-7	processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very
PKT-4	achieved by adding new commodity machines without any limitations. The best part of this
PKT-9	By enabling Hadoop an organization, with huge data, can get its analysis within minute for

DN-6:

DN -6 is assigned packet 5 & 10 for processing while packet 3 & 8 assigned for redundancy.

PKT-5	type of hardware is fault tolerant and can accommodate one or more machine failure without
PKT-10	which it was waiting for hours in traditional ETL with very less data.
PKT-3	technique relies on horizontal scaling of hardware. Horizontal scaling can be
PKT-8	low granular level but also speeds up its processing on decentralized data to the maximum.

Once these packets are distributed, DN starts its processing.

MAP’s role:

Here comes MAP in picutre.

On each DN, it starts its executions. it starts splitting the text it received. By doing this, it will get list of word as an output [L <K>] on each DN.

[L <K>] :

List of word as per our submitted text file :
DN -1 : “With”, “the”, “use”, “of “, “parallel”, “processing”, “design”, “Hadoop”, “overcome”, “distributed”, “much”, “performance”, “.”, “hit”, “By”, “harnessing”, “the”, “true “, “capability”, “of “, “distributed”, “parallel”
DN -2: No assignment is done as this node is dead.
DN -3: “achieved”, “by”, “adding”, “new”, “commodity”, “machines”, “without”, “any “, “the”, “best”, “part”, “of”, “this”, “limitations”, “”, “By”, “enabling”, “Hadoop”, “an”, “organization”, “with”, “,”, “huge”, “data”, “,”, “can”, “get”, “its”, “analysis”, “within”, “minute”, “for”
DN -4: “technique”, “relies”, “on”, “horizontal”, “scaling”, “of”, “hardware”, “.”, “Horizontal”, “scaling”, “can”, “be”, “low”, “granular”, “level”, “but”, “also”, “speeds”, “up”, “its”, “processing”, “on”, “decentralized”, “data”, “to”, “the”, “maximum”
DN -5: “these”, “scenarios”, “.”, “This”, “can”, “be”, “configured”, “using”, “low”, “cost”, “commodity”, “machines”, “.”, “This”, “processing”, “Hadoop”, “not”, “only”, “enables”, “organization”, “to “, “decentralize”, “its”, “data”, “and”, “allow”, “them”, “to”, “capture”, “data”, “at”, “every”
DN -6: “type”, “of”, “hardware”, “is”, “falut”, “tolerent”, “and”, “can”, “accommodate”, “one”, “or”, “more”, “machine”, “failure”, “without”, “which”, “it”, “was”, “waiting”, “for”, “hours”, “in”, “tradition”, “ETL”, “with”, “very”, “less”, “data”

[L <K>] with word count :

During this split it not only prepares the list of word [L<K>] on each DN but also mark a count of occurrence against each word as well.

This split and count will be done without further knowing that which word will come next. This make sense as it will not be aware what word will come next.

Work done by MAP:

Lets see the work done by MAP on all nodes.

Map of DN -1:

NODE – 1 [ L (K, V)]
PKT-1 (K)	V	PKT-6(K)	V
With	1	much	1
the	1	performance	1
use	1	.	1
of	1	hit	1
parallel	1	By	1
processing	1	harnessing	1
design	1	the	1
Hadoop	1	true	1
overcome	1	capability	1
distributed	1	of	1
		distributed	1
		parallel	1

Map of DN -2:

NODE – 2 [ L (K, V)]

Map of DN -3:

NODE – 3 [ L (K, V)]
PKT-4(K)	V	PKT-9(K)	V
achieved	1		1
by	1	By	1
adding	1	enabling	1
new	1	Hadoop	1
commodity	1	an	1
machines	1	organization	1
without	1	with	1
any	1	,	1
the	1	huge	1
best	1	data	1
part	1	,	1
of	1	can	1
this	1	get	1
limitations		its	1
		analysis	1
		within	1
		minute	1
		for	1

Map of DN -4:

NODE – 4 [ L (K, V)]
PKT-3(K)	V	PKT-8(K)	V
technique	1	low	1
relies	1	granular	1
on	1	level	1
horizontal	1	but	1
scaling	1	also	1
of	1	speeds	1
hardware	1	up	1
.	1	its	1
Horizontal	1	processing	1
scaling	1	on	1
can	1	decentralized	1
be	1	data	1
		to	1
		the	1
		maximum	1

Map of DN -5:

NODE – 5 [ L (K, V)]
PKT-2(K)	V	PKT-7(K)	V
these	1	processing	1
scenarios	1	Hadoop	1
.	1	not	1
This	1	only	1
can	1	enables	1
be	1	organization	1
configured	1	to	1
using	1	decentralize	1
low	1	its	1
cost	1	data	1
commodity	1	and	1
machines	1	allow	1
.	1	them	1
This	1	to	1
		capture	1
		data	1
		at	1
		every	1

Map of DN -6:

NODE – 6 [ L (K, V)]
PKT-5(K)	V	PKT-10(K)	V
type	1	which	1
of	1	it	1
hardware	1	was	1
is	1	waiting	1
falut	1	for	1
tolerent	1	hours	1
and	1	in	1
can	1	tradition	1
accommodate	1	ETL	1
one	1	with	1
or	1	very	1
more	1	less	1
machine	1	data	1
failure	1		1
without	1		1

As we can see here, it is marking 1 against each occurrence count without even looking previous and next word, whether it is already counted.

As soon as splitting and counting of list of word [L <K> ] is completed it will be converted into list of word and count [L<K,V>].

Once this counting is completed by by MAP, the output will be saved in Local Disk of each DN. Unless MAP finishes its task on all nodes, next step will not be kicked off.

Coming up next ….. SORT.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

MapReduce – Unwinding Map

Unwinding Map

Example:

The Scenario:

The Data (sample):

The Data as image:

Explanation of Data (Sample):

Details of packets with their sizes:

Summary of data explanation:

Packets distribution on nodes:

DN -1:

DN -2:
Node not available

DN-3:

DN -4:

DN-5:

DN-6:

MAP’s role:

[L <K>] :

[L <K>] with word count :

Work done by MAP:

1 thought on “MapReduce – Unwinding Map”

Leave a Reply Cancel reply

Unwinding Map

Example:

The Scenario:

The Data (sample):

The Data as image:

Explanation of Data (Sample):

Details of packets with their sizes:

Summary of data explanation:

Packets distribution on nodes:

DN -1:

DN -2: Node not available

DN-3:

DN -4:

DN-5:

DN-6:

MAP’s role:

[L <K>] :

[L <K>] with word count :

Work done by MAP:

1 thought on “MapReduce – Unwinding Map”

Leave a Reply Cancel reply

DN -2:
Node not available