Unwinding Map
Example:
The Scenario:
- We have 7 Node cluster where 1 Node is Name Node (NN) and rest of 6 node is Data Node (DN). With these 6 nodes one node is already down and is not available for our processing purpose. Hence, only 5 Node is available for us to take part as DN.
- We have configured for replication factor to 2.
- We have default block size configured to 64 MB using configuration dfs.block.size
- We will be executing WordCount program to find the count of each word which appear in the our data file.
- Content of Data file is taken from my previous blog “Magic of Hadoop“
The Data (sample):
"With the use of distributed parallel processing design Hadoop overcome these scenarios. This can be configured using low cost commodity machines. This technique relies on horizontal scaling of hardware. Horizontal scaling can be achieved by adding new commodity machines without any limitations. The best part of this type of hardware is fault tolerant and can accommodate one or more machine failure without much performance hit. By harnessing the true capability of distributed parallel processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very low granular level but also speeds up its processing on decentralized data to the maximum. By enabling Hadoop an organization, with huge data, can get its analysis within minute for which it was waiting for hours in traditional ETL with very less data."
The Data as image:
- Assuming the size of this content to 612 MB.
Explanation of Data (Sample):
Total Size : 612 MB
Packet Size : 64 MB
Details of packets with their sizes:
PKT-1:“With the use of distributed parallel processing design Hadoop overcome “ — 64 MB
PKT-2:“these scenarios. This can be configured using low cost commodity machines. This “— 64 MB
PKT-3:“technique relies on horizontal scaling of hardware. Horizontal scaling can be”— 64 MB
PKT-4:“achieved by adding new commodity machines without any limitations. The best part of this”— 64 MB
PKT-5:“type of hardware is fault tolerant and can accommodate one or more machine failure without “— 64 MB
PKT-6:“much performance hit.By harnessing the true capability of distributed parallel”— 64 MB
PKT-7:“processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very”— 64 MB
PKT-8:“low granular level but also speeds up its processing on decentralized data to the maximum.”— 64 MB
PKT-9:” By enabling Hadoop an organization, with huge data, can get its analysis within minute for”— 64 MB
PKT-10:“which it was waiting for hours in traditional ETL with very less data.”— 36 MB
Summary of data explanation:
Hence, total 10 Packet would be created, out of which 9 Packet would be of 64 MB and 1 Packet would be of 36 MB. Seeing the redundancy factor of 2, there will total 20 packets (9*2 =18 & 1*2 =2, total 20) to distributed among DNs where 10 would be for processing and rest of 10 would be to maintain the redundancy factor. It is NN which will decide which packet would go to which DN. Let’s see the distribution of the packets on DN.
Packets distribution on nodes:
DN -1:
DN -1 is assigned packet 1 & 6 for processing and packet 5 & 10 would for redundancy.
PKT-1 | With the use of distributed parallel processing design Hadoop overcome |
PKT-6 | much performance hit.By harnessing the true capability of distributed parallel |
PKT-5 | type of hardware is fault tolerant and can accommodate one or more machine failure without |
PKT-10 | which it was waiting for hours in traditional ETL with very less data. |
DN -2:
Node not available
DN-3:
DN -3 is assigned packet 4 & 9 for processing and packet 2 & 7 would for redundancy.
PKT-4 | achieved by adding new commodity machines without any limitations. The best part of this |
PKT-9 | By enabling Hadoop an organization, with huge data, can get its analysis within minute for |
PKT-2 | these scenarios. This can be configured using low cost commodity machines. This |
PKT-7 | processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very |
DN -4:
DN -4 is assigned packet 3 & 8 for processing while packet 1 & 6 assigned for redundancy.
PKT-3 | technique relies on horizontal scaling of hardware. Horizontal scaling can be |
PKT-8 | low granular level but also speeds up its processing on decentralized data to the maximum. |
PKT-1 | With the use of distributed parallel processing design Hadoop overcome |
PKT-6 | much performance hit.By harnessing the true capability of distributed parallel |
DN-5:
DN -5 is assigned packet 2 & 7 for processing and packet 4 & 9 would for redundancy.
PKT-2 | these scenarios. This can be configured using low cost commodity machines. This |
PKT-7 | processing Hadoop not only enables organization to decentralize its data and allow them to capture data at very |
PKT-4 | achieved by adding new commodity machines without any limitations. The best part of this |
PKT-9 | By enabling Hadoop an organization, with huge data, can get its analysis within minute for |
DN-6:
DN -6 is assigned packet 5 & 10 for processing while packet 3 & 8 assigned for redundancy.
PKT-5 | type of hardware is fault tolerant and can accommodate one or more machine failure without |
PKT-10 | which it was waiting for hours in traditional ETL with very less data. |
PKT-3 | technique relies on horizontal scaling of hardware. Horizontal scaling can be |
PKT-8 | low granular level but also speeds up its processing on decentralized data to the maximum. |
Once these packets are distributed, DN starts its processing.
MAP’s role:
Here comes MAP in picutre.
On each DN, it starts its executions. it starts splitting the text it received. By doing this, it will get list of word as an output [L <K>] on each DN.
[L <K>] :
List of word as per our submitted text file :
DN -1 : “With”, “the”, “use”, “of “, “parallel”, “processing”, “design”, “Hadoop”, “overcome”, “distributed”, “much”, “performance”, “.”, “hit”, “By”, “harnessing”, “the”, “true “, “capability”, “of “, “distributed”, “parallel”
DN -2: No assignment is done as this node is dead.
DN -3: “achieved”, “by”, “adding”, “new”, “commodity”, “machines”, “without”, “any “, “the”, “best”, “part”, “of”, “this”, “limitations”, “”, “By”, “enabling”, “Hadoop”, “an”, “organization”, “with”, “,”, “huge”, “data”, “,”, “can”, “get”, “its”, “analysis”, “within”, “minute”, “for”
DN -4: “technique”, “relies”, “on”, “horizontal”, “scaling”, “of”, “hardware”, “.”, “Horizontal”, “scaling”, “can”, “be”, “low”, “granular”, “level”, “but”, “also”, “speeds”, “up”, “its”, “processing”, “on”, “decentralized”, “data”, “to”, “the”, “maximum”
DN -5: “these”, “scenarios”, “.”, “This”, “can”, “be”, “configured”, “using”, “low”, “cost”, “commodity”, “machines”, “.”, “This”, “processing”, “Hadoop”, “not”, “only”, “enables”, “organization”, “to “, “decentralize”, “its”, “data”, “and”, “allow”, “them”, “to”, “capture”, “data”, “at”, “every”
DN -6: “type”, “of”, “hardware”, “is”, “falut”, “tolerent”, “and”, “can”, “accommodate”, “one”, “or”, “more”, “machine”, “failure”, “without”, “which”, “it”, “was”, “waiting”, “for”, “hours”, “in”, “tradition”, “ETL”, “with”, “very”, “less”, “data”
[L <K>] with word count :
During this split it not only prepares the list of word [L<K>] on each DN but also mark a count of occurrence against each word as well.
This split and count will be done without further knowing that which word will come next. This make sense as it will not be aware what word will come next.
Work done by MAP:
Lets see the work done by MAP on all nodes.
Map of DN -1:
NODE – 1 [ L (K, V)] | ||||
PKT-1 (K) | V | PKT-6(K) | V | |
With | 1 | much | 1 | |
the | 1 | performance | 1 | |
use | 1 | . | 1 | |
of | 1 | hit | 1 | |
parallel | 1 | By | 1 | |
processing | 1 | harnessing | 1 | |
design | 1 | the | 1 | |
Hadoop | 1 | true | 1 | |
overcome | 1 | capability | 1 | |
distributed | 1 | of | 1 | |
distributed | 1 | |||
parallel | 1 |
NODE – 2 [ L (K, V)] | ||||
NODE – 3 [ L (K, V)] | ||||
PKT-4(K) | V | PKT-9(K) | V | |
achieved | 1 | 1 | ||
by | 1 | By | 1 | |
adding | 1 | enabling | 1 | |
new | 1 | Hadoop | 1 | |
commodity | 1 | an | 1 | |
machines | 1 | organization | 1 | |
without | 1 | with | 1 | |
any | 1 | , | 1 | |
the | 1 | huge | 1 | |
best | 1 | data | 1 | |
part | 1 | , | 1 | |
of | 1 | can | 1 | |
this | 1 | get | 1 | |
limitations | its | 1 | ||
analysis | 1 | |||
within | 1 | |||
minute | 1 | |||
for | 1 |
NODE – 4 [ L (K, V)] | ||||
PKT-3(K) | V | PKT-8(K) | V | |
technique | 1 | low | 1 | |
relies | 1 | granular | 1 | |
on | 1 | level | 1 | |
horizontal | 1 | but | 1 | |
scaling | 1 | also | 1 | |
of | 1 | speeds | 1 | |
hardware | 1 | up | 1 | |
. | 1 | its | 1 | |
Horizontal | 1 | processing | 1 | |
scaling | 1 | on | 1 | |
can | 1 | decentralized | 1 | |
be | 1 | data | 1 | |
to | 1 | |||
the | 1 | |||
maximum | 1 |
NODE – 5 [ L (K, V)] | ||||
PKT-2(K) | V | PKT-7(K) | V | |
these | 1 | processing | 1 | |
scenarios | 1 | Hadoop | 1 | |
. | 1 | not | 1 | |
This | 1 | only | 1 | |
can | 1 | enables | 1 | |
be | 1 | organization | 1 | |
configured | 1 | to | 1 | |
using | 1 | decentralize | 1 | |
low | 1 | its | 1 | |
cost | 1 | data | 1 | |
commodity | 1 | and | 1 | |
machines | 1 | allow | 1 | |
. | 1 | them | 1 | |
This | 1 | to | 1 | |
capture | 1 | |||
data | 1 | |||
at | 1 | |||
every | 1 |
NODE – 6 [ L (K, V)] | ||||
PKT-5(K) | V | PKT-10(K) | V | |
type | 1 | which | 1 | |
of | 1 | it | 1 | |
hardware | 1 | was | 1 | |
is | 1 | waiting | 1 | |
falut | 1 | for | 1 | |
tolerent | 1 | hours | 1 | |
and | 1 | in | 1 | |
can | 1 | tradition | 1 | |
accommodate | 1 | ETL | 1 | |
one | 1 | with | 1 | |
or | 1 | very | 1 | |
more | 1 | less | 1 | |
machine | 1 | data | 1 | |
failure | 1 | 1 | ||
without | 1 | 1 |
1 thought on “MapReduce – Unwinding Map”