Hadoop uses the concept of parallelism

Dhruv Upadhyay
2 min readSep 6, 2022

--

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

To prove parallelism

I used AWS as its CloudFront service to provide graphical results or if you guys want can check from the logs registered.

So, I had launched 4 instances where one will act as masternode namely “client” and the other 3 as datanode namely “Slave”, “Slave2”, and “Slave3”.
Hadoop was already set up and the datanodes were live.
To show parallelism, I created a text file of around 621 MB, it is preferred to have heavier files for better graphical representation.

The same is demonstrated in the above video… Sorry for the quality of the video.

Result

1
2
3
4

If we try to analyze the picture given above, especially 1 and 3, we can see that parallelism was achieved as the data transfer in Slave and Slave1 was initiated at the same instant but not at a uniform rate and followed by Slave2.
The same is expressed by the graph of CPU utilization of different nodes.

--

--