When I wrote about Big Data I mentioned that Big data is a problem and Hadoop is a solution for it.
Let me start this post with Big Data Problems:-
- Storing the colossal amount of data.
- Storing heterogeneous data.
- Accessing and processing speed of data.
Here comes Hadoop…!!
- Hadoop is a framework to address “Big Data”.
- It is based on master-slave architecture.
- Slaves are called Data Nodes. Data nodes are scalable.
- Hadoop allows you to store “Big Data” in distributed environment (Data Nodes).
- Storing in distributed environment helps in increasing processing speed.
Components of Hadoop
HDFS (Hadoop Distributed File System)
- It is storage unit of Hadoop framework.
- It have many Data Nodes and single Master Node.
- It can store any amount of data i.e Big Data in a distributed way (Multiple Data Node and Single Master Node).
- Data is stored in Data Nodes on blocks and you can specify block size. Like you can configure block size to 128Mb for 512 Mb data. Data is stored in 4 blocks then.
- Data blocks are stored on different “DATA NODES”.
- Replication factor is 3. Each block is stored on 3 Data Nodes.
- Data Nodes can be added when needed.
- Heterogeneous data (structured or unstructured or semi structured) can be stored on HDFS.
- There is no pre-dumping schema validation.
- It is the processing unit of Hadoop.
- It helps to process data faster because “we move processing to data and not data to processing“.
- In YARN, the processing logic is sent to the various slave nodes and then data is processed parallely across different slave nodes.
- That processed results are sent to the master node where the results is merged and the response is sent back to the client.
- This addresses the third problem of Big Data that Accessing and processing speed of data with traditional database is slow.