Hadoop Architecture – Hadoop Distributed File System-HDFS

Hadoop ecosystem consist of Hadoop Distributed File System-HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie that helps to deep dive into Big Data Hadoop. Hadoop Distributed File System – HDFS is the mainly a Java based distributed storage on Hadoop for planning, organizing and managing pools of big data, that managed across large clusters of commodity servers. There are two major essentials of Hadoop HDFS- NameNode and DataNode.

1. NameNode

NameNode is a daemon which manages, maintains and operates all DATA nodes (slave nodes). It acts as the recorder of metadata for all blocks in it, and it covers information like size, location, source, and hierarchy, etc. It registers all changes that occur to metadata.

2. DataNode

DataNode performances as a slave node daemon which tracks on each slave machine. The data nodes performance as a storage device. It proceeds responsibility to help read and write request from the user.

Note:

Java is the native language of Hadoop Distributed File System-HDFS. Hence one can arrange DataNode and NameNode on machines with Java installed. In a usual deployment, there is one devoted machine running NameNode. All the other nodes in the cluster acts as DataNode. The NameNode contains metadata like the location of blocks on the DataNodes. The NameNode decides resources among several competing DataNodes.

Block

In Hadoop Distributed File System-HDFS terms, Block is nothing but the smallest unit of storage on a computer system. It is the minimum contiguous storage allocated to a file. In HDFS, we have a default block size of 128MB or 256 MB.

Replication

Hadoop Distributed File System-HDFS provides a consistent way to stock huge data in a distributed setting as data blocks. The blocks are also replicated to deliver fault tolerance. The default replication factor is 3 that is configurable. If we want to storage a file of 128 MB in HDFS using the default configuration, we will end up live in a space of 384 MB (3*128 MB) as the blocks will be replicated three times. Each replica will be existing in on a different DataNode.

Rack Awareness

The NameNode guarantees that all the replicas are not kept on the same rack or a single rack. It monitors an in-built Rack Awareness Algorithm to decrease latency as well as provide fault tolerance.

For details see Official web site of Hadoop here

Nub8 Hadoop Consulting Services

Nub8 offers strong Hadoop consulting services, including services for Hadoop ecosystem technology selection, customization, development and implementation covering Hive, Spark, Pig, Sqoop, Flume, Oozie, MapReduce, HDFS, Kafka and more. Nub8 consultants also provide Hadoop integration consulting service with NoSQL and relational databases such as MongoDB, Cassandra, HBase and others such as Oracle Database and Microsoft SQL Server.