Tuesday, January 31, 2017

Apache Hadoop HDFS一Knowing the Basics


Hadoop HDFS (Hadoop Distributed File System) is a distributed Java-based file system for storing large volumes of data. It is designed:
  • To be a scalable, fault-tolerant, distributed storage system 
  • To be the data management layer of Apache Hadoop
    • Hadoop (data management layer) = HDFS + YARN
      • YARN provides the resource management 
      • HDFS provides the distributed storage for big data
    • HDFS works closely with a wide variety of concurrent data access applications, coordinated by YARN.
  • To span large clusters of commodity servers
    • HDFS will “just work” under a variety of physical and systemic circumstances.
    • HDFS cluster = NameNode + DataNodes
In this article, we will use Apache Hadoop HDFS from the Hortonworks Data Platform (HDP: version 2.4.2) in the discussion.  For HDFS High Availability (HA) feature, our reference is based on [2].


HDFS Cluster


An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data.  Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable .

You can follow the instructions here to format and start HDFS on Hortonworks Data Platform. HDFS can be accessed from applications in many different ways. Natively, HDFS provides a FileSystem Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.  For more information, read here.[3,10,11]

Name Node


High-level summary of Name Node which it:
  • Provides high availability (HA) using redundant Name Nodes[2]
    • NameNode (active)
    • Secondary NameNode (standby)
  • Maintains the following two metadata files (or checkpoint files):
    • fsimage file
      • Holds the entire file system namespace,[12] including the mapping of blocks to files and file system properties
    • editlog file
      • Holds every change that occurs to the filesystem metadata

Namenode Web UI 

To smoke test your NameNode server, you can use the following URL[7,11]
http://$namenode.full.hostname:50070
to determine if you can reach the NameNode server with the browser. If successful, you can also select the Utilities menu to "browse the file system".

High Availability

The HDFS High Availability feature (vs. another new HDFS Federation feature) addresses the SPOF problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

If your individual IDs of NameNodes are nn1 and nn2, you can get their service status using the following command:[3]

$ sudo -u hdfs hdfs haadmin -getServiceState nn1
active
$ sudo -u hdfs hdfs haadmin -getServiceState nn2
 standby

Metadata Files

When NameNode starts up, it reads FsImage and EditLog files from disk, merges all the transactions present in the EditLog to the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage.

Metadata files are stored at: 
  • ${dfs.namenode.name.dir}/edits
  • ${dfs.namenode.name.dir}/fsimage
where dfs.namenode.name.dir property can be configured in hdfs-site.xml.[8]


Data Node


High-level summary of Data Node:[4]
  • Scalable Storage
    • HDFS cluster storage scales horizontally with the addition of DataNodes
  • Minimal data motion
    • Hadoop moves compute processes to the data on HDFS and not the other way around. 
      • Processing tasks can occur on the physical node where the data resides, which significantly reduces network I/O and provides very high aggregate bandwidth.
  • Data Disk Failure一Heartbeats and replication
    • Each DataNode sends a Heartbeat message to the NameNode periodically. 
      • If NameNode detects a DataNode stop sending Heartbeat message, it marks DataNode as dead and stop forwarding new IO requests to them.
    • The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: 
      • a DataNode may become unavailable
      • a replica may become corrupted
      • a hard disk on a DataNode may fail
      • the replication factor of a file may be increased
  • Data Rebalancing
    • HDFS automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold
  • Data Integritychecksum
    • When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. 

References

  1. Hadoop Distributed File System (HDFS)
  2. HDFS High Availability Using the Quorum Journal Manager
  3. HDFS Commands Guide (Apache Hadoop) 
    • All HDFS commands are invoked by the bin/hdfs script and can be grouped into:
      • User commands
      • Administrator commands
      • Debug commands
  4. HDFS Architecture (Apache Hadoop) 
  5. Apache Hadoop
  6. HDFS Federation (Hortonworks)
    • In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. 
  7. HDFS Ports (Hortonworks)
  8. Apache Ambari一Knowing the Basics (Xml and More)
  9. hdfs-default.xml (2.7.1)
  10. FileSystem Shell - Apache™ Hadoop
  11. Hadoop NameNode Web Interface
  12. Namespace (HDFS)
    • Consists of directories, files and blocks.
    • It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
  13. Hadoop DistCp Guide
    • Copy file or directories recursively
  14. All Cloud-related articles on Xml and More

21 comments:

Unknown said...

This blog is having the general information. Got a creative work and this is very different one.We have to develop our creativity mind.This blog helps for this. Thank you for this blog. This is very interesting and useful.
Android app development company in Chennai

sathya said...

Really nice information here about by choosing with the headlines. We want to make the readers whether it is relevant for their searches or not. They will decide by looking at the headline itself.

MSBI Training in Chennai

Informatica Training in Chennai

Roger Binny said...

The blog or and best that is extremely useful to keep I can share the ideas of the future as this is really what I was looking for, I am very comfortable and pleased to come here. Thank you very much.

Digital Marketing Course in Chennai
Digital Marketing Training in Chennai
Online Digital Marketing Training
SEO Training in Chennai
Digital Marketing Course
Digital Marketing Training
Digital Marketing Courses

MindtechAffiliates said...

Thank you for such a informative information.It will really helpfull for beginer to know the basic difference between linux and windows hosting.

Thanks
Cpa offers

DedicatedHosting4u said...

This is a really fascinating attempt! which is very well explained and articulated particularly favorable circumstances to the accommodation industry. It's music to my ears.
Thank you and yours!

DedicatedHosting4u.com

Ganesh said...

Thank you for your sharing and I want to more updates for my research..
Appium Training in Chennai
Appium Certification in Chennai
Pega Training in Chennai
Tableau Training in Chennai
Advanced Excel Training in Chennai
Spark Training in Chennai
Primavera Training in Chennai
Unix Training in Chennai
Power BI Training in Chennai
Corporate Training in Chennai
Placement Training in Chennai

vivekvedha said...

This blog is having the general information. Got a creative work and this is very different one.We have to develop our creativity mind.
acte velachery reviews complaints

acte tambaram reviews complaints

acte anna nagar reviews complaints

acte porur reviews complaints

acte omr reviews complaints

Joyal said...

Good post and informative. Thank you very much for sharing this good article, it was so good to read and useful to improve my knowledge as updated, keep blogging.Thank you for sharing wonderful information with us to get some idea about that content.oracle training in chennai

oracle training institute in chennai

oracle training in bangalore

oracle training in hyderabad

oracle training

oracle online training

hadoop training in chennai

hadoop training in bangalore

John said...

You should be a part of a contest for one of the highest quality sites on the net. I’m going to highly recommend this blog! The training and certification courses to get trained on microtek training. I got really cleared my doubts by reading your informative information.

Well i also founded something interesting over here in microtek training solutions

rocky said...

Wow it is really wonderful and awesome thus it is veWow, it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot.
python training in bangalore

python training in hyderabad

python online training

python training

python flask training

python flask online training

python training in coimbatore
python training in chennai

python course in chennai

python online training in chennai


dhinesh said...

good blog and I love your work, great star ahead keep it up

Full Stack Course Chennai
Full Stack Training in Bangalore

Full Stack Course in Bangalore

Full Stack Training in Hyderabad

Full Stack Course in Hyderabad

Full Stack Training

Full Stack Course

Full Stack Online Training

Full Stack Online Course


sakthi said...

Salesforce also delivers reliable data security and protection, ensuring that sensitive information is not lost or compromised. Salesforce provides your employees with the resources they need to maximize efficiency and productivity. Customers will also benefit from better customer service.
Salesforce Training in Chennai

Salesforce Online Training in Chennai

Salesforce Training in Bangalore

Salesforce Training in Hyderabad

Salesforce training in ameerpet

Salesforce Training in Pune

Salesforce Online Training

Salesforce Training

Sages Marketing said...

Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant

Sages Marketing said...

Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant
Damien Grant

DIO1337 said...

หวยออนไลน์ หวยหุ้น หวยลาว หวยยี่กี่ สมัครง่าย อัตรจ่ายสูงที่สุดในไทย
หวยออนไลน์ ความสนุกรูปแบบใหม่ อัตรจ่ายสูงที่สุดในไทย เรียนรู้ง่าย สมัครง่าย เล่นง่าย ได้เงินจริง สามารถเลือก ซื้อเลขเด็ด ของคุณได้ง่าย เรามีคอลเซนเตอร์ คอยให้บริการ ตลอด 24 ชั่วโมง

Mr Rahman said...

Great content & Thanks for sharing with oflox. Do you want to know Mobile App Development Company In Dehradun

Unknown said...

Very good article And so much Helful for me.

Eye Lashes

Sonny Quinn said...

Nicely done, Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. continue to write about
Data Engineering Solutions
 
Data Analytics Solutions

periyannan said...

This is an excellent post I seen thanks to share it. It is really what I wanted to see hope in future you will continue for sharing such a excellent post.
evs full form
raw agent full form
full form of tbh in instagram
dbs bank full form
https full form
tft full form
pco full form
kra full form in hr
tbh full form in instagram story
epc full form

Harshan said...


Useful blog, it is very impressive.

How JMeter is Used for Performance Testing
Why JMeter for Performance Testing

Healthandfigure said...

penis size se aap bhi pareshan hain toh aaj mein aap ko penis size badhane ki dawa oil ke baare me btaunga jis aap is pareshane se dur ho payenge aur toh aur apne penis ka size ka ling bada aur mota kare payeng