Hadoop NameNode and Checkpoint
Management of data is the core task of storage and database systems. Furthermore, less attention is on metadata, esp. in the traditional systems. Even in the design of parallel or distributed systems, architectures often assume that metadata can fit in the memory of a single node. Maybe Hadoop is the most famous system. LinkedIn is representative of the adoption of a large-scale Hadoop system. Evidence can be found in its blogs, such as THE ENDLESS PURSUIT OF SCALE AT LINKEDIN1.
“The largest Hadoop cluster at LinkedIn — which is probably the largest one in the world, certainly at a commercial entity if not a security agency for a major nation — has 10,000 nodes and weighed in at 500PB of storage capacity in 2020”. “The single HDFS NameNode for this monster can serve remote procedure calls with an average latency of under ten milliseconds, which is pretty good considering that this HDFS cluster has over 1.6 billion objects (that metric counts directories, files, and blocks together).” Yes, the cluster has 10,000 nodes, 500 PB of storage and 1.6 billion objects. And the single NameNode consumes about 380 GB of memory. How did LinkedIn go through the journey2?
“HDFS started running out of gas in 2016, and LinkedIn created Dynamometer to measure HDFS NameSpace performance and to see how it will perform as more load is applied to an actual system3.” “YARN started running into trouble in 2019 as LinkedIn started pushing scale hard, and the company created DynoYARN to simulate load on YARN clusters at any size and with any workload. This DynoYARN modeler has been open-sourced as of 4, too.” It is reported that LinkedIn stores 1 exabyte of data across all Hadoop clusters, while the largest 10,000-node cluster stores 500 PB of data. QJM is used as the Journal Service. The optimizations include Java tuning, such as the ration of Young and Tenured generation, global read-write lock in non-fair mode, Dynamometer to test the workload before deployment. File less than 512MB is treated as small, logging directory contains many small files (less than 1MB on average), so moved to a dedicated cluster with the help of a homebrew fs driver, FailoverFS. Observer node is the implementation of standby5. Fast Path tailing transaction to reduce the gap between Observer and Active. LastSeenStateId and LastWrittenStateId to help read your own writes. New HDFS API FileSystem.msync() to solve third-party communication. msync() is similar to HDFS hsync(). The latter guarantees that the data is available for other clients to read, while msync() provides the same guarantee for metadata. Now, the NameNodes perform 150K ops/sec on average, peaking at 250K ops/sec, with an average latency of 1-2 msec.
The data structure of NameNode is a combination of Java structures, not B+-like data structures used in database systems. And the checkpoint is a full dump of the memory, but it can be implemented in the standby servers called CheckpointNode6. Simplifying the process of this kind of functionality with the redundant node is popular in the primary/standby or multiple-replica deployment of systems. If there are no standby/replica nodes, we can offload the checkpoint to the storage system or some purpose-built facility. A full checkpoint will take a longer time when the footprint of the metadata grows. Incremental checkpoints may help, but the algorithm can be complex, and the merge of checkpoints can also be slower.
Footnotes
1 https://www.nextplatform.com/2021/09/15/the-endless-pursuit-of-scale-at-linkedin/
2 https://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr
3 https://engineering.linkedin.com/blog/2018/02/dynamometer--scale-testing-hdfs-on-minimal-hardware-with-maximum
4 https://engineering.linkedin.com/blog/2021/scaling-linkedin-s-hadoop-yarn-cluster-beyond-10-000-nodes
5 https://issues.apache.org/jira/browse/HDFS-12943
6 https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Checkpoint_Node
Hadoop
]