Correctly configuring HDFS preventing data loss

Published: December 2016 - Themes: Big Data, data storage system, HDFS.

HDFS is the distributed storage system powering big data industry behemoths such as Spark, HBase and Hadoop. By bringing redundancy to the application layer, HDFS greatly simplifies dealing with failing disks and nodes. Because of this promise of consistency and redundancy, HDFS is often touted as a safe data storage system.

Imagine my surprise upon discovering our HDFS cluster lost several acknowledged blocks after a power outage. Although an HDFS acknowledgement guarantees the block is present on three DataNodes within your cluster, those blocks don’t have to be persisted to disk. If those three nodes are taken out of the cluster, that data is irreparably lost.

By setting dfs.datanode.synconclose to true in your DataNode configuration, HDFS will acknowledge the write after the block has been synced to disk. Unfortunately the DataNodes wait until the entire datablock has been received before writing it to disk. This introduces a long wait and effectively cuts throughput of our HDFS cluster down by a third.

Datablocks in HDFS are never to change. This promise of immutability means we should be able to start syncing to disk without waiting for the entire block. Thankfully HDFS allows us to do exactly that by setting dfs.datanode.sync.behind.writes to true. With this feature enabled, most of the incurred performance impact has been alleviated.

In short: dfs.datanode.synconclose and dfs.datanode.sync.behind.writes are two DateNode parameters that should be enabled on all HDFS clusters.