Druid vs Hadoop Distributed File System Comparison

(32)

Sergey Lobko-Lobanovsky
Chief Executive Officer, Geomotiv
Published: May 4, 2019

These days, it’s hard to find industries that are not driven by verifiable data. Fast Big Data processing is extremely important for companies that depend on loads of data to compete. Problems related to Big Data have induced the urge to develop high-reliability systems. These systems ingest and read massive loads of data, and provide other useful data-related functionality.

Hadoop’s complex ecosystem has long since become a comprehensive answer to Big Data problems. The framework excels in storing large amounts of data. It also provides quick access to them through its Hadoop Distributed File System (HDFS).

From this perspective, Druid as a massive data store is close to HDFS. However, it has functional differences that result from its special architecture and design.

Let’s see what features stand behind the two solutions for Big Data management.

Type of Architecture

HDFS

Image provided by https://hadoop.apache.org

HDFS is implemented as a distributed file system. Here huge amounts of data are stored across multiple machines in a cluster. This special design allows data to be spread across thousands of computers each offering local (and complex) computation and storage facilities. The Hadoop architecture is hierarchical and contains main and secondary nodes. The architecture transmits the data and distributes it across the ecosystem in a master-slave mode.

Druid

Image provided by http://druid.io

Druid is a database with a distributed columnar architecture that comprises various types of nodes. It forms a cluster where each node is optimized to serve a particular function. Apart from this, Druid requires that some external dependencies work together within the cluster. Druid’s major components can be configured independently. Such a system design proves efficient in providing enhanced flexibility and control over the cluster.

Fault Tolerance

HDFS

When data flows across the cluster, the system divides the data into blocks. It further stores them in the cluster nodes. HDFS replicates each block in different nodes. Thus, if a computer goes down, the data remains available for retrieval from another machine.

Thus, HDFS provides enhanced fault tolerance through replication. However, Namenode becoming unavailable can damage the entire cluster. This critical issue can be resolved by using several Namenodes or storing them in a separate machine. The availability of various options that help to prevent data loss makes HDFS an extremely reliable file system. 

Druid

Druid communication failures have minimal impact on system performance thanks to its shared-nothing architecture. When the system receives data, it replicates it and places it in deep storage. In case one component becomes suddenly unavailable, the queries are easy to recover. That's because Druid’s major components are scaled and configured independently.

Druid’s reliability against sudden data loss is the key feature that makes it preferable when comparing to HDFS. It’s master-less architecture allows data to be retrieved from one of many nodes that are not so dependent on each other as we observe in HDFS.

Indexing

Full scan through mountains of data is a pretty common need in both technologies. However, each system treats this issue differently using different search modes.

Hadoop

Due to its nature, Hadoop DFS doesn’t use indexing to retrieve relevant files. HDFS requires external systems like Apache Hive to initiate indexed-search operations. However, a large data set can be split into smaller files before dumping to the storage system. This enables the user to refer to a particular file rather than search through all the records. Thus it helps to speed up the search process. 

Druid

Fast filtering and searching are done across multiple columns with the help of compressed bitmap indexing. The system also uses multiple indexes concurrently to retrieve the data from the database.

Types of Data Sets

When it comes to Big Data loads, system processing speed becomes as important as its storage capabilities. There are critical differences in the way both technologies handle various types of data sets.

HDFS

HDFS is ready to process large files with no system speed or performance loss. A large data set being written into the system gets split into smaller parts, whereby each node receives a portion of data to store. The process falls into a sequence of actions as the data flows node-by-node.

However, HDFS has limitations with smaller files due to its primary feature, i.e. streaming access to large files. So, numerous objects that are smaller than 128MB are likely to cause Namenode overload. These objects will consequently lead to overall system slow-down.

HDFS is ready to process large files with no system speed or performance loss.

Druid

As opposed to HDFS, Druid is good at writing smaller records easily accessible from each node. Each column in the storage is optimized for a certain type of data. It helps to facilitate serial data processing. The system can read and ingest the data simultaneously across the entire cluster. It is also possible to write the data to a certain part in the cluster.

Analytics

The ability to deliver analytics is a crucial when assessing Big Data storage system functionality. For instance, if you want to derive insights from large loads of files, the need for “simple, interactive data applications that anyone could use” becomes urgent, as the creator of Hadoop states. Let’s see how the two systems deliver data analytics.

HDFS

Being part of Apache Hadoop ecosystem, HDFS combines with Hadoop MapReduce and Spark to deliver extensive analytics and fulfill other Big Data-related tasks. HDFS is also compatible with the systems that are able to run sophisticated analysis. To name a few, Apache Hive, Impala, Pig, etc.

Druid

Powered by OLAP, Druid is able to deliver “slice-and-dice” analytics of large data sets. Its capabilities to ingest loads of data in real-time, together with a high-performance time-series database, makes time series data sets easy and quick to process.

Druid provides fast analytical queries, at high concurrency, on event-driven data. It can instantaneously ingest streaming data and provide sub-second queries to power interactive UIs.

Choose Your Answer to Data-Related Problems

The major component of Apache Hadoop, HDFS offers a scalable solution for storing and accessing large data files. The file system can scale to hundreds of nodes. Thus it can offer sufficient space for the records. Yet despite its storage capabilities, HDFS has certain limitations inherent in its architecture.

Druid with its “deep storage” functionality guarantees efficient data ingestion. It allows to have access to interactive analysis of raw data, and process the data. Managing the events as-they-occur in message buses like Kafka, or data lakes built on HDFS, Druid delivers actionable results that improve decision-making.

The data storage system ensures a high level of resistance to failures. It allows data to be preserved when sudden crashes occur.

Due to Hadoop’s specific architecture and its ability to split large volumes of data for fast distribution across the nodes, HDFS becomes the choice for storing huge amounts of data. However, it does have issues with accessing smaller blocks. In this case the Druid technology is a great fit. It's better in processing smaller portions of data at a high rate.

Conclusion

With an ever-increasing demand for the data to become more accessible and readable for both customers and internal users, the urge to introduce more analytics-friendly systems is on the rise. Your choice between HDFS and Druid depends on how exactly you’re planning to use your data and how fast you need to derive it.

References
SHARE THIS ARTICLE

Blog

Recommended Reading

The global Big Data market is growing year by year. Data-driven decisions have...

We’re already two months into 2021, which is an excellent time to discuss...

The global Big Data market is growing year by year. Data-driven decisions have...

Nowadays, use cases of Big data are numerous. Some of them deal with...

The Spark framework is an open-code general-purpose computational engine that is read...

Apache Kafka is a crucial component of Apache Software Foundation, which aims to...

01
/
05