Big Data is the new normal in data centers today, the inevitable result of the fact that so much of what we do and what we buy is now digitally recorded, and so many of the products we use are leaving their own “digital footprint” (known as the “Internet of Things / IoT”). The cornerstone technology of the Big Data era is Hadoop, which is now a common and compelling component of the modern data architecture. The question these days is not so much whether to embrace Hadoop but rather which distribution to choose. The three most popular and viable distributions come from Cloudera, Hortonworks and MapR Technologies. Their respective products are CDH (Cloudera Distribution of Apache Hadoop), HDP (Hortonworks Data Platform) and MapR. This paper will look at the differences between CDH, HDP and MapR, with a focus on:
- The Companies behind them
- Management/Administration Tools
- SQL-on-Hadoop Offerings
- Performance Benchmarks
THE THREE CONTENDERS: CLOUDERA, HORTONWORKS AND MAPR
The table below shows some key facts and figures related to each company.
Table 1: Company profiles
Of the three companies, only Hortonworks is traded publicly (as of 12/12/14; NASDAQ: HDP). So valuations, revenues and other financial measures are harder to ascertain for Cloudera and MapR.
Cloudera’s valuation is based on a $740M investment by Intel in March 2014 for an 18% stake in the company. Hortonworks valuation is based on it’s stock price of $27 on 12/31/14, which happens to be equivalent to their valuation in July 2014 when HP invested $50M for just under 5% of the company. When Hortonworks went public in December 2014, they raised $110M; on top of $248M they had raised privately before their Initial Public Offering (IPO), which totals the $358M in the table above. That’s one third of what Cloudera has raised but twice what MapR has. I haven’t found any information that would allow me to determine a valuation for MapR, though Google Capital and others made an $80M investment in MapR (for an undisclosed equity stake) in June 2014.
Cloudera announced that for the twelve months ending 1/31/15 “Preliminary unaudited total revenue surpassed $100 million.” Hortonworks’ $46M revenue is what they reported for the year ending 12/31/14. MapR’s figure of $42M is Wikibon’s estimate of 2014 revenue. Price/Earnings or P/E is a common financial measure for comparing companies, but since none of the three companies have yet to earn any profits, I’ve used Price/Sales in the table above. For comparison, Oracle typically trades at around 5.1X Sales; RedHat at around 7.8X Sales. So 41X for Cloudera and 24X for Hortonworks, while not quite off the scale, are exceedingly high.
The last two rows in the table above show how many employees each company has on the Project Management Committee and as additional Committers on the Apache Hadoop project. This shows their level of involvement in and commitment to Hadoop being sustained and enhanced by the open source community. From the first four rows of Table 1, it is clear that Cloudera has the lead in terms of being first to market, raising the most money, having the highest valuation, and selling the most software. Hortonworks, on the other hand, is the leading proponent of Hadoop as a vibrant and innovative open source project. This is true for the core Hadoop project and its most essential sub-projects like Hadoop Common, Hadoop MapReduce and Hadoop YARN (the Project Lead for each of these is employed by Hortonworks). As for other widely used projects within the Hadoop ecosystem, Hortonworks also has the lead for projects related to data access like Hive and Pig, while Cloudera has the lead for projects related to data movement like Sqoop and Flume.
All three vendors have comprehensive tools for configuring, managing and monitoring your Hadoop cluster. In fact, all three received scores of 5 (on a scale of 0 to 5) for “Setup, management, and monitoring tools” in Forrester’s report on “Big Data Hadoop Solutions, Q1 2014”. The main difference between the three is that Hortonworks offers a completely open source, completely free tool (Apache Ambari) while Cloudera and MapR offer their own proprietary tools (Cloudera Manager and MapR Control System, respectively). While free versions of these tools do come with the free versions of Cloudera’s and MapR’s distribution (Cloudera Express and MapR Community Edition, respectively), the tools’ advanced features only come with the paid editions of their distribution (Cloudera Enterprise and MapR Enterprise Edition, respectively).
That’s sort of like having a car but only getting satellite radio when you pay a subscription. Although with Cloudera Manager and MapR Control System, it’s more like having the navigation system, Bluetooth connectivity, and the airbags enabled only when you pay a subscription. You can get from place to place just fine without these extras but, in certain cases, it sure would be nice to have the use of these “advanced features”. When you drive Ambari off the lot, on the other hand, you’re free to use any and all available features.
The advanced features of Cloudera Manager, which are only enabled by subscription, include:
- Quota Management for setting/tracking user and group-based quotas/usage.
- Configuration History / Rollbacks for tracking all actions and configuration changes, with the ability to roll back to previous states.
- Rolling Updates for staging service updates and restarts to portions of the cluster sequentially to minimize downtime during cluster upgrades/updates.
- AD Kerberos and LDAP/SAML Integration
- SNMP Support for sending Hadoop-specific events/alerts to global monitoring tools as SNMP traps.
- Scheduled Diagnostics for sending a snapshot of the cluster state to Cloudera support for optimization and issue resolution.
- Automated Backup and Disaster Recovery for configuring/managing snapshotting and replication workflows for HDFS, Hive and HBase.
The advanced features of MapR Control System (MCS), which are only enabled by subscription, include:
- Advanced Multi-Tenancy with control over job placement and data placement.
- Consistent Point-In-Time Snapshots for hot backups and to recover data from deletions or corruptions due to user or application error.
- Disaster Recovery through remote replicas created with block level, differential mirroring with multiple topology configurations.
Apache Ambari has a number of advanced features (which are always free and enabled), such as:
- Configuration versioning and history provides visibility, auditing and coordinated control over configuration changes, and management of all services and components deployed on your Hadoop Cluster (rollback will be supported in the next release of Ambari).
- Views Framework provides plug-in UI capabilities to surface custom visualization, management and monitoring features in the Ambari Web console, for example the Tez View, which comes with Ambari 2.0 and gives you visibility into all the jobs on your cluster, allowing you to quickly identify which jobs consume the most resources and which are the best candidates to optimize.
- Blueprints provide declarative definitions of a cluster, which allows you to specify a Stack, the Component layout and the configurations to materialize a Hadoop cluster instance (via a REST API) without the need for any user interaction.
Ambari leverages other open source tools that may already be in use within your data center, such as Ganglia for metrics collection and Nagios for system alerting (e.g. sending emails when a node goes down, remaining disk space is low, etc). Furthermore, Apache Ambari provides APIs to integrate with existing management systems including Microsoft System Center and Teradata ViewPoint.
The SQL language is what virtually every programmer and every tool uses to define, manipulate and query data. This has been true since the advent of relational database management systems (RDBMS) 35 years ago. The ability to use SQL to access data in Hadoop was therefore a critical development. Hive was the first tool to provide SQL access to data in Hadoop through a declarative abstraction layer, the Hive Query Language (QL), and a data dictionary (metadata), the Hive metastore. Hive was originally developed at Facebook and was contributed to the open source community / Apache Software Foundation in 2008 as a subproject of Hadoop. Hive graduated to top-level status in September 2010. Hive was originally designed to use the MapReduce processing framework in the background and, therefore, it is still seen more as a batch-oriented tool than an interactive one.
To address the need for an interactive SQL-on-Hadoop capability, Cloudera developed Impala. Impala is based on Dremel, a real-time, distributed query and analysis technology developed by Google. It uses the same metadata that Hive uses but provides direct access to data in HDFS or HBase through a specialized distributed query engine. Impala streams query results whenever they’re available rather than all at once upon query completion. While Impala offers significant benefits in terms of interactivity and speed, which will be clearly demonstrated in the next section, it also comes with certain limitations. For example, Impala is not fault-tolerant (queries must be restarted if a node fails) and, therefore, may not be suitable for long-running queries. In addition, Impala requires the working set of a query to fit into the aggregate physical memory of the cluster it’s running on and, therefore, may not be suitable for multi-terabyte datasets. Version 2.0.0 of Impala, which comes with CDH 5.2.0, has a “Spill to Disk” option that may avoid this particular limitation. Lastly, User-Defined Functions (UDFs) can only be written in Java or C++.
Consistent with its commitment to develop and support only open source software, Hortonworks has stayed with Hive as its SQL-on-Hadoop offering and has worked to make it orders of magnitude faster with innovations such as Tez. Tez was introduced in Hive 0.13 / HDP 2.1 (April 2014) as part of the “Stinger Initiative”. It provides performance improvements for Hive by assembling many tasks into a single MapReduce job rather than many by using Directed Acyclic Graphs (DAGs). From Hive 0.10 (released in January 2013) to Hive 0.13 (April 2014), performance improved an average of 52X on 50 TPC-DS Queries(the total time to run all 50 queries decreased from 187.2 hours to 9.3 hours). Hive 0.14, which was released in November 2014 and comes with HDP 2.2, has support for INSERT, UPDATE and DELETE statements via ACID transactions. Hive 0.14 also includes the initial version of a Cost-Base Optimizer (CBO), which has been named Calcite (f.k.a. Optiq). As we’ll see in the next section, Hive is still slower than its SQL-on-Hadoop alternatives, in part because it writes intermediate results to disk (unlike Impala, which streams data between stages of a query, or Spark SQL, which holds data in memory).
Like Cloudera with Impala, MapR is building its own interactive SQL-on-Hadoop tool with Drill. Like Impala, Drill is also based on Google’s Dremel. Drill began in August 2012 as an incubator project under the Apache Software Foundation and graduated to top-level status in December 2014. MapR employs 13 of the 17 committers on the project. It uses the same metadata that Hive and Impala use (Hive metastore). What’s unique about Drill is that it doesn’t need metadata as schemas can be discovered on the fly (as opposed to RDBMS schema on write or Hive/Impala schema on read) by taking advantage of self-describing data such as that in XML, JSON, BSON, Avro, or Parquet files.
A fourth option that none of the three companies are touting as their primary SQL-on-Hadoop offering but all have included in their distributions is Spark SQL (f.k.a. “Shark”). Spark is another implementation of the DAG approach (like Tez). A significant innovation that Spark offers is Resilient Distributed Datasets (RDDs), an abstraction that makes it possible to work with distributed data in memory. Spark is a top-level project under the Apache Software Foundation. It was originally developed at the UC Berkeley AMPLab, became an incubator project in June 2013, and graduated to top-level status in February 2014. Spark currently has 35 committers from 14 different organizations (the most active being Databricks with 12 committers, UC Berkeley with 7, and Yahoo! with 4). CDH 5.3, HDP 2.2 and MapR 4.1 all include Spark 1.2 (MapR 4.1 also includes Impala 1.4.1). Furthermore, most major tool vendors have native Spark SQL connectors, including MicroStrategy, Pentaho, QlikView, Tableau, Talend, etc. In addition to HDFS, Spark can run against HBase, MongoDB, Cassandra, JSON, and Text files. Spark not only provides database access (with Spark SQL), but also has built-in libraries for continuous data processing (with Spark Streaming), machine learning (with MLlib), and graphical analytics (with GraphX). While Spark and Spark SQL are still relatively new to the market, they have been rapidly enhanced and embraced, and have the advantage of vendor neutrality, not being owned or invented by any of the three companies, while being endorsed by them all. In my opinion, this gives Spark SQL the best chance of becoming the predominant – if not the standard – SQL-on-Hadoop tool.
In August 2014, the Transaction Processing Performance Council (www.tpc.org) announced the TPC Express Benchmark HS (TPCx-HS). According to the TPC, this benchmark was developed “to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics.” Simply stated, the TPCx-HS benchmark measures the time it takes a Hadoop cluster to load and sort a given dataset. Datasets can have a Scale Factor (SF) of 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB or 10000TB. The workload consists of the following modules:
- HSGen – generates the data at a particular Scale Factor (based on TeraGen)
- HSDataCheck – checks the compliance of the dataset and replication
- HSSort – sorts the data into a total order (based on TeraSort)
- HSValidate – validates the output is sorted (based on TeraValidate)
The first TPCx-HS result was published by MapR (with Cisco as its hardware partner) in January 2015. Running MapR M5 Edition 4.0.1 on RHEL 6.4 on a 16-node cluster, their results for sort throughput (higher is better) and price-performance (lower is better) were:
- 5.07 HSph and $121,231.76/HSph @1TB Scale Factor
- 5.10 HSph and $120,518.63/HSph @3TB Scale Factor
- 5.77 HSph and $106,524.27/HSph @10TB Scale Factor
Cloudera responded in March 2015 (with Dell as its hardware partner). Running CDH 5.3.0 on Suse SLES 11 SP3 on a 32-node cluster, their results were:
- 19.15 HSph and $48,426.85/HSph @30TB Scale Factor (without virtualization)
- 20.76 HSph and $49,110.55/HSph @30TB Scale Factor (with virtualization)
As of April 2015, Hortonworks has yet to publish a TPCx-HS result.
A handful of benchmarks have been performed that measure analytical workloads, which are more complex than simple sorts. These benchmarks mimic real-world data warehouse / business intelligence systems and focus on the relative performance of the different SQL-on-Hadoop offerings. Some of these are derived from the TPC’s existing benchmarks for decision support systems, TPC-DS and TPC-H, which have been widely embraced by relational database and data appliance vendors for decades. Two such benchmarks are presented below, along with a third that is not based on a TPC benchmark. All three are fairly recent, having been published in 2014.
In August 2014, three IBM Researchers published a paper in Proceedings of the VLDB Endowment (Volume 7, No. 12) that compares Impala to Hive. They used the 22 queries specified in the TPC-H Benchmark but left out the refresh streams that INSERT new orders into the ORDERS and LINEITEM tables, then DELETE old orders from them. They ran this workload against a 1TB database / Scale Factor of 1,000 (a 3TB database / Scale Factor of 3,000 would have exceeded Impala’s limitation that requires the workload’s working set to fit in the cluster’s aggregate memory).
Results: Compared to Hive on MapReduce:
- Impala is on average 3.3X faster with compression
- Impala is on average 3.6X faster without compression
Compared to Hive on Tez:
- Impala is on average 2.1X faster with compression
- Impala is on average 2.3X faster without compression
As a side-note to the performance figures above, it’s interesting to see that compression generally helped with Hive and ORC files, at times dramatically (>20%). With Impala and Parquet files, on the other hand, compression hurt performance more often than not and never improved performance significantly (>20%).
Next, they used a subset of the TPC-DS Benchmark, consisting of 20 queries that access a single fact table (STORE_SALES) and 6 dimension tables (the full TPC-DS benchmark involves 99 queries that access 7 fact tables and 17 dimension tables). They ran this workload against a 3TB database / Scale Factor of 3,000and found:
- Impala is on average 8.2X faster than Hive on MapReduce
- Impala is on average 4.3X faster than Hive on Tez
Environment: Hive 0.12 on Hadoop 2.0.0 / CDH 4.5.0 for Hive on MapReduce; Hive 0.13 on Tez 0.3.0 and Apache Hadoop 2.3.0 for Hive on Tez; Impala 1.2.2. The cluster consisted of 21 nodes, each with 96GB of RAM connected through a 10Gbit Ethernet switch. Data is stored in Hive using Optimized Row Columnar (ORC) File format and in Impala using Parquet columnar format – both with and without snappy compression.
Cloudera published a benchmark in September 2014 shortly after the release of Impala 1.4.0 that compares the performance of Impala versus Hive on Tez versus Spark SQL. It uses a subset of the TPC-DS Benchmark, consisting of 8 “Interactive” queries, 6 “Reporting” queries, and 5 “Analytics” queries, for a total of 19 queries that access a single fact table (STORE_SALES) and 9 dimension tables. They ran this workload against a 15TB database / Scale Factor of 15,000, which is not one of the Scale Factors allowed by TPC. They could have used either 10TB / 10,000 SF or 30TB / 30,000 SF, which are two of the seven Scale Factors that are officially recognized by TPC. They also could have run all 99 of the TPC-DS queries or even the first 19 of the 99 queries, for example, but instead picked and chose 19 of the queries for the single-user test, then just 8 of those for the multi-user test. I have to assume Cloudera chose that particular Scale Factor / database size and those particular queries because they yielded the best comparative results for Impala. This is why I’m generally wary of benchmark results that are published by any given product’s vendor unless they comply with the full benchmark specifications and are independently audited.
Results: With a single-user workload that runs the 19 queries:
- Impala is on average 7.8X faster than Hive on Tez
- Impala is on average 3.3X faster than Spark SQL
With 10 concurrent users running just the 8 Interactive queries:
- Impala is on average 18.3X faster than Hive on Tez
- Impala is on average 10.6X faster than Spark SQL
Environment: Impala 1.4.0, Hive 0.13 on Tez and Spark SQL 1.1. 21 node cluster, each with 2 processors, 12 cores, 12 disk drives at 932GB each, and 64GB of RAM. Data is stored in Impala using Parquet columnar format, in Hive using Optimized Row Columnar (ORC) File format, and in Spark SQL using Parquet columnar format – all with snappy compression.
Another Big Data Benchmark was performed by the UC Berkeley AMPLab in February 2014. It compares the performance of Impala 1.2.3 versus Hive 0.12 on MapReduce versus Hive 0.12 on Tez 0.2.0 versus Spark SQL 0.8.1 on Amazon EC2 clusters with small, medium and large datasets. Based on the paper “A Comparison of Approaches to Large-Scale Data Analysis” by Pavlo et al. (from Brown University, M.I.T., etc.), it uses the following tables (with data from CommonCrawl.org, which contains petabytes of data collected over 7 years of web crawling):
Against which the following queries are run:
- SELECT pageURL, pageRank FROM rankings WHERE pageRank > X;
- SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X);
- SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01′) AND Date(`X’) GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1;
Impala was clearly built for speed and Cloudera is the current leader in SQL-on-Hadoop performance over Hortonworks with Hive and MapR with Drill. Impala’s lead over Spark SQL, however, is less clear. Spark SQL has the advantage with larger datasets and also with simple range scans regardless of number of rows queried. Hive, while currently slower than Impala, does support a fuller set of SQL commands, including INSERT, UPDATE and DELETE with ACID compliance. It also supports windowing functions and rollup (not yet supported by Impala), UDFs written in any language, and provides fault tolerance. Furthermore, Hive has a brand new Cost-Based Optimizer (Calcite) that, over time, should deliver performance improvements, while also reducing the risk of novice Hive developers dragging down performance with poorly written queries. One advantage Drill holds over Impala, Hive and Spark SQL is its ability to discover a datafile’s schema on the fly without needing the Hive metastore or other metadata.
All three vendors tout the advantages of open source software. After all, Hadoop was essentially born in the open source community (with seminal contributions from Google and Yahoo!) and has thrived in it. Each vendor contributes code and direction to many of the open source projects in the Hadoop ecosystem. Hortonworks is by far the most active vendor in this regard, Cloudera next most, and MapR the least active in the open source community. Hortonworks is also the leader in terms of keeping all of its software “in the open”, including its tools for management / administration (Ambari) and SQL-on-Hadoop (Hive). Cloudera and MapR may claim that their management / administration tools are also free but their advanced features are only enabled by paid subscription, or that their SQL-on-Hadoop tools are also open source but virtually all of the code is written by the respective vendor. Furthermore, MapR has made proprietary extensions to one of the core Hadoop projects, HDFS (Hadoop Distributed File System). So in terms of their commitment and contributions to Hadoop and its ecosystem as open source projects, Hortonworks is the leader, followed by Cloudera, followed by MapR.
Ultimately, the choice comes down to your particular priorities and what you’re willing to pay for with your subscription. For example, if MapR’s customizations to HDFS are important to you because they provide better or faster integration with your NAS, then perhaps MapR is right for you. If Hortonworks’ leading involvement in the Apache Hadoop ecosystem is important to you because you believe in the principles of community driven innovation and the promise of better support from the company that contributes more of the code, and in avoiding vendor lock-in due to dependence on their proprietary extensions, then perhaps HDP is right for you. If Cloudera’s Impala is important to you because you need the fastest SQL-on-Hadoop tool – at least for queries that primarily access a single large table and can afford to be restarted if a node fails, then perhaps CDH is right for you.