In the previous part of this series, I tried to provide alternative options to deliver modern BI Solutions that meet the criteria.

In this article, we will evaluate Hadoop only solution alternatives, to meet all BI requirements: high concurrency, high performance (sub-second queries), portability & interoperability, fast ingestion, support & acceptance in the market with skillset availability, low cost, low footprint (no data duplication), no data engineering to maintain the solution.

Usually, Hadoop is used for historical data and ETL offloading. It's because of EDW deployments alone on-the-premise cannot scale well beyond 100TB for the increasing storage requirements and the decreasing performance. 

Besides, traditional EDW solutions may not manage new data types such as JSON, etc. So, Hadoop is quite a valid option for on-prem deployments.

Hadoop has several further derivatives which might be used to replace EDW functionality (to serve data marts to end users):

  • SQL-on-Hadoop — Hive LLAP, Impala, Presto, IBM BigSQL
  • Virtualized in-memory File System — Alluxio
  • OLAP-on-Hadoop — Druid, AtScale, Jethro

SQL-on-Hadoop

Once the required data model is created on Hadoop usually in star/snowflake schema, it can be served to BI tools (such as Micro Strategy or Tableau) through SQL access which may offer good query response times small data sets and low-to-medium concurrent users.

There are several SQL-on-Hadoop options which may be suitable for end-users for BI purposes such as Presto, Hive LLAP, IBM Big SQL:

  • Hive LLAP (in-memory MPP solutions optimised for ORC) uses the Hive execution engine which is based on Map Reduce either on disk/memory with Tez. The latency in the Map Reduce is mostly attributed to YARN container startup costs (which is reduced to a minimum with latest LLAP versions)
  • Presto is designed to federate queries over several different data sources (NoSQL, HDFS, DBMS, S3, Kafka, Kudu) using an approach other than Map Reduce.
  • IBM Big SQL (paid product), an alternative SQL-on-Hadoop, claims to deliver 3 times better output compared to Presto and Spark with support of concurrency levels around 100 and medium latency in query response times.

Hive LLAP against Presto — X-axis (response time in secs), Y-axis (Concurrency)

The above benchmark (done by Hortonworks) is used to assess the Star Schema which is the typical schema used in OLAP workloads and interactive business intelligence queries based on TPC-H35 on 50 GB data.

The Hive LLAP 3.0.1 performance is much better than Presto on big data sizes and higher concurrencies. IBM Big SQL may offer better performance on higher concurrency and bigger data sets.

So why are SQL-on-Hadoop engines not doing well enough and suffering under higher concurrencies?

Most SQL-on-Hadoop engines generally skip Map Reduce, instead of query data directly from HDFS. All of these tools share the same basic architecture, known as MPP full-scan. They have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still some latency overhead (e.g. SerDe time) associated with pulling data from the underlying storage layer into the computation layer.

SQL-on-Hadoop engines with their full-scan architecture will perform redundant scan work by reading the entire column of each query every time. Optimisations such as partitioning on a column(date, city, country) will not improve this as at any time a different column will be filtered/sorted.

Several workarounds exist such as denormalized star schema to prevent joins. But that does not allow incremental/daily data appending and the need to rebuild the entire data set each time. Also, pre-aggregation, multiple data partitioning on replicated data are maintenance intensive and only short term workarounds.

The latest versions of Hive LLAP and Presto can query small data-sets (below 50 GB) with up to a few concurrent users without failures and query response times under some minutes, however as the data set and concurrency gets higher, they will not be suitable for interactive BI queries requiring sub-second response times. IBM BigSQL is a mature but paid product and may deliver 3 times better performance compared to Hive LLAP.

By default, SQL-on-Hadoop is a good fit for ETL and limited ad-hoc/interactive queries run by end-users such as data science workloads.

If SQL-on-Hadoop is evaluated against the criteria set above, it will satisfy most of the criteria except high concurrency, high performance for interactive BI workloads.

Virtualized in-memory File System

In-Memory Virtualized Data Access solutions would be a complementary approach to SQL-on-Hadoop with the help of Virtual File System, such as Alluxio.

Alluxio (paid product) offers the Virtual File System where the data on HDFS, S3 is accessed through Alluxio as shown below. Hive can be configured to read all or partial data through Alluxio’s in-memory cache (determined based on partitions)

Virtual File System can help improve the performance up to 5 times approximately.

When SQL-on-Hadoop & Virtual File System together is evaluated against the criteria set above, it will satisfy most requirements except for the fact that the concurrency is expected to still remain below 50+ for sub second level response times on limited data sizes (up to a few Terabytes)

Conclusion

As a rule of thumb SQL-on-Hadoop is only good for interactive analytics with low concurrent workloads even if it is used  together with Virtual in-memory File System (which has limitations to fit data into-memory). Hence we need to continue our search for other components that might be suitable to serve enterprise BI workloads such as OLAP-on-Hadoop, etc...

Next, in the series is OLAP-on-Hadoop. Stay tuned!!!