What is wrong with the below BI Architecture deployed in most organizations today?
Even if the data types (structured and semi-structured data: JSON, XML, CSV, TSV) can be managed only by EDW or only Hadoop Data Platform, why do they both exist in this architecture?
Usually, Hadoop is used for historical data and ETL offloading. It is because of EDW deployments that on-the-premise can't scale well beyond 100 TB for the increasing storage requirements and with a decreasing performance. Besides, traditional EDW solutions may not manage new data types such as JSON, etc.
It takes significant IT effort to make traditional BI solutions work, and the labor-intensive, time-consuming tasks of moving data from the data lake to the BI environment, securing data in both environments, and creating data subsets and schemas to analyze the data is costly and delays the time to insight.
In this article series, I will explore the most effective options for a modern BI architecture that can still meet enterprise requirements for Self-Service Analytics with interactive workloads (50+ concurrent users) and provides cost-effective scalable storage/processing.
In short, the above architecture has several flaws as listed below;
- Data Duplication — the data is stored multiple times. Hadoop ingests or prepares the model to be served and syncs it to EDW semantic layer.
- High Cost — The cost of maintaining both EDW and Hadoop platform is quite high.
- Duplicate Components — Hadoop platform is used for its scalable data storage and batch/stream processing capabilities whereas EDW can do the same with more investment or by moving to the cloud.
Moreover, Interactive BI mandates the following non-functional requirements:
- Concurrency: Typical BI dashboard may require 50+ Concurrent Users.
- Performance: BI users expect their dashboards to respond in no more than 5–10 seconds.
- Selectivity: A user is typically interested in a relatively small subset of the data and would use filters to identify it. And each user will likely be interested in a different subset.
- Ad-hoc & Agile: With self-service BI dashboards new queries will be created frequently.
- Complex Data Manipulation: Multidimensional Analysis requires the joining of tables, sorting of data, large aggregations and other expensive operations.
- Data Engineering: No additional data engineering will be required to maintain the self-service BI.
I would define some criteria to evaluate the options for a modern BI platform (processing structured and semi-structured data — JSON, XML, CSV, TSV), such as;
- high concurrency
- low footprint (no data duplication)
- no data engineering to maintain solution
- high performance (sub-second queries)
- low cost
Based on the above criteria, I would like next to explore options...
1) Hadoop Only Solutions:
- Remove EDW from the BI Architecture. Ingest all data into the Hadoop Platform, process the data and expose the data products directly to BI Tools through SQL-on-Hadoop (Hive LLAP, Impala, Presto, IBM BigSQL) Solutions.
- Remove EDW from the BI Architecture. Ingest all data into the Hadoop Platform, process the data and expose the data products directly to BI Tools through emerging OLAP-on-Hadoop (Druid, AtScale, Jethro) Solutions.
- Remove EDW from the BI Architecture. Ingest all data into the Hadoop Platform, process the data and expose data products directly to BI Tools through Virtualized File System/ Data Caching Layers (Alluxio, Apache Arrow).
2) EDW Only Solutions:
- Remove Hadoop from the BI Architecture. Ingest all data into the EDW platform offered as Insights as-a-service Platform — Cloud EDW (Snowflake, Big Query, Redshift) which can work with data types (JSON, XML, CSV, TSV) and expose the data products directly to BI Tools.
3) HTAP — Neither Hadoop nor EDW:
- Replace both Hadoop and EDW with one HTAP Database (SAP Hana, Splice Machine, Kudu) which can handle both OLTP and OLAP workloads for structured/semi-structured data (JSON, XML, CSV, TSV)
4) Other Alternatives:
- Replace both Hadoop and EDW using separate Storage and Processing component using Data Bricks, Dremio, etc.
In the next part of this series, each option will be explored further...
It takes significant IT effort to make traditional Hybrid BI solutions work, and the labor-intensive, time-consuming tasks of moving data from the data lake to the BI environment, securing the data in both environments, and creating data subsets and schemas to analyze the data is costly and increases the time to insight...