What is wrong with the below BI Architecture deployed in most organizations today?

Even if the data types (structured and semi-structured data: JSON, XML, CSV, TSV) can be managed only by EDW which can handle big data or only by the Distributed Storage/Processing platform (such as Hadoop), why do they both exist in this architecture? Or why is it not something else?

It takes significant IT effort to make traditional BI solutions work, and the labor-intensive, time-consuming tasks of moving data from the data lake to the BI environment, securing the data in both environments, and creating data subsets and schemas to analyze the data is costly and delays the time to insight.

The above architecture has several flaws as listed below;

  1. Data Duplication — the data is stored multiple times. Hadoop ingests or prepares the model to be served and syncs it to EDW semantic layer.
  2. High Cost — The cost of maintaining both EDW and Hadoop platform is quite high.
  3. Duplicate Components — Hadoop platform is used for its scalable data storage and batch/stream processing capabilities whereas EDW can do the same with more investment or by moving to the cloud.
  4. Batch Analytics — The data is ingested into Data Lake and processed in batch. There is no Operational Data Store to serve operational dashboards/reports and the trend is real-time analytics.

Mostly all technologies, tools or platforms have limitations. They are fine-tuned for a few purposes to do well. In this article series, I will try to come up with technology selection criteria and try to evaluate options.

Let's define the modern BI Platform Requirements:

  • Concurrency: Most enterprises demands 50+ Concurrent Users for Reporting/Dashboards.
  • Performance: BI users expect their dashboards to respond in no more than 5–10 seconds.
  • Selectivity: A user is typically interested in a relatively small subset of the data and would use filters to identify it. And each user, will likely be interested in a different subset.
  • Ad-hoc & Agile: With self-service BI dashboards new queries will be created frequently.
  • Complex Data Manipulation: Multidimensional Analysis requires the joining of tables, sorting of data, large aggregations, and other expensive operations.
  • Data Engineering: No additional data engineering will be required to maintain the self-service BI.
  • Fast Data Processing: Most data requirements are batch. But more and more real-time use-cases (IOT real-time analytics, etc) are being demanded.
  • Vendor-lockin: The platform usually locked to a vendor and/or to infrastructure.

To summarize, the criteria to evaluate a modern BI platform (processing structured and semi-structured data — JSON, XML, CSV, TSV) following criteria is considered in the order of importance:

  • high concurrency
  • high performance (sub-second queries)
  • portability & interoperability
  • fast ingestion
  • support & acceptance in the market with skillset availability
  • low cost
  • low footprint (no data duplication)
  • no data engineering to maintain solution

Technology Options (other than hybrid data platform with Hadoop + EDW) we will evaluate are:

  1. Hadoop only option
  2. EDW only option
  3. HTAP In-Memory (Hybrid OLTP + OLAP) option
  4. Semantic Graph Lake (Data lake + Graph OLAP MPP) option.  The Data Fabric approach can be used in replace of Hadoop and EDW for certain use-cases using Graph Marts.

In the next part of this series, each option will be explored further...