Admittingly, the concept of an Enterprise Data Warehouse (EDW) is a complex one. To some, an EDW is simply a DB with lots of data and, these days, even more due to big data. Others use the terms OLAP and EDW almost synonymously due to the fact that, typically, analytic queries are running on that big, huge, outrageous amount of data. To a third group, an EDW is a concept of architecting, managing and governing data from multiple sources, i.e. from multiple, isolated contexts and, thus, from multiple notions of consistency. So, while some think of an EDW as a pure technical problem (lots of data, performance, scalability, …), others look at it as a semantical challenge (single version of the truth, compliance, …). This is why some think HANA must be the answer (i.e. to the technical problem), while others favour approaches like BW (Business Warehouse). This blog will try to explain why both are right and why SAP is on the path to create what we label the HANA EDW.
There is two fundamental challenge categories to data warehousing:
- One is everything around processing large amounts of data, i.e. bulk loads, analytic querying, table partitioning, scalability, performance etc.
- The other is around the processes and the data models inside the data warehouse, i.e. questions around
- What happens if a column, table or any other objects is added, changed or removed?
- What’s the impact of those changes on the (loading, archiving, housekeeping, …) processes or other, related data models and their underlying queries?
- Who has uploaded something at what moment and why are the results now different?
- What’s the relationship between a “customer” in table A and the “partner” in table B? Are they the same? Or are they partially overlapping? Are these assumptions guaranteed by the system (e.g. through data quality processes)?
- Has the data been loaded correctly? Have there been any alerts?
- When has the last upload been performed from which source?
- Are my KPIs (like margin) consistent across all my models (and their underlying queries)?
The Data Warehousing Quadrant
1. and 2. are basically orthogonal dimensions of the data warehousing problem. In fact, many customers present us their challenges along those categories. See figure 1 for an example. Based on this, figure 2 shows a Data Warehousing Quadrant and categorises systems along those two dimensions (or “challenge categories” as labeled above). The increasing pressure to analyse the business and its underlying processes pushes along both dimensions – see orange arrows in figure 2:
- More granular data is loaded and analysed.
- More data is available through e-commerce, sensors and other sources.
- More scenarios are analysed.
- More scenarios (data models) are combined and integrated with each other. That requires much more effort asserting uniform master data, security, consistency, compliance and data quality.
The HANA EDW
From an SAP perspective, the two challenges and their associated pushes (i.e. the orange arrows in figure 2) are handled by HANA (mainly addressing category 1. = “data volume” in figure 2) and a managed approach implemented via BW (mainly addressing category 2. = “number of data models, sources” in figure 2). Migrating classic BWs from a classic RDBMS to BW-on-HANA mostly addresses the upwards push along category 1. But what about when you are sitting on the left hand side of the quadrant, i.e. you are running a hand-crafted, SQL-oriented data mart, data warehouse or very large data warehouse (VLDW)? First of all, some (BW) customers run such systems along their BW(s) as BW seems to be an overkill given that there are only a small number of tables or they have a scenario that requires manual optimisations or whatever. Secondly, some of those customers have started “small” (e.g. only with a few tables, data sources, processes etc) but run into more and more challenges over times as the number of tables, sources, processes and requirements (security, compliance etc) grows. In fact, I’ve been asked multiple times by customers who have started with a dedicated HANA data mart about how to add features and functions (that they know from BW) to that data mart. Basically, this is the horizontal arrow in figure 2 pushing towards the right hand side.
The good news for those customers is now that they can do both inside one HANA instance: BW-on-HANA, a SQL-oriented DW (or data mart) and a combination of both! This setup has been labeled the HANA EDW and comprises BW and the native, SQL minded approach. This is depicted in figure 3. The interesting part here are not the extremes – i.e. BW-on-HANA and the “SQL EDW” – but the area inbetween. This area is often referred to as the mixed scenario. BW 7.3 (on HANA) has seen some features in this respect but BW 7.4 will provide a lot more. The idea is that a customer can start at either end – i.e. on the BW side or on the “SQL EDW” side – and move into the other direction as required:
- BW data can be exposed as HANA views and can be accessed via native HANA interfaces like SQL, MDX, … See here for details.
- HANA views can be exposed as an infoprovider inside BW. See “bridge 1.” in here.
- BW 7.4 will see a revised acquisition layer called the Open ODS which is based on fields rather than infoobjects. This will allow a gradual integration of tables, views and other tabular structures into a managed data warehousing architecture like the LSA which is therefore re-labeled as LSA++.
- BW 7.4 will have many new modeling elements, like a revised composite provider which can incorporate native HANA views.
- In BW 7.4, even the modeling tools are planned to gradually move into Eclipse mingeling smoothely with other Eclipse based tooling like HANA studio, HANA modeler, ABAP workbench, Design Studio etc. – relational and BW modeling out of the same and modern environment. See here for more details.
These and other features will allow the smooth transitions into both directions as indicated in figure 3. Exactly those smooth transitions are a competitive differentiator which will be hard to beat as they are typically managed by copying data between systems, maintaining semantics twice, running two separate systems etc. BW will continue to be at the heart of this strategy.
|DB size (TB)||1.||3.2||4.8||5.5|
|DB growth per month (GB)||1.||80||145||200|
|Average number of users logged into the system||1.||150||200||400|
|Number of queries per day||1.||4000||5000||8000+|
|Number of Infoproviders (for reporting)||2.||395||1002||842|
|Number of DSOs||2.||263||292||320|
Figure 1: This is an example of how a BW customer explains the challenge in his data warehouse. Others add information on process chains, BEx queries, transformations and other (TLOGO) objects.
Figure 2: The Data Warehousing Quadrant
Figure 3: The HANA EDW
PS: Due to popular demand I provide the PPTX version of the figures above.
PPS: In the meantime, I’ve published another blog A Flavour of the HANA EDW which provides an example of how this translates into planned product features.