The Heidelberg Laureate Forum 2017

From Sep 24 to 29, the 5th edition of the Heidelberg Laureate Forum (HLF) took place, mainly on premisses of the University of Heidelberg. It intends to bring together laureates in mathematics and computer science with young researchers. The program roughly looks like this: during the mornings, the laureates give presentations on their field, their experience, their opinion where things are leading to, interesting research questions etc. During breaks, workshops, poster sessions, social and other events, there is plenty of room to pick up one or the aspect, e.g., from those presentations and have a fruitful discussion with some of the leading brains in the field. One of my favourite moments was a coincidental lunch time chat with Sir Michael Atiyah on a variety of topics and that variety makes it so fascinating.

Material from the HLF is available on the internet:

I could not attend all the presentations this year. But from those that I could attend, I recommend the following 3; this is an arbitrary, sobjective selection as all presentations had impressive content:

Finally, on the last day of the HLF, some laureates paid a visit to SAP. This was similar to last year’s visit with a tour through SAP’s inspiration pavillon and a demo of SAP’s digital boardroom.

This blog has also been published on Linkedin. You can follow me on Twitter via @tfxz.


SAP Products and the DW Quadrant

Heidelberg PhilosophenwegIn a recent blog, I have introduced the Data Warehousing Quadrant, a problem description for a data platform that is used for analytic purposes. The latter is called a data warehouse (DW) but labels, such as data mart, big data platform, data hub etc., are also used. In this blog, I will map some of the SAP products into that quadrant which will hopefully yield a more consistent picture of the SAP strategy.

To recap: the DW quadrant has two dimensions. One indicates the challenges regarding data volume, performance, query and loading throughput and the like. The other one shows the complexity of the modeling on top of the data layer(s). A good proxy for the complexity is the number of tables, views, data sources, load processes, transformations etc. Big numbers indicate many dependencies between all those objects and, thus, high efforts when things get changed, removed or added. But it is not only the effort: there is also a higher risk of accidentally changing, for example, the semantics of a KPI. Figure 1 shows the space outlined by the two dimensions. The space is the divided into four subcategories: the data marts, the very large data warehouses (VLDWs), the enterprise data warehouses (EDWs) and the big data warehouses (BDWs). See figure 1.

Figure 1: The DW quadrant.

Now, there is several SAP products that are relevant to the problem space outlined by the DW quadrant. Some observers (customers analysts, partners, colleagues) would like SAP to provide a single answer or a single product for that problem space. Fundamentally, that answer is HANA. However, HANA is a modern RDBMS; a DW requires tooling on top. So, there is something more required than just HANA. Figure 2 assigns SAP products / bundles to the respective subquadrants. The idea behind that is to be a “flexible rule of thumb” rather than a hard assignment. For example, BW/4HANA can play a role in more than just the EDW subquadrant. We will discuss this below. However, it becomes clear where the sweet spots or the focus area of the respective products are.

Figure 2: SAP products assigned to subquadrants.

From a technical and architectural perspective, there is a lot of relationships between those SAP products. For example, operational analytics in S/4 heavily leverages the BW embedded inside S/4. Another example is BW/4HANA’s ability to combine with any SQL object, like SQL accessible tables, views, procedures / scripts. This allows smooth transitions or extensions of an existing system into one or the other direction of the quadrant. Figure 3 indicates such transitions and extension options:

  1. Data Mart → VLDW: This is probably the most straightforward path as HANA has all the capabilities for scale-up and scale-out to move along the performance dimension. All products listed in the data mart subquadrant can be extended using SQL based modeling.

  2. Data Mart → EDW: S/4 uses BW’s analytic engine to report on CDS objects. Similarly, BW/4HANA can consume CDS views either via the query or in many cases also for extraction purposes. Native HANA data marts combine with BW/4HANA similarly to the HANA SQL DW (see 3.).

  3. VLDW ⇆ EDW: Here again, I refer you to the blog describing how BW/4HANA can combine with native SQL. This allows BW/4HANA to be complemented with native SQL modeling and vice versa!

  4. VLDW or EDW → BDW: Modern data warehouses incorporate unstructured and semi-structured data that gets preprocessed in distributed file or NoSQL systems that are connected to a traditional (structured), RDBMS based data warehouse. The HANA platform and BW/4HANA will address such scenarios. Watch out for announcements around SAPPHIRE NOW 😀

Figure 3: Transition and extension options.

The possibility to evolve an existing system – located somewhere in the space of the DW quadrant – to address new and/or additional scenarios, i.e. to move along one or both dimensions is an extremely important and valuable asset. Data warehouses do not remain stale; they are permanently evolving. This means that investments are secure and so it the ROI.

This blog has also been published here. You can follow me on Twitter via @tfxz.

The Data Warehousing Quadrant

A good understanding or a good description of a problem is a prerequisite to finding a solution. This blog presents such a problem description, namely for a data platform that is used for analytic purposes. Traditionally, this is called a data warehouse (DW) but labels, such as data mart, big data platform, data hub etc., are also used in this context. I’ve named this problem description the Data Warehousing Quadrant. An initial version has been shown in this blog. Since then, I’ve used it in many meetings with customers, partners, analysts, colleagues and students. It has the nice effect that it makes people think about their own data platform (problem) as they try to locate where they are and where they want to go. This is extremely helpful as it triggers the right dialog. Only if you work on the right questions you will find the right answers. Or put the other way: if you start with the wrong questions – a situation that occurs far more often than you’d expect – then you are unlikely to find the right answers.

The Data Warehousing Quadrant (Fig. 1) has two problem dimensions that are independent from each other:

  1. Data Volume: This is a technical dimension which comprises all sorts of challenges caused by data volume and/or significant performance requirements such as: query performance, ETL or ELT performance, throughput, high number of users, huge data volumes, load balancing etc. This dimension is reflected on the vertical axis in fig. 1.

  2. Model Complexity: This reflects the challenges triggered by the semantics, the data models, the transformation and load processes in the system. The more data sources that are connected to the DW, the more data models, tables, processes exist. So, the number of tables, views, connected sources is probably a good proxy for the complexity of modeling inside the DW. Why is this complexity relevant? The lower it is the less governance is required in the system. The more tables, models, processes there are, the more dependencies between all this objects exists and the more difficult it becomes to manage all those dependencies whenever something (like a column of a table) needs to be added, changed, removed. This is the day-to-day management of the “life” of a DW system. This dimension is reflected on the horizontal axis in fig. 1.

The DW quadrant
Figure 1: The DW quadrant.

Now, these two dimensions create a space that can be divided into four (sub-) quadrants which we discuss in the following:

Bottom-Left: Data Marts

Here, the typical scenarios are, for example,

  • a departmental data mart, e.g. a marketing department sets up a small, maybe even open source based RDBMS system and creates a few tables that help to track a marketing campaign. Those tables hold data of customers that were approached, their reactions or answers to questionnaires, addresses etc. SQL or other views allow some basic evaluations. After a few weeks, the marketing campaign ends, hardly any or no data gets added and the data, the underlying tables and views slowly “die” as they are not used anymore. Probably, one or two colleagues are sufficient to handle the system, both setting it up and creating the tables and views. They now the data model intimately, data volume is manageable and change management is hardly relevant as the data model is either simple (thus changes are simple) or has a limited lifespan (≈ the duration of the marketing campaign).

  • An operational data mart. This can also be the data that is managed via a certain operational application as you find them e.g. in an ERP, CRM or SRM system. Here, tables, data are given and data consistency is managed by the related application. There is no requirement to involve additional data from other sources as the nature of the analyses is limited to the data sitting in that system. Typically, data volumes and number of relevant tables are limited and do not constitute a real challenge.

Top-Left: Very Large Data Warehouses (VLDWs)

Here, a typical situation is that there is a small number of business processes – each one supported via an operational RDBMS – with at least one of them producing huge amounts of data. Imagine the sales orders submitted via Amazon’s website: this article cites 426 items ordered per second on Cyber Monday in 2013. Now, the model complexity is considerably simple as only a few business processes, thus tables (that describe those processes), are involved. However, the major challenges originate in the sheer volume of data produced by at least one of those processes. Consequently, topics such as DB partitioning, indexing, other tuning, scale-out, parallel processing are dominant while managing the data models or their lifecycles is fairly straightforward.

Bottom-Right: Enterprise Data Warehouses (EDWs)

When we talk about enterprises then we look at a whole bunch of underlying business processes: financial, HR, CRM, supply-chain, orders, deliveries, billing etc. Each of these processes is typically supported by some operational system which has a related DB in which it stores the data describing the ongoing activities within the respective process. There is natural dependencies and relationships between those processes – e.g. there has to be an order before something is delivered or billed – that it makes sense for business analysts to explore and analyse those business processes not only in an isolated way but also to look at those dependencies and overlaps. Everyone understands that orders might be hampered if the supply chain is not running well. In order to underline this with facts the data from the supply chain and the order systems need to be related and combined to see the mutual impacts.

Data warehouses that cover a large set of business processes within an enterprise are therefore called enterprise data warehouses (EDWs). Their characteristic is the large set of data sources (reflecting the business processes) which, in turn, translates into a large number of (relational) tables. A lot of work is required to cleanse and harmonise data in those tables. In addition, the dependencies between the business processes and its underlying data are reflected in the semantic modeling on top of those tables. Overall, a lot of knowledge and IP goes into building up an EDW. This makes it sometimes expensive but, also, extremely valuable.

An EDW does not remain static. It gets changed, adjusted, new sources get added, some models get refined. Changes in the day-to-day business – e.g. changes in a company’s org structure – translate into changes in the EDW. This, by the way, does apply to the other DWs mentioned above, too. However, the lifecycle is more prominent with EDWs than in the other cases. In other words: here, the challenges by the model complexity dimension dominate the life of an EDW.

Top-Right: Big Data Warehouses (BDWs)

Finally, there is the top-right quadrant which starts to become relevant with the advent of big data. Please beware that “big data” not only refers to data volumes but also incorporating types of data that have not been used that much so far. Examples are

  • videos + images,
  • free text from email or social networks,
  • complex log and sensor data.

This requires additional technologies involved that currently surge in the wider environment of Hadoop, Spark and the like. Those infrastructures are used to complement traditional DWs to form BDWs, aka modern data warehouses, aka big data hubs (BDHs). Basically, those BDWs see challenges from both dimensions, the data volume and the modeling complexity. The latter is being augmented by the fact that models might span various processing and data layers, e.g. Hadoop + RDBMS.

How To Use The DW Quadrant?

Now, how can the DW quadrant help? I have introduced it to various customers and analysts and it made them think. They always start mapping their respective problems or perspectives to the space outlined by the quadrant. It is useful to explain and express a situation and potential plans of how to evolve a system. Here are two examples:

SAP addresses those two dimensions or the forces that push along those dimensions via various products, namely SAP HANA and VORA for the data volume and performance challenges, while BW/4HANA and tooling for BDH will help along the complexity. Obviously, the combination of those products is then well suited to address the cases of big data warehouses.

An additional aspect is that no system is static but evolves over time. In terms of the DW quadrant, this means that you might start bottom-left as a data mart to then grow into one or the other or both dimensions. These dynamics can force you to change tooling and technologies. E.g. you might start as a data mart using an open source RDBMS (MySQL et al.) and Emacs (for editing SQL). Over time, data volumes grow – which might require to switch to a more scalable and advanced commercial RDBMS product – and/or sources and models are added which requires a development environment for models that has a repository, SQL generating graphical editors etc. Power Designer or BW/4HANA are examples for the latter.

This blog can also be found on and on Linkedin. You can follow me on Twitter via @tfxz.

Quality of Sensor Data: A Study Of Webcams

noisy GPS data

Fig 1: Noisy GPS data: allegedly running across a lake.

For a while, I’ve been wondering what the data quality of sensor data is. Naively – and many conversations that I had on this went along that route – it can be assumed that sensors always send correct data unless they fail completely. A first counter-example that many of us can relate to is GPS, e.g. integrated into a smartphone. See the figure to the right which visualises part of a running route and shows me allegedly running across a lake.

Now, sensor does not equal sensor, i.e. it is not appropriate to generalise “sensors”. Quality of measurements and data varies a lot on the actual measure (e.g. temperature), the environment, the connectivity of the sensor, the assumed precision and many more effects.

In this blog, I analyse a fairly simple, yet real-world setup, namely that of 3 webcams that take images every 30 minutes and send them via the FTP protocol to an FTP server. The setup is documented in the following figure that you can read from right to left in the following way:

  1. There are 3 webcams, each connected to a router via WLAN.
  2. The router is linked to an IP provider via a long-range WIFI connection based on microwave technology.
  3. Then there is a standard link via the internet from IP provider to IP provider.
  4. A router connects to the second IP provider.
  5. The FTP server is connected to that router.
The connection between the webcams on th right to the FTP server on the left.

Fig 2.: The connection between the webcams on the right to the FTP server on the left.

So, once an image is captured, it travels from 1. to 5. I have been running this setup for a number of years now. During that time, I’ve incorporated a number of reliability options like rebooting the cameras and the (right-hand) router once per day. From experience, steps 1. and 2. are the most vulnerable in this setup: both, long-range WIFI and WLAN, are liable to a number failure options. In my sepcific setup, there is no physical obstacle or frequency polluted environment. However, weather conditions are most likely to be the source of distortion, like widely varying humidity and temperature.

So, what is the experiment and what are the results? I’ve been looking at the image data sent over the course of approx. 3 months. In total, around 8000 images were transmitted. I counted the successful (fig 3) vs the unsuccessful (fig 4) transmissions. I did not track the images that completely failed to be transmitted, i.e. that did not reach the FTP server at all and therefore did not leave any trace. 5.3% of the images were distorted (as in fig 4) or every 19-th image failed to be transmitted correctly. In addition, that rate was no constant (e.g. per week) but there were times of heavy failures and times of no failures.

Successfully transmitted image.

Fig 3: Successfully transmitted image.

Distorted image.

Fig 4: Distorted image.

This is an initial and simple analysis but one that matches real-world conditions and setups pretty well and is therefore no artificial simulation. In the future, I might refine the analysis like counting non-transmissions too or correlating the quality with temperature, humidity or other potential influencing factors.

You can follow me on Twitter via @tfxz.

What To Take Away From The #HLF16

20160922-img_7497From 18-23 Sep 2016, I had the privilege to join approx. 200 young researchers and 21 laureates of the most prestigious awards in Mathematics and Computer Science at the 4th Heidelberg Laureate Forum (HLF). Similar to the Lindau Nobel Meetings, it intends to bring together and seed the exchange between the past and the future generations of researchers.

I thought that it is worthwhile writing up the experience from someone with my background and perspective which is the following: I am a member of the HLF foundation council and could therefore join this illustrous event. I hold a master degree from the Karlsruhe Institute of Technology (KIT) and a PhD from Edinburgh University, both in CS. For almost 18 years, I have been working for SAP in development, most of the time confronted with high performance problems of CS in general and DBMS in particular. So, generation- and careerwise, I’m somewhere inbetween the two focus groups of the HLF.

The Setup

… was excellent: there were lectures by the laureates during the morning and breakout events (panels, workshops, poster sessions and the like) in the afternoon. Inbetween, there were ample breaks for people to mingle, talk to each other, exchange ideas, make contacts and/or friends, basically everything to nourish creativity, curiosity and inspiration. Many years ago, a friend commented: “At conferences, breaks are as important as the scheduled events; presentations are there only to seed topics for lively discussions during the breaks.”. I think HLF is an excellent example implementing that notion.

Meeting People

20160920-img_7417Over the course of the week, I have talked to many attendees and to both, young researchers and laureates. Topics circled – as in the presentation – around the past and the future, lessons learned on past problems in order to tackle the coming problems. Sir Andrew Wiles did a great job in his lecture on how people tried to prove Fermat’s Last Theorem until after more than 300 years a new approach was triggered which he finally led to a successful end. Similarly, Barbara Liskov chose to talk about the lessons learned in the early 1970s when early versions of modularity and data abstraction made it into modern programming languages which finally led to object orientation.

On Wed morning of the “HLF week”, a group of them visited SAP. The young researchers learned about opportunities at SAP while the laureates were exposed to a demo of SAP’s digital boardroom. Also on this occasion, good questions and discussions came up.

The Presentations

20160923-img_7501The lectures given by the laureates were mostly excellent. I’ve decided to pick 3 which I personally enjoyed most, 1 of each discipline:

  • Mathematics: Sir Michael Atiyah is 95 years old. He took advantage of the fact that as a laureate you don’t have to prove to anyone anything and, thus, can take a few steps back and look at the research in your area from a distance. His lecture on the “The Soluble and the Insoluble” discusses this as “both a philosophical question, and a practical one, which depends on what one is trying to achieve and the means, time and money available. The explosion in computer technology keeps changing the goal posts.”

  • Computer Science: Raj Reddy picked “Too Much Information and Too Little Time” as a topic. Amongst others, he pointed to cognitive science and that modern software needs to take human limitations (make errors, forget, impatience, “go for least effort” etc) and strengths (tolerance of ambiguity + imprecision + errors, rich experience and knowledge, natural language) far more into account.

  • Physics: Brian Schmidt gave the Lindau Lecture on the “State of the Universe”. I am no expert in astrophysics but, still, there has been a lot of fascinating facts and “aha effects” in it. It is amazing what science can do.

There is a lot more material from this and previous events. If you are interested in finding out more details, you might want to look at the HLF website where you also find more video recordings of the laureates’s lectures.

This has been cross-published on Linkedin. You can follow me on Twitter via @tfxz.

3 Turing Award And 1 Nobel Price Winners Visiting SAP

20160919-img_7405aOn Wednesday 21 Sep, SAP hosted a group of visitors participating in the 4th Heidelberg Laureate Forum (HLF), a meeting of young researchers and winners of the most prestigious awards in Mathematics and Computer Science, the Abel Prize, the Fields Medal, the Nevanlinna Prize and the ACM Turing Award, all tantamount to a Nobel Prize in their respective discipline. Amongst the visitors were

From left to right: K. Bliznak (SAP), Mrs O. Ioannidi, J. Sifakis, Sir T. Hoare, B. Schmidt, V. Cerf, T. Zurek (SAP)

From left to right: K. Bliznak (SAP), Mrs O. Ioannidi, J. Sifakis, Sir T. Hoare, B. Schmidt, V. Cerf, T. Zurek (SAP)

This visit took around 2.5 hrs and comprised a tour through SAP’s inspiration pavillon and a demo of SAP’s digital boardroom. The 4 laureates showed a huge interest in both, still being critical here and there: at the big data wall, Vint Cerf noted the omission of some of the advances in the internet in the 1970s and 1980s. Brian Schmidt commented: “Vint, you are too biased!”. The digital boardroom demo triggered a good set of questions that went beyond the pure visualisation on the 3 screens but extended towards questions on how to incorporate data sitting in non-SAP or legacy systems, on who composes the dashboards and how that might be applicable in their respective areas etc. They even speculated on how SAP might create upsell opportunities. It has been a lively exchange of ideas and, overall, a bit different from the common visits by SAP customers.

Vint Cerf commenting on the Digital Boardroom

Vint Cerf commenting on the Digital Boardroom

If you are interested in finding out more details on the HLF, you might want to look at the HLF website where you also find video recordings of the laureates’s lectures.

This blog has also been published on SciLogs.You can follow me on Twitter via @tfxz.

Measuring Query Performance Considered Harmful

HerbstlaubRecently, I’ve been part of or have listened to a number of discussions around performance, benchmarks and the like. What surprises me is that performance is frequently considered single sided, i.e. there is a single focus on the absolute value: “query returns in 0.345 sec”. But what is the expense of achieving this result? This is why I decided to add this blog to the list of considered harmful articles that have some tradition in computer science.

To elaborate on my point, consider the following 3 scenarios behind the query mentioned above:

  • Scenario 1: The best performance can be achieved if the query result has been pre-calculated and sits in some cache waiting to be displayed on screen. Here, the only time spent is the one for displaying / rendering the query result on the end user’s screen. This has to be done for all queries, potentially an outrageously large number. Tuning costs: potentially prohibitive.
  • Scenario 2: The performance is as given by the data design. No efforts are spent to improve the performance. Tuning costs: zero.
  • Scenario 3: A trade-off decision is made and partial results are pre-calculated that are frequently reused by queries. Such partial results include all forms of indexes like B-trees, bitmaps, materialised views etc. This is inbetween scenarios 1 and 2 as it does not reach the extreme of scenario 1 but is more than doing nothing as in scenario 2. Scenario 3 is what happens in reality. Tuning costs: as much as you want to afford.

In scenario 1, the end user will be delighted as this is as good as it can be for him/her. Now, in order to create such a situation, someone (a DBA) needs to anticipate those queries. Good candidates are queries that are executed on a regular base, e.g. every morning showing the respectively top-5 critical customers for sales executives. In that case, it is realistic and feasible to prepare for such a situation and it is reasonable to pre-calculate a query result if the underlying query complexity requires that. However, if queries cannot be anticipated (like in data science or ad-hoc analyses) or if there are simply a lot of them (1000s of employees, each with a slightly different perspective – filter, projection – on the data) then pre-calculation starts to become either very expensive or unrealistic. In contrast, scenario 2 represents the other extreme, namely that nothing is pre-calculated and the performance is “as-is”. Scenario 3 reflects reality as it trades off query performance versus tuning costs. Figure 1 pictures the 3 scenarios with respect to emphasising those 2 dimensions.

Trade-off between query performance and tuning costs for the 3 scenarios.

Figure 1: Trade-off between query performance and tuning costs for the 3 scenarios.

Benchmarks that measure only (query) performance bear the risk that results tend towards scenario 1, thereby potentially implying enormous tuning costs. This is why the most prominent DB benchmarks – the ones published by the Transaction Processing Council (TPC) – have gradually incorporated pricing (i.e. costs) into the benchmark metrics, e.g.

Naturally, pricing is also a very flexible component and in reality subject to customer-supplier negotiations. However, the TPC has defined pricing as well as possible in order to address this issue. A different approach has been taken by SAP’s benchmark council by removing price completely from the publications in the context of SAP’s application benchmarks – see section 4.2.8 in here.

As pictured in figure 1, performance (in general!) needs to be considered as a trade-off versus (tuning, hardware and other) costs. Therefore, looking only at a pure performance measure (like query runtime, transaction throughput, INSERT rates etc) does not necessarily represent what can be achieved in reality as it might (!) go along with prohibitive costs or effort.

This blog is also available on Linkedin Pulse. You can follow me on Twitter via @tfxz.