For many years, it has been best practice to use well structured data (typically originating in an operational RDBMS) with well defined and well known semantics (derived from the application feeding that operational RDBMS) for analysing business and other processes. That analyses convert data into intelligence, meaning knowledge that allows to take better and educated decisions.
Knowledge is the final result, i.e. answers exist to certain questions. The latter constitute the step prior to gaining insight. Given the right set of well defined questions (or queries), it is typically straightforward to find the answers (or compute the query results). So, business analysts have been striving for a while to find the “golden queries”, i.e. the right questions to ask. This has led to approaches like data exploration, outlier analysis, data mining, machine learning etc. Fundamentally, they are expected to provide methods to find good questions that lead to useful results (insights). Nowadays, finding good questions is considered to be a fine art and data scientists are supposed to be at the forefront of that effort.
It is important to understand how knowledge and intelligence are acquired in order to understand how technical architectures that support that process need to look like. For many years, we have built OLTP, OLAP, data warehouse systems to collect and analyse data. Those systems continue to be relevant. However, they need to be complemented by new infrastructures like Hadoop that can cater for data that has no clear structure, initially no obvious value or purpose, little semantics and that frequently comes, above all, in huge volumes. However, while we have more or less learned to manage the data volumes, it is still necessary to tackle the many unknowns in big data. Forrester therefore states:
Of Gartner’s “3Vs” of big data (volume, velocity, variety), the variety of data sources is seen by our clients as both the greatest challenge and the greatest opportunity.
In fact, this explains also why a file system like HDFS incl. processing infrastructures like map-reduce or Spark is so suitable: it can address that variety better than traditional RDBMS.
At SAP, we are currently working on ways to integrate Hadoop not only technically (e.g. access Hadoop from SQL or allow Hadoop to access HANA) but provide and define ways for our customers to look at Hadoop (and “big data”) as an extension to their existing data management setups. This requires to understand how the customers work with data from finding areas where they expect valuable questions, to finding those questions (queries), to then derive the results.