How Real-Time BI Requirements are Evolving into the Mainstream

On December 2^nd, 2014, I published a blog announcing the preview of the first ever REST APIs for Power BI. These first APIs could only do four things. They could list all the datasets that a user had. They could create a new dataset from a JSON schema. They could add rows to a table in a dataset. Finally, they could delete all the rows in a dataset table. Four simple tasks, along with a whole bunch of security and API infrastructure, could enable so many different scenarios. The most common scenario which these APIs were tasked with was real-time dashboarding. When new rows were pushed into a table using the API, any dashboard visuals which used that table would automatically update to reflect the changes. If updates were pushed rapidly, the dashboards would come alive showing a constant stream of new results.

These APIs were first conceived while the rest of the Power BI service was being conceived. There were certain limitations not just within the yet to be built Power BI service but also within the platforms that facilitated working with real-time data. Azure SQL Data Warehouse, RedShift, BigQuery and Snowflake were still in their early days. Spark’s popularity was rapidly growing but was still not mainstream. These platforms and several others attempted to deal with the two main problems that come with real-time data; handling rapid ingestion of incoming data and querying large volumes of data. Real-time data on its own is not the same as big data. In many ways, it is the opposite. Each set of events in generally very small and easy to work with, query and store. Real-time data becomes big data very quickly as the constant stream of events add up to unmanageable sizes. For this reason, streaming platforms like Storm and Azure Stream Analytics attempt to work with this data while it was still in flight and while it is still small.

Working with the data while it is still small, enables you to land the data in a shape that can be more easily consumed later for various scenarios like alerting, AI/ML, operational reporting, business reporting and others. The catch with this approach is that different shapes of data work better for different scenarios. The amount of high-quality detailed data needed for business reporting is often much larger than what is needed for alerting or operational reporting and would add additional cost and latency to those scenarios making them unfeasible. The solution to this problem is to output the data into multiple different shapes so that each scenario can efficiently be achieved. For operational reporting, this meant that you would output a dataset specific for each visual and the different BI tools available at the time would simply render the visual with that data. For business reporting you would take a different approach by transforming the data inflight before landing it in a data warehouse where more traditional BI tools could access it. The transformations didn’t end there. In order to keep the data volumes down to avoid crossing into big data territory, event data would be rolled up again and again in batches raising the granularity of the data from minutes to hours to days and finally to months. These different approaches and vastly different BI tools meant that different skills were required to analyze the same data. Even worse, the two scenarios were very disconnected and in different tools, which made moving from one to the other very difficult.

When designing the APIs for Power BI, we were not immune to these same limitations. However, we wanted to do better by achieving both reporting scenarios with the same BI tool. We also wanted to make sure that it was the same tool and not just two completely different tools under one product name. This is why for real-time, you would still create a Power BI dataset with the API but then you could use any Power BI feature on top of this dataset just as if that dataset had been built in the Power BI Desktop for a non-real-time scenario. After the first release of Power BI, the rest of the Power BI service and especially the Power BI desktop began to rapidly evolve. While this caused some differences in the capabilities of a dataset authored in the REST API vs. the Desktop, the underlying technology remained in sync. Having everything in one tool, did not solve all the problems that I mentioned above, but it did give users some new ways to tackle them. It often still meant outputting a different shape of data for operational report and more detailed business reporting as illustrated in this once famous example of how-old.net. The how-old approach can be used to give the illusion to a user viewing a dashboard that all the data is together in one place.

This was five years ago and as usual, things don’t stay the same.

Spark’s popularity continued to rise until it began to stall out in 2017. During the same time, you saw the rise of massively parallelized data warehousing. These new platforms meant that big data no longer had to mean slow data and made working with large volumes of data, much more mainstream.

It was not just technology that was changing. When I started building BI solutions, our data was updated once to a few times a day. No matter how often that data was updated, most of our users only looked at that data once a day or once a week. For a long time, this was common with many of the customers that I encountered. Many of these businesses were born in the physical world where data and processes just moved more slowly. Nowadays, more businesses are being born purely online where new data is available instantly. While traditional operational real-time reporting scenarios still exist within these businesses, the even more mainstream BI scenarios are being pulled into the realm of real-time in the form of low latency reporting.

Low latency reporting refers to the time taken from when data is generated to when it shows up in a report. Unlike operational reporting or alerting, data does not need to be available sub second. A business user is not (usually) staring at a spreadsheet all day watching as new data continuously flows in however, they will be checking these reports multiple times throughout the day. These reports are not specially designed high-level reports specifically designed for real-time data rather like we see for operational reporting, they are reports which leverage the full fidelity of a dataset. Picture account mangers being able to see the most up to date activities of their customers before reaching out to them. Online retailers can gain insights in real-time about what is being sold on their site and immediately take actions. The reports that we used to look at once a week, will now be consulted throughout the day.

Reporting over the entirety of the data with low latency will blur the lines between operational reporting, normally done by a handful of users, and business reporting which can be consumed by every employee in an organization. The business needs are there, and the modern data warehouse platforms are starting to make this achievable. Power BI has always been able to work by directly querying data at the source. It has added a new feature that lets you automatically configure a report page to refresh in order to see the latest results. It has also released a feature which enables you speed up direct queries by pre caching some data at an aggregate level. Microsoft just announced Azure Synapse Analytics which adds the best of Spark to Azure SQL Data Warehouse (Microsoft’s massively parallel data warehouse platform) and integrates deeply with Power BI with a goal of reducing the number of steps and time required to go raw data to BI reports.

Nothing changes overnight but going forward, the types of real-time reports that dance around as new data is updated every second, will remain important to operational reporting and reports displayed on big monitors in hallways. Low latency reporting will become the more common scenario as more businesses demand that all users are able to gain insights over the entire set of data in near real-time. These reasonable demands will not only stress the limits of modern data platforms, they will also force changes to long standing BI development practices. Technology and people are being forced to evolve to a low latency world.