Blog Post

Building Trust with Data Status Pages

by
Nick Freund
-
January 31, 2024

No matter what measures you take to harden your data stack against issues, the fact is that things sometimes break. A column mapping changes, a Python script no longer returns the right results, or a dashboard suddenly does not properly query data. 

While the best data teams are great at monitoring and correcting these issues, no team is immune. Even with data observability tools and other measures in place, issues can still slip through the cracks and end up being exposed to end users.

The end users you support might understand this, especially if they have some data background themselves. But, as is too often the case, they more likely will flood your teams with inbound questions like, “Does this data look right?” or “Should I trust this?” 

At worst, data users might just go ahead and use a broken asset as-is, and come to the wrong decision because of it.

You have likely already invested in tooling to monitor and alert your team of data quality incidents. But your data observability efforts should not end there. Closing the data quality loop with end users is critically important – it provides people necessary clarity about the status of your data, and helps you avoid the worst case scenario of end users acting on bad data. 

Status pages act as an important buffer against this nightmare scenario, where your team attempts to fix something while simultaneously being barraged with messages from end users. By leveraging your existing investments in data quality, you can use data status pages to build a culture of trust around your data.

Bad Data, Bad Outcomes

If part of your ETL pipeline breaks, and suddenly users are seeing only partial results, an end user’s decision to use those results for an important decision will have ripple effects throughout the business and team:

For the business

  • You might lose revenue as you chase the wrong prospects or marketing strategies
  • You may incur a huge opportunity cost as the correct approach is abandoned 

For your team

  • There may be personal consequences for the decision maker, as the mistake could lead to meaningful business problems
  • Your team may lose trust with those who rely on them

In many ways, the loss of trust in your data and team does the most long-term damage to the business, as it erodes efforts to drive decision making with data. If business users no longer trust the data they need to support their decisions, they may abandon it entirely, or adopt it in only the most haphazard or perfunctory way. And once trust is gone, it is nearly impossible to get back.

Many data teams have identified this and are proactively investing in incident prevention. To date, this has generally meant a focus on the first step: implementing effective monitoring and change management to make sure things don’t break in the first place.

But what about the last mile of communicating data downtime? While the recent revolution in analytics engineering has catalyzed data quality initiatives, there have historically been no big innovations in how to communicate upstream issues to data consumers, and in preventing those users from relying on bad data during decision making. 

How Data Teams Have Communicated Outages

Today, data teams have a few straightforward methods of letting their users know about issues, or fielding questions about “weird looking data.” The most common are Slack posts in data-specific channels, or emails threads with impacted users. 

However, there are a few issues in leveraging these traditional communication methods for data downtime incidents. 

First, it can be difficult to triangulate where a problem is actually occurring when an issue crops up in a downstream asset. Data observability tools do help here, but the feedback loop between a user reporting an issue and finding the root cause can be long.

Second, straightforward communication techniques don’t live in context with the affected assets. It is very easy for end users to miss or ignore the message when it does not display at the exact moment when they are looking at the data.

Third, there is no system of record available for business users to see historical incidents. Tools like Slack and email do not provide a comprehensive record of previous issues. Creating transparency around incidents that have occurred in the past builds trust by demonstrating that users will be notified when outages occur, as well as by giving insight into the proportion of uptime for your data assets.

How Status Pages Drive Trust

In the worst case scenario we discussed a couple sections ago, one of the most likely outcomes is that the end user who made a bad decision will no longer trust the data or the data team. We are all human, and when we are burned – especially in a way that may impact our careers and standing at work – it is understandable to want to avoid that situation ever happening again. Data status pages can help avoid this outcome.

Status pages provide a window into data quality and freshness, are available to everyone, and are updated automatically. Crucially, since they appear next to a data asset, there will rarely be a situation where an identified issue will not be successfully communicated to the end user. They will know to approach the asset with caution, or ask the data team for help in resolving the issue.

There is another reason status pages lead to a culture of trust in your data. Because status pages acknowledge, rather than hide from, the reality of data quality issues, end users will know the team supporting them cares about quality and maintaining the data inputs used to make decisions. Users generally have a higher level of trust in assets with a status page that shows the all-clear, and which clearly communicates any historical downtime trends.

To take an analog from software, think about the support interactions you have had where a vendor will not acknowledge there is a bug, or sports a status page with 100.00% uptime for every feature. Do you trust that platform to acknowledge, correct, and keep you informed of key issues that could impact your business? Probably not. The same goes for your organization’s data.

Support teams also possess a number of ways of pushing this information out to you. Today, it is very common for statuses to be posted directly within the app, via a dedicated website, and through services like Intercom. The breadth and context of this communication is something data teams would benefit from emulating.

Deeper Integration → Better Communicated Data Quality 

Even the worst, most manual version of a status page is still better than nothing when it comes to creating trust with stakeholders. 

As a starter, a scrappy team might just mark the asset’s title or description to let users know to beware of current data issues. Obviously, this type of manual approach doesn’t scale when we are talking about data teams supporting hundreds, or even thousands of data assets and users. Most other solutions – from posting in Slack, to creating manual docs in your intranet – suffer from the same limitations.

The best version of a data status page is one integrated with the tools you already use to monitor the health of your data, such as dbt Cloud or Monte Carlo. Most mature data teams are already investing in these data observability solutions, so why not make this work for end users as well?

At Workstream.io, we have seen firsthand how effective this can be. Our dbt Cloud integration takes any asset that is configured as an exposure in dbt, and traces the sources and models that are necessary for that asset to function. It then builds a status page for each relevant asset automatically, alerting end users in real time if there are any issues with quality or freshness. The status page also shows the precise model or source that is failing checks, providing instant insight into how data teams can fix the problem, and context about the scope of the issue for end users who are trying to make decisions.

Instead of a manual system, integrated status pages automate all of the heavy lifting required to notify and communicate with end users about data downtime. And by integrating status pages not just with your data observability tools, but also with the actual BI tools where data is being consumed, end users can also be updated right in context with the data, rather than having to hunt for an explanation elsewhere.

Conclusion

The impediments to adopting status pages may be more of a mindset shift than a technological hurdle for data teams: after all, it can seem scary to put data issues out in the open in such a transparent way. 

However, this level of transparency prevents the worst case outcomes we have discussed, minimizes busy work associated with data downtime, and promotes a culture of trust and accountability. Trust and accountability is a two-way street between data teams and the users they support, and the benefits only compound over time.

by
Nick Freund
-
January 31, 2024