Blog Post

Why You Should Automate Data Incident Management

Nick Freund
February 27, 2024

How to make on-call data support suck less

Nothing sucks more than being the unlucky person on the data team stuck in the on-call rotation for the week. 

If you’ve ever found yourself in this position, you might feel a bit like Katniss Everdeen after the reaping. Or perhaps Tessie Hutchinson, the unlucky loser of The Lottery, who is ceremoniously stoned to death by her family and neighbors to ensure a good harvest. 

Tragically what data teams experience around on-call is not a dystopian fantasy, but rather the reality of success. It is the mature teams, those who have been wildly successful in making data available, and in making everyone care about it, that are forced to implement an on-call support model. 

Oddly, these are also the same teams that have most readily embraced the modern data stack, including data observability and transformation solutions to ensure reliable, trustworthy data. 

These teams pulled into the dregs of the on-call servitude must find more efficient ways to respond to alerts when data issues are flagged. And they need to automate the manual labor of diagnosing root causes and business impact, coordinating incident resolution, and scheduling follow-up tasks. And finally, these teams need a new approach to reduce the number of persistent inbound questions from stakeholders such as “Does this dashboard look right?

It’s the Process, stupid

The historical processes for on-call data triage are particularly tragic because they epitomize the annoying busy work that prevents data teams from focusing on the “important stuff.”

"Something to be careful about on a data team is notification fatigue. It’s really important to define a triage process for failures. We had a Slack channel the team would monitor, our data pipelines would output a message if something failed, and then we defined what you would do if things failed. Red means critical, yellow is in process, and green means done. There are platforms you can do that in… maybe take a look at Workstream."

– Brent Brewington, Data Engineer at Aimpoint Digital, formerly Home Depot on Season 2, Episode 13 of Data Knowledge Pioneers

If you Google the “Hierarchy of Data Science Needs,” you will find dozens of articles and subreddits about how teams can evolve from the rote work of collecting and cleaning data, to strategic work like analytics and building AI models. They normally include a pyramid or hierarchy that looks similar to this:

The Analytics Hierarchy of Needs from Towards Data Science 

This article rightly points out that the steps of the pyramid are sequential and dependent: “You cannot jump to [the next] stage before being accomplished in the underlying stage.” The use of the pyramid also rightly implies that laying these foundational layers takes time and effort. 

This is all true, but at its core, this is a technological argument. It is an argument for how data teams implement technologies to move up the pyramid to higher value, and more interesting and impactful work. And our issue is that it oversimplifies, and divorces technological progress from the reality of data work in an actual organization. 

The reality is that data teams face countervailing forces that pull their focus and time away from moving up the pyramid:

These forces include supporting core data infrastructure and handling incidents that are flagged and which require resolution and communication. The best teams have adopted tools like dbt to transform and test data, or solutions like Monte Carlo to observe it and ensure the quality of data. These technologies have become foundational to ensuring that data is properly collected, cleaned, and defined, and enable broader data availability in organizations. 

Responding to and solving data incidents, diagnosing data problems, and communicating to the business consumes an extraordinary amount of time. Getting to the top of the pyramid also requires automating these processes that otherwise distract a data team from advancing projects up the pyramid.

Data incident management is a process

What is data incident management? In short, it is everything your team does after a data incident is identified by your data observability solution. 

For the modern data team, this typically starts with some type of alerting to the on-call person. On-call teams are most often sent Slack alerts from systems like Monte Carlo, or dbt, to let them know that something broke during their most recent production run. 

From there, teams need to understand and control the blast radius of an issue. While some tools are great at providing lineage from test to model to dashboard, understanding exactly what broke what becomes a fishing expedition. And data lineage only tells you half of the story. You also need to understand if the issue is a critical P0, or lower priority, based on its impact to the business, your data users, and the decisions they make. 

From diagnosis, team members then need to loop others into that context, and coordinate follow-up activities. This includes creating, scheduling, and assigning ownership of development tasks. It can require escalation, answering end-user questions, and proactive communication about the outage. And it should conclude by writing a retrospective once a particularly troublesome incident is resolved, or one’s week on call has ended.

This, or something like it, is the process of being on call for data incident management. As data cultures mature, and more data is created and adopted, this process will require your team to dedicate more time and people power to it. 

Why data incident is different from software incident management

Data teams have had a lot of success by borrowing software engineering practices, and tools. 

While this has been a helpful catalyst for progress in data engineering and analytics, we at Workstream believe strongly that there are limitations to how far this will take teams. We have previously spoken ad nauseam about how agile is the worst form of government for data teams, except for all the others that we have tried. 

The limitations of borrowing tools and processes from software engineering are two-fold: 

To start, the core tools used to engineer data and software are divergent. For example, the leading observability tool for production software is Datadog; for data observability, it is Monte Carlo. Another important example of divergent tools is business intelligence platforms for consuming data, vs applications for using software. 

Secondly, data teams are smaller than software teams, and their customer base is largely internal. The same, small group of data people is required to simultaneously a) triage data downtime, b) handle support requests, and c) progress their projects up the Hierarchy of Needs. This requires data people to wear product, engineering, and support hats all at once and is precisely why on-call becomes a necessity. Data teams do not have the luxury of pitching tier-1 triage to a support function.

This drives unique dynamics for data teams that solutions like PagerDuty, or OpsGenie are not designed for. A divergent technology stack. Internal customers. Faster, more iterative feedback loops. A multidimensional nature of inputs and outputs. Far less modular, and more interdependent workflows. 

How to automate your data incident workflow

We have found that when teams automate their data incident workflow, they experience a 40% reduction in on-call time and a noticeable improvement in stakeholder relationships. It makes on-call suck less, helps you control the blast radius of data outages, and rebuild trust within your organization. 

It starts by offering a dedicated on-call system built for data teams and data incident management. It plugs into your core tools, and automatically routes issues and alerts to the person on call via the appropriate channels. It helps you streamline collaboration by sharing critical context and managing handoffs within your team. It automates the creation of follow-on tasks in JIRA, as well as retrospective documentation on issue resolution. 

“You must centralize some of this communication…if a data pipeline fails, have a space you can go to where you can have that conversation. Save that information so that other people on the team can learn from that later, they can search for it and everything. Don’t do a DM to somebody who manages it. Don’t create a silo in the process of resolving errors because you want to build that information in a place where your team can learn from it later.”  - Brent Brewington, Data Engineer at Aimpoint Digital, formerly Home Depot

Data incident automation allows you to contain the blast radius of an incident. It instantly tells your on-call team what and who is affected by data downtime. You can immediately identify which downstream dashboards and data assets might be affected, and understand who viewed broken data and what they saw. You can streamline all of the communication and processes for a single incident in one place, even when a single incident impacts many dashboards or is linked to many failing tests. 

And finally, it allows you to keep stakeholders informed throughout the process. It automatically alerts them wherever they already work and consume data. This includes proactive alerting in your business intelligence tool and proactive updates via your #data-support channels in Slack. Data consumers can view historical insights about data reliability, and understand the current status of an asset including whether it is up or down. 

Ultimately, the benefits of data incident automation yield to your people and your data culture. To your core data team, it yields time back to continue your important journey up the Hierarchy of Needs. And to everyone else, it provides more trust in your data, and in your data team. 

If you're interested in learning more about how to fast-track and improve your data incident automation programs, check out our quick demo below:


You can also watch the recording of an in-depth webinar about this topic with Monte Carlo's Co-Founder Lior Gavish. We took a deep dive into the root causes of the technological and processes breakdowns that create issues in data incident management in the first place.

Nick Freund
February 27, 2024