Blog Post

Data Knowledge Pioneers Episode 1: Exploring Entropy

by
Workstream
-
January 31, 2024

We are pleased to announce the start of a new podcast, where we speak with data leaders and practitioners about the problems they experience in creating, curating, and disseminating knowledge about their data. You can find the video version and full transcript below, or download the episode on your podcast app of choice.

Transcript

Nick Freund: Hey everyone. Welcome to Data Knowledge Pioneers, presented by Workstream.io. We're exploring how organizations create shared consciousness about their data. I'm Nick Freund, and I'm speaking with leaders and data practitioners about the acute problems they experience in creating, curating and disseminating knowledge about their data. Today we're gonna explore the issue or problem of data asset sprawl. And joining me are two awesome data leaders, who really know how easy it is for analytics environments to descend into what I would call a state of chaos. 

So, first off, we've got Jamie Davidson, who's the co-founder of Omni, an awesome new BI platform that they're developing. I highly recommend you check it out. He's the former VP of Product at Looker. And we also have Ted Conbeer, who's the Chief Data Officer at Privacy Dynamics, which is also another great new company in the data space. I asked Ted to join because he was the former SVP of Data and Strategy at MakeSpace, and he was a very early adopter of Looker. So hopefully we'll have some good discussions from both the builder as well as the data practitioner side of the house about this problem. Jamie, Ted, thanks so much for joining today. 

Jamie Davidson: Thanks for having us. 

Ted Conbeer: Yeah thanks, Nick. 

Nick Freund: Of course. To start, I just thought maybe we could dive into the problem of data asset sprawl and what we think it is, and how we want to define it. For me, as I was just mentioning, I kind of define it as this phenomenon of entropy in analytics environments. Or this descent into chaos, where over time more and more stuff gets created, and you just end up in this state where there's a lot of things and people don't know what to use or trust. Ted, we've talked a lot about this. I would just be curious, is that how you would define it? Would you define it in a different way? How would you characterize this problem? 

Ted Conbeer: The way I look at it is, new tools like Looker and the other modern BI tools have made it so easy to create and share analytical assets. You know, dashboards and charts and things like that. Really those things are pieces of software and they say, with any software, 80% of the cost of building software is actually maintaining it. But no one is actually maintaining these analytical assets full-time. And they're not treating them like true products. There's a lot of talk of data as products. But there's really hardly any maintenance. There's no observability and things like that, downstream from actually creating that asset. 

And so, at MakeSpace, for a long time I was a data team of one. And then we had a small team supporting a couple of hundred people in the field, and so we would have to partner with power users across the organization. And they were really good at creating assets. But those things would quickly go stale or break or get out of date. But it wasn't clear on the tool which assets were trustworthy and which assets were maybe stale or out of date, and could cause people to get to the wrong answer. And so that was kind of the problem that was keeping me up at night. And that's really what I think about when I think about data asset sprawl. 

Nick Freund: Makes sense. Jamie, I know you ran data at Looker internally for a while, and obviously you've worked with lots of customers. Does that resonate with you? Does that sound right? Do you have a different take on it?

Jamie Davidson: I completely agree. I do think entropy is a good way of describing it. I think part of the value of these tools is lowering the friction from asking every incremental question, so you can ultimately derive insight from your data, making it as easy as possible. The problem with lowering the friction, though, is you end up with the proliferation of logic. Every single permutation of every single question you get ends up with a new asset, a new dashboard that's looking at your sales funnel for this product line, for this region, for this timeframe and whatnot, without proper maintenance of those assets. Or without looking at a software development process, or a product process, and recognizing there's a cost to those things. You end up inevitably in a state of what we would call data chaos. Like, where you look for, “what is my canonical sales pipeline?” And I get 15 different answers from 15 different dashboards, potentially with materially different results, and driving different decision making, too. I think it's a key problem, it's a technology problem. I think it's also a process and people problem, too. Kind of like software development, and product development in the first place. 

Nick Freund: Totally. I mean, if it's both a technology and process problem, is it shared 50-50? Is it 80-20 one way or the other? 

Jamie Davidson: Well, the people side is actually typically way, way, way more difficult. How do you deprecate assets? How do you discover things? How do you empower an organization to make good decisions? There's been a rise in folks talking about things like the metrics layer. How do you define your KPIs in the first place and get a canonical definition, an agreed upon definition, across the organization. I think with those things, it's much more around the business, the process associated with it. Defining the metric is actually not that difficult to do. Every SQL analyst or every data scientist will go and do this, and redo it in every permutation. 

I think in general, it ends up being mostly a people problem. I think people problems can be helped with technology, though, and I think technology needs to basically serve as a means to reinforce canonical definitions. With Looker, we talk about a single source of truth, having a single place that you define metrics. And then you have the ability to have change management processes like a software development lifecycle, Git and version control and the like. That absolutely enables you to go and have some of that people process. But the people process is the most important part. That's where I think most organizations fall over. 

Nick Freund: Ted, I think you were calling that out too, right? That a lot of this was felt as you were enabling folks around self-service. Do you agree it's fundamentally a people problem?

Ted Conbeer: I guess so. I think in some ways, the technology and the expectations around the technology have created the people problem, from my standpoint. When you emailed someone an Excel spreadsheet, you put the date and the title of the spreadsheet. Somebody opens it a year and a half later. They don't expect to be able to make decisions using that year and a half old spreadsheet, right? But with a dashboard that automatically pulls down the latest data, they can fire that up and they're like, “Oh, cool, I'm good to go now.”

And so it’s fighting that default expectation that every dashboard should live forever, and every analysis is not a snapshot of a point in time. But it's not something that used to be true back then. It's something that is true now. That is a really hard problem to fight, especially on a small team. I was, like I said, one person supporting 200. We had at one point up to a thousand Looker dashboards, and many thousands of saved Looks. You could say that's a people problem because I enabled all these people to create that information to begin with. But we were getting a ton of value and unlocking a ton of great things by allowing that proliferation. But then not having any tools to scale myself, to rein that back in, was really where I ran into a bit of a wall. 

Jamie Davidson: Just to pile on there, I think if anything, the technology's exacerbated the people problem, not necessarily solved for it. It's one of the founding core ideas for Omni now – I worked with probably thousands of Looker customers. I probably met with personally a thousand Looker customers over my time there. And almost all of them ended up inevitably in a state where there's effectively data chaos. We would call it model rot, where the content is suddenly disconnected from the data. You've lost columns in the database. You've lost tables in the database. You've now got inconsistent logic. Even though you've got a software development lifecycle that can govern the development of LookML, it's sort of a net additive thing. You end up with a huge proliferation of it – I think one of the things that you were kinda highlighting – but I think it's super interesting. Actually thinking about the maintenance of  your instance, too, as a part of the product problem, in and of itself. Where perhaps not everything should be shared, and shared consistently across it. 

In Looker we would have customers that would turn off features like our PDTs, which is a way of creating a materialized table that would basically encode business logic, and materialize it in the database. People would turn that off because they'd end up having a huge proliferation of it, too. You see even worse now with dbt, where everyone is now an analyst. Engineering, everyone's got their own schemas, and they will add to the tables. We have to think of data and data assets effectively as a curation problem, and maybe not everything should be part of a canonical data model. Maybe some things should be siloed. But we firmly believe there's a maintenance step. There needs to be higher order constructs to allow for optional promotion paths for the things that are, in fact, actually reusable, that need to be part of something that looks probably a lot closer to CICD, actually. And looks closer to a testable, verifiable kind of process for key operational metrics or for key operational workflows.

Ted Conbeer: I totally agree with that. I think borrowing from software helps a lot. There is one key change or key difference between data and software, which is that, in data, your truth is defined by who's using those assets. And there's a social aspect of data and the conversations that happen around data.

You might have multiple definitions of a metric across an organization that might be similar, but meaningfully different and purposefully different from the finance and the operations team. But, if you're on the Ops team, what you really care about is, “What are the numbers that my boss is looking at? What are the numbers that the COO is looking at?” When you search for an asset, you shouldn't find the dashboard that the CFO is looking at, right? Or you should know, my boss hasn't looked at that set of numbers in two years. So, he probably doesn't care about them.

That's literally when I was trying to maintain the data sprawl at MakeSpace, even though I had in many cases built the canonical sales dashboard at one point, or built the canonical Ops dashboard, I would always reach out to our VP of Sales on Slack and be like, “What dashboard do you guys look at every day now? Because that's the one that I'm gonna use.” It doesn't matter what I bookmarked six months ago. It really doesn't. They might have moved on. And I think that's great how reporting can evolve and different functions can be really involved in that. But there’s this social graph that's missing that would really empower especially centralized teams to understand, “What the heck is going on out there across all these data assets that have been created?”

Jamie Davidson: I completely agree. It's amazing to me, but we don't use usage-based features or functionality to feed back into the consumption process. We don't know the metadata about who created things, and when were they last used, and who's using them in a disparate way, too. I think that's ultimately the most important signal, if this metric is what's being used to drive operations, that's the metric we should actually care about. Versus the theoretical best metric, or the right metric, or whatever it is. 

Nick Freund: It goes to Ted's point, about the point in general, about data as a product, right? 

Jamie Davidson: Yeah. 

Nick Freund: And what does that actually mean? And if you're building an actual product, the most important metric about a feature or an area of your product is, do people use it? How do they interact with it, and why? And if they're not, you probably wanna take that to end of life, right? That should be part of the discussion, but often it's not. 

Jamie Davidson: Also, defaults are really important. By default, not every dashboard should be a persistent dashboard for forever. By default, actually, everything is auto-end-of-lifed. Or maybe you don't lose the conceptual logic. We're not throwing away the SQL. But you have a big warning: “Hey, this hasn't been touched in three months.” A user beware kind of a thing.

Nick Freund: I mean, this is where I potentially can jump the shark and take us in a completely random direction. But as I've thought about these problems, especially the people dynamics, I think a lot of it comes down, in some ways, to the problems of information asymmetries, right? If you believe that, in some way, shape and form, data builders and data consumers in an organization construct a market. There's asymmetries in the information that they have about the data and how they use it. And there's whole theories in economics about how you resolve information asymmetries, and the classic example of a problem this creates is the used car market, right? Where the buyer of a car doesn't know whether the car is a lemon or not. And therefore they're not willing to pay a lot of money for the used car. But the seller can price things efficiently. I'm not saying that every dashboard or piece of data that gets shipped is a lemon, but I do think you’re looking at what's available to them and you're like, “Is this a lemon or not? And how do I understand that?” There's an information asymmetry from the consumer side. But then to what both of you were just saying, there's also information asymmetry from the builder side. Because you then just don't have the information that you need to maintain the product or evolve it.

Ted Conbeer: Yeah, it would be like listing your car for sale and never knowing if it's sold or not, right? 

Jamie Davidson: Yeah. 

Ted Conbeer: I think there's also just the recognition that some of these things we build are lemons. Some of them aren't useful. Some of them aren't accurate. And often those feedback loops are kind of broken to begin with. But when you know that those things exist out there, it's impossible to build trust across an organization. There are extremely high value, very trustworthy datasets within this landscape that you can really use to make high value decisions. That's a really difficult task for any data team, is to build up that kind of trust. Because once you have it, the whole organization can move a lot faster. They can be a lot more independent. That's the dream of self-service. But it takes a long time to build that, and a very short amount of time to lose it. You don't always even get that feedback when you know someone has a bad experience or discovers that they've been consuming a lemon.

Jamie Davidson: I completely, one-hundred percent agree. I think ultimately it really needs to be a partnership. Or if you're building data as a product, you gotta be a product manager. You gotta talk to the customers, and contextualize the product that you're building. Data folks often have the context for – they understand the schema. They can understand the ETL process and the freshness of the data. And the accuracy of the data. They often have a disconnect from actual use. So, what is actually important for the sales pipeline. Or for this product. Or for this department, and the partnership I think actually is kind of a people problem. Or a process problem, too. How do you contextualize both enough of the technical side and pair it with that business context, too. And have that feedback loop going both ways.

Ted Conbeer: I loved embedding or pseudo-embedding analysts in teams. My team would report to me, but every week they would attend the marketing metrics review, or every morning they'd attend the ops daily review. Because then they get to actually see how the teams are using their dashboards. They get to hear the questions that come up around the data, and be part of that kind of dialogue. Without that, you're completely blind. If we got busy and our team members stopped attending those meetings or something like that, it just felt like that always blew up. Because the business drifted faster than you thought it would. The data or the software drifted faster than you would think, and it takes a ton of engagement and a ton of feedback that, honestly, isn't natural for stakeholders to provide. They might not even know the changes that are happening behind the scenes or under the hood that might make their data less accurate. But it's also asking them to finish their day job and then think critically and think hard about how they could use data better, or what parts of their reporting don't answer their questions today. That's a lot to ask of somebody who's very busy with a completely different job. But that's also a very expensive people solution to a problem that one would hope could be supplemented, or the solution could be supplemented, by usage data or other text that we talked about.

Nick Freund: Totally. I think part of the challenge there, just from having talked to lots of teams is, how do you and Ted, how do you manage the demands on your team of all of your various workflows? You've got all this engineering work that you have to do. There's service partnership related interactions, and then there's the long-term product management. And it's a lot to put on a generally small group of people. There's lots of context switching there. And so, if you just lean fully into the people version of the solution, it's a hard one to scale and get right. 

Ted Conbeer: It requires a lot of judgment, in many cases. And I hope this is changing. I feel like it is. But in many cases, our data analysts are among the most junior people in the organization. They're 23 year-olds, smart, ambitious, but don't have the context and the years of experience to understand the long range consequences of what they're doing. And so often analysts are over eager to build something new and create a new solution and answer a question for the first time, because it feels great. But they don't appreciate the costs of supporting that thing that they just built into perpetuity. So that's definitely something that I learned the hard way over and over again. And now having been in this job for many years, right? I appreciate the difference between getting someone a quick answer and making it clear that this is a one-off exercise. Or a one-off analysis that may be contributed to a knowledge repository, but doesn't become a piece of software that I have to maintain forever. I make decisions about how I build it, and how I communicate the results, and how I set expectations in order to enable that. Because if you don't do that, and you think you're in product building mode and you set those expectations, or you make certain investments in building that product, you can end up in a really bad spot six months down the road where you've built all these products and you haven't maintained any of them. And half of them aren't being used. And the other half are broken. So it can be a really tricky balance to strike there.

Nick Freund: I mean, you've just been talking about it now, but historically – you talked about embedding – what are the top mechanisms or solutions that you had implemented to try to address some of this in the past? 

Ted Conbeer: My biggest hack that I wish I could copyright and take credit for, is I would basically offer to do a dashboard review for anyone in the organization turned around in less than 24 hours. So, go out, build whatever you want, send me the link, and I'll look at it and I'll give you a thumbs up or thumbs down in 24 hours. If you built this right, or if there's things you need to know about or be aware of. And same thing if anybody wanted to say, “I'm about to use this dashboard to make a decision. Does this look right to you? I found it. Someone built it a year ago. Can I use this?” Over and over again, in almost every interaction with my stakeholders across the business and Product and Ops and Marketing, I would just repeat this over and over again. If you ever don't know, or if you don't use the dashboard every day and you just found something and you want to build something new, just send it to me. Because I can take 15 minutes and probably tell you if it's close to right or not. But if you never send it to me, if you never ask, I'll never know. And then I kind of wash my hands a bit, and if you end up making the wrong decision, I'm gonna not back you up in that meeting. But I really think that helped build a lot of trust, and it put this people and process step in place that helped at least guard against the worst case scenario, which is: somebody wakes up, finds a dashboard, and makes a huge decision off of that data. 

Jamie Davidson: I'm sure, too, that also enables folks to learn, and you're empowering folks to go do self-service, too. Because you've got a step to validate and verify that something is okay. But it's OK if whenever the product analyst goes and builds a new dashboard and then can vet it, and that's a feedback loop in and of itself. That makes a ton of sense. That's awesome. 

Ted Conbeer: Yeah. It empowered people. It definitely felt like it was time very well spent. It dramatically lowered my stress, and improved everybody's trust, and improved everybody's usage of the tool once that was in place, and once I made that very clear across the organization. 

Nick Freund: Ted, did people take you up on that a lot? Or was that more of the thing you did every once in a while, but then people didn't actually... 

Ted Conbeer: No, it was multiple times every day.

Nick Freund: Oh wow. That's a lot. 

Ted Conbeer: Yeah.

Nick Freund: It wasn't a full request that Ted validates this and certifies this in perpetuity, right? This was more a one-off gut check? Or was it more of a spectrum?

Ted Conbeer: I would say the request was always just, “Hey, is this right?” Because I think even that level of subtlety or nuance is way beyond what you can expect from most business stakeholders. The difference between, “is this right now?” and “is this gonna be right in a month?” People don't really think that way, in my experience. So, it's like, “Hey, is this right?” It's a quick message in Slack. One line, with a link. That's it. That's all I need. And then I would take 15 minutes and write back usually just a couple of sentences around, “Yes, this looks great.” Or, “Most of this is good. This one thing worries me.” Or “I wanna make sure you're thinking about this metric the right way, because I know that metric can be really confusing and I don't know that we've trained you on that metric yet.” You know, things like that. 

So, we kind of ran the spectrum. The highest level of engagement for me would then be, I'm gonna schedule a meeting with you for 20 minutes tomorrow. We're gonna talk about my concerns about this. I'm gonna push you in the right direction or train you on a new way to get to the numbers that you're looking for. Or, I need more context about what you're doing. Let's talk about that and then I can give you other feedback. So, it usually was a very high leverage activity. Once the team scaled up, then either I would divert those requests to members of my team, or that would become, instead of a one-on-one conversation, it would become me and my team member with our stakeholder kind of a conversation. So again, we're sort of building that knowledge internally, and building that trust, and building a relationship between the more junior members of my team. I just found it to be a way that I could offer to be really helpful, increase my engagement with everyone across the business. And it always ended in these conversations that just felt really good and didn't honestly take that much of my time.

Nick Freund: Jamie, are there other things you've seen teams do, or you or your teams have implemented in the past to solve some of this?

Jamie Davidson: I always loved a hub and spoke embedded analyst model, too. I think this is true for my Product teams, but certainly true for data teams, building for product. You have to build empathy for customers. That is clearly a best practice. I've seen folks do things like office hours or training very deliberately, as a part of onboarding even. So whether they're doing a new tool or bringing new people onto the team, promoting an atmosphere of data literacy where there's some contextual piece. You're the Marketing Ops person and we're gonna train you on the marketing metrics or the marketing dashboards and stuff. You sort of learn some best practices from that process. 

But there, it's variable. It's variable with the quality of the content and the engagement. And then also, how does this help me do my job better? You have to kinda inspire the, “Why look at data? Why will this improve your process?” And I think that tends to be almost a cultural thing that's embedded as opposed to something that can be brought, or changed in a one-off handful. Having a single office hours, I think it's hard to just change it. 

Aside from that, I think the other area where folks end up breaking down quite a bit is the interface between the business, and analysts, and BI folks. Then there's also the interface between analysts and the data engineering, the infrastructure-y folks. And that's whether it's ETL or Fivetran replicating SaaS metrics or whatnot where, there's losses in context there. And I think the rise of the analyst engineer is amazing. And people are actually able to do things like data transformation and preprocessing of data. And Snowflake and BigQuery make things possible. And Fivetran makes things possible. Which is great. I think we still see the Product team shipped a new feature and changed how their underlying data model works for their product. Which is totally fair and should never be gated on by data, but may break downstream things, or have unintended consequences for downstream things, too. And so it’s another area where we see a lot of the entropy again. Where the columns change, the tables change, the meaning of things may change in a way that's not as intuitive to folks, and can also still have that same terrible end user experience. Where I thought all the logic was consistent. The SQL is good. But you're now throwing all of the experimental records in. Or the deleted records are in from Salesforce. And they're all present because the dbt model changed a little bit and we didn't really realize it downstream. 

But I think all of this is always an iterative process where you have to continually talk with more and more folks, and go back to first principles. Like, “Hey, we wanna solve this problem. We need this data, too. Let's make sure all the way up and down the stack is verified, too.” And sort of have partnerships all the way up and down, to the end business user, through the engineer who's actually enabling it.

Ted Conbeer: I think Jamie, one thing that you kind of touched on, but one of the reasons that good tech solutions don't exist for this, is it's really hard tech to build. I mean, the detecting of schema changes is one thing, but the schema doesn't have to change to totally break recording, right?

You're building a dashboard of men's pants sales and then you launch a kid's line or something, right? The data can change, but without the schema changing. You could write a dbt test for what are the accepted values of that column. But that person who's a few layers deep in the marketing organization who cares about the sales of kids' pants is not gonna go and write a dbt test to validate the assumption that they made in their dashboard. That's never happened. It's really hard to future-proof something when there are so many degrees of freedom, and so many different ways that your data or your assumptions can become incorrect. 

Jamie Davidson: I completely agree. It ends up being a problem that has a bunch of different disparate owners of various parts of the solution. And they all have to interface in order for it to actually come together to work. When you're the one-person data team and you're doing everything, and you have all the context of the business, that's a favorable place. That’s amazing. You can go really quickly. But then inevitably complexity creeps in, and you wanna have specialization. And suddenly, you changed what the definition of these things mean. So as a result, even if the SQL's still valid, it's not like what you think it is, actually, that's accurate. 

Nick Freund: Some of these changes, it becomes impossible to control for all of them, right? I think it goes back to what both of you have been saying around investing in the relationships and the partnerships, and in some way, this is everyone's problem to address. It's not just the data people. But it's also the folks throughout the business who are consuming the data, right? And maybe it's the data team's job to go and evangelize that, and to catalyze that. But you ultimately have to create a culture around it where you've got the quote unquote “customers” talking to the builders or the Product people, right? 

Jamie Davidson: Yeah, exactly. 

Nick Freund: Cool, well we've been going for a while. I think that might be a good place for us to wrap, unless either of you have anything else that you think we should cover. Again, Ted, Jamie, thanks so much for joining, and thanks to everyone for listening to Data Knowledge Pioneers. I'm your host, Nick Freund, founder and CEO of Workstream io, and if you're excited to learn more, please join us next time, where I'm gonna be talking to the founder and CEO of Brooklyn Data Company, Scott Breitenother, and the Head of Analytics at Future, Michelle Ballen. We're gonna dive more into tribal knowledge about data, and how to capture it and empower your team.

by
Workstream
-
January 31, 2024