Data mesh is all the rage. It’s a hot topic in the data industry right now. However, even after 50 years of investment in systems for enterprise data quality, most sizable corporations struggle to break down data silos and publish clean, curated, comprehensive and continuously updated data. One of the primary reasons companies still grapple with this issue is that they’re missing a critical component of the modern data ecosystem: data mastering.
The history of data mastering and the evolution of its ecosystem
Let’s get out our flux capacitor, hop into the DeLorean and go back in time! We’re going to explore the 1990s and the early 2000s when most data professionals were focused on building enterprise data warehouses, an approach that was somewhat successful because a company’s data ecosystem existed as a large monolithic artifact within the organization, one that could be governed and contained.
Now let’s speed ahead a decade. Here you’ll witness the introduction of next-generation analytics tools such as Qlik, Tableau, and Domo geared to democratize the data ecosystem. The goal of these tools was to have analysts – rather than database administrators – dictating how data should be processed and consumed in a distributed manner. The assumption at this time was that data aggregation was ineffective.
On the last leg of our time travel journey, we’ll make a pit stop in the mid-2010s to witness how cloud infrastructure provided the ability to quickly scale storage and compute efficiently. Everyone wanted to aggregate their data in a data lake during this time – or at least move their data to the cloud first, then figure out how to use it.
If Biff can be tamed, a new approach to the data ecosystem can be embraced
We are back…no, not in 1985, and come to think of it, I haven’t heard a Huey Lewis song in ages or perused a childhood bully’s Facebook page to see how they turned out. But it is 2022 and the data ecosystem is changing constantly with volume and variety exploding. Data is becoming more and more external and the best version of it exists outside of the firewall, not in your organization’s ERP or CRM solution.
To deal with data silos in analytics use cases, there are four different strategies one can employ: rationalization, the consolidation of data from different systems into one, standardization, creating consistent vocabularies and schemas and pushing them from one system to the rest, aggregation, assembling all data into a central depositary such as a data warehouse, and lastly, federation, storing and governing data in a distributed manner with interconnected data sources by domain.
In any successful data project, all four strategies are necessary but not sufficient on their own. They require a centralized entity table and persistent Universal ID (UID) linking data together. This is where data mastering comes in.
According to Zhamak Dehghani, former director of emerging technologies at Thoughtworks North America, data mesh is a new enterprise data architecture that embraces “the reality of ever-present, ubiquitous, and distributed nature of data.” And it has four aspirational principles:
- Data ownership by domain
- Data as a product
- Data available everywhere (self-serve)
- Data governed where it is
In this paradigm, the data is distributed and external. That’s why traditional, top-down master data management (MDM) simply will not work. Instead, organizations need to start with a machine learning-first data mastering approach, with a human in the loop to validate the results. It is a cornerstone for a successful data mesh strategy as it provides a centralized entity table and persistent universal IDs for users to do distributed queries.
Data mastering within the context of data mesh strategy
What’s inherent in data mesh is the belief that data is more distributed. Today, we need to think about data in terms of logical entities: customers, products, suppliers … the list goes on. But I’ll let you in on a dirty little secret; most companies actually have thousands of sources that provide data about these entities, making it difficult, if not impossible, to do a data mesh.
Companies embarking on a data mesh strategy will quickly realize that they need a consistent version of the best data across the organization. The only method of achieving this at scale is through ML-driven data mastering.
Peanut butter sandwiches taste WAY better with jelly!
Think about data mastering as a complement to data mesh. On their own, each produces a good result. But when combined, the results are spectacular. You can’t have peanut butter without jelly to make a truly perfect sandwich, and data mastering without data mesh is a dry offering without the jam making it juicier and sweet.
When you apply human-guided, ML-driven data mastering, you clean up your internal and external data sources. You engage in a bi-directional cycle allowing you to cleanse and curate your data efficiently. You also effectively realize the promise of distributed data mesh. It’s a critical continuous loop so that you can incorporate changes to your data – or sources – over time.
As you build your data mesh strategy, remember that the key to success is starting with modern data mastering.
And should you require a refresher course to leverage learning from past mistakes, don’t forget you need to be traveling at 88 miles per hour and generating a charge of 1.21 Gigawatts…. And look out for Doc – he can be found by the clock tower!