Hardly anyone relying on data can say their data is perfect. There is always that difference between the dataset you have and the dataset you wish you had. This difference is what Data Quality is all about.
Data quality problem exists everywhere where data is used: in tech and non-tech businesses, in the public sector, in engineering, in science. Each of these domains has its data specifics and its own set of data quality criteria.
Enterprise data quality deals with data quality in ERP data — data describing the flow of business processes in organizations. These include financial transactions, sales transactions, contracts, inventories, as well as lists of customers, vendors, etc.
Any large organizations and most medium businesses use highly integrated Enterprise Resource Planning systems to run their business processes. ERP data is a central component of such applications; it drives and controls the automatic flow of business processes in them. Every tick of this flow sums up to the company’s financials. That is why any business would want to make sure their ERP data is good enough to support the consistent and correct circulation of their business processes.
Companies understand this so much so that they spend up 50% of the time of their data analysts for finding and correcting data issues.
All modern tools and processes for maintaining Enterprise Data Quality are effectively rule-based, which means, in essence, they work by evaluating data against some set of pre-defined rules or conditions.
This approach was dominating business data landscapes since mainframe times, and its central principle hasn’t changed since. There is a good reason for that: it is robust and predictable.
The world, however, has changed dramatically since then — corporate databases have grown thousands of times both in volume and complexity. Today, this old rule-based principle has started to show its disadvantages:
- As data becomes more diverse, the number of combinations and interactions in data grows exponentially, which means the number of rules required to maintain the same level of Data Quality grows exponentially too. For businesses, this means the costs and efforts they have to spend on data quality grow fast also. It explains why companies have to pay so much to maintain good data quality today.
- Any rule-based system has an intrinsic limitation — it can only deal with problems known to people maintaining the system. But because people learn on mistakes, this also means that every issue they know has shown itself before as a data incident, and most likely caused losses. This intrinsic dependency renders all rule-based processes reactive. It explains why in reality all Data Quality assurance systems so closely related to incident management.
- All rule-based systems are rigid. It adds a burden of updating the rule sets to keep up with an ever-evolving business. It also includes updating documentation, changing and testing new rules, cleaning up old and no longer relevant ones, and so on. For large and older businesses that have a long history of changes, this becomes very tricky.
In the past ten years, the pace of changes has only increased — more and more businesses migrating to modern cloud infrastructure and getting access to more powerful databases. The data an average company is using has exploded in size and complexity.
As a result, the Data Quality function in any large organization is experiencing enormous pressure which will only get worse with time.
Enterprise data quality is a big business dominated by such behemoths like Informatica, IBM, SAP, Oracle and others. To help businesses, they are offering all sorts of apps to simplify and accelerate rule management. But they do not question the foundation principle and therefore do not address the fundamental disadvantages of the rule-based model that has been in use for more than 60 years.
Unlike others, we do question this model! In the past three years, we did extensive research in finding new ways of doing data quality in typical business data. And we found an answer in AI as you might have already guessed from the title. We found that non-rule based approach to Enterprise Data Quality is possible and that this approach has many new benefits, which look so fantastic, they will make any data quality professional sceptical:
- No need to maintain rules, and therefore, there is no scaling problem as your business processes become more complex, and as your data gets more diverse.
- An AI algorithm can discover not-yet-known issues, the issues that are already in data but that haven’t shown themselves as incidents yet.
- An AI algorithm can be self-learning, which means you don’t need to program it to understand your data or your business process. You don’t need to have up-to-date documentation describing your as-is state to start using it. All you need to do is feed your actual data into it.
- The algorithm is also self-adjusting which means it will automatically keep up with changes in business processes.
- Because of the above two properties, it can work in a deploy-and-forget mode.
- It can not only find problems but also suggest a solution for every particular record found wrong.
- It can potentially replace most rules in any existing Data Quality Assurance system.
- And finally, it can form a closed-loop fully automated Data Quality assurance system where data issues are corrected before you know it. All you need to do is just watch reports showing how many data quality incidents the algorithm has prevented.
Looks too good to be true, isn’t it? Of course, it has downsides also.
Like any other machine learning algorithm, it will not replace methods that work well without the need if AI, such as validating addresses, phone prefixes, email addresses. It will not work well when your data is small or when every record in your dataset is unique and does not follow any pattern.
But the critical unfixable problem of this approach is precisely what makes it so fantastic: it is non-rule based. Because business applications, in general, have been using business rules for years, business rule mindset is deeply rooted in business culture everywhere. Introducing AI algorithms questioning this core principle will not be easy.
But complicated doesn’t mean impossible! With such an impressive list of benefits and gradual step-by-step implementation plan, AI methods such as this will eventually shift the business culture from scepticism to cautious enthusiasm. Just like it happened to Big Data platforms and Cloud infrastructure in the past ten years.