Risk Management Magazine - What Risk Managers Should Know About Data Integrity to Reduce AI Risks

clean data for AI

Ask many corporate leaders about the foundation for their organization’s success with artificial intelligence and they will likely talk about using the latest models with the most sophisticated algorithms to generate “game-changing” insights. However, this laser focus on choosing the best AI model is flawed.

Success with AI does not start with algorithms. Instead, it begins with clean data. Without trusted information, even the most intelligent AI models will hallucinate, creating a potential landslide of operational, financial and reputational risks.

A data-first, algorithms-second approach is especially important for risk managers, who are left to deal with the consequences when inaccurate AI outputs create downstream problems. The risk is so great that AI is now listed as the second-largest global business concern behind cyber incidents in the 2026 Allianz Risk Barometer.

To best protect their organizations, risk managers need to know about several key issues around data integrity and AI.

Inside the ‘Dirty Data’ Dilemma

AI models depend on data to deliver trusted outputs, which is why data accuracy and cleanliness are essential. Yet, as insurers and other businesses strive to adopt AI and remain competitive, they often focus on an algorithm-first approach. However, without equal attention to data, proving return on investment can become a real challenge.

When AI misfires, the problem can often be traced back to where organizational data is stored and how models can access and train on it. Inside many companies, data is scattered across different systems, limiting the context that AI models need to do their job properly. The problem is compounded for companies saddled with core legacy mainframe systems built on COBOL or other outdated programming languages. Trying to integrate AI into these rigid solutions is extremely difficult.

Even in organizations with modern, cloud-based core systems, unclean and fragmented data can cause unexpected consequences, especially once it is entered into automated workflows. Consider an insurance carrier that implements an automated underwriting process. An AI model pulls property and loss data from multiple sources. Any gaps in that data, such as missing or outdated building characteristics or loss details, could create pricing errors and inconsistencies that can go unnoticed for weeks or months.

For risk managers in industries like financial services, health care, or life and health insurance, data concerns are even greater. These organizations typically handle troves of personally identifiable information (PII) and must ensure any data used to train AI models complies with privacy laws such as the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

Considerations for Risk Managers

As organizations expand their use of AI, risk managers need to evaluate how their organization—or the carrier they partner with—uses data and AI to make decisions. These four questions can help organizations assess and reduce data-related risks:

1. Is all data preserved or is it cleaned too early?

Risk managers do not have to understand data pipelines to create a lower-risk, AI-ready future, but they do need to understand how data preservation and extraction have changed in the AI era.

Just a few years ago, companies focused on “big data,” trying to collect as much information as possible. To do so, they used a process called “extract, transform and load” (ETL) that flattened all incoming data into a single, clean version before it was used.

For AI, ETL no longer makes sense. When you standardize data before it is analyzed, you end up training AI models on diluted, one-size-fits-all information that lacks context. This approach makes it difficult for companies to verify decisions made by AI tools.

A smarter approach is to seek carrier partners who preserve all raw data, no matter how ugly and dirty it might be. Companies that do so can take the best of that raw data, remove any anomalies or biases, and then train their AI models to understand and enhance specific use cases, such as streamlining the underwriting process. Doing so reduces risks and preserves raw data for future use, if needed.

2. Are data extraction tools fully integrated into core systems and processes?

Many carriers and organizations rely on optical character recognition (OCR) solutions to extract unstructured data from policy forms and claims submissions and turn it into a structured format. Just because data is extracted, however, does not mean it is ready for AI.

OCR used in isolation does not ensure that your data can deliver an accurate underwriting or claims decision, nor does it remove bias from historical inputs. It only structures the data. Carriers must decide how that information is governed and contextualized for AI engines.

This is where integration matters the most. Carriers and organizations with agile, cloud-based core systems can embed AI easily, meaning they can implement OCR tools with AI features that also clean and enrich extracted data, thereby improving accuracy. Organizations and carriers with legacy systems cannot do this, which means their AI-related risks are amplified.

3. How will AI models use PII and other sensitive data?

When sensitive data such as PII leaves an organization’s direct control, risk managers must understand where it goes, how long it is retained, who can access it, and what happens if something goes wrong.

If your organization or partner sends PII to an external AI tool such as ChatGPT, that tool becomes a data sub-processor. Once that happens, your company may be required to disclose the relationship and ensure that the sub-processor complies with all required privacy and security regulations. The organization also assumes responsibility for how that data is handled downstream. Risk managers must understand these potential pitfalls and develop strategies to address them.

4. When is human intervention required?

As organizations and carriers advance from generative AI, which requires user input at every step, toward agentic AI, which creates step-by-step processes autonomously, they must determine when humans need to intervene.

Risk managers should verify that their organizations and carrier partners have proper human-in-the-loop controls in place so humans can verify accuracy, especially as automation expands to complex use cases. At a minimum, organizations using AI to automate multi-step processes like underwriting and claims should require the same level of decision-making criteria within those processes that humans use.

While other parts of the business may focus on algorithms or hot topics in the space, risk managers must remain grounded in what matters most with AI: ensuring that the data used to train models and generate insights is accurate. Those who invest time in understanding where carrier or organization data comes from, how it is cleaned and how it is used will reduce their AI-related risks, build trust and create a faster path to ROI.

Kevin Gaut is chief technology officer at INSTANDA.

Topics

Inside the ‘Dirty Data’ Dilemma

Considerations for Risk Managers

Related Articles

4 Trends in AI Governance for 2026

Sanctioned AI Use Rose 50% in 2025, But Few Are Tapping Into Strategic Potential

Can You Identify AI Content?

How Should Boards Govern AI and the Next Wave of Technology?