“Clean” two birds with one taxonomy

Insight

June 24, 2026

“Clean” two birds with one taxonomy

Nicolas Delporte

Ensuring data fitness and quality in one go.

Cleaning, curating, wrangling, reshaping, scrubbing, pre-processing, harmonizing… Everyone does it, no one enjoys it.

Notice what every one of those words leaves out: what the data is for.

In real-world evidence, the fit-for-purpose literature defines quality by use — the test is whether the data let you estimate the causal effect you are after, not whether the columns are tidy (Gatto et al., 2023).

The same dataset can be flawless for one question and useless for the next. A variable you must correct for one analysis is one you must leave untouched for another — adjust for a confounder, but adjust for a mediator and you erase the very effect you set out to measure. Same flaw, opposite response. Only the question can tell them apart.

The Augura semantic layer was built exactly to bring this all together. At its core is a governed taxonomy and ontology: every data element is mapped to a concept with a stable meaning—its synonyms and standard codes, expected type, canonical unit, plausible range, and temporal characteristics. A column named HbA1c stops being an arbitrary field and becomes a known measurement with well-defined expectations.

From this single foundation emerge two complementary capabilities.

First, causal modeling. A causal question is resolved against the same concepts, allowing Augura to assemble a defensible DAG for that specific question and classify each variable as an exposure, confounder, mediator, collider, or effect modifier. Those roles are not intrinsic to the concepts themselves; the taxonomy defines what a variable is, while the causal question determines what role it plays.

The taxonomy defines what HbA1c is. The causal question defines what role it plays.

Second, automated data quality. Because every concept already carries its semantic expectations, validation rules are generated automatically. Type, range, unit, temporal consistency, and coded-value checks produce traceable findings rather than a simple pass/fail report.

This automation filters out the vast majority of routine data issues, allowing experts to focus on the few decisions that truly require human judgment. It does not eliminate the difficult cases—no general algorithm can determine recoverability in every situation (Holovchak et al., 2025)—but it automates the routine, documents the evidence, and brings a human into the loop only when it matters.

One governed vocabulary, two capabilities: causal modeling and data quality become two views of the same semantic foundation. We believe this combination—a state-of-the-art taxonomy and ontology driving both the causal graph and automated data quality—is the most logical foundation for unlocking the full potential of real-world data.

The routine work is automated; the critical decisions remain human.

About Augura

We built our platform around exactly this challenge: helping digital health companies and health systems design, execute, and communicate the causal evidence that turns a promising AI product into a credible, defensible clinical intervention.

Smiling man with light brown hair and stubble, wearing a khaki shirt in natural light

Want to learn more?