Using concept activation vectors as semantic filters in production

Adverse event (AE) detection is vital for pharmacovigilance and regulatory compliance. To address this need, Infinitus developed SAGE, a robust NLP system that detects AEs using a human-in-the-loop approach.

While LLMs are exceptional at identifying potential events, their broad contextual awareness often introduces noise, misinterpreting unrelated phrases as safety concerns. To solve this, we leverage concept activation vectors (CAVs) to verify that the evidence emitted by an LLM judge aligns with a learned, directional representation of the concept itself.

To visualize how we achieve this, consider the task of surfacing safety-relevant spans from long, unstructured call transcripts.

The LLM is a broad candidate generator of these spans. It uses contextual cues to surface anything that might be relevant. This gives us high recall, but also produces false positives because contextual resemblance isn’t the same as correctness.
The CAV acts as an invariant check. Instead of re-analyzing the input, it applies a fixed, narrow test to determine whether the defining signal of the concept is actually present. CAV answers “Is the core signal actually there?” and effectively behaves like a gate with a fast, deterministic validation.

Our modular system separates search from identification, a concept we explore in this post. Our results prove the feasibility of such a system that effectively balances wide-ranging discovery with pinpoint precision.

The precision problem for AE detection

Identifying potential adverse events (PAEs) on healthcare phone calls is difficult because they are infrequent. However, every mention of such an event must be flagged, regardless of how rare. We can employ LLMs to detect PAEs with sufficiently high recall. STT errors in these calls can lead to false positives, where we can utilize human review of detected adverse events to mitigate false positive noise.

This approach to false positive handling has its limitations. An uncontrolled false positive rate leads to increase in review load for humans in the loop. Human reviewers are forced to expend valuable time and cognitive resources sifting and evaluating through these spurious reports. The precision failures of these systems are essentially operational failures.

Even in systems with low false positive rates, a critical problem emerges when the recent history of alerts is dominated solely by false alarms. When recent alerts are solely false positives, reviewer fatigue and mistrust set in. If a reviewer sees only false alarms (e.g., for 30 days), they will likely dismiss or delay action on a finally true anomaly due to conditioning.

Therefore, maintaining both high precision and high recall at low operational cost are critical.

LLM as a judge

LLM judges are appealing as they are excellent at not missing things. However, they require frequent human calibration to ensure accuracy; the prompt instructions can be too vague, or the model might misinterpret the context, leading to over-generalization. For instance, a phrase like “I am calling with respect to the burial estate plan” could be wrongly analyzed as a patient being deceased, when it may simply be a mistranscription of the patient’s insurance plan.

These judges allow us to extract evidence for decisions, but their explanations must be taken at face value. If there are misclassifications, we can use this evidence to adjust the judge, but these results are also at-one-point evaluations. They need to be continuously revisited for new phrasings, new customers and account for the model’s evolution itself.

Concept activation vectors

Our goal therefore is to ascertain that emitted evidence from this LLM judge aligns with a concept. For instance on termination AEs, we would like to see evidence of discussion of a patient’s passing away on the healthcare phone call.

We could construct a centroid of embeddings of key phrases that represent this concept and compute cosine similarity. While intuitive, computationally convenient, and effective for many semantic tasks, similarity methods like these implicitly assume that proximity in embedding space corresponds to conceptual equivalence.

However, longer inputs, added context, or unrelated content can obfuscate centroid similarity, even when the core evidence remains unchanged. In the example, “The patient passed away; their plan information shows a plan effective date of 20th January 2026” , the addition of the date and plan info in the above example dilutes the embedding, moving it away from the “patient death” centroid even though the core fact hasn’t changed.

CAVs provide a structured way to detect the presence of a specific concept within a model’s embedding space with learned linear directions. Rather than asking whether an input is similar to known examples, CAVs ask whether the input is aligned with the concept itself.

By operating on direction rather than distance, CAVs function as concept alignment checks, not similarity scores. We basically ask, does the input activate the direction in the representation space that corresponds to the concept? This alignment is substantially more robust to wording, length, and surrounding context.

Training setup

Given that CAVs do not attempt to model the full decision boundary of a task, a carefully curated contrastive example set is sufficient. This makes CAVs particularly suitable for domains where labeling is expensive, sensitive, or requires expert judgment. This allows rapid iteration on concept definitions in production safety-critical settings, with very minimal labeling costs and computational overhead.

Our high level approach is to:

curate a contrastive examples dataset for termination AEs
embed the input using a fixed representation model and learn a concept vector
project the embedding onto the concept direction
compare the resulting scalar score against an explicitly tuned threshold

We hand-labeled approximately 100 evidence spans of patients’ termination being discussed in phone calls. We also enrich this dataset with the key phrases signifying patient termination. We used random calls with no adverse events as the negative dataset along with hard negatives labeled by human experts as false positives from stage 1 LLM judges.

The same embedding model is used throughout training and inference. For this experiment, we take the final layer activations as the embeddings. We then train a linear model (e.g. logistic regression, linear SVC) to separate positive vs negative activations.

The resulting weight vector of the model is our normalized concept activation vector. This vector provides us a direction of this maximal contrast between these two classes. The linear model learns to ignore the dimensions of the embedding space that represent “noise” (e.g. dates or insurance plan jargon) and only assigns weights to the dimensions that signify “termination“

The validation split is used to tune the projection scoring threshold via the precision–recall curve. At inference time, we embed the utterance and project it onto this learned CAV. The magnitude of this score reflects the strength of conceptual alignment, enabling graded analysis and post-hoc inspection.

The training process is budget-friendly, as it relies on the selection of the embedding model and operates external to the current system. Learning a linear vector makes the training itself computationally inexpensive, as well as tunable in case we want to further maximize recall or further maximize precision.

Results

The learned CAV acts as an effective second-stage semantic filter, reducing stage 1 false positives by 95.65% while maintaining recall. The very few missed true PAEs are cases where patient death is discussed in a terminology not in the dataset (like “involuntary disenrollment“). As observed in the CAV score distribution chart below, projection onto a learned concept direction yields a clear separation that enables a single threshold to substantially reduce false positives while preserving recall, confirming that the projection essentially zeroes out features that don’t align with the concept

Using a 200-iteration permutation test, we compared the real CAV to random CAVs learned with label permutation. None of the permuted CAVs matched or exceeded the observed separation, yielding an empirical p-value of p < 0.005. The real CAV achieved a Cohen’s D of 5.08, an effect size 12.4 times stronger than the average permuted CAV. This extreme separation indicates that the learned direction is not an artifact of high-dimensional geometry, but a stable semantic axis that generalizes beyond the training set.

Production implications

By using CAVs as a Stage-2 filter, we achieve:

Reduced reviewer fatigue: Human experts spend time on substantive evaluations, not dismissing true alarms due to false positive fatigue.
Auditability: Unlike black-box LLM decisions, a CAV score is a measurable scalar. We can point to exactly how much an utterance aligns with a concept. This layer can also be validated in isolation.
Modular growth: We can add new safety concepts (e.g., drug interaction) by simply learning a new vector, without retraining our entire core model. The thresholds for each of these concepts can be tuned independently.
Separation of concerns: From a machine learning systems perspective, each part of the pipeline has a clear responsibility. The first stage is optimized for high-recall evidence extraction and can tolerate over-triggering. The second stage with concept vectors is optimized with precision for narrow concept definitions.

Conclusion

The method uses interpretability research on concept vectors for semantic control via directional representations. CAVs are lightweight linear directions that are cheap to compute, easy to audit, and simple to deploy.

Using CAVs as a simple, interpretable Stage-2 filter reliably reduces over-flagging from high-recall LLM judges, materially lowering human review load while maintaining safety-critical recall. This makes it scalable for deployment.

A major advantage is that adding new safety concepts requires neither complex architectural changes nor large amounts of labeled data. Combining multiple concept vectors enhances safety guardrail robustness without altering the underlying ML system.

Using concept activation vectors as semantic filters in production

The precision problem for AE detection

LLM as a judge

Concept activation vectors

Training setup

Results

Production implications

Conclusion

Recommended articles

Beyond LeetCode: Interviewing for real engineering impact

Safety in every word: Automating adverse event detection

New research: Second stage error detection for highly accurate information extraction from phone conversations

See why more healthcare orgs choose Infinitus over any other voice AI.