Sanitize, filter, and classify data ingested by the Knowledge Layer from internal and external source bases before it is embedded into the vector store or used for retrieval-augmented generation, preventing inadvertent exposure or manipulation of sensitive organizational knowledge.
AI/ML / Multi Agent Refarch / Controls / DEV
Data Filtering From External Knowledge Bases
CCC.MARefArc.CN01 · PREV
Related Capabilities
| ID | Title | Description |
|---|---|---|
| CCC.MARefArc.CP14 | Approved-model registry and lifecycle | Catalog of approved models with metadata, version information, configuration parameters, and usage constraints, ensuring agents access only models meeting organizational, regulatory, and security standards. |
| CCC.MARefArc.CP11 | Adaptive learning | Generates learning signals based on execution outcomes to refine prompts, adjust agent configurations, or improve tool-selection strategies. |
| CCC.MARefArc.CP16 | Model-interaction zero-trust guardrails | Enforces authentication and authorization for every inference request and applies input validation against prompt injection, output filtering and redaction, access control, rate limits, and cost management before and after model execution. |
Related Threats
| ID | Title | Description |
|---|---|---|
| CCC.MARefArc.TH06 | Foundation-model training and fine-tuning data poisoning | Adversaries tamper with training, fine-tuning, or third-party data feeds behind the approved models, mislabeling data or embedding backdoor triggers and biases that corrupt downstream decisions without visible symptoms until a major failure. |
| CCC.MARefArc.TH07 | Adaptive-learning and continuous-learning exploitation | The adaptive-learning capability that refines prompts and configurations from execution outcomes can be steered by an adversary who systematically feeds misleading signals, gradually skewing agent behaviour when validation of learning inputs is inadequate. |
| CCC.MARefArc.TH01 | Model memorization leaks sensitive data across sessions | The hosted models accessed through the LLM layer may memorize sensitive inputs or training data and later disclose customer PII, proprietary algorithms, or trading strategies, including cross-user leakage into unrelated sessions. |
| CCC.MARefArc.TH02 | Hosted-provider data-handling exposure | Sensitive data submitted through the LLM gateway to third-party hosted models is exposed when the provider lacks transparent encryption, retention limits, or secure-deletion guarantees, leaving the institution without control over data it no longer holds. |
Assessment Requirements
| ID | Text | Applicability |
|---|---|---|
| CCC.MARefArc.CN01.AR01 | Data ingested into the Knowledge Layer MUST be scanned and filtered for sensitive content before it is embedded or indexed for retrieval. | tlp-clear, tlp-green, tlp-amber, tlp-red |
| CCC.MARefArc.CN01.AR02 | Ingestion pipelines MUST enforce source-level allow and deny rules so that unapproved repositories cannot be embedded into the vector store. | tlp-clear, tlp-green, tlp-amber, tlp-red |
Guideline Mappings
| Framework | ID | Remarks |
|---|---|---|
| finos-air | AIR-PREV-002 |