AI/ML / Multi Agent Refarch / Controls / DEV

Data Filtering From External Knowledge Bases

CCC.MARefArc.CN01 · PREV

Sanitize, filter, and classify data ingested by the Knowledge Layer from internal and external source bases before it is embedded into the vector store or used for retrieval-augmented generation, preventing inadvertent exposure or manipulation of sensitive organizational knowledge.

Related Capabilities

ID	Title	Description
CCC.MARefArc.CP14	Approved-model registry and lifecycle	Catalog of approved models with metadata, version information, configuration parameters, and usage constraints, ensuring agents access only models meeting organizational, regulatory, and security standards.
CCC.MARefArc.CP11	Adaptive learning	Generates learning signals based on execution outcomes to refine prompts, adjust agent configurations, or improve tool-selection strategies.
CCC.MARefArc.CP16	Model-interaction zero-trust guardrails	Enforces authentication and authorization for every inference request and applies input validation against prompt injection, output filtering and redaction, access control, rate limits, and cost management before and after model execution.

Related Threats

ID	Title	Description
CCC.MARefArc.TH06	Foundation-model training and fine-tuning data poisoning	Adversaries tamper with training, fine-tuning, or third-party data feeds behind the approved models, mislabeling data or embedding backdoor triggers and biases that corrupt downstream decisions without visible symptoms until a major failure.
CCC.MARefArc.TH07	Adaptive-learning and continuous-learning exploitation	The adaptive-learning capability that refines prompts and configurations from execution outcomes can be steered by an adversary who systematically feeds misleading signals, gradually skewing agent behaviour when validation of learning inputs is inadequate.
CCC.MARefArc.TH01	Model memorization leaks sensitive data across sessions	The hosted models accessed through the LLM layer may memorize sensitive inputs or training data and later disclose customer PII, proprietary algorithms, or trading strategies, including cross-user leakage into unrelated sessions.
CCC.MARefArc.TH02	Hosted-provider data-handling exposure	Sensitive data submitted through the LLM gateway to third-party hosted models is exposed when the provider lacks transparent encryption, retention limits, or secure-deletion guarantees, leaving the institution without control over data it no longer holds.

Assessment Requirements

ID	Text	Applicability
CCC.MARefArc.CN01.AR01	Data ingested into the Knowledge Layer MUST be scanned and filtered for sensitive content before it is embedded or indexed for retrieval.	tlp-clear, tlp-green, tlp-amber, tlp-red
CCC.MARefArc.CN01.AR02	Ingestion pipelines MUST enforce source-level allow and deny rules so that unapproved repositories cannot be embedded into the vector store.	tlp-clear, tlp-green, tlp-amber, tlp-red

Guideline Mappings

Framework	ID	Remarks
finos-air	AIR-PREV-002