Data Provenance and Source Vetting

CCC.GenAI.CN03 · MachineLearning

Ensure that all data for training, fine-tuning or RAG comes from trusted, approved sources and is authorised for the intended purposes in order to prevent the initial introduction of malicious content or leaked sensitive data.

Related Capabilities

ID	Title	Description
CCC.Core.CP02	Encryption at Rest Enabled by Default	The service automatically encrypts all data using industry-standard cryptographic protocols prior to being written to a storage medium.
CCC.Core.CP06	Access Control	The service automatically enforces user configurations to restrict or allow access to a specific component or a child resource based on factors such as user identities, roles, groups, or attributes.
CCC.GenAI.CP03	Embedding Model Selection	Ability to select a foundation model used for tasks like semantic search, clustering, and document similarity by converting text into vector embeddings.
CCC.GenAI.CP06	Customizable Model Selection	Provide users the ability to fine-tune models with their own data.
CCC.GenAI.CP21	Generate Content	Ability to generate a response given a foundation model, parameter values, and a prompt.
CCC.GenAI.CP22	Data Control	Ensures prompts, model outputs, embeddings, and training data fed by customers are not used to train foundation models.
CCC.GenAI.CP24	Content Moderation	Ensure the service detects and filters abusive, harmful, and sensitive information to ensure responsible and safe use of the service.

Related Threats

ID	Title	Description
CCC.GenAI.TH02	Data Poisoning	Data poisoning occurs when training, fine-tuning or embedding data is tampered with in order to modify the model's behaviour, for example steering it towards specific outputs, degrading performance or introducing backdoors.
CCC.GenAI.TH03	Sensitive Information Disclosure	Sensitive data can be memorised by the model from user interaction or training and may then be leaked to unintended and unauthorised parties by querying the model, for example through crafted prompts.

Assessment Requirements

ID	Text	Applicability
CCC.GenAI.CN03.AR01	When data is designated for model training or RAG ingestion, then its source MUST be explicitly approved and its provenance documented.	tlp-clear, tlp-green, tlp-amber, tlp-red
CCC.GenAI.CN03.AR02	Data from unvetted sources MUST NOT be used in production systems.	tlp-clear, tlp-green, tlp-amber, tlp-red

Guideline Mappings

Framework	ID	Remarks
FINOS-AIGF	AIR-PREV-006	Data Quality & Classification/Sensitivity
SAIF	Training Data Management
MITRE-ATLAS	AML.M0025	Maintain AI Dataset Provenance