Skip to content

Commit

Permalink
Update LLM04_DataModelPoisoning.md
Browse files Browse the repository at this point in the history
Signed-off-by: DistributedApps.AI <kenhuangus@users.noreply.github.com>
  • Loading branch information
kenhuangus authored Dec 9, 2024
1 parent e1a45eb commit 7b14a03
Showing 1 changed file with 68 additions and 66 deletions.
134 changes: 68 additions & 66 deletions 2_0_vulns/translations/zh-CN/LLM04_DataModelPoisoning.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,68 @@
## LLM04: Data and Model Poisoning

### Description

Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.

Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors). Understanding these stages helps identify where vulnerabilities may originate. Data poisoning is considered an integrity attack since tampering with training data impacts the model's ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.

Moreover, models distributed through shared repositories or open-source platforms can carry risks beyond data poisoning, such as malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Also, consider that poisoning may allow for the implementation of a backdoor. Such backdoors may leave the model's behavior untouched until a certain trigger causes it to change. This may make such changes hard to test for and detect, in effect creating the opportunity for a model to become a sleeper agent.

### Common Examples of Vulnerability

1. Malicious actors introduce harmful data during training, leading to biased outputs. Techniques like "Split-View Data Poisoning" or "Frontrunning Poisoning" exploit model training dynamics to achieve this.
(Ref. link: [Split-View Data Poisoning](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%201%20Split-View%20Data%20Poisoning.jpeg))
(Ref. link: [Frontrunning Poisoning](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%202%20Frontrunning%20Data%20Poisoning.jpeg))
2. Attackers can inject harmful content directly into the training process, compromising the model’s output quality.
3. Users unknowingly inject sensitive or proprietary information during interactions, which could be exposed in subsequent outputs.
4. Unverified training data increases the risk of biased or erroneous outputs.
5. Lack of resource access restrictions may allow the ingestion of unsafe data, resulting in biased outputs.

### Prevention and Mitigation Strategies

1. Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM. Verify data legitimacy during all model development stages.
2. Vet data vendors rigorously, and validate model outputs against trusted sources to detect signs of poisoning.
3. Implement strict sandboxing to limit model exposure to unverified data sources. Use anomaly detection techniques to filter out adversarial data.
4. Tailor models for different use cases by using specific datasets for fine-tuning. This helps produce more accurate outputs based on defined goals.
5. Ensure sufficient infrastructure controls to prevent the model from accessing unintended data sources.
6. Use data version control (DVC) to track changes in datasets and detect manipulation. Versioning is crucial for maintaining model integrity.
7. Store user-supplied information in a vector database, allowing adjustments without re-training the entire model.
8. Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.
9. Monitor training loss and analyze model behavior for signs of poisoning. Use thresholds to detect anomalous outputs.
10. During inference, integrate Retrieval-Augmented Generation (RAG) and grounding techniques to reduce risks of hallucinations.

### Example Attack Scenarios

#### Scenario #1
An attacker biases the model's outputs by manipulating training data or using prompt injection techniques, spreading misinformation.
#### Scenario #2
Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.
#### Scenario # 3
A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.
#### Scenario #4
Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.
#### Scenario #5
An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.

### Reference Links

1. [How data poisoning attacks corrupt machine learning models](https://www.csoonline.com/article/3613932/how-data-poisoning-attacks-corrupt-machine-learning-models.html): **CSO Online**
2. [MITRE ATLAS (framework) Tay Poisoning](https://atlas.mitre.org/studies/AML.CS0009/): **MITRE ATLAS**
3. [PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news](https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/): **Mithril Security**
4. [Poisoning Language Models During Instruction](https://arxiv.org/abs/2305.00944): **Arxiv White Paper 2305.00944**
5. [Poisoning Web-Scale Training Datasets - Nicholas Carlini | Stanford MLSys #75](https://www.youtube.com/watch?v=h9jf1ikcGyk): **Stanford MLSys Seminars YouTube Video**
6. [ML Model Repositories: The Next Big Supply Chain Attack Target](https://www.darkreading.com/cloud-security/ml-model-repositories-next-big-supply-chain-attack-target) **OffSecML**
7. [Data Scientists Targeted by Malicious Hugging Face ML Models with Silent Backdoor](https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/) **JFrog**
8. [Backdoor Attacks on Language Models](https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f): **Towards Data Science**
9. [Never a dill moment: Exploiting machine learning pickle files](https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/) **TrailofBits**
10. [arXiv:2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) **Anthropic (arXiv)**
11. [Backdoor Attacks on AI Models](https://www.cobalt.io/blog/backdoor-attacks-on-ai-models) **Cobalt**

### Related Frameworks and Taxonomies

Refer to this section for comprehensive information, scenarios strategies relating to infrastructure deployment, applied environment controls and other best practices.

- [AML.T0018 | Backdoor ML Model](https://atlas.mitre.org/techniques/AML.T0018) **MITRE ATLAS**
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework): Strategies for ensuring AI integrity. **NIST**
### LLM04: 2025 数据与模型投毒

#### 描述

数据投毒发生在预训练、微调或嵌入数据阶段通过操控数据引入漏洞、后门或偏见。此类操控可能损害模型的安全性、性能或道德行为,导致有害输出或功能受损。常见风险包括模型性能下降、输出偏见或有毒内容以及对下游系统的利用。

数据投毒可能针对LLM生命周期的不同阶段,包括预训练(从通用数据学习)、微调(适应特定任务)和嵌入(将文本转换为数值向量)。理解这些阶段有助于定位潜在漏洞来源。作为一种完整性攻击,数据投毒通过篡改训练数据影响模型的预测能力。外部数据源的风险尤为突出,未经验证或恶意内容可能成为攻击工具。

此外,通过共享库或开源平台分发的模型可能面临除数据投毒以外的风险,例如通过恶意序列化文件(如pickling)嵌入恶意代码,这些代码在加载模型时会执行。更复杂的是,投毒还可能实现后门功能,这种后门在触发特定条件之前保持隐蔽,难以检测。

#### 常见漏洞示例

1. 恶意行为者在训练数据中引入有害数据,导致输出偏见。例如,“Split-View数据投毒”或“前置投毒(Frontrunning Poisoning)”等技术利用训练动态实现攻击。
(参考链接:[Split-View数据投毒](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%201%20Split-View%20Data%20Poisoning.jpeg)
(参考链接:[前置投毒](https://github.com/GangGreenTemperTatum/speaking/blob/main/dc604/hacker-summer-camp-23/Ads%20_%20Poisoning%20Web%20Training%20Datasets%20_%20Flow%20Diagram%20-%20Exploit%202%20Frontrunning%20Data%20Poisoning.jpeg)

2. 攻击者直接在训练过程中注入恶意内容,影响模型输出质量。
3. 用户无意中注入敏感或专有信息,这些信息可能在后续输出中暴露。
4. 未验证的训练数据增加偏差或错误输出的风险。
5. 资源访问限制不足可能导致不安全数据的引入,从而产生偏见输出。

#### 防范与缓解策略

1. 使用工具如OWASP CycloneDX或ML-BOM跟踪数据来源和变换,在模型开发的各个阶段验证数据合法性。
2. 严格审查数据供应商,并对模型输出与可信来源进行验证,检测投毒迹象。
3. 实施严格的沙箱机制限制模型接触未经验证的数据源,并通过异常检测技术过滤对抗性数据。
4. 针对不同用例定制模型,通过特定数据集进行微调,提高输出的准确性。
5. 确保基础设施控制,防止模型访问非预期数据源。
6. 使用数据版本控制(DVC)跟踪数据集的变更,检测潜在操控。版本控制对维护模型完整性至关重要。
7. 将用户提供的信息存储在向量数据库中,允许调整数据而无需重新训练整个模型。
8. 通过红队测试和对抗技术测试模型鲁棒性,例如通过联邦学习减少数据扰动的影响。
9. 监控训练损失并分析模型行为,检测投毒迹象。设定阈值以识别异常输出。
10. 在推理过程中结合检索增强生成(RAG)和归因技术,减少幻觉风险。

#### 示例攻击场景

##### 场景1
攻击者通过操控训练数据或提示注入技术偏向模型输出,传播虚假信息。

##### 场景2
缺乏适当过滤的有毒数据导致有害或偏见输出,传播危险信息。

##### 场景3
恶意行为者或竞争对手创建伪造文件进行训练,导致模型输出反映不准确信息。

##### 场景4
过滤不充分允许攻击者通过提示注入插入误导性数据,导致受损输出。

##### 场景5
攻击者利用投毒技术为模型插入后门触发器,例如身份验证绕过或数据泄露。

#### 参考链接

1. [数据投毒攻击如何破坏机器学习模型](https://www.csoonline.com/article/3613932/how-data-poisoning-attacks-corrupt-machine-learning-models.html)**CSO Online**
2. [MITRE ATLAS(框架)Tay投毒](https://atlas.mitre.org/studies/AML.CS0009/)**MITRE ATLAS**
3. [PoisonGPT:如何在Hugging Face上隐藏削弱的LLM以传播假新闻](https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/)**Mithril Security**
4. [指令期间的语言模型投毒](https://arxiv.org/abs/2305.00944)**Arxiv White Paper 2305.00944**
5. [网络规模训练数据集投毒 - Nicholas Carlini | Stanford MLSys #75](https://www.youtube.com/watch?v=h9jf1ikcGyk)**Stanford MLSys Seminars YouTube Video**
6. [ML模型库:下一个供应链攻击目标](https://www.darkreading.com/cloud-security/ml-model-repositories-next-big-supply-chain-attack-target)**OffSecML**
7. [针对数据科学家的恶意Hugging Face模型](https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/)**JFrog**
8. [AI模型的后门攻击](https://towardsdatascience.com/backdoor-attacks-on-language-models-can-we-trust-our-models-weights-73108f9dcb1f)**Towards Data Science**
9. [永远不会有空闲时刻:利用机器学习的pickle文件](https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/)**TrailofBits**
10. [Sleeper Agents:训练欺骗性LLMs以通过安全训练](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training)**Anthropic(arXiv)**

#### 相关框架和分类

- [AML.T0018 | ML模型后门](https://atlas.mitre.org/techniques/AML.T0018)**MITRE ATLAS**
- [NIST AI风险管理框架](https://www.nist.gov/itl/ai-risk-management-framework):确保AI完整性的策略。**NIST**

0 comments on commit 7b14a03

Please sign in to comment.