A curated list of trustworthy deep learning papers. Daily updating...
-
Updated
Jan 10, 2025
A curated list of trustworthy deep learning papers. Daily updating...
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Code accompanying the paper Pretraining Language Models with Human Preferences
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
📚 A curated list of papers & technical articles on AI Quality & Safety
[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".
Reading list for adversarial perspective and robustness in deep reinforcement learning.
A curated list of awesome resources for Artificial Intelligence Alignment research
A curated list of awesome academic research, books, code of ethics, data sets, institutes, maturity models, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.
Sparse probing paper full code.
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.
Scan your AI/ML models for problems before you put them into production.
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
Official Implementation of Nabla-GFlowNet
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
An implementation of iterated distillation and amplification
Add a description, image, and links to the ai-alignment topic page so that developers can more easily learn about it.
To associate your repository with the ai-alignment topic, visit your repo's landing page and select "manage topics."