Publications | Javier Rando

Take a look at my Google Scholar for updated publications and citations.

2024

Pre-print
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Javier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, and Florian Tramèr

2024

Abs Bib PDF Website

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.
@article{rando2024competition, title = {Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs}, author = {Rando, Javier and Croce, Francesco and Mitka, Krystof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tramèr, Florian}, year = {2024}, }

Agenda

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh, Erik Jenner, Stephen Casper, Oliver Sourbut, and 28 more authors

2024

Abs Bib PDF Website

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.

@article{anwar2024foundational,
  title = {Foundational Challenges in Assuring Alignment and Safety of Large Language Models},
  author = {Anwar, Usman and Saparov, Abulhair and Rando, Javier and Paleka, Daniel and Turpin, Miles and Hase, Peter and Singh, Ekdeep and Jenner, Erik and Casper, Stephen and Sourbut, Oliver and Edelman, Benjamin and Zhang, Zhaowei and Gunther, Mario and Korinek, Anton and Hernandez-Orallo, Jose and Hammond, Lewis and Bigelow, Eric and Pan, Alexander and Langosco, Lauro and Korbak, Tomasz and Zhang, Heidi and Zhong, Ruiqi and hÉigeartaigh, Seán Ó and Rachet, Gabriel and Corsi, Giulio and Chan, Alan and Anderljung, Markus and Edwards, Lillian and Bengio, Yoshua and Chen, Danqi and Albanie, Samuel and Maharaj, Tegan and Foerster, Jakob and Tramer, Florian and He, He and Kasirzadeh, Atoosa and Choi, Yejin and Krueger, David},
  year = {2024},
}

ICLR
Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, and Florian Tramèr
🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆
ICLR, 2024

Abs Bib PDF Blog

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
@article{rando2023universal, title = {Universal Jailbreak Backdoors from Poisoned Human Feedback}, author = {Rando, Javier and Tram{\`e}r, Florian}, journal = {ICLR}, year = {2024}, }

2023

Workshop
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando

SoLaR Workshop @ NeurIPS, 2023

Abs Bib PDF

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
@article{scalable2023shah, title = {Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation}, journal = {SoLaR Workshop @ NeurIPS}, author = {Shah, Rusheb and Pour, Soroush and Tagade, Arush and Casper, Stephen and Rando, Javier}, year = {2023}, }
Pre-print
Personas as a Way to Model Truthfulness in Language Models

Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He

arXiv preprint arXiv:2310.18168, 2023

Abs Bib PDF

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model’s answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
@article{personas2023joshi, title = {Personas as a Way to Model Truthfulness in Language Models}, author = {Joshi, Nitish and Rando, Javier and Saparov, Abulhair and Kim, Najoung and He, He}, journal = {arXiv preprint arXiv:2310.18168}, year = {2023}, }
TMLR
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authors

Transactions on Machine Learning Research, 2023

Abs Bib PDF

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
@article{open2023casper, title = {Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback}, author = {Casper, S. and Davies, X. and Shi, C. and Gilbert, T. K. and Scheurer, J. and Rando, J. and Freedman, R. and Korbak, T. and Lindner, D. and Freire, P. and Wang, T. and Marks, S. and Segerie, C.-R. and Carroll, M. and Peng, A. and Christoffersen, P. and Damani, M. and Slocum, S. and Anwar, U. and Siththaranjan, A. and Nadeau, M. and Michaud, E. J. and Pfau, J. and Krasheninnikov, D. and Chen, X. and Langosco, L. and Hase, P. and Bıyık, E. and Dragan, A. and Krueger, D. and Sadigh, D. and Hadfield-Menell, D.}, year = {2023}, journal = {Transactions on Machine Learning Research}, }
ESORICS
PassGPT: Password Modeling and (Guided) Generation with Large Language Models

Javier Rando, Fernando Perez-Cruz, and Briland Hitaj

28th European Symposium on Research in Computer Security, 2023

Abs Bib PDF Blog Code

Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision. In this paper, we investigate the efficacy of LLMs in modeling passwords. We present PassGPT, a LLM trained on password leaks for password generation. PassGPT outperforms existing methods based on generative adversarial networks (GAN) by guessing twice as many previously unseen passwords. Furthermore, we introduce the concept of guided password generation, where we leverage PassGPT sampling procedure to generate passwords matching arbitrary constraints, a feat lacking in current GAN-based strategies. Lastly, we conduct an in-depth analysis of the entropy and probability distribution that PassGPT defines over passwords and discuss their use in enhancing existing password strength estimators.
@article{passgpt2023rando, title = {PassGPT: Password Modeling and (Guided) Generation with Large Language Models}, journal = {28th European Symposium on Research in Computer Security}, author = {Rando, Javier and Perez-Cruz, Fernando and Hitaj, Briland}, year = {2023}, }

2022

Workshop
Red-Teaming the Stable Diffusion Safety Filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆
arXiv preprint arXiv:2210.04610, 2022

Abs Bib PDF Code

Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALL·E, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter’s limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.
@article{red2022rando, title = {Red-Teaming the Stable Diffusion Safety Filter}, year = {2022}, author = {Rando, Javier and Paleka, Daniel and Lindner, David and Heim, Lennart and Tram{\`e}r, Florian}, journal = {arXiv preprint arXiv:2210.04610}, }
Workshop
How is Real-World Gender Bias Reflected in Language Models?

J. Rando, A. Theus, R. Sevastjanova, and M. El-Assady

VISxAI Workshop @ IEEE VIS, Sep 2022

Abs Bib Website

Our work tries to explore, through visualization, a potential relationship between gender bias in language models and real-world demographics. Followingly, we will revisit the main insights we gathered from the visualizations. However, we want to emphasize that this dashboard is of an exploratory nature. Hence we strongly encourage the reader to interact with the visualizations and come to own conclusions.
@article{rando2022what, title = {How is Real-World Gender Bias Reflected in Language Models?}, author = {Rando, J. and Theus, A. and Sevastjanova, R. and El-Assady, M.}, journal = {VISxAI Workshop @ IEEE VIS}, year = {2022}, month = sep, }
Workshop
Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO

Javier Rando, Nasib Naimi, Thomas Baumann, and Max Mathys
AdvML Workshop @ ICML
arXiv preprint arXiv:2206.06761, Sep 2022

Abs Bib PDF Slides

This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO. First, we evaluate whether features learned through self-supervision are more robust to adversarial attacks than those emerging from supervised learning. Then, we present properties arising for attacks in the latent space. Finally, we evaluate whether three well-known defense strategies can increase adversarial robustness in downstream tasks by only fine-tuning the classification head to provide robustness even in view of limited compute resources. These defense strategies are: Adversarial Training, Ensemble Adversarial Training and Ensemble of Specialized Networks.
@article{rando2022exploring, title = {Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO}, author = {Rando, Javier and Naimi, Nasib and Baumann, Thomas and Mathys, Max}, journal = {arXiv preprint arXiv:2206.06761}, year = {2022}, }
ACL
“That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks

Edoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg Groh

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Abs Bib PDF Slides

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
@inproceedings{suspicious2022mosca, title = {“That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks}, author = {Mosca, Edoardo and Agarwal, Shreyash and Rando, Javier and Groh, Georg}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages = {7806--7816}, year = {2022}, month = may, }

2020

ISCRAM
Uneven coverage of natural disasters in Wikipedia: The case of floods

Valerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos Castillo

In ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020

Abs Bib PDF

The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at adding information in real-time as a crisis unfolds. In this study, we focus on the English version of Wikipedia, that is by far the most comprehensive version of this encyclopedia. Wikipedia tends to have good coverage of disasters, particularly those having a large number of fatalities. However, we also show that a tendency to cover events in wealthy countries and not cover events in poorer ones permeates Wikipedia as a source for disaster-related information. By performing careful automatic content analysis at a large scale, we show how the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada. We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries. These results have implications for systems using Wikipedia or similar collaborative media platforms as an information source for detecting emergencies or for gathering valuable information for disaster response.
@inproceedings{uneven2020lorini, title = {Uneven coverage of natural disasters in Wikipedia: The case of floods}, author = {Lorini, Valerio and Rando, Javier and S{\'a}ez-Trumper, Diego and Castillo, Carlos}, booktitle = {ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management}, year = {2020}, month = oct, pages = {688–-703}, }