Publications

Take a look at my Google Scholar for updated publications and citations.

2023

  1. Pre-print
    Universal Jailbreak Backdoors from Poisoned Human Feedback
    Rando, J., and Tramèr, F.
    2023
  2. Workshop
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    Shah, R., Feuillade–Montixi, Q., Pour, S., Tagade, A., Casper, S., and Rando, J.
    SoLaR Workshop @ NeurIPS 2023
  3. Pre-print
    Personas as a Way to Model Truthfulness in Language Models
    Joshi, N.,  Rando, J., Saparov, A., Kim, N., and He, H.
    2023
  4. TMLR
    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J.,  Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C.-R., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E. J., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L., Hase, P., Bıyık, E., Dragan, A., Krueger, D., Sadigh, D., and Hadfield-Menell, D.
    Transactions on Machine Learning Research (TMLR) 2023
  5. ESORICS
    PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Rando, J., Perez-Cruz, F., and Hitaj, B.
    28th European Symposium on Research in Computer Security (ESORICS) 2023

2022

  1. Workshop
    Red-Teaming the Stable Diffusion Safety Filter
    🏆 Best Paper Award 🏆
    Rando, J., Paleka, D., Lindner, D., Heim, L., and Tramèr, F.
    ML Safety Workshop @ NeurIPS 2022
  2. Workshop
    How is Real-World Gender Bias Reflected in Language Models?
    Rando, J., Theus, A., Sevastjanova, R., and El-Assady, M.
    VISxAI Workshop @ IEEE VIS Sep 2022
  3. Workshop
    Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
    Rando, J., Naimi, N., Baumann, T., and Mathys, M.
    AdvML Workshop @ ICML Jul 2022
  4. ACL
    “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks
    Mosca, E., Agarwal, S.,  Rando, J., and Groh, G.
    ACL May 2022

2021

    2020

    1. ISCRAM
      Uneven coverage of natural disasters in Wikipedia: The case of floods
      V., Lorini,  Rando, J., Saez-Trumper, D., and Castillo, C.
      ISCRAM Oct 2020