ALIS: Aligned LLM Instruction Security Strategy for Unsafe Input Prompt

摘要

In large language models, existing instruction tuning methods may fail to balance the performance with robustness against attacks from user input like prompt injection and jailbreaking. Inspired by computer hardware and operating systems, we propose an instruction tuning paradigm named Aligned LLM Instruction Security Strategy (ALIS) to enhance model performance by decomposing user inputs into irreducible atomic instructions and organizing them into instruction streams which will guide the response generation of model. ALIS is a hierarchical structure, in which user inputs and system prompts are treated as user and kernel mode instructions respectively. Based on ALIS, the model can maintain security constraints by ignoring or rejecting the input instructions when user mode instructions attempt to conflict with kernel mode instructions. To build ALIS, we also develop an automatic instruction generation method for training ALIS, and give one instruction decomposition task and respective datasets. Notably, the ALIS framework with a small model to generate instruction streams still improve the resilience of LLM to attacks substantially without any lose on general capabilities.

出版物
In International Conference on Computational Linguistics
宋鑫浩
宋鑫浩
博士研究生
段苏峰
段苏峰
助理研究员
刘功申
刘功申
教授