Hacking LLMs with prompt injections
I recently had the opportunity to attend Google IO, during which many new products were announced. One aspect that stood out about many of these products was the focus on AI, particularly generative AI.
Generative AI is fascinating and I am excited to see what we can do by integrating its capabilities with different functionalities. I already use these tools for scripting, copywriting, and generating ideas about blogs. While these applications are super cool, it’s also important to study how we can use these technologies securely.
Large language models are already being woven into products that handle private information, going way beyond just summarizing news articles and copyediting emails. It’s like a whole new world out there! I’ve already seen them being planned for use in customer service chatbots, content moderation, and generating ideas and advice based on user needs. They’re also being used for code generation, unit test generation, rule generation for security tools, and much more.
Let me preface by saying that I’m a total newbie in the AI and machine learning field. Just like everyone else, I’m still trying to wrap my head around the capabilities of these advancements. When it comes to large language models, I’m a beginner prompt-kiddie at best. Nevertheless, I am very interested in learning more. Feel free to call me out if any of the information I presented is inaccurate.
What are LLMs doing exactly?
Given a block of text, or “context”, an LLM tries to compute the most probable next character, word, or phrase. For example, given the partial sentence, “I am a security…”
What is the most likely next word based on the data it’s trained on? It might be that the probability of the most possible next three words is as follows:
I am a security… guard: 50%
I am a security… engineer: 20%
I am a security… researcher: 10%
The model would generate the most likely option: “I am a security guard.”
But if you can provide the model with more context, it changes the probability of the…