I think if the 2nd LLM has ever seen the actual prompt, then no, you could just jailbreak the 2nd LLM too. But you may be able to create a bot that is really good at spotting jailbreak-type prompts in general, and then prevent it from going through to the primary one. I also assume I’m not the first to come up with this and OpenAI knows exactly how well this fares.
Can you explain how you would jailbfeak it, if it does not actually follow any instructions in the prompt at all? A model does not magically learn to follow instructuons if you don’t train it to do so.
I think if the 2nd LLM has ever seen the actual prompt, then no, you could just jailbreak the 2nd LLM too. But you may be able to create a bot that is really good at spotting jailbreak-type prompts in general, and then prevent it from going through to the primary one. I also assume I’m not the first to come up with this and OpenAI knows exactly how well this fares.
Can you explain how you would jailbfeak it, if it does not actually follow any instructions in the prompt at all? A model does not magically learn to follow instructuons if you don’t train it to do so.