Large language models are gaining traction in critical processes such as academic peer review, but researchers now demonstrate a significant vulnerability to a subtle form of attack. Panagiotis Theocharopoulos of the International School of Athens, along with Ajinkya Kulkarni and Mathew Magimai. -Doss from the Idiap Research Institute, and their colleagues constructed a dataset of genuine academic papers to investigate hidden prompt injection attacks. The team successfully embedded disguised instructions within these papers, written in multiple languages, and then used a language model to review the altered documents. Results reveal that these hidden prompts substantially influence review scores and acceptance decisions for papers injected with English, Japanese, and Chinese instructions, exposing a clear risk for LLM-based reviewing systems and highlighting how vulnerabilities differ across languages.
The study involves injecting semantically equivalent instructions, written in English, Japanese, Chinese, and Arabic, into accepted papers intended for a machine learning conference. These modified papers then undergo review by a large language model, allowing the team to assess whether the embedded prompts influence the evaluation process. The results demonstrate that prompt injection significantly alters review scores and acceptance/rejection decisions when the instructions are in English, Japanese, or Chinese, whereas Arabic injections elicit minimal response, highlighting a notable difference in vulnerability across languages. This work underscores the susceptibility of current LLM-based reviewing systems to document-level prompt injection attacks and reveals important variations in their resilience.
Hidden Prompt Injection Attacks on LLMs
Large language models (LLMs) are increasingly integrated into real-world pipelines due to their capabilities in efficient task automation, but their robustness and reliability are central requirements when processing external, untrusted inputs. Prompt injection is a key vulnerability, where malicious instructions can compromise the model’s intended behaviour, particularly concerning when LLMs support decision-making. This study evaluated the robustness of LLM-based academic reviewing under document-level hidden prompt injection, using a dataset of accepted conference papers. The methodology involved injecting adversarial prompts into the papers and assessing the resulting changes in both numerical review scores and acceptance decisions.
Experiments were conducted using a locally served language model, with deterministic inference and a fixed input length. Score drift was quantified as the difference between injected and baseline scores, while injection success rate measured the frequency of differing decisions. The results demonstrate that prompt injection can substantially influence both numerical review scores and accept/reject recommendations. Strong and consistent effects were observed for English, Japanese, and Chinese injections, frequently leading to harsher reviews and high-impact decision reversals. In contrast, Arabic prompt injection exhibited markedly weaker effects, with limited score drift and fewer decision changes.
This suggests that vulnerability to prompt injection is not uniform across languages, potentially due to uneven multilingual alignment and instruction-following reliability. These findings underscore the need for caution when deploying LLMs in document-based evaluative settings and motivate further investigation into multilingual robustness and effective defences against indirect prompt injection attacks. Future work will extend this analysis to additional conferences and LLMs, explore diverse injection strategies, and investigate mitigation techniques.
Prompt Injection Attacks Compromise Peer Review
Scientists conducted a systematic evaluation of large language models (LLMs) used in academic peer review, revealing a significant vulnerability to document-level prompt injection attacks. The team constructed a dataset of approximately 500 real academic papers accepted to a machine learning conference and embedded hidden adversarial prompts within these documents, testing the LLM’s responses. Results demonstrate that prompt injection substantially alters review scores and accept/reject decisions when using English, Japanese, and Chinese instructions, indicating a clear susceptibility to manipulation. Conversely, injections written in Arabic produced minimal to no effect, revealing notable differences in vulnerability across languages.
The study measured the impact of these injections by comparing baseline reviews of original papers with reviews of the injected variants, focusing on shifts in numerical scores and final decisions. This work highlights a critical concern for the increasing use of LLMs in high-stakes decision-support systems, particularly within academic contexts where submission volumes are rapidly increasing. The breakthrough delivers crucial insights into the robustness of LLM-based reviewing systems and underscores the need for further research into effective mitigation strategies.
Hidden Prompts Subvert Peer Review Models
This research demonstrates a significant vulnerability in large language models when applied to tasks involving document evaluation, specifically academic peer review. By embedding hidden instructions within accepted research papers, scientists successfully altered the models’ scoring and acceptance decisions, revealing that seemingly innocuous text can exert substantial influence over automated assessments. The team observed that prompt injection frequently resulted in harsher reviews and, critically, a reversal of acceptance recommendations to rejection, highlighting a real risk for decision-support systems relying on untrusted textual inputs. Notably, the susceptibility to these attacks varied considerably by language, with English, Japanese, and Chinese injections proving highly effective, while Arabic injections yielded minimal impact. Researchers suggest this asymmetry stems from imbalances in multilingual alignment and training resources, where English-centric development may lead to reduced compliance with adversarial instructions in other languages. The team plans to expand this work to include more diverse datasets, explore varied attack strategies, and investigate potential mitigation techniques to enhance the robustness of multilingual systems against prompt injection.
👉 More information
🗞 Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing
🧠 ArXiv: https://arxiv.org/abs/2512.23684
