What question did this study set out to answer?

To improve prompt optimization in large language models by addressing prompt sensitivity through multi-objective reinforcement learning.

March 23, 2026Open Access

A Multi-Objective Reinforcement Learning Approach to Prompt Optimization in NLP

Key Points

To improve prompt optimization in large language models by addressing prompt sensitivity through multi-objective reinforcement learning.
Developed a Multi-Objective Reinforcement Learning system for prompt optimization.
Utilized a culturally diverse dataset of 2520 samples containing religious and non-religious sentence pairs.
Analyzed dimension-specific variability and biases affecting prompts.
Optimized prompts using Deep Q-Network and Proximal Policy Optimization strategies.
Proximal Policy Optimization outperformed Deep Q-Network in validation and test sets.
Achieved higher rewards, improved action diversity, and enhanced generalization across tasks.
Reduced high-bias responses and balanced sentence sentiments more effectively.

Abstract

Prompt sensitivity remains a major challenge in Large Language Models (LLMs) such as GPT-4, Qwen, DeepSeek, and Llama, often leading to inconsistent, biased, or excessively long outputs. This paper presents a Multi-Objective Reinforcement Learning (MORL) system, which optimizes prompts automatically in terms of accuracy, fairness, robustness, informativeness, and efficiency of tokens. The system is conditioned on a 2520-sample culturally diverse dataset of religious and non-religious sentence pairs, four semantic dimensions, and different countries. Statistical analysis of the dimension-specific variability was found to be high, with food having the highest bias and time representation having the lowest bias driving reward shaping. Every encounter is coded into a rich linguistic contextual state vector, and the information allows the agent to learn corrective strategies. The optimization is done with a weighted multi-objective reward in Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). Although DQN narrowly converged on the rewards, PPO performed better on both validation and test sets in all aspects, including higher rewards, action diversity, and generalization. PPO was effective in reducing the high-bias cases, provided more balance in sentence sentiments, and stabilized interdimensional performance. All in all, the framework provides more accurate, informative, and consistent responses to LLM, showing that MORL is an effective solution to culturally sensitive prompt optimization.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Muhammad Junaid Iqbal

Muhammad Asghar Khan

Tahir Alyas

Journals

Procedia Computer Science

Actions

Institutions

University of Rome Tor Vergata

Prince Mohammad bin Fahd University

Lahore Garrison University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Multi-Objective Reinforcement Learning Approach to Prompt Optimization in NLP

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study