Prompt sensitivity remains a major challenge in Large Language Models (LLMs) such as GPT-4, Qwen, DeepSeek, and Llama, often leading to inconsistent, biased, or excessively long outputs. This paper presents a Multi-Objective Reinforcement Learning (MORL) system, which optimizes prompts automatically in terms of accuracy, fairness, robustness, informativeness, and efficiency of tokens. The system is conditioned on a 2520-sample culturally diverse dataset of religious and non-religious sentence pairs, four semantic dimensions, and different countries. Statistical analysis of the dimension-specific variability was found to be high, with food having the highest bias and time representation having the lowest bias driving reward shaping. Every encounter is coded into a rich linguistic contextual state vector, and the information allows the agent to learn corrective strategies. The optimization is done with a weighted multi-objective reward in Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). Although DQN narrowly converged on the rewards, PPO performed better on both validation and test sets in all aspects, including higher rewards, action diversity, and generalization. PPO was effective in reducing the high-bias cases, provided more balance in sentence sentiments, and stabilized interdimensional performance. All in all, the framework provides more accurate, informative, and consistent responses to LLM, showing that MORL is an effective solution to culturally sensitive prompt optimization.
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Junaid Iqbal
Muhammad Asghar Khan
Tahir Alyas
Procedia Computer Science
University of Rome Tor Vergata
Prince Mohammad bin Fahd University
Lahore Garrison University
Building similarity graph...
Analyzing shared references across papers
Loading...
Iqbal et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69c0df0bfddb9876e79c1540 — DOI: https://doi.org/10.1016/j.procs.2026.01.110