Artificial Intelligence Value Alignment via Inverse Reinforcement Learning
Keywords:
Value alignment, Artificial Intelligence, Inverse Reinforcement Learning, AI Alignment, Deep Learning, RLHFAbstract
Value alignment, one of the Artificial Intelligence (AI) Alignment problems, pertains to ensuring that AI systems adhere to human values. These intricate problems lack definitive solutions, and significant research has been conducted to address it. Nevertheless, substantial progress is still required to effectively tackle the AI Alignment problems. Current trajectories in AI development, particularly in the realm of Deep Learning and RLHF, pose significant existential risks due to potential misalignments in AI objectives and human values. In the present study, our objective is to address the AI value alignment problem through the utilization of Inverse Reinforcement Learning (IRL). The central idea revolves around employing the IRL framework to acquire a reward function from an expert who exhibits behaviour consistent with human values. Subsequently, the AI system will mimic the expert’s actions, thereby aligning its behaviour with human values in a verifiable way.
Downloads
References
S. Arora and P. Doshi. “A survey of inverse reinforcement learning: Challenges, methods and progress”. In: Artificial Intelligence 297 (2021), p. 103500.
D. S. Brown, J. Schneider, A. Dragan, and S. Niekum. “Value alignment verification”. In: International Conference on Machine Learning. PMLR. 2021, pp. 1105–1115.
J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, et al. “AI Alignment: A Comprehensive Survey”. In: arXiv preprint arXiv:2310.19852 (2023).
R. Ngo, L. Chan, and S. Mindermann. “The alignment problem from a deep learning perspective”. In: arXiv preprint arXiv:2209.00626 (2022).