AlignDiff: Aligning Diverse Human Preferences via Behavior-customisable Diffusion Model

Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RL from Human Feedback (RLHF) to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.

Overview of AlignDiff. AlignDiff leverages RLHF to quantify human preferences and utilizes them to guide diffusion planning for zero-shot behavior customizing. Firstly, we collect multi-perspective human feedback through crowdsourcing. Secondly, we use this feedback to train an attribute strength model, which we then use to relabel the behavioral datasets. Thirdly, we train a diffusion model on the annotated datasets, which can understand and generate trajectories with various attributes. Lastly, we can use AlignDiff for inference, aligning agent behaviors with human preferences at any time.

Covering distribution of AlignDiff and other baselines trained with human labels. The x-axis represents relative attribute strength, and the y-axis represents the corresponding actual attribute. The points (x, y) in the highlighted region indicate that the algorithm is more likely to produce trajectories with the actual attribute y given x as a condition. Comparing to the ground truth (dataset distribution), AlignDiff is able to better cover the dataset distribution and produce behaviors that are not present in the dataset (intuitively, it can fill the gaps in the disconnected regions within the highlighted area).

Results of human evaluation. We invite human evaluators to assess the behaviors produced by the algorithm, with higher scores indicating stronger alignment with human preferences. AlignDiff outperforms other baselines significantly. For detailed information on the human evaluation, please refer to the experimental section of the paper.

Zip	Link
Datasets	Download
Pre-trained Models	Download

AlignDiff: Aligning Diverse Human Preferences via Behavior-customisable Diffusion Model

Abstract

Method

Results

Matching performance

Switching performance

Covering performance

Human evaluation

Released behavioral & human feedback datasets