Skip to content

wangjs9/ForesightOptim

Repository files navigation

Forsight-based Policy Optimization for Strategic Conversations

This repo supports the following methods:

  • Online ForsightOptim
  • Offline ForsightOptim

And the following environments:

  • Adversarial WordTaboo
  • RSA Collabrative Game

as well as the checkpoints.

Quick Start

1. Install Dependencies

git clone https://github.com/wangjs9/ForesightOptim.git
cd ForsightOptim

conda env create -f environment.yaml
conda activate foresight

2. Download Datasets and Models

Offline datasets and used in this work can be found [here](a link). Please create a "datasets" folder and put the contents accordingly.

We use the meta-llama/Meta-Llama-3-8B-Instruct, Qwen/Qwen3-8B, and google/gemma-2-9b-it as the backbone models.

3. Model Training

1. Preparation:

  • Change the --output_dir, --model_name_or_path, --train_data_path in the corresponding bash file.

  • tmux usage:

    tmux new-session -d -s Foresight # create a session 
    tmux attach -t Foresight # attach to the session
    conda activate foresight
    bash scripts/xxx.sh xxx # run the programs (SFT, SelfPlayPPO, ForesightOptim)
    # kill the session: tmux kill-session -t Foresight

2. Data Collection:

  • Download WordTaboo
  • Generate RSAGame
    cd cooperative_rsa
    
    The detailed information can be checked in the cooperative_rsa folder.

3. SFT Stage: Fine-tune the model using lora by: bash bash scripts/sft.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo

4. Episode Collection with Rewards:

  • Generate the game episodes:

    bash scripts/get_episode.sh Meta-Llama-3-8B-Instruct WordTaboo
    # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
    # the task can be RSA or WordTaboo
  • Prepare training datasets for selfplay:

    bash scripts/get_trainset.sh Meta-Llama-3-8B-Instruct WordTaboo
    # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
    # the task can be RSA or WordTaboo

5. RL Stage:

  • SelfPlayPPO:

    bash scripts/ppo.sh Meta-Llama-3-8B-Instruct WordTaboo
    # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
    # the task can be RSA or WordTaboo
  • ForesightOptim:

    bash scripts/fopo.sh Meta-Llama-3-8B-Instruct WordTaboo
    # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
    # the task can be RSA or WordTaboo

Model Checkpoints

4. Model Evaluation

We tested LLMs on the following tasks along with their repos with the setups described in our paper.

1. GAMABench: CUHK-ARISE/GAMABench

Citing ForesightOptim

If you find our paper, codes, and the checkpoints are useful, please help cite our work by:

Recommaned Works:

This work was significantly inspired by the following amazing works and their codes. We have cited them in our paper and we would also like to recommand them here:

About

Foresight-base Optimization by Self-Play

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors