Forsight-based Policy Optimization for Strategic Conversations

This repo supports the following methods:

Online ForsightOptim
Offline ForsightOptim

And the following environments:

Adversarial WordTaboo
RSA Collabrative Game

as well as the checkpoints.

Quick Start

1. Install Dependencies

git clone https://github.com/wangjs9/ForesightOptim.git
cd ForsightOptim

conda env create -f environment.yaml
conda activate foresight

2. Download Datasets and Models

Offline datasets and used in this work can be found [here](a link). Please create a "datasets" folder and put the contents accordingly.

We use the meta-llama/Meta-Llama-3-8B-Instruct, Qwen/Qwen3-8B, and google/gemma-2-9b-it as the backbone models.

3. Model Training

1. Preparation:

Change the --output_dir, --model_name_or_path, --train_data_path in the corresponding bash file.

tmux usage:

tmux new-session -d -s Foresight # create a session 
tmux attach -t Foresight # attach to the session
conda activate foresight
bash scripts/xxx.sh xxx # run the programs (SFT, SelfPlayPPO, ForesightOptim)
# kill the session: tmux kill-session -t Foresight

2. Data Collection:

Download WordTaboo
Generate RSAGame
```
cd cooperative_rsa
```
The detailed information can be checked in the cooperative_rsa folder.

3. SFT Stage: Fine-tune the model using lora by: bash bash scripts/sft.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo

4. Episode Collection with Rewards:

Generate the game episodes:

bash scripts/get_episode.sh Meta-Llama-3-8B-Instruct WordTaboo
# the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
# the task can be RSA or WordTaboo

Prepare training datasets for selfplay:

bash scripts/get_trainset.sh Meta-Llama-3-8B-Instruct WordTaboo
# the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
# the task can be RSA or WordTaboo

5. RL Stage:

SelfPlayPPO:

bash scripts/ppo.sh Meta-Llama-3-8B-Instruct WordTaboo
# the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
# the task can be RSA or WordTaboo

ForesightOptim:

bash scripts/fopo.sh Meta-Llama-3-8B-Instruct WordTaboo
# the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B
# the task can be RSA or WordTaboo

Model Checkpoints

SFT Models:
- SFT Meta-Llama-3-8B-Instruct on wordtaboo: YourName/im_Llama3-8B-I_word
- SFT Qwen3-8B on wordtaboo YourName/word_sft_Qwen3-8B
- SFT DeepSeek-R1-Distill-Qwen-7B on wordtaboo YourName/word_sft_DeepSeek-R1-Distill-Qwen-7B
PPO Models:
- SelfPlay + PPO Meta-Llama-3-8B-Instruct on wordtaboo: YourName/word_ppo_Llama3-8B-I
- SelfPlay + PPO Qwen3-8B on wordtaboo: YourName/word_ppo_Qwen3-8B
- SelfPlay + PPO DeepSeek-R1-Distill-Qwen-7B on wordtaboo: YourName/word_ppo_DeepSeek-R1-Distill-Qwen-7B
FoPO Models:
- SelfPlay + FoPO Meta-Llama-3-8B-Instruct on wordtaboo: YourName/word_fopo_Llama3-8B-I
- SelfPlay + FoPO Qwen3-8B on wordtaboo: YourName/word_fopo_Qwen3-8B
- SelfPlay + FoPO DeepSeek-R1-Distill-Qwen-7B on wordtaboo: YourName/word_fopo_DeepSeek-R1-Distill-Qwen-7B

4. Model Evaluation

We tested LLMs on the following tasks along with their repos with the setups described in our paper.

1. GAMABench: CUHK-ARISE/GAMABench

Citing ForesightOptim

If you find our paper, codes, and the checkpoints are useful, please help cite our work by:

Recommaned Works:

This work was significantly inspired by the following amazing works and their codes. We have cited them in our paper and we would also like to recommand them here:

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
agent_trainers		agent_trainers
checkpoints		checkpoints
competitive_taboo		competitive_taboo
configs		configs
cooperative_rsa		cooperative_rsa
environment		environment
scripts		scripts
train		train
README.md		README.md
arguments.py		arguments.py
dataloaders.py		dataloaders.py
environment.yaml		environment.yaml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forsight-based Policy Optimization for Strategic Conversations

Quick Start

1. Install Dependencies

2. Download Datasets and Models

3. Model Training

Model Checkpoints

4. Model Evaluation

Citing ForesightOptim

Recommaned Works:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forsight-based Policy Optimization for Strategic Conversations

Quick Start

1. Install Dependencies

2. Download Datasets and Models

3. Model Training

Model Checkpoints

4. Model Evaluation

Citing ForesightOptim

Recommaned Works:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages