This repo supports the following methods:
- Online ForsightOptim
- Offline ForsightOptim
And the following environments:
- Adversarial WordTaboo
- RSA Collabrative Game
as well as the checkpoints.
git clone https://github.com/wangjs9/ForesightOptim.git
cd ForsightOptim
conda env create -f environment.yaml
conda activate foresightOffline datasets and used in this work can be found [here](a link). Please create a "datasets" folder and put the contents accordingly.
We use the meta-llama/Meta-Llama-3-8B-Instruct, Qwen/Qwen3-8B, and google/gemma-2-9b-it as the backbone models.
1. Preparation:
-
Change the --output_dir, --model_name_or_path, --train_data_path in the corresponding bash file.
-
tmux usage:
tmux new-session -d -s Foresight # create a session tmux attach -t Foresight # attach to the session conda activate foresight bash scripts/xxx.sh xxx # run the programs (SFT, SelfPlayPPO, ForesightOptim) # kill the session: tmux kill-session -t Foresight
2. Data Collection:
- Download WordTaboo
- Generate RSAGame
The detailed information can be checked in the
cd cooperative_rsacooperative_rsafolder.
3. SFT Stage: Fine-tune the model using lora by:
bash bash scripts/sft.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo
4. Episode Collection with Rewards:
-
Generate the game episodes:
bash scripts/get_episode.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo
-
Prepare training datasets for selfplay:
bash scripts/get_trainset.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo
5. RL Stage:
-
SelfPlayPPO:
bash scripts/ppo.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo
-
ForesightOptim:
bash scripts/fopo.sh Meta-Llama-3-8B-Instruct WordTaboo # the model can be Qwen3-8B or DeepSeek-R1-Distill-Qwen-7B # the task can be RSA or WordTaboo
-
SFT Models:
-
SFT Meta-Llama-3-8B-Instruct on wordtaboo: YourName/im_Llama3-8B-I_word
-
SFT Qwen3-8B on wordtaboo YourName/word_sft_Qwen3-8B
-
SFT DeepSeek-R1-Distill-Qwen-7B on wordtaboo YourName/word_sft_DeepSeek-R1-Distill-Qwen-7B
-
-
PPO Models:
-
SelfPlay + PPO Meta-Llama-3-8B-Instruct on wordtaboo: YourName/word_ppo_Llama3-8B-I
-
SelfPlay + PPO Qwen3-8B on wordtaboo: YourName/word_ppo_Qwen3-8B
-
SelfPlay + PPO DeepSeek-R1-Distill-Qwen-7B on wordtaboo: YourName/word_ppo_DeepSeek-R1-Distill-Qwen-7B
-
-
FoPO Models:
-
SelfPlay + FoPO Meta-Llama-3-8B-Instruct on wordtaboo: YourName/word_fopo_Llama3-8B-I
-
SelfPlay + FoPO Qwen3-8B on wordtaboo: YourName/word_fopo_Qwen3-8B
-
SelfPlay + FoPO DeepSeek-R1-Distill-Qwen-7B on wordtaboo: YourName/word_fopo_DeepSeek-R1-Distill-Qwen-7B
-
We tested LLMs on the following tasks along with their repos with the setups described in our paper.
1. GAMABench: CUHK-ARISE/GAMABench
If you find our paper, codes, and the checkpoints are useful, please help cite our work by:
This work was significantly inspired by the following amazing works and their codes. We have cited them in our paper and we would also like to recommand them here: