Ziqing Qian*, Jiaying Lei*, Shengqi Dang, Nan Cao
Tongji University
Social media has fundamentally transformed how people access information and form social connections, with content expression playing a critical role in driving information diffusion. While prior research has focused largely on network structures and tipping point identification, it provides limited tools for automatically generating content tailored for virality within a specific audience.
To fill this gap, we propose the novel task of Diffusion-Oriented Content Generation (DOCG) and introduce an information enhancement algorithm for generating content optimized for diffusion. Our method includes:
- π Influence Indicator β enables content-level diffusion assessment without requiring access to network topology
- βοΈ Information Editor β employs reinforcement learning to explore interpretable editing strategies, leveraging generative models to produce semantically faithful, audience-aware textual or visual content
Experiments on real-world social media datasets and a user study demonstrate that our approach significantly improves diffusion effectiveness while preserving the core semantics of the original content.
Designed to Spread: A Generative Approach to Enhance Information Diffusion
Ziqing Qian, Jiaying Lei, Shengqi Dang, Nan Cao
Proceedings of the AAAI Conference on Artificial Intelligence, 2026, 40(2): 944β952
π Paper | π DOI
designed-to-spread/
βββ Data/ # Raw inputs & pipeline outputs (posts/accounts JSONL, features, models, β¦)
βββ Data_preprocess/ # End-to-end preprocessing (PySpark + Python; entry: main.py)
βββ Long_CLIP/ # Long-CLIP model code (used by image / CLIP features)
βββ baselines/ # llm_baseline.py, in_context_learning.py
βββ influence_indicator/ # train_pijc_model.py, Evaluation.ipynb
βββ information_editor/ # text_editor.py, visual_editor.py, base_editor.py (imported by scripts/)
βββ scripts/ # Runnable entry points: text/visual editors, tweetβimage preprocess
βββ requirements.txt # Python dependencies
Note: There is no preprocess.ipynb or run_pipeline.sh in this repo. Preprocessing is driven by Data_preprocess/main.py (and its modules). The only notebook is influence_indicator/Evaluation.ipynb (evaluation / analysis). Text and image editing are run via scripts/run_text_editor.py and scripts/run_visual_editor.py.
Make sure you have the following installed:
- Python >= 3.8
- CUDA 12.1 (required for GPU acceleration; the dependencies are built against
cu121) - Conda (recommended for environment management)
git clone https://github.com/idvxlab/designed-to-spread.git
cd designed-to-spreadIt is strongly recommended to use a dedicated virtual environment to avoid dependency conflicts:
# Using conda (recommended)
conda create -n docg python=3.10
conda activate docg
# Or using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtKey dependencies include:
| Package | Version | Purpose |
|---|---|---|
torch |
2.4.1 | Core deep learning framework |
transformers |
4.46.3 | Pre-trained language models |
diffusers |
0.36.0 | Diffusion-based image generation |
openai |
2.2.0 | LLM API calls |
scikit-learn |
1.3.2 | Machine learning utilities |
pyspark |
3.5.8 | Large-scale data processing |
Note: If you do not have a CUDA 12.1 GPU, you may need to install a CPU-only version of PyTorch separately. See the PyTorch installation guide.
-
Place (or symlink) your social data under
Data/as expected byData_preprocess/extract_tables.py:posts.jsonl(required), optionallyposts_all.jsonlaccounts.jsonl(required for the user table)
The released sample under
Data/raw_data/only contains ID lists and small*_example.jsonlsnippets; for the full pipeline you need fullposts.jsonl/accounts.jsonl(or pass explicit paths). -
Run the eight-step preprocessing pipeline from the repo root (uses PySpark in step 1):
cd Data_preprocess
python main.py
# Optional: custom Data dir or JSONL paths
# python main.py --base_dir /path/to/Data --posts_path ... --accounts_path ...main.py runs, in order: extract_tables β extract_network β split_dataset β generate_samples β filter_texts β compute_message_features β compute_user_features β compute_features_and_labels.
You can also import and run any single moduleβs main() for debugging. Intermediate artifacts are written under Data/ (e.g. features, splits, outputs/ for .pt training packs).
Train the cascade / diffusion predictor on the preprocessed .pt samples:
cd influence_indicator
python train_pijc_model.pyBy default this reads Data/outputs/outputs_train and Data/outputs/outputs_test, and saves checkpoints under Data/outputs/model/ (e.g. model_epoch_100.pth referenced elsewhere in the repo).
Optional β Jupyter evaluation: open and run influence_indicator/Evaluation.ipynb.
(If cells fail to import Long-CLIP weights, note the repo directory is Long_CLIP/ (underscore); align sys.path and model_path in the notebook with that folder and your weight file, e.g. longclip-B.pt.)
Core logic lives in information_editor/ (text_editor.py, visual_editor.py). Entry points are under scripts/ from the repository root:
Text editing
python scripts/run_text_editor.py
# See defaults and overrides:
python scripts/run_text_editor.py --helpImage / visual editing (expects CSVs with tweet_id, tweet_text, image_prompt, plus images under --orig_image_folder):
python scripts/run_visual_editor.py
python scripts/run_visual_editor.py --helpDefaults assume a trained PIJC model at Data/outputs/model/model_epoch_100.pth, user features at Data/features/user_features_dict.pt, and tweet splits under Data/tweet2image/ (see script --help).
Tweet β image preprocessing & train/test split (uses LLM_API_KEY / LLM_BASE_URL from a .env file; generates prompts, images, and train_tweets.csv / test_tweets.csv):
python scripts/tweet2image_preprocess_and_split.py
# Edit the paths in the `if __name__ == "__main__"` block if your CSV/output locations differ.Two Python scripts (no notebooks):
# LLM-style baseline
python baselines/llm_baseline.py --help
# In-context learning baseline
python baselines/in_context_learning.py --helpUse --text and/or --image to select modalities; see argparse defaults for CSVs, model paths, and output dirs.
There is no single scripts/run_pipeline.sh. A typical order is:
Data_preprocess/main.pyβ features &Data/outputs/tensorsinfluence_indicator/train_pijc_model.pyβData/outputs/model/*.pth- (Optional)
scripts/tweet2image_preprocess_and_split.pyβData/tweet2image/for the image track scripts/run_text_editor.py/scripts/run_visual_editor.py(and/orbaselines/*.py)
Configure paths via each scriptβs CLI flags or the hard-coded defaults in tweet2image_preprocess_and_split.pyβs main block where applicable.
If you find this work useful in your research, please consider citing:
@inproceedings{qian2026designed,
title = {Designed to Spread: A Generative Approach to Enhance Information Diffusion},
author = {Qian, Ziqing and Lei, Jiaying and Dang, Shengqi and Cao, Nan},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {40},
number = {2},
pages = {944--952},
year = {2026},
doi = {10.1609/aaai.v40i2.37063},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/37063}
}If you have any questions, feel free to open an issue or contact us at 2411920@tongji.edu.cn.
This project is from the Intelligent Big Data Visualization Lab (iDVX Lab) at Tongji University. The projects released here are associated with publications from the lab.