I am a graduate student at Xi'an Jiaotong University (XJTU), focusing on AI infrastructure, LLM serving, and RL post-training systems.
I enjoy turning systems ideas into practical open-source implementations: efficient rollout execution, distributed training workflows, weight synchronization, and CUDA kernel optimization for GRPO-style workloads.
- π Currently contributing to: vLLM-Omni, a framework for efficient omni-modality model inference and serving.
- π Core contributor to: RL-Kernel / Kernel-Align, building high-performance RL post-training infrastructure.
- π¬ Research interests: Efficient inference, multimodal serving, GRPO/RLHF systems, CUDA kernels, and AI4S/PINN applications.
- π Writing: Engineering notes and learning records on my personal blog.
| Project | Focus | Status |
|---|---|---|
| vLLM-Omni | Efficient omni-modality model inference and serving in the vLLM ecosystem | π₯ Contributing |
| RL-Kernel / Kernel-Align | High-performance infrastructure for RL post-training and kernel optimization | β‘ Core Contributor |
| Area | Selected Work |
|---|---|
| vLLM Rollout | Shared-prefix caching for GRPO candidate generation, lazy sampler construction, grouped outputs, and normalized rollout schemas |
| Training & Distributed Runtime | DeepSpeed training workers, Ray actor orchestration, health checks, cleanup, and real CUDA/NCCL smoke validation |
| Overlap Pipeline | Asynchronous rollout and training execution with explicit versioned weight publication |
| Weight Synchronization | Low-copy and shared-memory bridge contracts with publish/import/ack/release lifecycle handling |
| RL Kernel Validation | RL-shaped fixtures, PyTorch reference operators, benchmark adapters, and loss-step tests for logprob, ratio, KL, masking, and objective drift |
| CUDA Optimization | Fused selected-logprob kernel paths for GRPO-style workloads with RL-shaped benchmark and profiling evidence |


