Skip to content
View inaniloquentee's full-sized avatar
πŸ˜†
πŸ˜†

Block or report inaniloquentee

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
inaniloquentee/README.md


πŸ‘¨β€πŸ’» About Me

I am a graduate student at Xi'an Jiaotong University (XJTU), focusing on AI infrastructure, LLM serving, and RL post-training systems.

I enjoy turning systems ideas into practical open-source implementations: efficient rollout execution, distributed training workflows, weight synchronization, and CUDA kernel optimization for GRPO-style workloads.

  • πŸ”­ Currently contributing to: vLLM-Omni, a framework for efficient omni-modality model inference and serving.
  • πŸš€ Core contributor to: RL-Kernel / Kernel-Align, building high-performance RL post-training infrastructure.
  • πŸ”¬ Research interests: Efficient inference, multimodal serving, GRPO/RLHF systems, CUDA kernels, and AI4S/PINN applications.
  • πŸ“ Writing: Engineering notes and learning records on my personal blog.

πŸš€ Current Focus: Open Source

Project Focus Status
vLLM-Omni Efficient omni-modality model inference and serving in the vLLM ecosystem πŸ”₯ Contributing
RL-Kernel / Kernel-Align High-performance infrastructure for RL post-training and kernel optimization ⚑ Core Contributor

🧩 Selected RL-Kernel Work

Area Selected Work
vLLM Rollout Shared-prefix caching for GRPO candidate generation, lazy sampler construction, grouped outputs, and normalized rollout schemas
Training & Distributed Runtime DeepSpeed training workers, Ray actor orchestration, health checks, cleanup, and real CUDA/NCCL smoke validation
Overlap Pipeline Asynchronous rollout and training execution with explicit versioned weight publication
Weight Synchronization Low-copy and shared-memory bridge contracts with publish/import/ack/release lifecycle handling
RL Kernel Validation RL-shaped fixtures, PyTorch reference operators, benchmark adapters, and loss-step tests for logprob, ratio, KL, masking, and objective drift
CUDA Optimization Fused selected-logprob kernel paths for GRPO-style workloads with RL-shaped benchmark and profiling evidence

πŸ› οΈ Tech Stack

Languages & Core AI Infrastructure Distributed & Tooling
C++
Python
CUDA
PyTorch
vLLM
DeepSpeed
Ray
Linux
Git

πŸ“Š GitHub Analytics




Activity Graph

Building efficient systems, one kernel and one iteration at a time.

Pinned Loading

  1. vllm-omni vllm-omni Public

    Forked from vllm-project/vllm-omni

    A framework for efficient model inference with omni-modality models

    Python

  2. Mooncake Mooncake Public

    Forked from kvcache-ai/Mooncake

    Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

    C++

  3. DeepSpeed DeepSpeed Public

    Forked from deepspeedai/DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

    Python

  4. RL-Kernel RL-Kernel Public

    Forked from RL-Align/RL-Kernel

    Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM integration, Triton kernels, and transparent hardware-aware scaling.

    Python 2

  5. vime vime Public

    Forked from vllm-project/vime

    An LLM post-training framework with vLLM for RL Scaling

    Python