refactor: unify linear/quantization architecture and remove deprecate… by qinyiqun · Pull Request #366 · InfiniTensor/InfiniLM

qinyiqun · 2026-05-12T03:27:44Z

Summary

Move linear module from InfiniCore to InfiniLM with quantization-based dispatch
Add GPTQ->GPTQ_QY weight conversion gated by QY device type
Implement fused linear weight splitting and re-registration
Fix TP split dimensions for all quantization schemes
Add alpha scaling parameter and logical dim size delegation
Move set_zeros/set_minus_one to utils.hpp as shared utilities

Motivation

Linear/quantization should be moved to InfiniLM from InfiniCore.

Closes #

Type of Change

feat — new feature / new model
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from main — the branch is rebased cleanly on top of the current main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
No raw new/delete; RAII / smart pointers / existing allocators are used.
Changed files are formatted by scripts/format.py.
No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

Code is PEP 8 compliant.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
Changed files are formatted by scripts/format.py.
No changes/reference to python/infinilm/auto_config.py.

Testing

For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
Passed single request test (examples/test_infer.py), or specify the reason for skipping.
Passed offline performance test (examples/bench.py), or specify the reason for skipping.
Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

…d interfaces - Move linear module from InfiniCore to InfiniLM with quantization-based dispatch - Add GPTQ->GPTQ_QY weight conversion gated by QY device type - Implement fused linear weight splitting and re-registration - Fix TP split dimensions for all quantization schemes - Add alpha scaling parameter and logical dim size delegation - Move set_zeros/set_minus_one to utils.hpp as shared utilities

qinyiqun · 2026-05-12T03:30:24Z

需要讨论的问题:module在InfiniLM中应该以宏的形式声明和初始化还是应该以智能指针的形式存在。

pengcheng888 · 2026-05-12T11:52:07Z

         k_dim_},
        dtype_,
        rank_info.device);
+    set_zeros(k_caches_);


kv cache需要置0么，这是哪个平台的要求么

国产芯片有脏内存，malloc内存不设置0会留下原来的数据。

kv_cache.cpp中有四个函数中有申请kv cache的，那这四个函数都的加

pengcheng888 · 2026-05-12T11:58:04Z

                    rank_info_.device,
                    pending_cache_config_ != nullptr ? pending_cache_config_.get() : nullptr);
            } else {
-                std::vector<std::string> classic_models = {"llama", "qwen2", "minicpm", "fm9g", "fm9g7b"};


这段classic_models代码暂时不要删除，如果要删除lama_legacy文件夹的话，应该单独提pr删除

pengcheng888 · 2026-05-12T12:00:15Z

-        break;
-    }
-    }
+    auto register_fn = [this](const std::string &n, infinicore::nn::Parameter p) { this->register_parameter(n, std::move(p)); };


register_fn变量能移动到 init_kv_cache_quant_params 函数中么

pengcheng888 · 2026-05-12T12:14:00Z

-        break;
-    }
-    }
+    init_kv_cache_quant_params(register_fn, device, kv_cache_k_scale_, kv_cache_v_scale_);


这个函数可以作为 Attention类的private函数，自己调用么

pengcheng888 · 2026-05-12T12:16:06Z

-        break;
-    }
-    }
+    auto register_fn = [this](const std::string &n, infinicore::nn::Parameter p) { this->register_parameter(n, std::move(p)); };


这个变量register_fn时不是没有被使用

pengcheng888 · 2026-05-12T12:17:17Z

@@ -19,12 +19,12 @@ MoeMLP::MoeMLP(std::shared_ptr<infinilm::config::ModelConfig> model_config,
    auto quant_scheme = model_config->get_quant_scheme();
    auto quantization_method = model_config->get_quantization_method();
    switch (quant_scheme) {


switch (quant_scheme) 不是藏在了linear里面么，为什么这个还要switch (quant_scheme)

ColumnParallelLinear和RowParallelLinear可以调用带有quantization参数构造么

看起来这个文件是我修改之后加进去的，我改一下

pengcheng888 · 2026-05-12T12:25:49Z

-        break;
-    }
-    }
+    infinilm::layers::attention::init_kv_cache_quant_params(register_fn, device, kv_cache_k_scale_, kv_cache_v_scale_);


这里要复用init_kv_cache_quant_params函数么

pengcheng888 · 2026-05-12T12:29:07Z

-        std::shared_ptr<infinilm::config::ModelConfig> model_config,
-        engine::distributed::RankInfo rank_info = engine::distributed::RankInfo(),
-        const cache::CacheConfig *cache = nullptr,
-        backends::AttentionBackend attention_backend = backends::AttentionBackend::Default);


这个也构造也删除的话， llama_legacy就走不到了

pengcheng888 · 2026-05-12T12:31:33Z

+
+} // namespace infinilm::nn
+
 #include "fused_linear.hpp"


include放到开头

pengcheng888 · 2026-05-13T02:33:27Z

            const size_t block_per_req = nblocks;
            input.block_tables = block_tables_holder_->as_strided({b, block_per_req}, {(ptrdiff_t)block_per_req, 1});
            input.slot_mapping = infinicore::Tensor::empty({b}, infinicore::DataType::I64, infinicore::context::getDevice());
-            set_zeros(input.slot_mapping.value());


这行代码为什么要删了，不用重置input.slot_mapping了么

qinyiqun requested a review from a team May 12, 2026 03:27

qinyiqun requested review from PanZezhong1725, pengcheng888 and wooway777 May 12, 2026 04:46

pengcheng888 reviewed May 12, 2026

View reviewed changes

pengcheng888 reviewed May 13, 2026

View reviewed changes

Conversation

qinyiqun commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

qinyiqun commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qinyiqun commented May 12, 2026 •

edited

Loading