Skip to content

Telechat-12b的Tokenizer是不是有问题? #79

@Sun-Xiaohui

Description

@Sun-Xiaohui

encode和decode绝大多数情况下应该是一个互逆操作,但是12b模型的tokenizer,encode和decode表现如下

PATH = '/toolchain/LLM/telechat-12b-hf'
tokenizer = AutoTokenizer.from_pretrained(PATH, trust_remote_code=True)
print(tokenizer.encode(tokenizer.decode([2000]))) #[561,579]
print(tokenizer.decode([579])) # 'red'
print(tokenizer.encode('red')) # [2952]
print(tokenizer.decode([2952])) # 'red'

可以解答一下吗? @hannawong @ZiYu0427 @liuxz0801 @Unknown-Body @LSX-Sneakerprogrammer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions