Skip to content

Unexpected behavior regarding re.findall() and different syntax for searching for the same sequence #145604

@ketchupfan

Description

@ketchupfan

Bug report

Bug description:

There appears to be a bug with the findall() function in the re library in which using different syntax for search for the same sequence of strings does not produce the same results. For example, when searching a DNA sequence for a GC rich region, using re.findall("[GC]{12,}", dnaString) versus re.findall("(G|C){12,}", dnaString) produces different results:

import re

dnaString = "GCCGCGGGGGCCCCCGCGCCCGGGGATATTATAAAGGGGGGGGCCCCCCCCCCCCCCCCCCCCGC"
allGCrich = re.findall("[GC]{12,}", dnaString)
print(allGCrich)
# prints "['GCCGCGGGGGCCCCCGCGCCCGGGG', 'GGGGGGGGCCCCCCCCCCCCCCCCCCCCGC']" as desired

allGCrich = re.findall("(G|C){12,}", dnaString)
print(allGCrich)
# prints "['G', 'C']" which doesn't appear to be correct

This does not appear to occur some of the other functions in the re library—such as search()—as using re.search("(G|C){12,}", dnaString) and re.search("[GC]{12,}", dnaString) produces the same results, as desired.

CPython versions tested on:

3.14

Operating systems tested on:

macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    type-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions