Filter repeating running heads and feet before segmentation#616
Filter repeating running heads and feet before segmentation#616de-code wants to merge 5 commits into
Conversation
related to eLifePathways/ScienceBeam2.0#61 Add a pre-segmentation noise filter that detects layout blocks repeating at the top or bottom of pages across a document (running heads, running feet) using position and cross-page text repetition. Detected blocks are excluded from the segmentation model input and preserved in the output XML as <note type="running-head"> / <note type="running-foot"> elements for auditability. Enabled by default via noise_filter_enabled in config.yml.
ScienceBeam Parser EvaluationOverall (59 docs across 6 corpora)grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-83dec818-20260527.2241: 59 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 59 docs
biorxiv (9 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 9 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 9 docs
ore (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs
pkp (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs
scielo_br (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs
scielo_mx (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs
scielo_preprints-jats (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs
|
|
For |
Rewrite the noise block classifier to avoid hardcoded 0.2/0.8 page fractions. Instead, the top/bottom quartile of each page's own block y-distribution defines the noise zone, so the threshold adapts to each document's layout. Two additional guards prevent false positives: - Stddev check: position must be stable across occurrences - Height check: occurrences whose block height exceeds 2× the group median are not filtered (catches a large title on page 1 that also repeats as a small footer on pages 2+, as seen in scielo_br)
|
Not having looked at the validation dataset. But there are two cases for the training dataset where the PDF doesn't match the JATS document:
Both are In those two cases, it is now extracting title as English + Portuguese, whereas before it was English only. |
related to https://github.com/eLifePathways/ScienceBeam2.0/issues/61
Add a pre-segmentation noise filter that detects layout blocks whose text repeats at the top or bottom of pages across a document. Detected blocks are excluded from the segmentation model input and preserved in the output XML as / elements for auditability.
Enabled by default via noise_filter_enabled in config.yml.