Skip to content

Filter repeating running heads and feet before segmentation#616

Draft
de-code wants to merge 5 commits into
mainfrom
filter-header-footer
Draft

Filter repeating running heads and feet before segmentation#616
de-code wants to merge 5 commits into
mainfrom
filter-header-footer

Conversation

@de-code
Copy link
Copy Markdown
Collaborator

@de-code de-code commented May 27, 2026

related to https://github.com/eLifePathways/ScienceBeam2.0/issues/61

Add a pre-segmentation noise filter that detects layout blocks whose text repeats at the top or bottom of pages across a document. Detected blocks are excluded from the segmentation model input and preserved in the output XML as / elements for auditability.

Enabled by default via noise_filter_enabled in config.yml.

related to eLifePathways/ScienceBeam2.0#61

Add a pre-segmentation noise filter that detects layout blocks
repeating at the top or bottom of pages across a document (running
heads, running feet) using position and cross-page text repetition.
Detected blocks are excluded from the segmentation model input and
preserved in the output XML as <note type="running-head"> /
<note type="running-foot"> elements for auditability.

Enabled by default via noise_filter_enabled in config.yml.
@de-code de-code changed the title Filter running headers, footers and page numbers before segmentation Filter repeating running heads and feet before segmentation May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

ScienceBeam Parser Evaluation

Overall (59 docs across 6 corpora)

grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-83dec818-20260527.2241: 59 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 59 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.504 0.453 0.431 -0.073 -0.022
title (levenshtein) string 0.643 0.509 0.488 -0.155 -0.022
title (edit_sim) string 0.639 0.605 0.599 -0.041 -0.006
abstract (levenshtein) string 0.642 0.516 0.622 -0.020 +0.106
abstract (edit_sim) string 0.662 0.546 0.616 -0.045 +0.071
author_full_names (levenshtein) partial_ulist 0.701 0.679 0.677 -0.024 -0.002
author_full_names (edit_sim) partial_ulist 0.717 0.704 0.703 -0.014 -0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.476 0.494 +0.494 +0.018
affiliation_text (edit_sim) partial_ulist 0.000 0.532 0.546 +0.546 +0.014
keywords (levenshtein) partial_ulist 0.500 0.000 0.000 -0.500 +0.000
keywords (edit_sim) partial_ulist 0.457 0.000 0.000 -0.457 +0.000
body_section_titles (levenshtein) partial_list 0.223 0.295 0.287 +0.064 -0.008
body_section_titles (edit_sim) partial_list 0.224 0.298 0.291 +0.067 -0.007
acknowledgement (levenshtein) string 0.264 0.303 0.483 +0.219 +0.180
acknowledgement (edit_sim) string 0.374 0.418 0.477 +0.102 +0.059
first_reference_text (levenshtein) string 0.000 0.386 0.386 +0.386 +0.000
first_reference_text (edit_sim) string 0.000 0.554 0.554 +0.554 +0.000
reference_title (levenshtein) partial_list 0.282 0.298 0.439 +0.156 +0.141
reference_title (edit_sim) partial_list 0.306 0.321 0.429 +0.123 +0.108
reference_doi (levenshtein) partial_ulist 0.546 0.315 0.339 -0.207 +0.024
reference_doi (edit_sim) partial_ulist 0.448 0.324 0.347 -0.101 +0.024
biorxiv (9 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 9 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 9 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.889 0.941 0.941 +0.052 +0.000
title (levenshtein) string 0.947 0.941 0.941 -0.006 +0.000
title (edit_sim) string 0.939 0.944 0.944 +0.005 +0.000
abstract (levenshtein) string 0.947 0.364 0.875 -0.072 +0.511
abstract (edit_sim) string 0.947 0.541 0.870 -0.076 +0.329
author_full_names (levenshtein) partial_ulist 0.970 0.962 0.962 -0.008 +0.000
author_full_names (edit_sim) partial_ulist 0.933 0.926 0.926 -0.007 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.907 0.962 +0.962 +0.054
affiliation_text (edit_sim) partial_ulist 0.000 0.891 0.958 +0.958 +0.068
keywords (levenshtein) partial_ulist 0.901 0.000 0.000 -0.901 +0.000
keywords (edit_sim) partial_ulist 0.907 0.000 0.000 -0.907 +0.000
body_section_titles (levenshtein) partial_list 0.516 0.819 0.833 +0.317 +0.014
body_section_titles (edit_sim) partial_list 0.472 0.764 0.785 +0.313 +0.020
acknowledgement (levenshtein) string 0.750 0.875 0.941 +0.191 +0.066
acknowledgement (edit_sim) string 0.726 0.871 0.903 +0.177 +0.032
first_reference_text (levenshtein) string 0.000 0.875 0.875 +0.875 +0.000
first_reference_text (edit_sim) string 0.000 0.870 0.872 +0.872 +0.002
reference_title (levenshtein) partial_list 0.766 0.291 0.765 -0.001 +0.475
reference_title (edit_sim) partial_list 0.727 0.343 0.704 -0.023 +0.362
reference_doi (levenshtein) partial_ulist 0.954 0.860 0.950 -0.004 +0.090
reference_doi (edit_sim) partial_ulist 0.871 0.775 0.880 +0.009 +0.104
ore (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.462 1.000 1.000 +0.538 +0.000
title (levenshtein) string 0.462 1.000 1.000 +0.538 +0.000
title (edit_sim) string 0.547 1.000 1.000 +0.453 +0.000
abstract (levenshtein) string 0.571 0.571 0.571 +0.000 +0.000
abstract (edit_sim) string 0.680 0.600 0.618 -0.062 +0.018
author_full_names (levenshtein) partial_ulist 0.757 0.897 0.938 +0.182 +0.041
author_full_names (edit_sim) partial_ulist 0.757 0.898 0.939 +0.182 +0.041
affiliation_text (levenshtein) partial_ulist 0.000 0.805 0.824 +0.824 +0.019
affiliation_text (edit_sim) partial_ulist 0.000 0.802 0.821 +0.821 +0.019
keywords (levenshtein) partial_ulist 0.431 0.000 0.000 -0.431 +0.000
keywords (edit_sim) partial_ulist 0.395 0.000 0.000 -0.395 +0.000
body_section_titles (levenshtein) partial_list 0.276 0.101 0.107 -0.169 +0.006
body_section_titles (edit_sim) partial_list 0.301 0.189 0.195 -0.106 +0.006
acknowledgement (levenshtein) string 0.833 1.000 1.000 +0.167 +0.000
acknowledgement (edit_sim) string 0.888 1.000 1.000 +0.112 +0.000
first_reference_text (levenshtein) string 0.000 0.462 0.462 +0.462 +0.000
first_reference_text (edit_sim) string 0.000 0.739 0.723 +0.723 -0.016
reference_title (levenshtein) partial_list 0.237 0.424 0.536 +0.299 +0.112
reference_title (edit_sim) partial_list 0.270 0.439 0.521 +0.251 +0.082
reference_doi (levenshtein) partial_ulist 0.681 0.016 0.019 -0.662 +0.003
reference_doi (edit_sim) partial_ulist 0.489 0.004 0.004 -0.485 -0.000
pkp (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.571 0.182 0.182 -0.390 +0.000
title (levenshtein) string 0.667 0.182 0.182 -0.485 +0.000
title (edit_sim) string 0.564 0.317 0.317 -0.247 +0.000
abstract (levenshtein) string 0.824 0.889 0.889 +0.065 +0.000
abstract (edit_sim) string 0.744 0.730 0.730 -0.014 +0.000
author_full_names (levenshtein) partial_ulist 0.853 0.831 0.831 -0.022 +0.000
author_full_names (edit_sim) partial_ulist 0.843 0.845 0.845 +0.002 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.489 0.489 +0.489 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.438 0.438 +0.438 +0.000
keywords (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
keywords (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
reference_title (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
reference_title (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
reference_doi (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
scielo_br (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.462 0.462 0.333 -0.128 -0.128
title (levenshtein) string 0.462 0.462 0.333 -0.128 -0.128
title (edit_sim) string 0.483 0.479 0.443 -0.040 -0.036
abstract (levenshtein) string 0.714 0.462 0.500 -0.214 +0.038
abstract (edit_sim) string 0.620 0.477 0.526 -0.094 +0.049
author_full_names (levenshtein) partial_ulist 0.667 0.571 0.549 -0.118 -0.022
author_full_names (edit_sim) partial_ulist 0.708 0.645 0.615 -0.093 -0.030
affiliation_text (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.348 0.327 +0.327 -0.021
keywords (levenshtein) partial_ulist 0.429 0.000 0.000 -0.429 +0.000
keywords (edit_sim) partial_ulist 0.387 0.000 0.000 -0.387 +0.000
body_section_titles (levenshtein) partial_list 0.247 0.375 0.278 +0.031 -0.097
body_section_titles (edit_sim) partial_list 0.242 0.399 0.309 +0.067 -0.090
acknowledgement (levenshtein) string 0.000 0.000 1.000 +1.000 +1.000
acknowledgement (edit_sim) string 0.632 0.683 1.000 +0.368 +0.317
first_reference_text (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.385 0.384 +0.384 -0.001
reference_title (levenshtein) partial_list 0.147 0.379 0.313 +0.166 -0.066
reference_title (edit_sim) partial_list 0.248 0.423 0.360 +0.111 -0.064
reference_doi (levenshtein) partial_ulist 0.889 0.333 0.333 -0.556 +0.000
reference_doi (edit_sim) partial_ulist 0.671 0.538 0.538 -0.133 +0.000
scielo_mx (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.182 0.182 0.182 -0.000 +0.000
title (levenshtein) string 0.571 0.333 0.333 -0.238 +0.000
title (edit_sim) string 0.589 0.393 0.391 -0.199 -0.002
abstract (levenshtein) string 0.333 0.333 0.462 +0.128 +0.128
abstract (edit_sim) string 0.524 0.439 0.492 -0.032 +0.053
author_full_names (levenshtein) partial_ulist 0.389 0.323 0.294 -0.095 -0.028
author_full_names (edit_sim) partial_ulist 0.398 0.332 0.319 -0.078 -0.012
affiliation_text (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.168 0.168 +0.168 +0.000
keywords (levenshtein) partial_ulist 0.532 0.000 0.000 -0.532 +0.000
keywords (edit_sim) partial_ulist 0.447 0.000 0.000 -0.447 +0.000
body_section_titles (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.364 0.364 +0.364 +0.000
first_reference_text (edit_sim) string 0.000 0.575 0.590 +0.590 +0.014
reference_title (levenshtein) partial_list 0.368 0.198 0.404 +0.036 +0.207
reference_title (edit_sim) partial_list 0.361 0.242 0.390 +0.029 +0.149
reference_doi (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
scielo_preprints-jats (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 10 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-83dec818-20260527.2241 sciencebeam-parser:pr-616-00ef481e-20260601.1137 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact) string 0.462 0.000 0.000 -0.462 +0.000
title (levenshtein) string 0.750 0.182 0.182 -0.568 +0.000
title (edit_sim) string 0.714 0.533 0.533 -0.180 +0.000
abstract (levenshtein) string 0.462 0.462 0.462 +0.000 +0.000
abstract (edit_sim) string 0.457 0.486 0.488 +0.030 +0.001
author_full_names (levenshtein) partial_ulist 0.574 0.517 0.517 -0.057 +0.000
author_full_names (edit_sim) partial_ulist 0.663 0.598 0.598 -0.064 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.696 0.737 +0.737 +0.041
affiliation_text (edit_sim) partial_ulist 0.000 0.580 0.606 +0.606 +0.026
keywords (levenshtein) partial_ulist 0.706 0.000 0.000 -0.706 +0.000
keywords (edit_sim) partial_ulist 0.608 0.000 0.000 -0.608 +0.000
body_section_titles (levenshtein) partial_list 0.302 0.528 0.559 +0.256 +0.030
body_section_titles (edit_sim) partial_list 0.328 0.480 0.505 +0.177 +0.025
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.667 0.667 +0.667 +0.000
first_reference_text (edit_sim) string 0.000 0.787 0.789 +0.789 +0.003
reference_title (levenshtein) partial_list 0.177 0.495 0.645 +0.468 +0.151
reference_title (edit_sim) partial_list 0.231 0.482 0.626 +0.395 +0.144
reference_doi (levenshtein) partial_ulist 0.751 0.734 0.792 +0.041 +0.057
reference_doi (edit_sim) partial_ulist 0.659 0.669 0.714 +0.055 +0.045

@de-code
Copy link
Copy Markdown
Collaborator Author

de-code commented Jun 1, 2026

For scielo_br the title extraction is lower because the title is repeated across all pages. Although on the first page it is larger and at a slightly different position (e.g. for S1413-24782006000300002).

Rewrite the noise block classifier to avoid hardcoded 0.2/0.8 page
fractions. Instead, the top/bottom quartile of each page's own block
y-distribution defines the noise zone, so the threshold adapts to each
document's layout.

Two additional guards prevent false positives:
- Stddev check: position must be stable across occurrences
- Height check: occurrences whose block height exceeds 2× the group
  median are not filtered (catches a large title on page 1 that also
  repeats as a small footer on pages 2+, as seen in scielo_br)
@de-code de-code deployed to benchmark June 1, 2026 11:37 — with GitHub Actions Active
@de-code
Copy link
Copy Markdown
Collaborator Author

de-code commented Jun 2, 2026

Not having looked at the validation dataset. But there are two cases for the training dataset where the PDF doesn't match the JATS document:

  • 10.1590/S0102-311X2001000400007
  • S0102-311X2001000400008

Both are Debate on the paper by ... whereas the PDF is probably the original paper rather than the debate. The DOI links to the debate and the JATS reflect that content.

In those two cases, it is now extracting title as English + Portuguese, whereas before it was English only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant