Filter repeating running heads and feet before segmentation by de-code · Pull Request #616 · eLifePathways/sciencebeam-parser

de-code · 2026-05-27T21:07:18Z

related to https://github.com/eLifePathways/ScienceBeam2.0/issues/61

Add a pre-segmentation noise filter that detects layout blocks whose text repeats at the top or bottom of pages across a document. Detected blocks are excluded from the segmentation model input and preserved in the output XML as / elements for auditability.

Enabled by default via noise_filter_enabled in config.yml.

related to eLifePathways/ScienceBeam2.0#61 Add a pre-segmentation noise filter that detects layout blocks repeating at the top or bottom of pages across a document (running heads, running feet) using position and cross-page text repetition. Detected blocks are excluded from the segmentation model input and preserved in the output XML as <note type="running-head"> / <note type="running-foot"> elements for auditability. Enabled by default via noise_filter_enabled in config.yml.

github-actions · 2026-05-27T21:16:22Z

ScienceBeam Parser Evaluation

Overall (59 docs across 6 corpora)

grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-83dec818-20260527.2241: 59 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 59 docs

Field (method)	Type	grobid 0.9.0-crf	sciencebeam-parser:main-83dec818-20260527.2241	sciencebeam-parser:pr-616-00ef481e-20260601.1137	Δ grobid 0.9.0-crf	Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact)	string	0.504	0.453	0.431	-0.073	-0.022
title (levenshtein)	string	0.643	0.509	0.488	-0.155	-0.022
title (edit_sim)	string	0.639	0.605	0.599	-0.041	-0.006
abstract (levenshtein)	string	0.642	0.516	0.622	-0.020	+0.106
abstract (edit_sim)	string	0.662	0.546	0.616	-0.045	+0.071
author_full_names (levenshtein)	partial_ulist	0.701	0.679	0.677	-0.024	-0.002
author_full_names (edit_sim)	partial_ulist	0.717	0.704	0.703	-0.014	-0.000
affiliation_text (levenshtein)	partial_ulist	0.000	0.476	0.494	+0.494	+0.018
affiliation_text (edit_sim)	partial_ulist	0.000	0.532	0.546	+0.546	+0.014
keywords (levenshtein)	partial_ulist	0.500	0.000	0.000	-0.500	+0.000
keywords (edit_sim)	partial_ulist	0.457	0.000	0.000	-0.457	+0.000
body_section_titles (levenshtein)	partial_list	0.223	0.295	0.287	+0.064	-0.008
body_section_titles (edit_sim)	partial_list	0.224	0.298	0.291	+0.067	-0.007
acknowledgement (levenshtein)	string	0.264	0.303	0.483	+0.219	+0.180
acknowledgement (edit_sim)	string	0.374	0.418	0.477	+0.102	+0.059
first_reference_text (levenshtein)	string	0.000	0.386	0.386	+0.386	+0.000
first_reference_text (edit_sim)	string	0.000	0.554	0.554	+0.554	+0.000
reference_title (levenshtein)	partial_list	0.282	0.298	0.439	+0.156	+0.141
reference_title (edit_sim)	partial_list	0.306	0.321	0.429	+0.123	+0.108
reference_doi (levenshtein)	partial_ulist	0.546	0.315	0.339	-0.207	+0.024
reference_doi (edit_sim)	partial_ulist	0.448	0.324	0.347	-0.101	+0.024

biorxiv (9 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-83dec818-20260527.2241: 9 docs | sciencebeam-parser:pr-616-00ef481e-20260601.1137: 9 docs

Field (method)	Type	grobid 0.9.0-crf	sciencebeam-parser:main-83dec818-20260527.2241	sciencebeam-parser:pr-616-00ef481e-20260601.1137	Δ grobid 0.9.0-crf	Δ sciencebeam-parser:main-83dec818-20260527.2241
title (exact)	string	0.889	0.941	0.941	+0.052	+0.000
title (levenshtein)	string	0.947	0.941	0.941	-0.006	+0.000
title (edit_sim)	string	0.939	0.944	0.944	+0.005	+0.000
abstract (levenshtein)	string	0.947	0.364	0.875	-0.072	+0.511
abstract (edit_sim)	string	0.947	0.541	0.870	-0.076	+0.329
author_full_names (levenshtein)	partial_ulist	0.970	0.962	0.962	-0.008	+0.000
author_full_names (edit_sim)	partial_ulist	0.933	0.926	0.926	-0.007	+0.000
affiliation_text (levenshtein)	partial_ulist	0.000	0.907	0.962	+0.962	+0.054
affiliation_text (edit_sim)	partial_ulist	0.000	0.891	0.958	+0.958	+0.068
keywords (levenshtein)	partial_ulist	0.901	0.000	0.000	-0.901	+0.000
keywords (edit_sim)	partial_ulist	0.907	0.000	0.000	-0.907	+0.000
body_section_titles (levenshtein)	partial_list	0.516	0.819	0.833	+0.317	+0.014
body_section_titles (edit_sim)	partial_list	0.472	0.764	0.785	+0.313	+0.020
acknowledgement (levenshtein)	string	0.750	0.875	0.941	+0.191	+0.066
acknowledgement (edit_sim)	string	0.726	0.871	0.903	+0.177	+0.032
first_reference_text (levenshtein)	string	0.000	0.875	0.875	+0.875	+0.000
first_reference_text (edit_sim)	string	0.000	0.870	0.872	+0.872	+0.002
reference_title (levenshtein)	partial_list	0.766	0.291	0.765	-0.001	+0.475
reference_title (edit_sim)	partial_list	0.727	0.343	0.704	-0.023	+0.362
reference_doi (levenshtein)	partial_ulist	0.954	0.860	0.950	-0.004	+0.090
reference_doi (edit_sim)	partial_ulist	0.871	0.775	0.880	+0.009	+0.104

ore (10 docs)