Add ore, pkp, scielo_br, scielo_mx, scielo_preprints-jats corpora to eval.yml by de-code · Pull Request #617 · eLifePathways/sciencebeam-parser

de-code · 2026-05-27T21:21:27Z

part of https://github.com/eLifePathways/ScienceBeam2.0/issues/83

Extends both train and validation splits with the five additional corpora from the sciencebeam-v2-benchmarking dataset. Smoke sampling stays at 10 per corpus; full counts match the dataset row counts. No code changes needed — predict.py and score.py already iterate over the split's corpus keys dynamically.

…eval.yml part of eLifePathways/ScienceBeam2.0#83 Extends both train and validation splits with the five additional corpora from the sciencebeam-v2-benchmarking dataset. Smoke sampling stays at 10 per corpus; full counts match the dataset row counts. No code changes needed — predict.py and score.py already iterate over the split's corpus keys dynamically.

github-actions · 2026-05-27T21:37:39Z

ScienceBeam Parser Evaluation

Overall (59 docs across 6 corpora)

grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 9 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 59 docs

Field (method)	Type	grobid 0.9.0-crf	sciencebeam-parser:main-3874e53e-20260527.2141	sciencebeam-parser:pr-617-3afbeb64-20260527.2150	Δ grobid 0.9.0-crf	Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact)	string	0.504	0.941	0.453	-0.052	-0.488
title (levenshtein)	string	0.643	0.941	0.509	-0.134	-0.432
title (edit_sim)	string	0.639	0.944	0.605	-0.034	-0.338
abstract (levenshtein)	string	0.642	0.364	0.516	-0.126	+0.152
abstract (edit_sim)	string	0.662	0.541	0.546	-0.116	+0.005
author_full_names (levenshtein)	partial_ulist	0.701	0.962	0.679	-0.023	-0.283
author_full_names (edit_sim)	partial_ulist	0.717	0.926	0.704	-0.013	-0.222
affiliation_text (levenshtein)	partial_ulist	0.000	0.907	0.481	+0.481	-0.426
affiliation_text (edit_sim)	partial_ulist	0.000	0.891	0.532	+0.532	-0.359
keywords (levenshtein)	partial_ulist	0.500	0.000	0.000	-0.500	+0.000
keywords (edit_sim)	partial_ulist	0.457	0.000	0.000	-0.457	+0.000
body_section_titles (levenshtein)	partial_list	0.223	0.819	0.295	+0.072	-0.524
body_section_titles (edit_sim)	partial_list	0.224	0.764	0.298	+0.074	-0.467
acknowledgement (levenshtein)	string	0.264	0.875	0.303	+0.039	-0.572
acknowledgement (edit_sim)	string	0.374	0.871	0.418	+0.044	-0.453
first_reference_text (levenshtein)	string	0.000	0.875	0.386	+0.386	-0.489
first_reference_text (edit_sim)	string	0.000	0.870	0.554	+0.554	-0.316
reference_title (levenshtein)	partial_list	0.282	0.291	0.298	+0.015	+0.007
reference_title (edit_sim)	partial_list	0.306	0.343	0.321	+0.015	-0.022
reference_doi (levenshtein)	partial_ulist	0.546	0.860	0.315	-0.231	-0.545
reference_doi (edit_sim)	partial_ulist	0.448	0.775	0.324	-0.125	-0.452

biorxiv (9 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 9 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 9 docs

Field (method)	Type	grobid 0.9.0-crf	sciencebeam-parser:main-3874e53e-20260527.2141	sciencebeam-parser:pr-617-3afbeb64-20260527.2150	Δ grobid 0.9.0-crf	Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact)	string	0.889	0.941	0.941	+0.052	+0.000
title (levenshtein)	string	0.947	0.941	0.941	-0.006	+0.000
title (edit_sim)	string	0.939	0.944	0.944	+0.005	+0.000
abstract (levenshtein)	string	0.947	0.364	0.364	-0.584	+0.000
abstract (edit_sim)	string	0.947	0.541	0.541	-0.406	+0.000
author_full_names (levenshtein)	partial_ulist	0.970	0.962	0.962	-0.008	+0.000
author_full_names (edit_sim)	partial_ulist	0.933	0.926	0.926	-0.007	+0.000
affiliation_text (levenshtein)	partial_ulist	0.000	0.907	0.907	+0.907	+0.000
affiliation_text (edit_sim)	partial_ulist	0.000	0.891	0.891	+0.891	+0.000
keywords (levenshtein)	partial_ulist	0.901	0.000	0.000	-0.901	+0.000
keywords (edit_sim)	partial_ulist	0.907	0.000	0.000	-0.907	+0.000
body_section_titles (levenshtein)	partial_list	0.516	0.819	0.819	+0.303	+0.000
body_section_titles (edit_sim)	partial_list	0.472	0.764	0.764	+0.293	+0.000
acknowledgement (levenshtein)	string	0.750	0.875	0.875	+0.125	+0.000
acknowledgement (edit_sim)	string	0.726	0.871	0.871	+0.145	+0.000
first_reference_text (levenshtein)	string	0.000	0.875	0.875	+0.875	+0.000
first_reference_text (edit_sim)	string	0.000	0.870	0.870	+0.870	+0.000
reference_title (levenshtein)	partial_list	0.766	0.291	0.291	-0.475	+0.000
reference_title (edit_sim)	partial_list	0.727	0.343	0.343	-0.385	+0.000
reference_doi (levenshtein)	partial_ulist	0.954	0.860	0.860	-0.094	+0.000
reference_doi (edit_sim)	partial_ulist	0.871	0.775	0.775	-0.096	+0.000

ore (10 docs)