Skip to content

Add ore, pkp, scielo_br, scielo_mx, scielo_preprints-jats corpora to eval.yml#617

Merged
de-code merged 3 commits into
mainfrom
expand-datasets
May 27, 2026
Merged

Add ore, pkp, scielo_br, scielo_mx, scielo_preprints-jats corpora to eval.yml#617
de-code merged 3 commits into
mainfrom
expand-datasets

Conversation

@de-code
Copy link
Copy Markdown
Collaborator

@de-code de-code commented May 27, 2026

part of https://github.com/eLifePathways/ScienceBeam2.0/issues/83

Extends both train and validation splits with the five additional corpora from the sciencebeam-v2-benchmarking dataset. Smoke sampling stays at 10 per corpus; full counts match the dataset row counts. No code changes needed — predict.py and score.py already iterate over the split's corpus keys dynamically.

…eval.yml

part of eLifePathways/ScienceBeam2.0#83

Extends both train and validation splits with the five additional corpora
from the sciencebeam-v2-benchmarking dataset. Smoke sampling stays at 10
per corpus; full counts match the dataset row counts. No code changes
needed — predict.py and score.py already iterate over the split's corpus
keys dynamically.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

ScienceBeam Parser Evaluation

Overall (59 docs across 6 corpora)

grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 9 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 59 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.504 0.941 0.453 -0.052 -0.488
title (levenshtein) string 0.643 0.941 0.509 -0.134 -0.432
title (edit_sim) string 0.639 0.944 0.605 -0.034 -0.338
abstract (levenshtein) string 0.642 0.364 0.516 -0.126 +0.152
abstract (edit_sim) string 0.662 0.541 0.546 -0.116 +0.005
author_full_names (levenshtein) partial_ulist 0.701 0.962 0.679 -0.023 -0.283
author_full_names (edit_sim) partial_ulist 0.717 0.926 0.704 -0.013 -0.222
affiliation_text (levenshtein) partial_ulist 0.000 0.907 0.481 +0.481 -0.426
affiliation_text (edit_sim) partial_ulist 0.000 0.891 0.532 +0.532 -0.359
keywords (levenshtein) partial_ulist 0.500 0.000 0.000 -0.500 +0.000
keywords (edit_sim) partial_ulist 0.457 0.000 0.000 -0.457 +0.000
body_section_titles (levenshtein) partial_list 0.223 0.819 0.295 +0.072 -0.524
body_section_titles (edit_sim) partial_list 0.224 0.764 0.298 +0.074 -0.467
acknowledgement (levenshtein) string 0.264 0.875 0.303 +0.039 -0.572
acknowledgement (edit_sim) string 0.374 0.871 0.418 +0.044 -0.453
first_reference_text (levenshtein) string 0.000 0.875 0.386 +0.386 -0.489
first_reference_text (edit_sim) string 0.000 0.870 0.554 +0.554 -0.316
reference_title (levenshtein) partial_list 0.282 0.291 0.298 +0.015 +0.007
reference_title (edit_sim) partial_list 0.306 0.343 0.321 +0.015 -0.022
reference_doi (levenshtein) partial_ulist 0.546 0.860 0.315 -0.231 -0.545
reference_doi (edit_sim) partial_ulist 0.448 0.775 0.324 -0.125 -0.452
biorxiv (9 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 9 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 9 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.889 0.941 0.941 +0.052 +0.000
title (levenshtein) string 0.947 0.941 0.941 -0.006 +0.000
title (edit_sim) string 0.939 0.944 0.944 +0.005 +0.000
abstract (levenshtein) string 0.947 0.364 0.364 -0.584 +0.000
abstract (edit_sim) string 0.947 0.541 0.541 -0.406 +0.000
author_full_names (levenshtein) partial_ulist 0.970 0.962 0.962 -0.008 +0.000
author_full_names (edit_sim) partial_ulist 0.933 0.926 0.926 -0.007 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.907 0.907 +0.907 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.891 0.891 +0.891 +0.000
keywords (levenshtein) partial_ulist 0.901 0.000 0.000 -0.901 +0.000
keywords (edit_sim) partial_ulist 0.907 0.000 0.000 -0.907 +0.000
body_section_titles (levenshtein) partial_list 0.516 0.819 0.819 +0.303 +0.000
body_section_titles (edit_sim) partial_list 0.472 0.764 0.764 +0.293 +0.000
acknowledgement (levenshtein) string 0.750 0.875 0.875 +0.125 +0.000
acknowledgement (edit_sim) string 0.726 0.871 0.871 +0.145 +0.000
first_reference_text (levenshtein) string 0.000 0.875 0.875 +0.875 +0.000
first_reference_text (edit_sim) string 0.000 0.870 0.870 +0.870 +0.000
reference_title (levenshtein) partial_list 0.766 0.291 0.291 -0.475 +0.000
reference_title (edit_sim) partial_list 0.727 0.343 0.343 -0.385 +0.000
reference_doi (levenshtein) partial_ulist 0.954 0.860 0.860 -0.094 +0.000
reference_doi (edit_sim) partial_ulist 0.871 0.775 0.775 -0.096 +0.000
ore (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 0 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.462 1.000 +0.538
title (levenshtein) string 0.462 1.000 +0.538
title (edit_sim) string 0.547 1.000 +0.453
abstract (levenshtein) string 0.571 0.571 +0.000
abstract (edit_sim) string 0.680 0.600 -0.080
author_full_names (levenshtein) partial_ulist 0.757 0.897 +0.140
author_full_names (edit_sim) partial_ulist 0.757 0.898 +0.142
affiliation_text (levenshtein) partial_ulist 0.000 0.805 +0.805
affiliation_text (edit_sim) partial_ulist 0.000 0.802 +0.802
keywords (levenshtein) partial_ulist 0.431 0.000 -0.431
keywords (edit_sim) partial_ulist 0.395 0.000 -0.395
body_section_titles (levenshtein) partial_list 0.276 0.101 -0.175
body_section_titles (edit_sim) partial_list 0.301 0.189 -0.112
acknowledgement (levenshtein) string 0.833 1.000 +0.167
acknowledgement (edit_sim) string 0.888 1.000 +0.112
first_reference_text (levenshtein) string 0.000 0.462 +0.462
first_reference_text (edit_sim) string 0.000 0.739 +0.739
reference_title (levenshtein) partial_list 0.237 0.424 +0.187
reference_title (edit_sim) partial_list 0.270 0.439 +0.169
reference_doi (levenshtein) partial_ulist 0.681 0.016 -0.665
reference_doi (edit_sim) partial_ulist 0.489 0.004 -0.485
pkp (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 0 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.571 0.182 -0.390
title (levenshtein) string 0.667 0.182 -0.485
title (edit_sim) string 0.564 0.317 -0.247
abstract (levenshtein) string 0.824 0.889 +0.065
abstract (edit_sim) string 0.744 0.730 -0.014
author_full_names (levenshtein) partial_ulist 0.853 0.831 -0.022
author_full_names (edit_sim) partial_ulist 0.843 0.845 +0.002
affiliation_text (levenshtein) partial_ulist 0.000 0.522 +0.522
affiliation_text (edit_sim) partial_ulist 0.000 0.438 +0.438
keywords (levenshtein) partial_ulist 0.000 0.000 +0.000
keywords (edit_sim) partial_ulist 0.000 0.000 +0.000
body_section_titles (levenshtein) partial_list 0.000 0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.000 +0.000
reference_title (levenshtein) partial_list 0.000 0.000 +0.000
reference_title (edit_sim) partial_list 0.000 0.000 +0.000
reference_doi (levenshtein) partial_ulist 0.000 0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 +0.000
scielo_br (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 0 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.462 0.462 +0.000
title (levenshtein) string 0.462 0.462 +0.000
title (edit_sim) string 0.483 0.479 -0.004
abstract (levenshtein) string 0.714 0.462 -0.253
abstract (edit_sim) string 0.620 0.477 -0.143
author_full_names (levenshtein) partial_ulist 0.667 0.571 -0.095
author_full_names (edit_sim) partial_ulist 0.708 0.645 -0.063
affiliation_text (levenshtein) partial_ulist 0.000 0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.348 +0.348
keywords (levenshtein) partial_ulist 0.429 0.000 -0.429
keywords (edit_sim) partial_ulist 0.387 0.000 -0.387
body_section_titles (levenshtein) partial_list 0.247 0.375 +0.128
body_section_titles (edit_sim) partial_list 0.242 0.399 +0.157
acknowledgement (levenshtein) string 0.000 0.000 +0.000
acknowledgement (edit_sim) string 0.632 0.683 +0.051
first_reference_text (levenshtein) string 0.000 0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.385 +0.385
reference_title (levenshtein) partial_list 0.147 0.379 +0.232
reference_title (edit_sim) partial_list 0.248 0.423 +0.175
reference_doi (levenshtein) partial_ulist 0.889 0.333 -0.556
reference_doi (edit_sim) partial_ulist 0.671 0.538 -0.133
scielo_mx (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 0 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.182 0.182 -0.000
title (levenshtein) string 0.571 0.333 -0.238
title (edit_sim) string 0.589 0.393 -0.196
abstract (levenshtein) string 0.333 0.333 -0.000
abstract (edit_sim) string 0.524 0.439 -0.085
author_full_names (levenshtein) partial_ulist 0.389 0.323 -0.066
author_full_names (edit_sim) partial_ulist 0.398 0.332 -0.066
affiliation_text (levenshtein) partial_ulist 0.000 0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.168 +0.168
keywords (levenshtein) partial_ulist 0.532 0.000 -0.532
keywords (edit_sim) partial_ulist 0.447 0.000 -0.447
body_section_titles (levenshtein) partial_list 0.000 0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.364 +0.364
first_reference_text (edit_sim) string 0.000 0.575 +0.575
reference_title (levenshtein) partial_list 0.368 0.198 -0.170
reference_title (edit_sim) partial_list 0.361 0.242 -0.119
reference_doi (levenshtein) partial_ulist 0.000 0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 +0.000
scielo_preprints-jats (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-3874e53e-20260527.2141: 0 docs | sciencebeam-parser:pr-617-3afbeb64-20260527.2150: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-3874e53e-20260527.2141 sciencebeam-parser:pr-617-3afbeb64-20260527.2150 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-3874e53e-20260527.2141
title (exact) string 0.462 0.000 -0.462
title (levenshtein) string 0.750 0.182 -0.568
title (edit_sim) string 0.714 0.533 -0.180
abstract (levenshtein) string 0.462 0.462 +0.000
abstract (edit_sim) string 0.457 0.486 +0.029
author_full_names (levenshtein) partial_ulist 0.574 0.517 -0.057
author_full_names (edit_sim) partial_ulist 0.663 0.598 -0.064
affiliation_text (levenshtein) partial_ulist 0.000 0.696 +0.696
affiliation_text (edit_sim) partial_ulist 0.000 0.580 +0.580
keywords (levenshtein) partial_ulist 0.706 0.000 -0.706
keywords (edit_sim) partial_ulist 0.608 0.000 -0.608
body_section_titles (levenshtein) partial_list 0.302 0.528 +0.226
body_section_titles (edit_sim) partial_list 0.328 0.480 +0.152
acknowledgement (levenshtein) string 0.000 0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.667 +0.667
first_reference_text (edit_sim) string 0.000 0.787 +0.787
reference_title (levenshtein) partial_list 0.177 0.495 +0.318
reference_title (edit_sim) partial_list 0.231 0.482 +0.251
reference_doi (levenshtein) partial_ulist 0.751 0.734 -0.017
reference_doi (edit_sim) partial_ulist 0.659 0.669 +0.010

de-code added 2 commits May 27, 2026 22:41
… report

With multiple corpora the report was growing too long to scan at a glance.
Each corpus section is now wrapped in a collapsed <details> block, and an
Overall section is added at the top showing doc-count-weighted F1 across
all corpora so the headline result is immediately visible.
@de-code de-code marked this pull request as ready for review May 27, 2026 22:00
@de-code de-code merged commit 5b07693 into main May 27, 2026
7 checks passed
@de-code de-code deleted the expand-datasets branch May 27, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant