Skip to content

Latest commit

 

History

History
234 lines (183 loc) · 11.3 KB

File metadata and controls

234 lines (183 loc) · 11.3 KB

AMS Check Logic


Assumptions

  • Article source credibility has already been verified upstream
  • Input name and DOB are assumed correct; DOB expected in YYYY-MM-DD (with basic format validation)
  • Name commonality is assessed globally (cultural/geographic context would be considered in future development)
  • Only name and DOB are used as inputs — many other useful signals (occupation, location, employer, family members) are intentionally excluded in this task

Test Data

Manually created test data used to inform focus areas and scoring can be found in data/ams_test_dataset_filled.csv


Pipeline Overview

URL → [1. Fetch, Extract & Parse] → [2. LLM Extraction] → [3. Deterministic Checks] → [4. LLM Assessment] → [5. Scoring]

Justificaiton of two-call structure

  • LLM call 1 is extraction only
  • LLM call 2 is judgement only This separates the concerns of the LLM rather than asking it to do two things at once. This ensures the prompt (and therefore LLM) is focused and makes the output more reliable. A model asked to simultaneously extract and assess tends to let its assessment colour its extraction.

Stage 1: Fetch, Extract & Parse (Deterministic)

  • Fetch article URL, extract clean text
  • Extract publish date from metadata or webpage (null if not found)
  • Fallback if unavailable: record failure reason, attempt metadata-only extraction, try cached version if possible
  • If nothing extractable → verdict must be refer (never discard on fetch failure)

Stage 2: LLM Extraction

Single LLM call to extract structured facts. No match assessment yet — extraction only. Potentially, with more time this would be mixed with NER (Named Entity Recognition) systems

  • Article language / script used
  • Country of origin (where was the article published) if available
  • Extracted entities as name + entity_type where entity_type is one of person | company | animal | other
    • Each entity also includes is_deceased: true | false | null (entity-level only; no input-name matching in extraction)
  • All name forms used in the article for the screened candidate, cultural origin, name commonality
  • Gender suggestions - how many he / she / they / them
  • DOB if stated, age if stated, life events with dates and event_type enum:
    • For each life event date, extract structured components where possible: event_date_year, event_date_month, event_date_day (integers or null)
    • Also extract the type of event (mainly for deterministic checks)
      • birth (person's own birth event, tolerated implied age range -1 to 1; do not confuse with children_mentioned)
      • first_job_internship (roughly 16-26)
      • employment (at least 14)
      • prison_sentence (deterministic: <10 impossible, 10-15 highly implausible, 16+ generally plausible)
      • university_graduation (roughly 21-24)
      • founded_company (at least 18)
      • military_service (18+)
      • marriage (at least 16-18 depending context)
      • children_mentioned (at least 16-18)
      • retirement (roughly 60+, with early retirement possible)
      • other (only if none apply)

Unique identifiers (corroborative, not primary signals)

  • Twitter / X handle
  • LinkedIn profile
  • Facebook profile
  • Instagram handle
  • Note: handles are unreliable as primary signals (accounts get reused, people have multiples) but potentially useful for further research (see PART 2)

Stage 3: Deterministic Checks (Code)

Fast, determinstic rule-based checks run before the second LLM call.

Name

  • Normalise both strings (lowercase, trim, remove punctuation)
  • Check in the extracted names AND in the raw article text to confirm existence
  • Output: exact, derived_match or no_exact_match
  • (In future would consider looking at explicit romanisation / transliterations of names so we can test deterministically)

DOB

  • Exact DOB match (parsed to ISO)
  • Component match scoring against input DOB:
    • +1 when year matches
    • +2 when year and month match
    • +3 when year, month, and day all match (year / year & month matches capctured as positive for cases where that's the only dob data available)
    • If a provided component conflicts (e.g. month or day mismatch), treat as inconsistent
  • Age arithmetic: stated age + publish date → implied birth year, check vs. input DOB (±1 year)
  • Flag: day ≤ 12 → possible day/month transposition
  • Flag: 01/01/XXXX → possibly approximate
  • If ambiguous date fromat used e.g. 02/03/2000 in a US article - treat either date as potentially valid

Stage 4: LLM Assessment

Second LLM call. Receives extraction output + deterministic results. Provides independent assessments.

Name match exact | likely_variant | possible | unlikely | no_match + reason Considers: nicknames, initials, middle names, cultural ordering, transliteration, particles, name uniqueness

Deceased check (input-name specific) input_name_is_deceased: true | false | null + reason Computed in assessment call using input_name against extracted person entities and article context.

Gender match consistent | inconsistent | ambiguous | not_found + reason Considers: does the gender of the input name, and the artile match (I'm aware that a person could transition with both a name and gender change - but this is uncommon, so we're ignoring it for this analysis) Only negative scoring for this section (a positive gender match means very little)

DOB match Owned by deterministic checks (not by LLM in call 2). LLM receives deterministic DOB output as context but does not return a separate DOB verdict field.

Life Event match impossible | implausible | plausible + reason Considers: Age at which the life event took place and how plausbile that would be to occur

Sentiment positive | negative | neutral | n/a + reason Only assessed if the screened person is identified as a person in the article and name is not no_match Scoped to this individual specifically, not the article overall (with more time would refine implementation)


Stage 5: Confidence Scoring (Deterministic)

Code combines all signals into a final verdict in two layers:

Layer 1: Hard Overrides (applied before scoring)

  • fetch/extract failed or incomplete (content_source != full_text or fetch failure reason present) → refer
  • exact input-name match to non-person entity type (company/animal/other) → discard
  • screened input name is assessed as deceased (input_name_is_deceased = true) → discard (assuming it's a live application, the person making it cannot be confirmed deceased, even if all other details match)

Layer 2: Additive Score (non-overridden paths)

Name score (highest weight)

Deterministic vs LLM Score
exact + exact +5
no_exact_match + likely_variant +2
no_exact_match + possible +0
no_exact_match + unlikely -2
no_exact_match + no_match -5
mixed exact/likely_variant or exact/possible reduced positive
any disagreement -1 uncertainty penalty

Name commonality thresholding (no commonality multiplier on score) Name commonality does not scale points. It only changes the match threshold: (With more time to test, I would experiment having it shift the discard threshold as well)

  • rare: match threshold >= 3
  • uncommon: match threshold >= 4
  • unknown: match threshold >= 4
  • common: match threshold >= 5
  • very_common: match threshold >= 7

DOB score (deterministic-only)

Deterministic DOB result Score
exact +4
age_consistent +2
transposition_flagged / approximate_flagged +0
not_found +0
inconsistent -4 (or -2 when exactly 1 day off)

Life event score (per event, deterministic baseline with LLM-preferred resolution)

Plausibility Score
impossible -3 each
highly_implausible -2 each
implausible -1 each
plausible +0
unknown +0

For now, life events are scored purely negatively (or neutral at 0).
With occupation/background inputs in the future, life events could also contribute positively when role progression and context align with the article.

Deterministic vs LLM resolution

  • Deterministic outputs are baseline signals
  • Where both deterministic and LLM provide the same signal family (especially life events), LLM judgment is preferred for scoring
  • If deterministic is more negative than LLM for the same signal, add a warning flag for analyst review

Gender score (negative-only)

Condition Score
highly_inconsistent + strong/strong evidence -3
inconsistent + one strong side -1
consistent / unknown / ambiguous / unisex +0

Final Thresholds

  • rare: match if score >= 3
  • uncommon: match if score >= 4
  • unknown: match if score >= 4
  • common: match if score >= 5
  • very_common: match if score >= 7
  • discard if score <= -3
  • otherwise refer

Sentiment

  • Sentiment is reported for analyst context only and does not contribute to the score.

Output

  • verdict: match | refer | discard
  • confidence: high | medium | low (derived from verdict strength and distance from threshold)
  • scoring_explanation: explicit breakdown of score components, thresholds, and gate decisions (I have included a match outcome as this could, with testing, reduce the need for the analyst to review)

Some Next Steps & Improvements

(For some extraction and analysis, I've used LLM's over alternative, more deterministic methods, primarily because finding the data for / implementing the alternative methods would be time consuming for an MVP, but calling out as an FYI - some mentions below as well)

Pipeline & extraction

  • Improve article fetcher (website url scraper) robustness and completeness — input quality is the biggest single lever on output quality
    • Also experiment with ways to bypass bot checkers / paywalls (if legal and ethical to do so) (I would also compare against feeding urls directly to an LLM, however I would build some guardrails to guard against hallucinations)
  • Better extraction, and parsing of dates (including publish date, which is sometimes NOT in the metadata, but instead in the HTML itself)
  • Cross-check LLM-extracted entities against raw article text to catch hallucinated DOBs or life events
  • Experiment with NER systems alongside the LLM extraction call for higher confidence entity identification
  • ONS (and/or other cencus) data to perform a more robust assessment of the popularity / commonality of names (potentially even in the context of the time period)

Scoring & logic

  • Tune scoring thresholds against a larger labelled dataset — current thresholds are principled but not empirically validated in any real depth; any tuning without held-out ground truth is guesswork
  • Tiered DOB scoring: greater penalty the further the article DOB deviates from the input, rather than a flat inconsistent score
  • Tighter subject-scoping: confirm the named person is not just mentioned in the article but is the actual subject of the adverse sentiment

Evaluation

  • Current test set is too highly weighted toward well-known public figures to guard against false negatives, but a more realistic distribution would include obscure individuals where corroborating signals are sparse (i.e. increase the number / proportion of true negatives)