- Article source credibility has already been verified upstream
- Input name and DOB are assumed correct; DOB expected in YYYY-MM-DD (with basic format validation)
- Name commonality is assessed globally (cultural/geographic context would be considered in future development)
- Only name and DOB are used as inputs — many other useful signals (occupation, location, employer, family members) are intentionally excluded in this task
Manually created test data used to inform focus areas and scoring can be found in data/ams_test_dataset_filled.csv
URL → [1. Fetch, Extract & Parse] → [2. LLM Extraction] → [3. Deterministic Checks] → [4. LLM Assessment] → [5. Scoring]
- LLM call 1 is extraction only
- LLM call 2 is judgement only This separates the concerns of the LLM rather than asking it to do two things at once. This ensures the prompt (and therefore LLM) is focused and makes the output more reliable. A model asked to simultaneously extract and assess tends to let its assessment colour its extraction.
- Fetch article URL, extract clean text
- Extract publish date from metadata or webpage (null if not found)
- Fallback if unavailable: record failure reason, attempt metadata-only extraction, try cached version if possible
- If nothing extractable → verdict must be refer (never discard on fetch failure)
Single LLM call to extract structured facts. No match assessment yet — extraction only. Potentially, with more time this would be mixed with NER (Named Entity Recognition) systems
- Article language / script used
- Country of origin (where was the article published) if available
- Extracted entities as
name + entity_typewhereentity_typeis one ofperson|company|animal|other- Each entity also includes
is_deceased: true | false | null (entity-level only; no input-name matching in extraction)
- Each entity also includes
- All name forms used in the article for the screened candidate, cultural origin, name commonality
- Gender suggestions - how many he / she / they / them
- DOB if stated, age if stated, life events with dates and event_type enum:
- For each life event date, extract structured components where possible:
event_date_year,event_date_month,event_date_day(integers or null) - Also extract the type of event (mainly for deterministic checks)
birth(person's own birth event, tolerated implied age range-1to1; do not confuse withchildren_mentioned)first_job_internship(roughly 16-26)employment(at least 14)prison_sentence(deterministic: <10 impossible, 10-15 highly implausible, 16+ generally plausible)university_graduation(roughly 21-24)founded_company(at least 18)military_service(18+)marriage(at least 16-18 depending context)children_mentioned(at least 16-18)retirement(roughly 60+, with early retirement possible)other(only if none apply)
- For each life event date, extract structured components where possible:
- Twitter / X handle
- LinkedIn profile
- Facebook profile
- Instagram handle
- Note: handles are unreliable as primary signals (accounts get reused, people have multiples) but potentially useful for further research (see PART 2)
Fast, determinstic rule-based checks run before the second LLM call.
Name
- Normalise both strings (lowercase, trim, remove punctuation)
- Check in the extracted names AND in the raw article text to confirm existence
- Output:
exact,derived_matchorno_exact_match - (In future would consider looking at explicit romanisation / transliterations of names so we can test deterministically)
DOB
- Exact DOB match (parsed to ISO)
- Component match scoring against input DOB:
+1when year matches+2when year and month match+3when year, month, and day all match (year / year & month matches capctured as positive for cases where that's the only dob data available)- If a provided component conflicts (e.g. month or day mismatch), treat as inconsistent
- Age arithmetic: stated age + publish date → implied birth year, check vs. input DOB (±1 year)
- Flag: day ≤ 12 → possible day/month transposition
- Flag: 01/01/XXXX → possibly approximate
- If ambiguous date fromat used e.g. 02/03/2000 in a US article - treat either date as potentially valid
Second LLM call. Receives extraction output + deterministic results. Provides independent assessments.
Name match
exact | likely_variant | possible | unlikely | no_match + reason
Considers: nicknames, initials, middle names, cultural ordering, transliteration, particles, name uniqueness
Deceased check (input-name specific)
input_name_is_deceased: true | false | null + reason
Computed in assessment call using input_name against extracted person entities and article context.
Gender match
consistent | inconsistent | ambiguous | not_found + reason
Considers: does the gender of the input name, and the artile match (I'm aware that a person could transition with both a name and gender change - but this is uncommon, so we're ignoring it for this analysis)
Only negative scoring for this section (a positive gender match means very little)
DOB match Owned by deterministic checks (not by LLM in call 2). LLM receives deterministic DOB output as context but does not return a separate DOB verdict field.
Life Event match
impossible | implausible | plausible + reason
Considers: Age at which the life event took place and how plausbile that would be to occur
Sentiment
positive | negative | neutral | n/a + reason
Only assessed if the screened person is identified as a person in the article and name is not no_match
Scoped to this individual specifically, not the article overall (with more time would refine implementation)
Code combines all signals into a final verdict in two layers:
- fetch/extract failed or incomplete (
content_source != full_textor fetch failure reason present) → refer - exact input-name match to non-person entity type (
company/animal/other) → discard - screened input name is assessed as deceased (
input_name_is_deceased = true) → discard (assuming it's a live application, the person making it cannot be confirmed deceased, even if all other details match)
Name score (highest weight)
| Deterministic vs LLM | Score |
|---|---|
exact + exact |
+5 |
no_exact_match + likely_variant |
+2 |
no_exact_match + possible |
+0 |
no_exact_match + unlikely |
-2 |
no_exact_match + no_match |
-5 |
mixed exact/likely_variant or exact/possible |
reduced positive |
| any disagreement | -1 uncertainty penalty |
Name commonality thresholding (no commonality multiplier on score) Name commonality does not scale points. It only changes the match threshold: (With more time to test, I would experiment having it shift the discard threshold as well)
rare: match threshold>= 3uncommon: match threshold>= 4unknown: match threshold>= 4common: match threshold>= 5very_common: match threshold>= 7
DOB score (deterministic-only)
| Deterministic DOB result | Score |
|---|---|
exact |
+4 |
age_consistent |
+2 |
transposition_flagged / approximate_flagged |
+0 |
not_found |
+0 |
inconsistent |
-4 (or -2 when exactly 1 day off) |
Life event score (per event, deterministic baseline with LLM-preferred resolution)
| Plausibility | Score |
|---|---|
impossible |
-3 each |
highly_implausible |
-2 each |
implausible |
-1 each |
plausible |
+0 |
unknown |
+0 |
For now, life events are scored purely negatively (or neutral at 0).
With occupation/background inputs in the future, life events could also contribute positively when role progression and context align with the article.
Deterministic vs LLM resolution
- Deterministic outputs are baseline signals
- Where both deterministic and LLM provide the same signal family (especially life events), LLM judgment is preferred for scoring
- If deterministic is more negative than LLM for the same signal, add a warning flag for analyst review
Gender score (negative-only)
| Condition | Score |
|---|---|
highly_inconsistent + strong/strong evidence |
-3 |
inconsistent + one strong side |
-1 |
consistent / unknown / ambiguous / unisex |
+0 |
rare: match if score>= 3uncommon: match if score>= 4unknown: match if score>= 4common: match if score>= 5very_common: match if score>= 7- discard if score
<= -3 - otherwise refer
Sentiment
- Sentiment is reported for analyst context only and does not contribute to the score.
Output
verdict:match|refer|discardconfidence:high|medium|low(derived from verdict strength and distance from threshold)scoring_explanation: explicit breakdown of score components, thresholds, and gate decisions (I have included a match outcome as this could, with testing, reduce the need for the analyst to review)
(For some extraction and analysis, I've used LLM's over alternative, more deterministic methods, primarily because finding the data for / implementing the alternative methods would be time consuming for an MVP, but calling out as an FYI - some mentions below as well)
Pipeline & extraction
- Improve article fetcher (website url scraper) robustness and completeness — input quality is the biggest single lever on output quality
- Also experiment with ways to bypass bot checkers / paywalls (if legal and ethical to do so) (I would also compare against feeding urls directly to an LLM, however I would build some guardrails to guard against hallucinations)
- Better extraction, and parsing of dates (including publish date, which is sometimes NOT in the metadata, but instead in the HTML itself)
- Cross-check LLM-extracted entities against raw article text to catch hallucinated DOBs or life events
- Experiment with NER systems alongside the LLM extraction call for higher confidence entity identification
- ONS (and/or other cencus) data to perform a more robust assessment of the popularity / commonality of names (potentially even in the context of the time period)
Scoring & logic
- Tune scoring thresholds against a larger labelled dataset — current thresholds are principled but not empirically validated in any real depth; any tuning without held-out ground truth is guesswork
- Tiered DOB scoring: greater penalty the further the article DOB deviates from the input, rather than a flat inconsistent score
- Tighter subject-scoping: confirm the named person is not just mentioned in the article but is the actual subject of the adverse sentiment
Evaluation
- Current test set is too highly weighted toward well-known public figures to guard against false negatives, but a more realistic distribution would include obscure individuals where corroborating signals are sparse (i.e. increase the number / proportion of true negatives)