Mon 06 April 2026

Literary NER generalisation experiment

I wanted a lightweight NER model for Jane Austen and nearby nineteenth-century fiction. There was no off-the-shelf version sitting there waiting for me, and building one sounded fun.

What followed was a useful reminder that a high validation score can be perfectly real and still hide a model that learned names instead of patterns.

Narrative structure

1. Validation scores don't imply generalisation

Model	In-domain	Cross-val†	Pickwick OOD
CRF	0.9438	0.2010	0.0668
TextCNN	0.8736	0.2548	0.0919

CRF goes from 0.94 → 0.20 → 0.07. On held-out test it looks nearly solved. On a genuinely unseen book it is basically useless.

TextCNN is weaker in-domain but loses less. It learned something slightly more general, even by accident.

The model you would ship from the validation score is the worst one on the real test.

2. The obvious next move: a transformer

The obvious next move.

huawei-noah/TinyBERT_General_4L_312D — 14.3M params, 4 layers, 312d hidden. Fine-tuned 4 epochs.

---
config:
  theme: base
  fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
  themeVariables:
    xyChart:
      backgroundColor: "#f7f7fa"
      titleColor: "#1f2328"
      xAxisLabelColor: "#1f2328"
      xAxisTickColor: "#1f2328"
      xAxisLineColor: "#1f2328"
      yAxisLabelColor: "#1f2328"
      yAxisTickColor: "#1f2328"
      yAxisLineColor: "#1f2328"
      plotColorPalette: "#7e8fa6, #c1b6ff, #f5a97f"
  xyChart:
    width: 1120
    height: 500
    titleFontSize: 18
    plotReservedSpacePercent: 40
    xAxis:
      labelFontSize: 12
      labelPadding: 14
      showTitle: false
    yAxis:
      labelFontSize: 12
      showTitle: false
---
xychart-beta
    title "F1 across tiers"
    x-axis ["In-domain", "Cross-val", "Pickwick"]
    y-axis 0 --> 1
    line "CRF" [0.9438, 0.2010, 0.0668]
    line "TextCNN" [0.8736, 0.2548, 0.0919]
    line "TinyBERT" [0.9707, 0.2896, 0.1004]

CRF TextCNN TinyBERT

It wins the easy numbers. Pickwick barely moves.

[reasonable person closes the notebook here]

3. Different architectures, same failure

We kept going. Three classical architectures, each aimed at a different failure mode.

XGB+CRF: continuous GloVe features + CRF transitions
CharHybrid: char-level + word-level CNN + CRF
NeuralGBT+CRF: TextCNN softmax → GBT arbitration → CRF decode

XGB+CRF lands at 0.0673 on Pickwick, CharHybrid at 0.0743, NeuralGBT+CRF at 0.0941, and TinyBERT at 0.1004. Different architectures, same outcome: nobody is really generalising.

The stack is the clue.

4. The model learned names, not patterns

The cleanest test: vandalise the names.

Replace entity tokens in the in-domain test set with invented strings with no GloVe vector, and with common words whose embeddings point away from the usual name cluster.

Model	Baseline	Invented names	Common words
CRF	0.9534	0.6466	0.6298
NeuralGBT+CRF	0.9451	0.6578	0.6247

About a thirty-point drop either way. Precision barely moves. Recall caves in.

The models were doing name lookup, not entity recognition.

When Darcy becomes Zorfax, they stop firing. When it becomes running, the embedding now actively points away from “name”.

Fix: poison the shortcut.

Replace 30% of entity spans during training with invented names. Labels unchanged. Context unchanged. Token identity becomes unreliable, forcing the model onto surrounding patterns.

---
config:
  theme: base
  fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
  themeVariables:
    xyChart:
      backgroundColor: "#f7f7fa"
      titleColor: "#1f2328"
      xAxisLabelColor: "#1f2328"
      xAxisTickColor: "#1f2328"
      xAxisLineColor: "#1f2328"
      yAxisLabelColor: "#1f2328"
      yAxisTickColor: "#1f2328"
      yAxisLineColor: "#1f2328"
      plotColorPalette: "#6a8759, #7e8fa6"
  xyChart:
    width: 1120
    height: 430
    titleFontSize: 18
    plotReservedSpacePercent: 40
    xAxis:
      labelFontSize: 12
      labelPadding: 14
      showTitle: false
    yAxis:
      labelFontSize: 12
      showTitle: false
---
xychart-beta
    title "Pickwick OOD: baseline plus augmentation lift"
    x-axis ["NGBT", "TinyBERT"]
    y-axis 0 --> 1
    bar "Augmented total" [0.7718, 0.2550]
    bar "Baseline" [0.0941, 0.1004]

Baseline Pickwick F1 Lift from augmentation

NeuralGBT+CRF goes from 0.09 → 0.77 on Pickwick. It gives up about two in-domain F1 points and gains almost seventy where it matters.

TinyBERT improves too, just less dramatically.

WordPiece already softens the OOV problem by breaking unseen names into subwords. The GloVe-backed GBT had no fallback. Unknown names were dead space. Augmentation forced it to learn around that.

NeuralGBT+CRF (aug) at 0.77 still beats TinyBERT (aug) at 0.26 by roughly 3×.

Examples

What the three tiers actually test

Name overlap between train and val is normal for NER. Standard benchmarks split by document, not by entity name.

This setup uses a clean gradient instead:

Tier	What the model has seen	What it tests
In-domain	Same author, same characters, held-out sentences	That learning happened at all
Cross-val†	Known authors, new books, mostly unseen names	Generalisation within distribution
Pickwick OOD	Nothing — cold start	Actual generalisation

Pickwick is the only true cold-start test.

Some sentence-level examples make the failure mode clearer than another average. Key: crf = CRF, ngbt = NeuralGBT+CRF, aug = NeuralGBT+CRF (augmented).

Easy in-domain hit. "Live with me, dear Lady Bertram!" All three fire: crf=B-PER I-PER, ngbt=B-PER I-PER, aug=B-PER I-PER.
Possessive form breaks the baseline. The sudden termination of Colonel Brandon's visit at the park... crf=B-PER I-PER, ngbt=O O, aug=B-PER I-PER. CRF anchors on Colonel; the augmented model recovers the possessive form.
Some patterns transfer across books. "Do you know, Miss Linton, that brute Hareton laughs at me!" CRF misses all three tokens, while ngbt and aug get them from the title + capitalised-surname pattern.
Augmentation is what finally survives OOD. Pickwick undertook to drive... and Pickwick's Determination... both get crf=O, ngbt=O, aug=B-PER. The augmented model learned a transferable person-pattern instead of a name lookup.
Some OOD cases still beat everyone. ...the Company at the Peacock assembled... and ...Advantage of Dodson and Fogg... all stay O. Pub names and mid-phrase legal surnames still do not expose enough signal.

Tech specs

Data — v2 clean partition

Split	Sentences	Source	Notes
Training	~17,930	Austen (6 novels, 80% split)	Gazetteer-labeled
+ Training	~19,290	David Copperfield + Jane Eyre	Seen authors
+ Silver pool	360	8 books × 45 sentences	CRF+spaCy scored, Claude Haiku labeled
In-domain test	4,483	Austen held-out (20%)	Same distribution as training
Cross-val†	320	Bleak House + Wuthering Heights	Known authors, unseen books
Pickwick OOD	3,814	Pickwick Papers	Entirely unseen — book, characters, register
Training total	37,580

Augmented training: 37,580 × 4 copies (3 augmented + original) = 150,320 sentences. Each augmented copy replaces 30% of entity spans per sentence with invented strings.

Model specs

Model	Key details
CRF	sklearn-crfsuite, orthographic + honorific + suffix features, Viterbi decode
TextCNN	GloVe 100d frozen, kernels (2,3,4), 64 filters, softmax
XGB+CRF	HistGradientBoostingClassifier (251-dim: GloVe ±2 context + shape), proba → CRF
CharHybrid	Char-CNN (32 filters) + word-CNN (128 filters, kernels 2–5) + CRF layer
NeuralGBT+CRF	TextCNN softmax (6-dim) + GloVe + shape = 257-dim → GBT (150 trees, depth 5) → CRF
TinyBERT	4L/312d, fine-tuned 4 epochs, lr=3e-5, first-subword label alignment
NeuralGBT+CRF (aug)	Same architecture, trained on augmented data
TinyBERT (aug)	Same architecture, fine-tuned on augmented data

Name-swap augmentation

Each entity span is swapped with probability 0.3
Invented strings use random consonant-vowel alternation, e.g. Zorfax, Threlk
Capitalisation preserved; trailing punctuation preserved
Labels unchanged
Effect: token identity stops being reliable, so the model has to use context

Evaluation

seqeval span-level F1
BIO tags: O, B-PER, I-PER, B-LOC, I-LOC

Notes / potential callouts

XGBoost 3.x segfaults on Python 3.14 ARM (OpenMP). Used sklearn.HistGradientBoostingClassifier throughout.
Name-swap eval: precision barely moves on invented/common swaps (~0.89–0.92), recall collapses (~0.95 → ~0.48). The models are still precise when they fire. They just stop firing on unknown names.
TinyBERT's WordPiece tokeniser breaks OOV names into subword pieces (Jarndyce → Jan##rd##yce), which is why augmentation helps it less than the GBT.
BookCorpus includes Victorian-era text. TinyBERT likely saw Dickens-like prose before fine-tuning. That makes its cross-val less impressive, and its Pickwick failure more interesting.

Appendix — full results

Model	In-domain	Cross-val†	Pickwick OOD
CRF	0.9438	0.2010	0.0668
TextCNN	0.8736	0.2548	0.0919
XGB+CRF	0.9406	0.2046	0.0673
CharHybrid	0.9425	0.2256	0.0743
NeuralGBT+CRF	0.9212	0.2777	0.0941
TinyBERT 4L/312d	0.9707	0.2896	0.1004
—	—	—	—
NeuralGBT+CRF (aug)	0.8973	0.5753	0.7718
TinyBERT 4L/312d (aug)	0.9597	0.3330	0.2550

† Cross-val: Bleak House + Wuthering Heights — known authors, unseen books