Mon 06 April 2026
Literary NER generalisation experiment
I wanted a lightweight NER model for Jane Austen and nearby nineteenth-century fiction. There was no off-the-shelf version sitting there waiting for me, and building one sounded fun.
What followed was a useful reminder that a high validation score can be perfectly real and still hide a model that learned names instead of patterns.
Narrative structure
1. Validation scores don't imply generalisation
| Model | In-domain | Cross-val† | Pickwick OOD |
|---|---|---|---|
| CRF | 0.9438 | 0.2010 | 0.0668 |
| TextCNN | 0.8736 | 0.2548 | 0.0919 |
CRF goes from 0.94 → 0.20 → 0.07. On held-out test it looks nearly solved. On a genuinely unseen book it is basically useless.
TextCNN is weaker in-domain but loses less. It learned something slightly more general, even by accident.
The model you would ship from the validation score is the worst one on the real test.
2. The obvious next move: a transformer
The obvious next move.
huawei-noah/TinyBERT_General_4L_312D — 14.3M params, 4 layers, 312d hidden. Fine-tuned 4 epochs.
---
config:
theme: base
fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
themeVariables:
xyChart:
backgroundColor: "#f7f7fa"
titleColor: "#1f2328"
xAxisLabelColor: "#1f2328"
xAxisTickColor: "#1f2328"
xAxisLineColor: "#1f2328"
yAxisLabelColor: "#1f2328"
yAxisTickColor: "#1f2328"
yAxisLineColor: "#1f2328"
plotColorPalette: "#7e8fa6, #c1b6ff, #f5a97f"
xyChart:
width: 1120
height: 500
titleFontSize: 18
plotReservedSpacePercent: 40
xAxis:
labelFontSize: 12
labelPadding: 14
showTitle: false
yAxis:
labelFontSize: 12
showTitle: false
---
xychart-beta
title "F1 across tiers"
x-axis ["In-domain", "Cross-val", "Pickwick"]
y-axis 0 --> 1
line "CRF" [0.9438, 0.2010, 0.0668]
line "TextCNN" [0.8736, 0.2548, 0.0919]
line "TinyBERT" [0.9707, 0.2896, 0.1004]
It wins the easy numbers. Pickwick barely moves.
[reasonable person closes the notebook here]
3. Different architectures, same failure
We kept going. Three classical architectures, each aimed at a different failure mode.
- XGB+CRF: continuous GloVe features + CRF transitions
- CharHybrid: char-level + word-level CNN + CRF
- NeuralGBT+CRF: TextCNN softmax → GBT arbitration → CRF decode
XGB+CRF lands at 0.0673 on Pickwick, CharHybrid at 0.0743, NeuralGBT+CRF at 0.0941, and TinyBERT at 0.1004. Different architectures, same outcome: nobody is really generalising.
The stack is the clue.
4. The model learned names, not patterns
The cleanest test: vandalise the names.
Replace entity tokens in the in-domain test set with invented strings with no GloVe vector, and with common words whose embeddings point away from the usual name cluster.
| Model | Baseline | Invented names | Common words |
|---|---|---|---|
| CRF | 0.9534 | 0.6466 | 0.6298 |
| NeuralGBT+CRF | 0.9451 | 0.6578 | 0.6247 |
About a thirty-point drop either way. Precision barely moves. Recall caves in.
The models were doing name lookup, not entity recognition.
When Darcy becomes Zorfax, they stop firing. When it becomes running, the embedding now actively points away from “name”.
Fix: poison the shortcut.
Replace 30% of entity spans during training with invented names. Labels unchanged. Context unchanged. Token identity becomes unreliable, forcing the model onto surrounding patterns.
---
config:
theme: base
fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
themeVariables:
xyChart:
backgroundColor: "#f7f7fa"
titleColor: "#1f2328"
xAxisLabelColor: "#1f2328"
xAxisTickColor: "#1f2328"
xAxisLineColor: "#1f2328"
yAxisLabelColor: "#1f2328"
yAxisTickColor: "#1f2328"
yAxisLineColor: "#1f2328"
plotColorPalette: "#6a8759, #7e8fa6"
xyChart:
width: 1120
height: 430
titleFontSize: 18
plotReservedSpacePercent: 40
xAxis:
labelFontSize: 12
labelPadding: 14
showTitle: false
yAxis:
labelFontSize: 12
showTitle: false
---
xychart-beta
title "Pickwick OOD: baseline plus augmentation lift"
x-axis ["NGBT", "TinyBERT"]
y-axis 0 --> 1
bar "Augmented total" [0.7718, 0.2550]
bar "Baseline" [0.0941, 0.1004]
NeuralGBT+CRF goes from 0.09 → 0.77 on Pickwick. It gives up about two in-domain F1 points and gains almost seventy where it matters.
TinyBERT improves too, just less dramatically.
WordPiece already softens the OOV problem by breaking unseen names into subwords. The GloVe-backed GBT had no fallback. Unknown names were dead space. Augmentation forced it to learn around that.
NeuralGBT+CRF (aug) at 0.77 still beats TinyBERT (aug) at 0.26 by roughly 3×.
Examples
What the three tiers actually test
Name overlap between train and val is normal for NER. Standard benchmarks split by document, not by entity name.
This setup uses a clean gradient instead:
| Tier | What the model has seen | What it tests |
|---|---|---|
| In-domain | Same author, same characters, held-out sentences | That learning happened at all |
| Cross-val† | Known authors, new books, mostly unseen names | Generalisation within distribution |
| Pickwick OOD | Nothing — cold start | Actual generalisation |
Pickwick is the only true cold-start test.
Some sentence-level examples make the failure mode clearer than another average. Key: crf = CRF, ngbt = NeuralGBT+CRF, aug = NeuralGBT+CRF (augmented).
-
Easy in-domain hit. "Live with me, dear Lady Bertram!" All three fire:
crf=B-PER I-PER,ngbt=B-PER I-PER,aug=B-PER I-PER. -
Possessive form breaks the baseline. The sudden termination of Colonel Brandon's visit at the park...
crf=B-PER I-PER,ngbt=O O,aug=B-PER I-PER. CRF anchors onColonel; the augmented model recovers the possessive form. -
Some patterns transfer across books. "Do you know, Miss Linton, that brute Hareton laughs at me!" CRF misses all three tokens, while
ngbtandaugget them from the title + capitalised-surname pattern. -
Augmentation is what finally survives OOD. Pickwick undertook to drive... and Pickwick's Determination... both get
crf=O,ngbt=O,aug=B-PER. The augmented model learned a transferable person-pattern instead of a name lookup. -
Some OOD cases still beat everyone. ...the Company at the Peacock assembled... and ...Advantage of Dodson and Fogg... all stay
O. Pub names and mid-phrase legal surnames still do not expose enough signal.
Tech specs
Data — v2 clean partition
| Split | Sentences | Source | Notes |
|---|---|---|---|
| Training | ~17,930 | Austen (6 novels, 80% split) | Gazetteer-labeled |
| + Training | ~19,290 | David Copperfield + Jane Eyre | Seen authors |
| + Silver pool | 360 | 8 books × 45 sentences | CRF+spaCy scored, Claude Haiku labeled |
| In-domain test | 4,483 | Austen held-out (20%) | Same distribution as training |
| Cross-val† | 320 | Bleak House + Wuthering Heights | Known authors, unseen books |
| Pickwick OOD | 3,814 | Pickwick Papers | Entirely unseen — book, characters, register |
| Training total | 37,580 |
Augmented training: 37,580 × 4 copies (3 augmented + original) = 150,320 sentences. Each augmented copy replaces 30% of entity spans per sentence with invented strings.
Model specs
| Model | Key details |
|---|---|
| CRF | sklearn-crfsuite, orthographic + honorific + suffix features, Viterbi decode |
| TextCNN | GloVe 100d frozen, kernels (2,3,4), 64 filters, softmax |
| XGB+CRF | HistGradientBoostingClassifier (251-dim: GloVe ±2 context + shape), proba → CRF |
| CharHybrid | Char-CNN (32 filters) + word-CNN (128 filters, kernels 2–5) + CRF layer |
| NeuralGBT+CRF | TextCNN softmax (6-dim) + GloVe + shape = 257-dim → GBT (150 trees, depth 5) → CRF |
| TinyBERT | 4L/312d, fine-tuned 4 epochs, lr=3e-5, first-subword label alignment |
| NeuralGBT+CRF (aug) | Same architecture, trained on augmented data |
| TinyBERT (aug) | Same architecture, fine-tuned on augmented data |
Name-swap augmentation
- Each entity span is swapped with probability 0.3
- Invented strings use random consonant-vowel alternation, e.g.
Zorfax,Threlk - Capitalisation preserved; trailing punctuation preserved
- Labels unchanged
- Effect: token identity stops being reliable, so the model has to use context
Evaluation
- seqeval span-level F1
- BIO tags: O, B-PER, I-PER, B-LOC, I-LOC
Notes / potential callouts
- XGBoost 3.x segfaults on Python 3.14 ARM (OpenMP). Used
sklearn.HistGradientBoostingClassifierthroughout. - Name-swap eval: precision barely moves on invented/common swaps (~0.89–0.92), recall collapses (~0.95 → ~0.48). The models are still precise when they fire. They just stop firing on unknown names.
- TinyBERT's WordPiece tokeniser breaks OOV names into subword pieces (
Jarndyce→Jan##rd##yce), which is why augmentation helps it less than the GBT. - BookCorpus includes Victorian-era text. TinyBERT likely saw Dickens-like prose before fine-tuning. That makes its cross-val less impressive, and its Pickwick failure more interesting.
Appendix — full results
| Model | In-domain | Cross-val† | Pickwick OOD |
|---|---|---|---|
| CRF | 0.9438 | 0.2010 | 0.0668 |
| TextCNN | 0.8736 | 0.2548 | 0.0919 |
| XGB+CRF | 0.9406 | 0.2046 | 0.0673 |
| CharHybrid | 0.9425 | 0.2256 | 0.0743 |
| NeuralGBT+CRF | 0.9212 | 0.2777 | 0.0941 |
| TinyBERT 4L/312d | 0.9707 | 0.2896 | 0.1004 |
| — | — | — | — |
| NeuralGBT+CRF (aug) | 0.8973 | 0.5753 | 0.7718 |
| TinyBERT 4L/312d (aug) | 0.9597 | 0.3330 | 0.2550 |
† Cross-val: Bleak House + Wuthering Heights — known authors, unseen books