Learning Doesn't Imply Generalisation

Literary NER generalisation experiment

I wanted a lightweight NER model for Jane Austen and nearby nineteenth-century fiction. There was no off-the-shelf version sitting there waiting for me, and building one sounded fun.

What followed was a useful reminder that a high validation score can be perfectly real and still hide a model that learned names instead of patterns.

Narrative structure

1. Validation scores don't imply generalisation

Model In-domain Cross-val† Pickwick OOD
CRF 0.9438 0.2010 0.0668
TextCNN 0.8736 0.2548 0.0919

CRF goes from 0.94 → 0.20 → 0.07. On held-out test it looks nearly solved. On a genuinely unseen book it is basically useless.

TextCNN is weaker in-domain but loses less. It learned something slightly more general, even by accident.

The model you would ship from the validation score is the worst one on the real test.


2. The obvious next move: a transformer

The obvious next move.

huawei-noah/TinyBERT_General_4L_312D — 14.3M params, 4 layers, 312d hidden. Fine-tuned 4 epochs.

---
config:
  theme: base
  fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
  themeVariables:
    xyChart:
      backgroundColor: "#f7f7fa"
      titleColor: "#1f2328"
      xAxisLabelColor: "#1f2328"
      xAxisTickColor: "#1f2328"
      xAxisLineColor: "#1f2328"
      yAxisLabelColor: "#1f2328"
      yAxisTickColor: "#1f2328"
      yAxisLineColor: "#1f2328"
      plotColorPalette: "#7e8fa6, #c1b6ff, #f5a97f"
  xyChart:
    width: 1120
    height: 500
    titleFontSize: 18
    plotReservedSpacePercent: 40
    xAxis:
      labelFontSize: 12
      labelPadding: 14
      showTitle: false
    yAxis:
      labelFontSize: 12
      showTitle: false
---
xychart-beta
    title "F1 across tiers"
    x-axis ["In-domain", "Cross-val", "Pickwick"]
    y-axis 0 --> 1
    line "CRF" [0.9438, 0.2010, 0.0668]
    line "TextCNN" [0.8736, 0.2548, 0.0919]
    line "TinyBERT" [0.9707, 0.2896, 0.1004]
CRF TextCNN TinyBERT

It wins the easy numbers. Pickwick barely moves.

[reasonable person closes the notebook here]


3. Different architectures, same failure

We kept going. Three classical architectures, each aimed at a different failure mode.

  • XGB+CRF: continuous GloVe features + CRF transitions
  • CharHybrid: char-level + word-level CNN + CRF
  • NeuralGBT+CRF: TextCNN softmax → GBT arbitration → CRF decode

XGB+CRF lands at 0.0673 on Pickwick, CharHybrid at 0.0743, NeuralGBT+CRF at 0.0941, and TinyBERT at 0.1004. Different architectures, same outcome: nobody is really generalising.

The stack is the clue.


4. The model learned names, not patterns

The cleanest test: vandalise the names.

Replace entity tokens in the in-domain test set with invented strings with no GloVe vector, and with common words whose embeddings point away from the usual name cluster.

Model Baseline Invented names Common words
CRF 0.9534 0.6466 0.6298
NeuralGBT+CRF 0.9451 0.6578 0.6247

About a thirty-point drop either way. Precision barely moves. Recall caves in.

The models were doing name lookup, not entity recognition.

When Darcy becomes Zorfax, they stop firing. When it becomes running, the embedding now actively points away from “name”.

Fix: poison the shortcut.

Replace 30% of entity spans during training with invented names. Labels unchanged. Context unchanged. Token identity becomes unreliable, forcing the model onto surrounding patterns.

---
config:
  theme: base
  fontFamily: "system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Helvetica Neue, Arial, sans-serif"
  themeVariables:
    xyChart:
      backgroundColor: "#f7f7fa"
      titleColor: "#1f2328"
      xAxisLabelColor: "#1f2328"
      xAxisTickColor: "#1f2328"
      xAxisLineColor: "#1f2328"
      yAxisLabelColor: "#1f2328"
      yAxisTickColor: "#1f2328"
      yAxisLineColor: "#1f2328"
      plotColorPalette: "#6a8759, #7e8fa6"
  xyChart:
    width: 1120
    height: 430
    titleFontSize: 18
    plotReservedSpacePercent: 40
    xAxis:
      labelFontSize: 12
      labelPadding: 14
      showTitle: false
    yAxis:
      labelFontSize: 12
      showTitle: false
---
xychart-beta
    title "Pickwick OOD: baseline plus augmentation lift"
    x-axis ["NGBT", "TinyBERT"]
    y-axis 0 --> 1
    bar "Augmented total" [0.7718, 0.2550]
    bar "Baseline" [0.0941, 0.1004]
Baseline Pickwick F1 Lift from augmentation

NeuralGBT+CRF goes from 0.09 → 0.77 on Pickwick. It gives up about two in-domain F1 points and gains almost seventy where it matters.

TinyBERT improves too, just less dramatically.

WordPiece already softens the OOV problem by breaking unseen names into subwords. The GloVe-backed GBT had no fallback. Unknown names were dead space. Augmentation forced it to learn around that.

NeuralGBT+CRF (aug) at 0.77 still beats TinyBERT (aug) at 0.26 by roughly 3×.


Examples

What the three tiers actually test

Name overlap between train and val is normal for NER. Standard benchmarks split by document, not by entity name.

This setup uses a clean gradient instead:

Tier What the model has seen What it tests
In-domain Same author, same characters, held-out sentences That learning happened at all
Cross-val† Known authors, new books, mostly unseen names Generalisation within distribution
Pickwick OOD Nothing — cold start Actual generalisation

Pickwick is the only true cold-start test.


Some sentence-level examples make the failure mode clearer than another average. Key: crf = CRF, ngbt = NeuralGBT+CRF, aug = NeuralGBT+CRF (augmented).

  • Easy in-domain hit. "Live with me, dear Lady Bertram!" All three fire: crf=B-PER I-PER, ngbt=B-PER I-PER, aug=B-PER I-PER.

  • Possessive form breaks the baseline. The sudden termination of Colonel Brandon's visit at the park... crf=B-PER I-PER, ngbt=O O, aug=B-PER I-PER. CRF anchors on Colonel; the augmented model recovers the possessive form.

  • Some patterns transfer across books. "Do you know, Miss Linton, that brute Hareton laughs at me!" CRF misses all three tokens, while ngbt and aug get them from the title + capitalised-surname pattern.

  • Augmentation is what finally survives OOD. Pickwick undertook to drive... and Pickwick's Determination... both get crf=O, ngbt=O, aug=B-PER. The augmented model learned a transferable person-pattern instead of a name lookup.

  • Some OOD cases still beat everyone. ...the Company at the Peacock assembled... and ...Advantage of Dodson and Fogg... all stay O. Pub names and mid-phrase legal surnames still do not expose enough signal.


Tech specs

Data — v2 clean partition

Split Sentences Source Notes
Training ~17,930 Austen (6 novels, 80% split) Gazetteer-labeled
+ Training ~19,290 David Copperfield + Jane Eyre Seen authors
+ Silver pool 360 8 books × 45 sentences CRF+spaCy scored, Claude Haiku labeled
In-domain test 4,483 Austen held-out (20%) Same distribution as training
Cross-val† 320 Bleak House + Wuthering Heights Known authors, unseen books
Pickwick OOD 3,814 Pickwick Papers Entirely unseen — book, characters, register
Training total 37,580

Augmented training: 37,580 × 4 copies (3 augmented + original) = 150,320 sentences. Each augmented copy replaces 30% of entity spans per sentence with invented strings.

Model specs

Model Key details
CRF sklearn-crfsuite, orthographic + honorific + suffix features, Viterbi decode
TextCNN GloVe 100d frozen, kernels (2,3,4), 64 filters, softmax
XGB+CRF HistGradientBoostingClassifier (251-dim: GloVe ±2 context + shape), proba → CRF
CharHybrid Char-CNN (32 filters) + word-CNN (128 filters, kernels 2–5) + CRF layer
NeuralGBT+CRF TextCNN softmax (6-dim) + GloVe + shape = 257-dim → GBT (150 trees, depth 5) → CRF
TinyBERT 4L/312d, fine-tuned 4 epochs, lr=3e-5, first-subword label alignment
NeuralGBT+CRF (aug) Same architecture, trained on augmented data
TinyBERT (aug) Same architecture, fine-tuned on augmented data

Name-swap augmentation

  • Each entity span is swapped with probability 0.3
  • Invented strings use random consonant-vowel alternation, e.g. Zorfax, Threlk
  • Capitalisation preserved; trailing punctuation preserved
  • Labels unchanged
  • Effect: token identity stops being reliable, so the model has to use context

Evaluation

  • seqeval span-level F1
  • BIO tags: O, B-PER, I-PER, B-LOC, I-LOC

Notes / potential callouts

  • XGBoost 3.x segfaults on Python 3.14 ARM (OpenMP). Used sklearn.HistGradientBoostingClassifier throughout.
  • Name-swap eval: precision barely moves on invented/common swaps (~0.89–0.92), recall collapses (~0.95 → ~0.48). The models are still precise when they fire. They just stop firing on unknown names.
  • TinyBERT's WordPiece tokeniser breaks OOV names into subword pieces (JarndyceJan##rd##yce), which is why augmentation helps it less than the GBT.
  • BookCorpus includes Victorian-era text. TinyBERT likely saw Dickens-like prose before fine-tuning. That makes its cross-val less impressive, and its Pickwick failure more interesting.

Appendix — full results

Model In-domain Cross-val† Pickwick OOD
CRF 0.9438 0.2010 0.0668
TextCNN 0.8736 0.2548 0.0919
XGB+CRF 0.9406 0.2046 0.0673
CharHybrid 0.9425 0.2256 0.0743
NeuralGBT+CRF 0.9212 0.2777 0.0941
TinyBERT 4L/312d 0.9707 0.2896 0.1004
NeuralGBT+CRF (aug) 0.8973 0.5753 0.7718
TinyBERT 4L/312d (aug) 0.9597 0.3330 0.2550

† Cross-val: Bleak House + Wuthering Heights — known authors, unseen books