OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on bert-base-multilingual-cased and finetuned on the dataset of the 2019 ICDAR competition on post-OCR correction. It contains texts in the following languages:

  • BG
  • CZ
  • DE
  • EN
  • ES
  • FI
  • FR
  • NL
  • PL
  • SL

10% of the texts (stratified on language) were selected for validation. The test set is as provided.

The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance.

There are 3 classes in the data:

  • 0: No OCR mistake
  • 1: Start token of an OCR mistake
  • 2: Inside token of an OCR mistake

Results

Set Loss
Train 0.224500
Val 0.285791
Test 0.4178357720375061

Average F1 by language:

BG CZ DE EN ES FI FR NL PL SL
0.74 0.69 0.96 0.67 0.63 0.83 0.65 0.69 0.8 0.69

Demo

Space for this model.

Code

Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using jvdzwaan/ocrpostcorrection-task-1 2