Segmentext is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
Segmentext was trained using HPC resources from GENCIβIDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.
Segmentext can be tested on PleIAs-Bad-Data-Editor, a free demo along with OCRonos, another model trained by PleIAs for the correction of OCR errors and other digitization artifact.
Use
Segmentext support the following text segmentation:
- Text
- Separator - actually a segmentation separator, generally based on newline (actually ΒΆ) with some variations due to text segmentation understanding.
- Title
- Table
- Dialog - any kind of speaker attributed intervention.
- Bibliography - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
- Contact - personal information, can be especially useful in the context of PII removal.
- Paratext - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
- Author - author names and signatures.
- Date - statement of date and time, common in letters and newspaper articles.
- Keyword - list of keywords, especially common in scientific publications.
Example
- Downloads last month
- 128