Deep learning for information extraction from digital documents: An innovative approach to automatic parsing and rich text extraction from PDF files

Yükleniyor...
Küçük Resim

Tarih

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

IGI Global

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Print-oriented PDF documents are excellent at preserving the position of text and other objects but have difficulties in processing. Processable PDF documents will provide solutions to the unique needs of different sectors by paving the way for many innovations such as searching within documents, linking with different documents, or restructuring in a format that will increase the reading experience. In this chapter, a deep learning-based system design is presented that aims to export clean text content, separate all visual elements, and extract rich information from the content without losing the integrated structure of content types. While the F-RCNN model using the Detectron2 library was used to extract the layout, the cosine similarities between the wod2vec representations of the texts were used to identify the related clips, and the transformer language models were used to classify the clip type. The performance values on the 200-sample data set created by the researchers were determined as 1.87 WER and 2.11 CER in the headings and 0.22 WER and 0.21 CER in the paragraphs. © 2023 Elsevier B.V., All rights reserved.

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Onay

İnceleme

Ekleyen

Referans Veren