Summary of Carlos Alonso: Inteligencia artificial aplicada a los archivos históricos. El proyecto Carabela
Summary of "Carlos Alonso: Inteligencia artificial aplicada a los archivos históricos. El proyecto Carabela"
This virtual conference, organized by the General Archives of the Nation of Murcia and presented by historian and archaeologist Carlos Alonso Villalobos, focuses on the application of artificial intelligence (AI) to historical archives, specifically through the Carabela Project. The project aims to facilitate access, research, and management of handwritten historical documents, particularly those related to underwater archaeological heritage such as historical shipwrecks.
Main Ideas and Concepts
- Context and Motivation:
- Traditional archival research on handwritten historical documents is slow, resource-intensive, and difficult due to the nature of old manuscripts.
- The Carabela Project seeks to leverage AI, particularly handwriting recognition technologies, to overcome these obstacles.
- The project is a collaboration involving the Polytechnic University of Valencia, the Andalusian Institute of Historical Heritage, and external collaborators, funded by the BBVA Foundation.
- Project Background:
- Originated from the need to protect and document underwater archaeological heritage, focusing on historical shipwrecks from the 15th to 19th centuries.
- Utilizes archival sources such as the General Archive of the Indies in Seville and the Provincial Historical Archive of Cádiz.
- Documents include manuscripts, textual records, and graphic materials like historical cartography.
- Challenges include low-resolution digitization (125 dpi), poor contrast, varied handwriting styles, and complex document layouts.
- Artificial Intelligence Methodology:
- The project applies advanced AI techniques, including neural networks and probabilistic models, for handwritten text recognition (HTR).
- The AI system uses two main models:
- Optical Model: Learns the visual patterns of handwritten words based on morphological features.
- Language Model: Uses human-corrected transcriptions to understand linguistic context and grammar, improving recognition accuracy.
- Probabilistic indices assign confidence scores to recognized words, enabling search and indexing even with imperfect transcriptions.
- Human-machine interaction is crucial: humans correct AI outputs to train and improve the system.
- Work Process:
- Phase 1: Collection Formation
- Compiled and digitized approximately 150,000 documents, refined to 125,000 after removing duplicates.
- Documents include diverse handwriting styles and conservation states.
- Phase 2: Digital Document Processing
- Automatic detection of text boxes and lines within digitized images.
- Preprocessing to handle marginal annotations, crossed-out text, abbreviations, and ink bleed-through.
- Phase 3: Algorithm Training
- Selected 550 representative images covering various handwriting and document conditions.
- Used a specialized transcription tool ("application almost") for assisted transcription with symbolic notation for abbreviations and archaic terms.
- Transcriptions were modernized for easier search and access.
- Phase 1: Collection Formation
- Search and Access System:
- Developed a web interface allowing probabilistic searches with adjustable confidence thresholds.
- Supports complex queries including synonyms grouped as macros (e.g., different types of ships).
- Allows proximity searches (e.g., terms appearing within a certain distance).
- Demonstrated examples include searches for terms like "captain general," "Francisco," "shipwreck," and place names.
- The system returns images with highlighted search terms and confidence scores.
- Results and Impact:
- The system enabled discovery of approximately 400 new shipwreck references in minutes, compared to decades of traditional research.
- Achieved around 92% success in search and classification tasks.
- Success rates varied depending on digitization quality (e.g., lower for Cádiz archive due to poorer image quality).
- The tool is valuable for researchers and archivists alike, speeding up document location, transcription assistance, and metadata extraction.
- Demonstrated automatic pre-cataloging capabilities, including document type classification and keyword extraction with over 93% accuracy in tests.
- The system can reorder pages of documents correctly even if digitized out of order.
- Future Perspectives:
- AI tools can be scaled to millions of documents, limited mainly by computing power and processing time.
- The system requires adaptation and training for each archival collection due to differences in handwriting, document types, and digitization quality.
- The project is research-based, not a commercial software package; deployment requires collaboration with archives and tailored training.
- Potential spin-offs or service models may emerge from university research groups.
- The technology is language-agnostic and can be trained for different languages, including Finnish and Portuguese.
- Ongoing work includes expanding applications to automatic pre-cataloging and document classification in archives.
- Encourages collaboration with other archives and institutions to expand training data and improve the system.
- Limitations and Considerations:
- Transcriptions generated are probabilistic aids, not perfect paleographic transcriptions.
- Human expertise remains essential for validation and detailed research.
- Digitization quality critically affects AI performance.
Notable Quotes
— 15:07 — « We live in a very good moment in which there are indeed technologies that are capable of working with data at time scales far below, infinitely inferior to human capacity, and at the same time, artificial intelligence allows you to move forward towards making machines think. »
— 67:30 — « This is a bit of a corset with atrocious clothing, yes, but the system already has a deep learning of the Castilian graphics from the century, which is the first simile, that already has that deep wardrobe that exists of character recognition. »
— 74:01 — « Here what we are looking for is the help to the probabilistic index, that is to say the help that tells us what word possibilities these documents have, and that is what gives us all this search power from metadata to classify the documents. »
— 79:51 — « It allows you to do truncation, to brush the rabbits with terms with proximity, with eliminating one, let's say, of those in the digital press. It allows you to do thematic searches without you knowing anything about that documentation and locates it. »
— 82:03 — « This team from the Polytechnic University of Valencia is a world pioneer in this tool in terms of quality of results. »
Category
Educational