详细信息
Artificial intelligence-driven multilingual corpus for enhancing information retrieval in academic libraries
文献类型:期刊文献
英文题名:Artificial intelligence-driven multilingual corpus for enhancing information retrieval in academic libraries
作者:Wu, Mingwei[1];Luo, Xiaofeng[1];Wang, Xiulan[1]
第一作者:吴明蔚
通信作者:Wang, XL[1]
机构:[1]Gansu Univ Tradit Chinese Med, Sch Hlth Management, Lanzhou 730000, Peoples R China
第一机构:甘肃中医药大学
通信机构:[1]corresponding author), Gansu Univ Tradit Chinese Med, Sch Hlth Management, Lanzhou 730000, Peoples R China.|[10735]甘肃中医药大学;
年份:2025
外文期刊名:INFORMATION DEVELOPMENT
收录:;Scopus(收录号:2-s2.0-105019957326);WOS:【SSCI(收录号:WOS:001604685900001)】;
语种:英文
外文关键词:multilingualism; information retrieval; natural language processing; metadata; low-Resource languages
摘要:Academic libraries in Sub-Saharan Africa and China face persistent challenges in cross-lingual retrieval due to linguistic fragmentation and uneven metadata infrastructures. This study constructed and evaluated a multilingual academic corpus designed to enhance semantic retrieval, metadata interoperability, and inclusive access across 13 languages. The core innovation lies in the Multilingual Adaptive Corpus for Retrieval Equity (MACRE), a modular architecture that integrates language-specific adapters, ontology-driven metadata harmonisation, and an intent disambiguation engine features that collectively surpass existing retrieval frameworks. The project aligns with core LIS objectives by advancing user-centred discovery, cross-language cataloguing, and metadata standardisation in institutional repositories. A 41,129,582-token corpus was compiled from 39,725 academic records drawn from university repositories across both regions. The corpus incorporated Mandarin, English, and 11 African languages selected to reflect regional LIS priorities. Metadata was harmonised to SKOS and Schema.org standards. The proposed MACRE retrieval model was benchmarked against ColBERT-X, SwahiliDocBERT, and CrossLingual2Vec using cosine similarity, MRR, MAP, and NDCG. Evaluation included ablation and post hoc analysis. Mandarin and English accounted for 64.6% of all tokens; Swahili reached 16.9%, while nine African languages contributed under 1.8% each. MACRE significantly outperformed all baselines (MRR = 0.864; MAP = 0.812; p < .001), particularly in LIS-aligned fields such as metadata accuracy (98.6%) and entity completion (94.9%). Adapter performance exceeded 90% in dominant languages but revealed key gaps in under-annotated African records. These findings illustrate that retrieval accuracy is not just a technical challenge, but also reflects underlying LIS concerns, such as language equity, cataloguing depth, and metadata policy enforcement. This study contributes a scalable LIS infrastructure for multilingual academic retrieval, advancing both technological and policy innovations for cross-lingual access in library systems.
参考文献:
正在载入数据...
