详细信息
基于CmabBERT-BILSTM-CRF的针灸古籍分词技术研究 被引量:2
Research on word segmentation technology of acupuncture ancient books based on CmabBERT-BILSTM-CRF
文献类型:期刊文献
中文题名:基于CmabBERT-BILSTM-CRF的针灸古籍分词技术研究
英文题名:Research on word segmentation technology of acupuncture ancient books based on CmabBERT-BILSTM-CRF
作者:钟昕妤[1];李燕[1];徐丽娜[1];陈月月[1];帅亚琦[1]
第一作者:钟昕妤
机构:[1]甘肃中医药大学信息工程学院,甘肃兰州730101
第一机构:甘肃中医药大学信息工程学院(教育技术中心)
年份:2023
期号:4
起止页码:11
中文期刊名:计算机时代
外文期刊名:Computer Era
基金:基于AI深度学习的中医知识图谱构建(2021LDA09002);甘肃中医药大学研究生创新基金项目(2022CX137)。
语种:中文
中文关键词:针灸古籍;分词;序列标注;预训练
外文关键词:acupuncture ancient books;word segmentation;sequence tagging;pre-training
摘要:针灸古籍中含有大量通假字、歧义词和专业术语。基于深度学习的分词方法,因静态字向量固有表示和大规模且高质量语料缺乏等问题,限制了分词性能。为缓解上述问题,提出引入预训练策略,在ALBERT模型基础上,利用大量中医古籍再训练得到CmabBERT模型,并构建CmabBERT-BILSTM-CRF融合模型运用于针灸古籍分词任务。实验结果表明,在小样本语料基础下,对比Jieba分词器、BILSTM-CRF和ALBERT-BILSTM-CRF模型,该融合模型展现了更优越的分词性能。
Acupuncture ancient books contain a large number of false words,ambiguous words and professional terms.The word segmentation method based on deep learning is limited by the inherent representation of static word vectors and the lack of large scale and high-quality corpus.In order to alleviate the above problems,a pre-training strategy is proposed.Based on the ALBERT model,a large number of ancient Chinese medicine books are retrained to obtain the CmabBERT model,and the CmabBERT BILSTM-CRF fusion model is constructed and applied to the word segmentation task of acupuncture ancient books.The experimental results show that compared with the Jieba word segmentation,BILSTM-CRF and ALBERT-BILSTM-CRF models,this fusion model exhibits superior word separation performance on the basis of small sample corpus.
参考文献:
正在载入数据...