# NER-Pytorch-Chinese **Repository Path**: lvbs/ner-pytorch-chinese ## Basic Information - **Project Name**: NER-Pytorch-Chinese - **Description**: # NER系列-中文实体识别模型实践 本项目主要基于Pytorch, 验证常见的NER范式模型在不同中文NER数据集上(Flat、Nested、Discontinuous)的表现 ## Environment python==3.8、transformers>=4.12.3、torch==1.8.0 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-25 - **Last Updated**: 2025-06-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # NER系列-中文实体识别模型实践 ## Introduction 本项目主要基于Pytorch, 验证常见的NER范式模型在不同中文NER数据集上(Flat、Nested、Discontinuous)的表现 NER系列模型实践,包括如下: 1. Bert-Softmax、Bert-Crf、Bert-BiLSTM-Softmax、Bert-BiLSTM-Crf 2. Word-Feature Model(词汇增强模型):FlatNER、[LEBERT](https://arxiv.org/abs/2105.07148) 3. PointerNET (To do) 4. MRC(Machine Reading Comprehension, MRC) 5. span-based NER (To do) ### Dataset Introduction mainly tested on ner dataset as below: 中文NER数据集: - **Flat** NER Datasets: Ontonote4、Msra - **Nested** NER Datasets:ACE 2004、 ACE 2005 - **Discontinuous** NER Datasets: CADEC 关于一般NER数据处理成以下格式: ```yaml { "text": ["吴", "重", "阳", ",", "中", "国", "国", "籍",","], "label": ["B-NAME", "I-NAME", "I-NAME", "O", "B-CONT", "I-CONT", "I-CONT", "I-CONT", "O"] } ``` 阅读理解-NER(MRC-NER)处理成以下格式: ```yaml { "context": "图 为 马 拉 维 首 都 利 隆 圭 政 府 办 公 大 楼 。 ( 本 报 记 者 温 宪 摄 )", "end_position": [4,15], "entity_label": "NS", "impossible": false, "qas_id": "3820.1", "query": "按照地理位置划分的国家,城市,乡镇,大洲", "span_position": ["2;4", "7;15"], "start_position": [2, 7] } ``` ## Environment python==3.8、transformers>=4.12.3、torch==1.8.0 Or run the shell ``` pip install -r requirements.txt ``` ## Project Structure - config:some model parameters define - datasets:数据管道 - losses:损失函数 - metrics:评价指标 - models:存放自己实现的BERT模型代码 - output:输出目录,存放模型、训练日志 - processors:数据处理 - script:脚本 - utils: 工具类 - train.py: 主函数 ## Usage ### Quick Start you can start training model by run the shell 1. run ner model except mrc model ``` bash script/train.sh ``` 2. run mrc model ``` bash script/mrc_train.sh ``` ### Results top F1 score of results on test: | model/f1_score | Msra | Ontonote | |-----------------------------|------------|------------| | BERT-Sotfmax | 0.9553 | 0.8181 | | BERT-BiLSTM-Sotfmax | __0.9566__ | 0.8177 | | BERT-BiLSTM-LabelSmooth | 0.9549 | 0.8215 | | BERT-Crf | 0.9562 | 0.8218 | | BERT-BiLSTM-Crf | 0.9561 | __0.8227__ | | BERT-BiLSTM-Crf-LabelSmooth | 0.9547 | 0.8216 | | BERT-BiLSTM-Crf-LEBERT | 0.9518 | 0.8094 | | BERT-BiLSTM-Sotfmax-LEBERT | 0.9544 | 0.8196 | | MRC | 0.942 | 0.812 | #### Speed GPU: 3060TI 8G 在速度上,以Msra数据集为例,train数据量41728, 完成训练花费时间大概是如下,总体来说CRF要慢不少。 | model | time | batch_size | |---------------------|-----------|------------| | BERT-Sotfmax | 6min 14s | 24 | | BERT-BiLSTM-Sotfmax | 6min 46s | 24 | | BERT+Crf | 8min 06s | 24 | | BERT-BiLSTM-Crf | 8min 20s | 24 | | MRC | 50min 10s | 4 | ## Paper & Refer - [A Unified MRC Framework for Named Entity Recognition](https://arxiv.org/abs/1910.11476) - [Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter](https://arxiv.org/abs/2105.07148) - [tencent-ailab-embedding](https://ai.tencent.com/ailab/nlp/en/embedding.html) - https://github.com/yangjianxin1/LEBERT-NER-Chinese - https://github.com/lonePatient/BERT-NER-Pytorch