# bert-sklearn

**Repository Path**: ooco/bert-sklearn

## Basic Information

- **Project Name**: bert-sklearn
- **Description**: a sklearn wrapper for Google's BERT model
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2019-12-02
- **Last Updated**: 2022-02-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# scikit-learn wrapper to finetune BERT


 A scikit-learn wrapper to finetune [Google's BERT](https://github.com/google-research/bert) model for text and token sequence tasks based on the [huggingface pytorch](https://github.com/huggingface/pytorch-pretrained-BERT) port.
 
* Includes configurable MLP as final classifier/regressor for text and text pair tasks
* Includes token sequence classifier for NER, PoS, and chunking tasks
* Includes  [**`SciBERT`**](https://github.com/allenai/scibert) and [**`BioBERT`**](https://github.com/dmis-lab/biobert) pretrained models for scientific  and biomedical domains.


Try in [Google Colab](https://colab.research.google.com/drive/1-wTNA-qYmOBdSYG7sRhIdOrxcgPpcl6L)!


## installation

requires python >= 3.5 and pytorch >= 0.4.1

```bash
git clone -b master https://github.com/charles9n/bert-sklearn
cd bert-sklearn
pip install .
```

## basic operation

**`model.fit(X,y)`**  i.e finetune **`BERT`**

* **`X`**: list, pandas dataframe, or numpy array of text, text pairs, or token lists

* **`y`** : list, pandas dataframe, or numpy array of labels/targets

```python3
from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import load_model

# define model
model = BertClassifier()         # text/text pair classification
# model = BertRegressor()        # text/text pair regression
# model = BertTokenClassifier()  # token sequence classification

# finetune model
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

# make probabilty predictions
y_pred = model.predict_proba(X_test)

# score model on test data
model.score(X_test, y_test)

# save model to disk
savefile='/data/mymodel.bin'
model.save(savefile)

# load model from disk
new_model = load_model(savefile)

# do stuff with new model
new_model.score(X_test, y_test)
```
See [demo](https://github.com/charles9n/bert-sklearn/blob/master/demo.ipynb) notebook.

## model options

```python3
# try different options...
model.bert_model = 'bert-large-uncased'
model.num_mlp_layers = 3
model.max_seq_length = 196
model.epochs = 4
model.learning_rate = 4e-5
model.gradient_accumulation_steps = 4

# finetune
model.fit(X_train, y_train)

# do stuff...
model.score(X_test, y_test)
```
See [options](https://github.com/charles9n/bert-sklearn/blob/master/Options.md)


## hyperparameter tuning

```python3
from sklearn.model_selection import GridSearchCV

params = {'epochs':[3, 4], 'learning_rate':[2e-5, 3e-5, 5e-5]}

# wrap classifier in GridSearchCV
clf = GridSearchCV(BertClassifier(validation_fraction=0), 
                    params,
                    scoring='accuracy',
                    verbose=True)

# fit gridsearch 
clf.fit(X_train ,y_train)
```
See [demo_tuning_hyperparameters](https://github.com/charles9n/bert-sklearn/blob/master/demo_tuning_hyperparams.ipynb) notebook.

## GLUE datasets
The train and dev data sets from the [GLUE(Generalized Language Understanding Evaluation) ](https://github.com/nyu-mll/GLUE-baselines) benchmarks were used with `bert-base-uncased` model and compared againt the reported results in the Google paper and [GLUE leaderboard](https://gluebenchmark.com/leaderboard).

|    | MNLI(m/mm)| QQP   | QNLI | SST-2| CoLA | STS-B | MRPC | RTE |
| - | - | - | - | - |- | - | - | - |
|BERT base(leaderboard) |84.6/83.4  | 89.2 | 90.1 | 93.5 | 52.1 | 87.1  | 84.8 | 66.4 | 
| bert-sklearn  |83.7/83.9| 90.2 |88.6 |92.32 |58.1| 89.7 |86.8 | 64.6 |

Individual runs can be found can be found [here](https://github.com/charles9n/bert-sklearn/tree/master/glue_examples).

## CoNLL-2003 Named Entity Recognition(NER)

NER results for [**`CoNLL-2003`**](https://www.clips.uantwerpen.be/conll2003/ner/)  shared task

|    | dev f1 | test f1   |
| - | - | - |
| BERT paper| 96.4 | 92.4|
| bert-sklearn | 96.04 | 91.97|

Span level stats on test:
```bash
processed 46666 tokens with 5648 phrases; found: 5740 phrases; correct: 5173.
accuracy:  98.15%; precision:  90.12%; recall:  91.59%; FB1:  90.85
              LOC: precision:  92.24%; recall:  92.69%; FB1:  92.46  1676
             MISC: precision:  78.07%; recall:  81.62%; FB1:  79.81  734
              ORG: precision:  87.64%; recall:  90.07%; FB1:  88.84  1707
              PER: precision:  96.00%; recall:  96.35%; FB1:  96.17  1623
```
See [ner_english notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_english.ipynb) for a demo using `'bert-base-cased'` model.

## NCBI Biomedical NER

NER results using bert-sklearn with **`SciBERT`** and **`BioBERT`** on the  the [**`NCBI disease Corpus`**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/) name recognition task.

Previous [SOTA](https://arxiv.org/pdf/1711.07908.pdf) for this task is **87.34** for f1 on the test set.


|    |  test f1 (bert-sklearn) | test f1 (from papers)  |
| - | - | - |
| BERT base cased| 85.09 | 85.49|
| SciBERT basevocab cased| 88.29 | 86.91|
| SciBERT scivocab cased| 87.73 |  86.45|
| BioBERT pubmed_v1.0 |  87.86  | 87.38|
| BioBERT pubmed_pmc_v1.0 | 88.26 |  89.36|
| BioBERT pubmed_v1.1 |87.26  | NA|

See [ner_NCBI_disease_BioBERT_SciBERT notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_NCBI_disease_BioBERT_SciBERT.ipynb) for a demo using **`SciBERT`** and **`BioBERT`** models.

See [SciBERT paper](https://arxiv.org/pdf/1903.10676.pdf) and [BioBERT paper](https://arxiv.org/pdf/1901.08746.pdf) for more info on the respective models.

## Other examples

* See [IMDb notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/IMDb.ipynb) for a text classification demo on the Internet Movie Database review sentiment task.

* See [chunking_english notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/chunker_english.ipynb) for a demo on syntactic chunking using the [**`CoNLL-2000`**](https://www.clips.uantwerpen.be/conll2003/ner/) chunking task data.

* See [ner_chinese notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_chinese.ipynb) for a demo using `'bert-base-chinese'` for Chinese NER.


## tests

Run tests with pytest :
```bash
python -m pytest -sv tests/
```

## references

* [Google `BERT` github](https://github.com/google-research/bert)  and [paper: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (10/2018) by
J. Devlin, M. Chang, K. Lee, and K. Toutanova](https://arxiv.org/abs/1810.04805)

* [huggingface `pytorch-pretrained-BERT` github](https://github.com/huggingface/pytorch-pretrained-BERT)

* [`SciBERT` github](https://github.com/allenai/scibert) and [paper: "SCIBERT: Pretrained Contextualized Embeddings for Scientific Text" (3/2019) by I. Beltagy, A. Cohan, and  K. Lo](https://arxiv.org/pdf/1903.10676.pdf)

* [`BioBERT` github](https://github.com/dmis-lab/biobert) and [paper: "BioBERT: a pre-trained biomedical language representation model for biomedical text mining" (2/2019) by J. Lee, W. Yoon, S. Kim , D. Kim, S. Kim , C.H. So, and J. Kang
](https://arxiv.org/pdf/1901.08746.pdf)