# bert

**Repository Path**: ibsyl/bert

## Basic Information

- **Project Name**: bert
- **Description**: Use BERT as feature. TensorFlow code and pre-trained models for BERT
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2021-08-09
- **Last Updated**: 2022-10-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

[TOC]

# Use BERT as feature
1. 如何调用bert，将输入的语句输出为向量？
2. 如果在自己的代码中添加bert作为底层特征，需要官方例子run_classifier.py的那么多代码吗？
# 环境

```python
mac:
tf==1.4.0
python=2.7

windows:
tf==1.12
python=3.5
```

# 入口

调用预训练的模型，来做句子的预测。
bert_as_feature.py
配置data_root为模型的地址
调用预训练模型：chinese_L-12_H-768_A-12
调用核心代码：
```python
# graph
input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')

# 初始化BERT
model = modeling.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False)

# 加载bert模型
tvars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_check_point)

# 获取最后一层和倒数第二层。
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[-2]

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    token = tokenization.CharTokenizer(vocab_file=bert_vocab_file)
    query = u'Jack,请回答1988, UNwant\u00E9d,running'
    split_tokens = token.tokenize(query)
    word_ids = token.convert_tokens_to_ids(split_tokens)
    word_mask = [1] * len(word_ids)
    word_segment_ids = [0] * len(word_ids)
    fd = {input_ids: [word_ids], input_mask: [word_mask], segment_ids: [word_segment_ids]}
    last, last2 = sess.run([encoder_last_layer, encoder_last_layer], feed_dict=fd)
    print('last shape:{}, last2 shape: {}'.format(last.shape, last2.shape))
```

完整代码见： [bert_as_feature.py](https://github.com/InsaneLife/bert/blob/master/bert_as_feature.py) 

代码库：https://github.com/InsaneLife/bert

中文模型下载：**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:    Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M    parameters

# 最终结果

最后一层和倒数第二层：
last shape:(1, 14, 768), last2 shape: (1, 14, 768)

```
# last value
[[ 0.8200665   1.7532703  -0.3771637  ... -0.63692784 -0.17133102
   0.01075665]
 [ 0.79148203 -0.08384223 -0.51832616 ...  0.8080162   1.9931345
   1.072408  ]
 [-0.02546642  2.2759912  -0.6004753  ... -0.88577884  3.1459959
  -0.03815675]
 ...
 [-0.15581022  1.154014   -0.96733016 ... -0.47922543  0.51068854
   0.29749477]
 [ 0.38253042  0.09779643 -0.39919692 ...  0.98277044  0.6780443
  -0.52883977]
 [ 0.20359193 -0.42314947  0.51891303 ... -0.23625426  0.666618
   0.30184716]]
```


# 预处理

`tokenization.py`是对输入的句子处理，包含两个主要类：`BasickTokenizer`, `FullTokenizer`

`BasickTokenizer`会对每个字做分割，会识别英文单词，对于数字会合并，例如：

```
query: 'Jack,请回答1988, UNwant\u00E9d,running'
token: ['jack', ',', '请', '回', '答', '1988', ',', 'unwanted', ',', 'running']
```

`FullTokenizer`会对英文字符做n-gram匹配，会将英文单词拆分，例如running会拆分为run、##ing，主要是针对英文。

```
query: 'UNwant\u00E9d,running'
token: ["un", "##want", "##ed", ",", "runn", "##ing"]
```

对于中文数据，特别是NER，如果数字和英文单词是整体的话，会出现大量UNK，所以要将其拆开，想要的结果：

```
query: 'Jack,请回答1988'
token:  ['j', 'a', 'c', 'k', ',', '请', '回', '答', '1', '9', '8', '8']
```

具体变动如下：

```python
class CharTokenizer(object):
    """Runs end-to-end tokenziation."""
    def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

    def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
            for sub_token in token:
                split_tokens.append(sub_token)
        return split_tokens

    def convert_tokens_to_ids(self, tokens):
        return convert_tokens_to_ids(self.vocab, tokens)
```