如何使用 Chinese ELECTRA 中文

泥膩泥膩

Published in

男友說我是宅包

12 min readJul 7, 2020

ELECTRA(厄勒克特拉)根據維基百科說明是希臘神話中的一個女性人物

好~這不是重點XD

ymcui/Chinese-ELECTRA

谷歌与斯坦福大学共同研发的最新预训练模型ELECTRA因其小巧的模型体积以及良好的模型性能受到了广泛关注。…

github.com

最近在Survey中文ELECTRA遇到一些問題，所以記錄一下QQ

我在上面的Chinese-ELECTRA GitHub下載：

Chinese-ELECTRA
它提供的模型ELECTRA-small, Chinese

然後照著它的"使用方法"走，遇到一些問題，就是跟它上面寫得不太一樣XD這邊我就記錄我按照它的步驟，然後遇到的問題和解決的方法

第零步：下載 Chinese-ELECTRA

$ git clone https://github.com/ymcui/Chinese-ELECTRA.git

第一步：下載預訓練模型並解壓

這邊我是下載ELECTRA-small, Chinese

然後把它丟到 Chinese-ELECTRA 資料夾下面，資料名稱我取成 electra-small

在 ./Chinese-ELECTRA/electra-small/ 下有三個資料夾：

chinese_electra_small_L-12_H-256_A-4/ 這個只是我單純用來記錄一下我這個模型的名稱而已，沒有任何意義XD
finetuning_data/ 這個資料夾底下需要再新建資料夾XD 然後放 train.json 和 dev.json，例如：./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/

3. models/ 下面再新建資料夾，下面放預訓練的模型的位置，把解壓縮厚的五個檔案都丟進去electra_model.*，vocab.txt，checkpoint，例如：./Chinese-ELECTRA/electra-small/models/electra-small/

第二步：準備任務資料

下載 CMRC 2018訓練集和驗證集，然後把它們重命名成train.json和dev.json

把這兩個檔案放到./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/

(跟GitHub上寫的路徑不太一樣，總之放到上面路徑就可以跑惹)

然後還有一個很重要的步驟就是要在 ./Chinese-ELECTRA/ 下面新增一個檔案 → params_cmrc2018.json

{
    "task_names": ["cmrc2018"],
    "max_seq_length": 512,
    "vocab_size": 21128,
    "model_size": "small",
    "do_train": true,
    "do_eval": true,
    "write_test_outputs": true,
    "num_train_epochs": 2,
    "learning_rate": 3e-4,
    "train_batch_size": 32,
    "eval_batch_size": 32
}

第三步：開始 fine-tuning 訓練模型

回到 ./Chinese-ELECTRA/ 目錄下，執行

python run_finetuning.py \
    --data-dir electra-small \
    --model-name electra-small \
    --hparams params_cmrc2018.json

第四步：評測結果

訓練結束後，cmrc2018_dev_preds.json會存在 ./Chinese-ELECTRA/electra-small/models/electra-small/results/cmrc2018_qa/

可以使用外部評測程式碼來得到最終評測的結果

python cmrc2018_drcd_evaluate.py \
    electra-small/finetuning_data/cmrc2018/dev.json \
    electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json

但是執行的時候會發生很多問題Q_Q 大概有五個Error吧Q_Q

所以要先改一下 cmrc2018_drcd_evaluate.py 程式碼

在 import 的地方做修改import importlib #加入這行 importlib.reload(sys) #原本是reload(sys)，改成這行 # sys.setdefaultencoding('utf8') #把這行mark掉
把有 .decode(‘utf-8’) 全部刪掉
安裝 punkt 套件import nltk nltk.download(‘punkt’)

完成程式碼的修改後，就可以執行了

{"AVERAGE": "71.967", "F1": "80.592", "EM": "63.343", "TOTAL": 3219, "SKIP": 0, "FILE": "electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json"}

這樣就完成 Chinese-ELECTRA 的 tutorial 惹

💪💪😀😃😄😁😆👍👍

以下就是錯誤的訊息而已，可以跳過XD

第一個 NameError

Traceback (most recent call last):
  File "cmrc2018_drcd_evaluate.py", line 16, in <module>
    reload(sys)
NameError: name 'reload' is not defined

第二個 AttributeError

Traceback (most recent call last):
  File "cmrc2018_drcd_evaluate.py", line 142, in <module>
    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
  File "cmrc2018_drcd_evaluate.py", line 100, in evaluate
    prediction 	= str(prediction_file[query_id]).decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

第三個 AttributeError

Traceback (most recent call last):
  File "cmrc2018_drcd_evaluate.py", line 142, in <module>
    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
  File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
    f1 += calc_f1_score(answers, prediction)
  File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
    ans_segs = mixed_segmentation(ans, rm_punc=True)
  File "cmrc2018_drcd_evaluate.py", line 24, in mixed_segmentation
    in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'

第四個 LookupError

Traceback (most recent call last):
  File "cmrc2018_drcd_evaluate.py", line 142, in <module>
    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
  File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
    f1 += calc_f1_score(answers, prediction)
  File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
    ans_segs = mixed_segmentation(ans, rm_punc=True)
  File "cmrc2018_drcd_evaluate.py", line 44, in mixed_segmentation
    ss = nltk.word_tokenize(temp_str)
  File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 834, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 952, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 673, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/nltk_data'
    - '/usr/lib/nltk_data'
    - ''
**********************************************************************

第五個 AttributeError

Traceback (most recent call last):
  File "cmrc2018_drcd_evaluate.py", line 142, in <module>
    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
  File "cmrc2018_drcd_evaluate.py", line 102, in evaluate
    em += calc_em_score(answers, prediction)
  File "cmrc2018_drcd_evaluate.py", line 128, in calc_em_score
    ans_ = remove_punctuation(ans)
  File "cmrc2018_drcd_evaluate.py", line 52, in remove_punctuation
    in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'