如何使用 Chinese ELECTRA 中文

泥膩泥膩
男友說我是宅包
12 min readJul 7, 2020

ELECTRA(厄勒克特拉)根據維基百科說明是希臘神話中的一個女性人物

好~這不是重點XD

最近在Survey中文ELECTRA遇到一些問題,所以記錄一下QQ

我在上面的Chinese-ELECTRA GitHub下載:

  1. Chinese-ELECTRA
  2. 它提供的模型ELECTRA-small, Chinese

然後照著它的"使用方法"走,遇到一些問題,就是跟它上面寫得不太一樣XD這邊我就記錄我按照它的步驟,然後遇到的問題和解決的方法

第零步:下載 Chinese-ELECTRA

$ git clone https://github.com/ymcui/Chinese-ELECTRA.git

第一步:下載預訓練模型並解壓

這邊我是下載ELECTRA-small, Chinese

然後把它丟到 Chinese-ELECTRA 資料夾下面,資料名稱我取成 electra-small

在 ./Chinese-ELECTRA/electra-small/ 下有三個資料夾:

  1. chinese_electra_small_L-12_H-256_A-4/ 這個只是我單純用來記錄一下我這個模型的名稱而已,沒有任何意義XD
  2. finetuning_data/ 這個資料夾底下需要再新建資料夾XD 然後放 train.json 和 dev.json,例如:./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/

3. models/ 下面再新建資料夾,下面放預訓練的模型的位置,把解壓縮厚的五個檔案都丟進去electra_model.*vocab.txtcheckpoint,例如:./Chinese-ELECTRA/electra-small/models/electra-small/

第二步:準備任務資料

下載 CMRC 2018訓練集和驗證集,然後把它們重命名成train.jsondev.json

把這兩個檔案放到./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/

(跟GitHub上寫的路徑不太一樣,總之放到上面路徑就可以跑惹)

然後還有一個很重要的步驟就是要在 ./Chinese-ELECTRA/ 下面新增一個檔案 → params_cmrc2018.json

{
"task_names": ["cmrc2018"],
"max_seq_length": 512,
"vocab_size": 21128,
"model_size": "small",
"do_train": true,
"do_eval": true,
"write_test_outputs": true,
"num_train_epochs": 2,
"learning_rate": 3e-4,
"train_batch_size": 32,
"eval_batch_size": 32
}

第三步:開始 fine-tuning 訓練模型

回到 ./Chinese-ELECTRA/ 目錄下,執行

python run_finetuning.py \
--data-dir electra-small \
--model-name electra-small \
--hparams params_cmrc2018.json

第四步:評測結果

訓練結束後,cmrc2018_dev_preds.json會存在 ./Chinese-ELECTRA/electra-small/models/electra-small/results/cmrc2018_qa/

可以使用外部評測程式碼來得到最終評測的結果

python cmrc2018_drcd_evaluate.py \
electra-small/finetuning_data/cmrc2018/dev.json \
electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json

但是執行的時候會發生很多問題Q_Q 大概有五個Error吧Q_Q

所以要先改一下 cmrc2018_drcd_evaluate.py 程式碼

  1. 在 import 的地方做修改
    import importlib #加入這行
    importlib.reload(sys) #原本是reload(sys),改成這行
    # sys.setdefaultencoding('utf8') #把這行mark掉
  2. 把有 .decode(‘utf-8’) 全部刪掉
  3. 安裝 punkt 套件
    import nltk
    nltk.download(‘punkt’)

完成程式碼的修改後,就可以執行了

{"AVERAGE": "71.967", "F1": "80.592", "EM": "63.343", "TOTAL": 3219, "SKIP": 0, "FILE": "electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json"}

這樣就完成 Chinese-ELECTRA 的 tutorial 惹

💪💪😀😃😄😁😆👍👍

以下就是錯誤的訊息而已,可以跳過XD

第一個 NameError

Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 16, in <module>
reload(sys)
NameError: name 'reload' is not defined

第二個 AttributeError

Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 100, in evaluate
prediction = str(prediction_file[query_id]).decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

第三個 AttributeError

Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
f1 += calc_f1_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
ans_segs = mixed_segmentation(ans, rm_punc=True)
File "cmrc2018_drcd_evaluate.py", line 24, in mixed_segmentation
in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'

第四個 LookupError

Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
f1 += calc_f1_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
ans_segs = mixed_segmentation(ans, rm_punc=True)
File "cmrc2018_drcd_evaluate.py", line 44, in mixed_segmentation
ss = nltk.word_tokenize(temp_str)
File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 834, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 952, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 673, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

>>> import nltk
>>> nltk.download('punkt')

Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
- ''
**********************************************************************

第五個 AttributeError

Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 102, in evaluate
em += calc_em_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 128, in calc_em_score
ans_ = remove_punctuation(ans)
File "cmrc2018_drcd_evaluate.py", line 52, in remove_punctuation
in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'

--

--