最近在Survey中文ELECTRA遇到一些問題,所以記錄一下QQ
我在上面的Chinese-ELECTRA GitHub下載:
- Chinese-ELECTRA
- 它提供的模型
ELECTRA-small, Chinese
然後照著它的"使用方法"走,遇到一些問題,就是跟它上面寫得不太一樣XD這邊我就記錄我按照它的步驟,然後遇到的問題和解決的方法
第零步:下載 Chinese-ELECTRA
$ git clone https://github.com/ymcui/Chinese-ELECTRA.git
第一步:下載預訓練模型並解壓
這邊我是下載ELECTRA-small, Chinese
然後把它丟到 Chinese-ELECTRA 資料夾下面,資料名稱我取成 electra-small
在 ./Chinese-ELECTRA/electra-small/ 下有三個資料夾:
- chinese_electra_small_L-12_H-256_A-4/ 這個只是我單純用來記錄一下我這個模型的名稱而已,沒有任何意義XD
- finetuning_data/ 這個資料夾底下需要再新建資料夾XD 然後放 train.json 和 dev.json,例如:./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/
3. models/ 下面再新建資料夾,下面放預訓練的模型的位置,把解壓縮厚的五個檔案都丟進去electra_model.*
,vocab.txt
,checkpoint
,例如:./Chinese-ELECTRA/electra-small/models/electra-small/
第二步:準備任務資料
下載 CMRC 2018訓練集和驗證集,然後把它們重命名成train.json
和dev.json
把這兩個檔案放到./Chinese-ELECTRA/electra-small/finetuning_data/cmrc2018/
(跟GitHub上寫的路徑不太一樣,總之放到上面路徑就可以跑惹)
然後還有一個很重要的步驟就是要在 ./Chinese-ELECTRA/ 下面新增一個檔案 → params_cmrc2018.json
{
"task_names": ["cmrc2018"],
"max_seq_length": 512,
"vocab_size": 21128,
"model_size": "small",
"do_train": true,
"do_eval": true,
"write_test_outputs": true,
"num_train_epochs": 2,
"learning_rate": 3e-4,
"train_batch_size": 32,
"eval_batch_size": 32
}
第三步:開始 fine-tuning 訓練模型
回到 ./Chinese-ELECTRA/ 目錄下,執行
python run_finetuning.py \
--data-dir electra-small \
--model-name electra-small \
--hparams params_cmrc2018.json
第四步:評測結果
訓練結束後,cmrc2018_dev_preds.json
會存在 ./Chinese-ELECTRA/electra-small/models/electra-small/results/cmrc2018_qa/
可以使用外部評測程式碼來得到最終評測的結果
python cmrc2018_drcd_evaluate.py \
electra-small/finetuning_data/cmrc2018/dev.json \
electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json
但是執行的時候會發生很多問題Q_Q 大概有五個Error吧Q_Q
所以要先改一下 cmrc2018_drcd_evaluate.py
程式碼
- 在 import 的地方做修改
import importlib #加入這行
importlib.reload(sys) #原本是reload(sys),改成這行
# sys.setdefaultencoding('utf8') #把這行mark掉 - 把有
.decode(‘utf-8’)
全部刪掉 - 安裝 punkt 套件
import nltk
nltk.download(‘punkt’)
完成程式碼的修改後,就可以執行了
{"AVERAGE": "71.967", "F1": "80.592", "EM": "63.343", "TOTAL": 3219, "SKIP": 0, "FILE": "electra-small/models/electra-small/results/cmrc2018_qa/cmrc2018_dev_preds.json"}
這樣就完成 Chinese-ELECTRA 的 tutorial 惹
💪💪😀😃😄😁😆👍👍
以下就是錯誤的訊息而已,可以跳過XD
第一個 NameError
Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 16, in <module>
reload(sys)
NameError: name 'reload' is not defined
第二個 AttributeError
Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 100, in evaluate
prediction = str(prediction_file[query_id]).decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
第三個 AttributeError
Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
f1 += calc_f1_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
ans_segs = mixed_segmentation(ans, rm_punc=True)
File "cmrc2018_drcd_evaluate.py", line 24, in mixed_segmentation
in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'
第四個 LookupError
Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 101, in evaluate
f1 += calc_f1_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 112, in calc_f1_score
ans_segs = mixed_segmentation(ans, rm_punc=True)
File "cmrc2018_drcd_evaluate.py", line 44, in mixed_segmentation
ss = nltk.word_tokenize(temp_str)
File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/usr/local/lib/python3.6/dist-packages/nltk/tokenize/__init__.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 834, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 952, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python3.6/dist-packages/nltk/data.py", line 673, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
- ''
**********************************************************************
第五個 AttributeError
Traceback (most recent call last):
File "cmrc2018_drcd_evaluate.py", line 142, in <module>
F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
File "cmrc2018_drcd_evaluate.py", line 102, in evaluate
em += calc_em_score(answers, prediction)
File "cmrc2018_drcd_evaluate.py", line 128, in calc_em_score
ans_ = remove_punctuation(ans)
File "cmrc2018_drcd_evaluate.py", line 52, in remove_punctuation
in_str = str(in_str).decode('utf-8').lower().strip()
AttributeError: 'str' object has no attribute 'decode'