OpenAIのWhisper文字起こし25MB制限を解決するPHP, Laravel, ffmpegを使ったファイル分割の例

Published in

Nyle Engineering Blog

7 min readAug 6, 2023

Astronaut Cat Writes in Tropical Resort in Space Pixel Art

OpenAIのAPIを使った音声の文字起こしは、今や多くのアプリケーションで利用されています。この記事では、特にWhisper文字起こしの25MB制限に焦点を当て、PHP, Laravel, ffmpeg, PHP-FFMpegなどの技術を使用したファイル分割について詳しく解説します。

OpenAI APIについて

OpenAI API

We're releasing an API for accessing new AI models developed by OpenAI.

openai.com

OpenAI APIは、AIを活用した多岐にわたるサービスを提供しています。

Whisper文字起こし

Introducing Whisper

We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on…

openai.com

Whisperは、OpenAIが提供する音声認識エンジンです。音声データをテキストに変換する技術は、多岐にわたる分野で使われています。

ファイル分割

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

現時点（2023年8月）の仕様ではAPIにて1回で処理可能なファイルの容量に上限が設けられており、25MB以上（一般的な品質の音声ファイルで10分ちょっと）のものは分割して処理するなどが必要です。

PHPとLaravelによる実装

PHPとLaravelを使った実装方法を以下に示します。

PHP-FFMpeg

https://github.com/PHP-FFMpeg/PHP-FFMpeg

ffmpegは動画・音声の変換に使われるツールです。このツールを使うことで、簡単にファイルを分割することが可能です。

PHP-FFMpegは、ffmpegをPHPから操作するライブラリです。このライブラリを使えば、コーディングがさらに簡単になります。

インストール

$ apt-get install -y ffmpeg
$ composer require php-ffmpeg/php-ffmpeg

サンプルコード

以下のサンプルコードでは動画の場合は2分ごと、音声の場合は10分ごとにチャンクを生成して処理しています。

手前のチャンクの処理結果や期待する単語をpromptに与えることで精度を調整できます。

例えば「異動」のような同音異義語が複数存在するような単語では何もpromptに指定しないと「移動」となってしまったりしますが、promptに「移動」と与えておけば「移動」となります。（基本的には文脈から適切な変換を行なってくれるのでこの例のような問題はあまり起きませんが、固有名詞などはそれなりに揺らぐので設定しておくと良いです。）

<?php

namespace App\Services;

use App\Contracts\Transcribable;
use App\Enums\JobStatus;
use FFMpeg\Coordinate\TimeCode;
use FFMpeg\Format\Audio\Mp3;
use FFMpeg\Format\Video\X264;
use FFMpeg\FFMpeg;
use Illuminate\Support\Facades\Storage;
use OpenAI;

class AudioTranscriptionService
{
    public static function transcribeAudio(Transcribable $model, $filePath, $initialText): string
    {
        $model->updateJobStatus(JobStatus::Processing);
        $fileName = basename($filePath);
        $fileExtension = pathinfo($fileName, PATHINFO_EXTENSION);
        // FFMpegを初期化
        $ffmpeg = FFMpeg::create();
        // ファイルを開く
        $audio = $ffmpeg->open($filePath);
        // 音声ファイルの長さを取得する
        $duration = $audio->getFormat()->get('duration');
        // OpenAIクライアントを生成
        $client = OpenAI::client(config('services.openai.key'));

        // 音声ファイルを10分ごとに分割し、それぞれを文字起こしAPIに送信
        $translateText = '';
        $previousText = $initialText;
        $chunkDuration = $fileExtension === 'mp4' ? 120 : 600; // mp4の場合は2分ごと、それ以外は10分ごとにチャンクを生成
        for ($i = 0; $i < $duration; $i += $chunkDuration) {
            $audio
                ->filters()
                ->clip(TimeCode::fromSeconds($i), TimeCode::fromSeconds($chunkDuration));
            $chunkFileName = $fileName . '-' . $i;
            if($fileExtension === 'mp4') {
                $audioType = new X264();
                $chunkFileName .= '.mp4';
            } else {
                $audioType = new Mp3();
                $chunkFileName .= '.mp3';
            }
            $chunkFilePath = 'app/' . $chunkFileName;
            // チャンクを保存
            $audio->save($audioType, storage_path($chunkFilePath));
            // チャンクを文字起こしAPIに送信
            $file = fopen(storage_path($chunkFilePath), 'r');
            $response = $client->audio()->transcribe([
                'model' => 'whisper-1',
                'file' => $file,
                'prompt' => $previousText,
            ]);
            $previousText = $initialText . ' ' . $response->text;
            $translateText .= $response->text;
            $model->updateTranscript($translateText); // 途中経過を保存
            Storage::delete($chunkFileName);
        }
        $model->updateJobStatus(JobStatus::Completed);

        return $translateText;
    }
}

結論

OpenAI APIで利用できる文字起こし機能の容量制限を超えたファイルの扱いについて書きました。