Efficient searching through PDFs with MySQL and Laravel

smknstd
code16
Published in
4 min readJul 6, 2023
Photo by James on Unsplash

Search functionality is often crucial in a modern web application. When a website provides documents with extensive text, such as PDF files, it is often necessary to allow users to list and search within the content of these documents. While specialized tools like Elasticsearch, MeiliSearch, or Typesense may seem enticing, it’s important to consider their significant impact. By incorporating one of these tools into your stack, the architecture of your application becomes more complex. Not only do you need to understand how to use such a tool, but you also have to install, configure, maintain, ensure its security, and monitor its operation.

However, specialized search tools also bring undeniable advantages. Here are some examples of features that a modern search engine can offer:

  • Typo tolerance (also known as Fuzzy search)
  • Partial word search
  • Result highlighting and snippet previews
  • Autocomplete
  • Custom ranking
  • Search synonyms
  • Stemming
  • Text classification
  • and more.

When deciding whether to use searching tool, keep in mind that this type of searching has many subtleties unique to itself, know both the benefits and disadvantages of utilizing each of them and choose wisely.

MySQL can do it

If you can settle for a basic search feature without anything too sophisticated then there is an interesting simpler approach available. It is straightforward to set up and offers appealing advantages. Let’s dig into it.

First, you need to extract the text from the PDF files and insert it into your database. MySQL can then index this content, enabling you to search within it. To assist with this, there is a PHP package based on the popular tool “pdftotext” that allows easy extraction of text from a PDF document.

composer require spatie/pdf-to-text

It is very simple to use. Since this operation can be resource-intensive, it is advisable to perform it asynchronously in a background job. Note that MySQL’s “text” column type has a maximum length of 65,535 characters (~10000 words), so your PDF content should not be too large.

use App\Models\PdfDocument;
use Illuminate\Support\Facades\Storage;
use Illuminate\Support\Str;
use Spatie\PdfToText\Pdf;

class StorePdfDocumentAsText implements ShouldQueue
{
public function handle()
{
PdfDocument::create([
'title' => 'My great document about something',
'content' => Str::limit(
Pdf::getText(
Storage::disk('s3')->path('my_file.pdf'),
),
60000,
),
]);
}
}

While using a “like” clause to search within the content of files is possible, it may not be suitable for large or numerous documents in terms of performance. Additionally, using a “like” clause limits searches to exact matches. The purpose of this article is to demonstrate that MySQL has a built-in feature called “Natural Language Full-Text Searches”.

It can be considered a middle ground between a “like” clause and a dedicated external tool like Elasticsearch. MySQL has provided this feature for a long time, but it is only since the beginning of 2022 that Laravel allows native usage of it with Eloquent. This allows more advanced searches using a relatively simple syntax. For example, you can:

  • Search for a set of multiple words using everyday language rather than keywords
  • Search within multiple different columns (e.g., document title, file name, and, of course, its content)
  • Order the results by relevance
  • Good performance

To set it up, you need to create a specific index on the desired columns using a migration:

class CreateFullTextIndex extends Migration
{
public function up()
{
Schema::table('pdf_documents', function (Blueprint $table) {
$table->fullText(['title', 'content'])->language('english');
});
}
}

Then, configure it in the model. As it uses attributes syntax you’ll need at least PHP8.

use Laravel\Scout\Searchable;
use Laravel\Scout\Attributes\SearchUsingFullText;

class PdfDocument extends Model
{
use Searchable;

#[SearchUsingFullText(["title", "content"])]
public function toSearchableArray()
{
return [
"title" => $this->title,
"content" => $this->content,
];
}
}

That’s it! As soon as the migration has been run, you can immediately use the search functionality on your model using the traditional scout syntax. No action is required on your part to synchronize the index; MySQL handles it automatically, ensuring it is always up to date.

Here’s an example of how you can populate a search result page based on the content of your PDF files :

$results = PdfDocument::search('my search terms')
->where('user_id', $user->id) //additionnal clauses
->paginate(20);

The query would look something like:

select *
from users
where (match (title, content) against ('my search terms' in natural language mode))
and user_id = 29;

MySQL does calculate a score for each result and uses it for ordering. The score is only used internally by MySQL, but it is represented in this way:

+----+-------------------------------------+-----------------+
| id | title | score |
+----+-------------------------------------+-----------------+
| 4 | My first document about flowers | 1.5219271183014 |
| 6 | Another document about flowers | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

That’s it for today. Happy searching y’all !

--

--