Efficiently Handling Large Data Retrievals in PHP: Solving Memory Limit Issues

Abdelrahman Emam
4 min readAug 2, 2024

--

In specific cases, retrieving and processing large datasets from a source can present significant challenges. The volume of data can be so large that handling it efficiently becomes crucial to maintaining the application’s performance.

The Issue

I was tasked with implementing a new feature that required retrieving a massive amount of data from Elasticsearch. The dataset’s structure and size were such that it exceeded PHP’s memory limit. This meant that loading the entire dataset into memory at once caused the PHP script to hit its memory limit, leading to errors and application crashes.

Initial Solution: Using the Scroll API

To address the issue of exceeding PHP’s memory limit, the first solution that came to mind was to use Elasticsearch’s Scroll API. The Scroll API is designed to handle large datasets by retrieving data in chunks, rather than loading the entire dataset into memory all at once. This approach allows for processing large volumes of data efficiently while staying within PHP’s memory constraints.

Limitations of the Scroll API

“While the Scroll API efficiently handles the retrieval of large datasets by processing data in smaller chunks, it doesn’t completely resolve the issue of memory consumption. As these chunks are retrieved, they still need to be processed and stored temporarily in memory. If the application accumulates too much data at once or fails to release memory effectively, it can still lead to memory exhaustion issues.”

Another potential but generally not recommended solution is to simply raise the PHP memory limit. While this might temporarily alleviate the issue, it doesn’t address the underlying problem of inefficient data handling and can lead to other issues such as increased resource consumption and potential instability in your application.

The solution

To address these ongoing memory consumption challenges, the next section will explore solutions designed to mitigate the impact of handling large datasets and ensure efficient memory usage.

Retrieve Data Using the Scroll API:

  • Use Elasticsearch’s Scroll API to fetch data in manageable chunks. This prevents the entire dataset from being loaded into memory at once, thus helping to avoid memory limit issues.
public function GetElasticSearchChunks(array $data): Generator
{
$params = $this->GetElasticSearchParams($data);

$params['scroll'] = '1m';
$response = $this->client->search($params);

$scrollId = $response['_scroll_id'];
$hits = $response['hits']['hits'];
....
}

Map Data with Generators:

  • Employ PHP generators to process and map each chunk of data to the desired structure. Generators allow you to iterate over data without loading it all into memory, thus optimizing memory usage.
  • Generators use a lazy evaluation approach, meaning they yield one item at a time rather than loading all items into memory simultaneously. This keeps the memory footprint low and prevents memory exhaustion issues.
  • By processing data incrementally, generators reduce the overhead of handling large datasets. This can lead to faster data processing and response times, as you avoid the delays associated with large-scale data loading.
public function GetElasticSearchChunks(array $data): Generator
{
$params = $this->GetElasticSearchParams($data);

$params['scroll'] = '1m';
$response = $this->client->search($params);

$scrollId = $response['_scroll_id'];
$hits = $response['hits']['hits'];

while (count($hits) > 0) {
foreach ($hits as $hit) {
yield $this->getDataMapped($hit);
}

$scrollParams = [
'scroll_id' => $scrollId,
'scroll' => '1m',
];

$response = $this->client->scroll($scrollParams);
$scrollId = $response['_scroll_id'];
$hits = $response['hits']['hits'];
}
}

Use Streamed Responses:

  • Streamed responses are a technique for handling large datasets by sending data to the client in small, incremental chunks rather than all at once. This approach is particularly useful for reducing memory usage and improving performance when dealing with substantial amounts of data.
  • By sending data in chunks, you avoid holding the entire dataset in memory. This minimizes the memory footprint of your application and helps prevent memory exhaustion.
  • Streamed responses are well-suited for applications that need to scale, as they manage memory consumption and processing load more effectively compared to loading large datasets all at once.
try {
$chunks = $this->GetElasticSearchChunks($filters);

$callback = function () use ($chunks) {
echo '[';
$first = true;
foreach ($chunks as $chunk) {
if (!$first) {
echo ',';
}
echo json_encode($chunk);
$first = false;
flush();
}
echo ']';
};

$headers = [
"Content-Type" => "application/json",
"Cache-Control" => "no-cache, no-store, must-revalidate",
"Pragma" => "no-cache",
"Expires" => "0",
];

return response()->stream($callback, 200, $headers);
} catch (\Exception $e) {
Log::error($e->getMessage());
}

Conclusion

By implementing the Scroll API, leveraging PHP generators, and utilizing streamed responses, you can efficiently handle large data retrievals in PHP, optimizing memory usage and improving application performance. This approach not only prevents memory exhaustion issues but also ensures a scalable and responsive system. Adopting these techniques will help you manage large datasets more effectively, leading to more robust and reliable applications.

--

--