The Hidden Villain: Java File Operations in PostgreSQL CopyManager API

Published in

IBM Data Science in Practice

3 min readJun 13, 2024

Introduction

Hook
Imagine spending countless hours debugging a data corruption issue, only to find the problem where you least expected it. We faced this exact scenario when multibyte characters in our PostgreSQL database were corrupted during bulk loads using Java’s CopyManager API.

Background
Handling multibyte characters correctly is essential in global applications. Our UTF-8 encoded CSV files seemed perfect, yet data corruption persisted. Initially, we suspected issues with the pgjdbc driver and the PostgreSQL CopyManager API. However, the real culprit turned out to be our Java file handling method.

In this blog post, we uncover how improper file handling in Java caused the issue and share the solution that restored our data integrity.

The Problem

Issue Description

We encountered a perplexing issue while using the PostgreSQL CopyManager API to bulk load data into our database. Despite our CSV files being correctly encoded in UTF-8, the data inserted into the database appeared corrupted. Specifically, rows inserted using the COPY command showed corrupted multibyte characters, while rows inserted using JDBC appeared correctly.

Example of the issue:
Data inserted using COPY is shown on target corrupted:

Select * from kschema.tgt;
  pkcol | str1
  33    | Ã¥
  34    | Ã¥

Data inserted using JDBC on source is coming correct on target:

Select * from kschema.Tgt;
  pkcol | str1
  43    | å

Initial Investigations

Our initial investigation revealed that the CSV files were correctly encoded in UTF-8. We began by scrutinizing the pgjdbc driver and the PostgreSQL CopyManager API, suspecting that they might not handle multibyte characters properly. Despite ensuring that our CSV file was in UTF-8, the CopyManager API seemed to fail in recognizing it, leading to corrupted data.

Example code snippet:

String COPY_CMD = "COPY kschema.src FROM STDIN (FORMAT csv, delimiter '|')";
CopyManager cm = new CopyManager((BaseConnection) con);
long numrowsinserted = cm.copyIn(COPY_CMD, in);

However, further investigation showed that while the rows inserted using JDBC were correct, the rows inserted using the COPY command had corrupted multibyte characters. This discrepancy indicated that the issue might lie elsewhere. After thorough debugging, we realized that the problem was rooted in how the Java program read the UTF-8 encoded files, not in the PostgreSQL tools themselves.

Unmasking the Hidden Villain

Discovery

After spending considerable time investigating potential issues with the pgjdbc driver and the PostgreSQL CopyManager API, we decided to take a closer look at our Java code, specifically how it handled file operations. It was during this deep dive that we uncovered the true culprit: the way our Java program read the UTF-8 encoded CSV files.

Technical Explanation

The issue stemmed from this line of code:

Reader in = new BufferedReader(new FileReader("./fail.csv"));

This line does not handle UTF-8 encoded content correctly. The `FileReader` class reads the file content byte by byte, converting each byte to UTF-8. This byte-by-byte conversion results in incorrect interpretation of multibyte characters, leading to data corruption.

To correctly read UTF-8 encoded files, we needed to explicitly specify the encoding. The proper approach involves using `InputStreamReader` with the specified charset:

Reader in = new BufferedReader(new InputStreamReader(
            new FileInputStream("./fail.csv"), "UTF-8"));

By informing the reader that the file content is UTF-8, the program can correctly interpret and handle multibyte characters, preserving the data integrity.

Proof of Correctness

To verify the solution, we tested the corrected file reading approach with various multibyte characters. The results showed that the data was correctly loaded into PostgreSQL without any corruption. This confirmed that the issue was entirely related to the improper handling of file reading in Java, and not a problem with the pgjdbc driver or the PostgreSQL CopyManager API.

By making this simple adjustment, we ensured that our data load process handled multibyte characters correctly, eliminating the data corruption issue and maintaining the integrity of our database.