Step-by-step Guide to Converting TMX Files to Excel using Python

Said Sürücü
3 min readMay 24, 2023

--

In my previous posts, I’ve walked you through the process of converting between various file types, such as RTF, MQXLIFF, and Excel, to streamline the localization workflow. Now, we’re going to take it a step further by focusing on TMX (Translation Memory eXchange) files. This XML-based format is widely used to interchange translation memory data between tools and is a vital aspect of Computer-Assisted Translation (CAT). But before we unpack the Python code that performs this task, let’s consider why this conversion is beneficial.

Benefits of Converting TMX to Excel

Flexibility and Accessibility: Excel’s user-friendly interface and powerful functionality make data manipulation a breeze. Not everyone is comfortable working with XML files directly, and Excel provides a more accessible platform.

Easy Data Analysis and Editing: Excel allows for quick data analysis and bulk editing. You can easily sort, filter, find & replace, and perform other operations on your data, which can be a lifesaver when dealing with large TMX files.

Sharing and Collaboration: Excel files are widely used and easily shareable, making collaboration smoother. They also offer additional features like adding comments or track changes which are useful during collaborative translation projects.

The Code

import sys
import os
from lxml import etree
import pandas as pd

Here, we import the necessary Python modules. The standard Python libraries sys and os are for system-specific parameters and functions. lxml is used for processing XML, which is crucial for handling TMX files. pandas is used for data manipulation and analysis, perfect for managing Excel files.

input_file = os.path.basename(sys.argv[1:][0])

We fetch the input file name from the command-line arguments. os.path.basename gets the final component of a pathname, which in our case, is the file name.

df = pd.DataFrame(columns=["Source", "Target"], dtype="string")

We create a pandas DataFrame, which is a two-dimensional, size-mutable, heterogeneous tabular data structure. We define two columns, “Source” and “Target”, both with a data type of string.

xml_tree = etree.parse(input_file)

We use lxml.etree.parse to parse the XML content of the input TMX file. This creates an XML tree that we can then navigate.

trans_units = xml_tree.findall(".//tu")

Within the XML tree, we find all instances of the ‘translation unit’ (tu) elements. Each tu contains a source and a target segment in a particular language pair.

source_texts = []
target_texts = []
for trans_unit in trans_units:
tuv1 = trans_unit.findall(".//tuv")[0]
tuv2 = trans_unit.findall(".//tuv")[1]
source = tuv1.findall(".//seg")[0]
target = tuv2.findall(".//seg")[0]

source_text = ''.join(source.itertext())
target_text = ''.join(target.itertext())

source_texts.append(source_text)
target_texts.append(target_text)

We create empty lists for source texts and target texts. For each tu, we find the first tuv (translation variant) as the source and the second tuv as the target. We extract their text, join them into a single string, and append them to their respective lists.

df["Source"] = source_texts
df["Target"] = target_texts

We assign the source and target text lists to the “Source” and “Target” columns of our DataFrame, respectively.

df.to_excel("TMX_" + os.path.splitext(input_file)[0] + ".xlsx",index=False)

Finally, we write the DataFrame to an Excel file using the to_excel() method, prepending "TMX_" to the original file name and removing the original file extension.

Stay tuned for our next Medium article where we’ll continue exploring how Python can boost your localization workflow. As always, at 23 Studios, we’re committed to providing you with cutting-edge solutions in the field of localization and Computer-Assisted Translation. Feel free to reach out for any inquiries or tailored services. Happy localizing!

This article is part of a series on localization and Computer-Assisted Translation tools. Don’t forget to check out my previous articles on converting MemoQ’s RTF and MQXLIFF files to Excel and back.

--

--