Extracting Tables From Plain Text Files Statistically With Numpy

Ming
Data Science with Ming
5 min readJan 22, 2018

No rules, no standards. — Mencius

Tables are not created the same. Some are born with coherence to rigorous standards, some are not. The formers are easily parsable by machines, which the latters are only made for humans’ eyes only.

Take this excerpt from an annual report of the Imperial Bancorp as an example:

This example perfectly demonstrates some major difficulties in parsing textual tables:

  • No column indicators.
  • The words “Imperial” align so perfectly in the middle of the table that one may confuse these entries of “Imperial Bank” as two separate columns.
  • Multi-line cells exist (e.g. Imperial Municipal Services Group, Inc. (1) (2)).
  • Has non-tabular data (e.g. footnotes, descriptions and even XML tags).
  • There could be no spaces between cells sharing a row. E.g, “Imperial Global Trading Company, Inc. (1)Imperial Bank" actually should be split at ")I".
  • Some columns are left-aligned, while some are right-aligned (e.g. the percentages).

Intuitively, when detecting the structure of a table, we should consider the distribution of whitespaces as our most significant feature.

(Note: Other approaches exist, too. This blog post serves merely as a starter. As an instance, Extracting Tabular Information From Text Files is a great read.)

The code below converts the 2D matrix of raw characters to boolean values. Knowing that some tables use the period sign (“.") as the placeholder, I included in my code the case of a character being either " " or ".":

arr = np.asarray(     # use the result as a 2D arraylist(map(lambda string:      # mapped for each line in the contentlist(map(lambda x: x in (' ', '.'), string)),    # tell letters from whitespacescontent))).astype(int)     # True/False to 1/0.arr

Output:

array([[0, 0, 0, ..., 1, 1, 1],
[0, 0, 0, ..., 1, 1, 1],
[0, 0, 0, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[0, 0, 0, ..., 1, 1, 1]])

Notice that, not all whitespaces serve as delimiters. In English documents, whitespaces also exist between neraly every pair of adjacent words, unlike CJK languages. (Interestingly, this used to be an annoyance in natural language processing.) Perhaps we should consider a whitespace to be a delimiter when the character next to it is also one:

Fantastic! Now let’s project the characters to each axis — In plain English, this simply means counting the amount of non-whitespace characters for each column and each row:

proj_x = (~arr3).sum(axis=0)
proj_y = (~arr3).sum(axis=1)

Along each axis, when the projection suddenly goes up or down, it should indicate the start or end of a table column. Let’s detect this “steep gradient” by calculating the consecutive-pairwise differences in the projections:

proj_x_diff_y = np.ediff1d(proj_x)
proj_x_diff_y = np.insert(proj_x_diff_y, 0, proj_x[0])
proj_x_diff_x = np.arange(len(proj_x))
proj_x_diff = np.stack([proj_x_diff_x, proj_x_diff_y]).transpose()

How steep is steep, though? After playing with the number for a bit, I’d say going over half of the range of the differences would be a reliable indicator of such events:

proj_x_diff_y_min = proj_x_diff_y.min()
proj_x_diff_y_max = proj_x_diff_y.max()
proj_x_diff_y_threshold = .5*(proj_x_diff_y_max-proj_x_diff_y_min)+proj_x_diff_y_min
column_positions_start = np.where( proj_x_diff_y > proj_x_diff_y_threshold )
column_positions_end = np.where( proj_x_diff_y < -proj_x_diff_y_threshold )

After applying the same inspection along the y-axis (to detect where the table is, rather than where individual columns are), we can plot the result:

Great! We have found sort of a grid.

In the x direction, the red lines are good enough as separators. The blue lines — where the projections decrease rapidly — are not as good indicators of columns as red lines do. We will toss the blue lines and simply str.rstrip(' ') when extracting cell values.

The direction that really could use more work is the y direction. This steepest-projection method failed to detect the horizontal separator under the header of the table. We have to come up with more relevant features to look into. The repetitiveness of characters (ignoring whitespaces) might be a suitable option:

from collections import Counter
repetitiveness = np.asarray([max(Counter(line.replace(' ', '')).values(),default=0) for line in content])
_ = plt.plot(repetitiveness)

The start row of table and the end row of it are defined as:

  • Start row is where the repetitiveness first reaches a peak, and that the character repeated the most on this line should took more than half of the available spots. Otherwise, there is no start or end separator lines.
  • End row is, after the start row, where the repetitiveness first reaches a peak. However, if the difference in repetitiveness between the start and the end rows is too much (exceeding half of the range of all repetitivenesses in this document), then there is simply no such.

Not bad, eh?

In the case where the table does not to start immediately at the first row, there may be a header. Assuming that header — as a part of table — will be separated with the rest of document by an empty line:

Finally we can cut off the unwanted lines:

One advantage of having headers is that headers are often better indicators of column positions. We can shift a column’s guessed starting position to the left, all the way to where the header is all empty at this x coordinate:

Perfect! Now we can finally read out the data:

Done!

With generality in mind, the steps were taken in a way that is pretty universal to other tables.

--

--