The Case of The Disappearing Strings

Orson Adams
5 min readApr 9, 2018

--

This post is about me debugging some code I wrote in python to parse court opinions and finding that text would disappear on print seemingly arbitrarily. In one case the name of the Judge would not appear on the print and I got a bit frightened.

The Opinions

After a case is appealed and the judges make their decisions they would write their decisions down in text form — these documents are called opinions. I’m no scholar of law but I think that definition suffices.

Anyway, I’m interested in understanding the language that is used in these opinions. In particular my curiosity lies in how/if the language changes given the political affiliations of panel judges. So I pulled some opinions from the First Circuit Appeals Court using the courtlistner API.

The data is provided as JSON but the actual text of the opinions are (in some cases like the one below) ORC’d text. Here is the running example for this post:

'\n\nOctober 12, 1995      [NOT FOR PUBLICATION]\r\n\n                  UNITED STATES COURT OF APPEALS\r\n                      FOR THE FIRST CIRCUIT\r\n\n                                           \r\n\nNo. 95-1448\r\n\n                          UNITED STATES,\r\n\n                            Appellee,\r\n\n                                v.\r\n\n                        REX FORBES OICKLE,\r\n\n                      Defendant - Appellant.\r\n\n                                           \r\n\n           APPEAL FROM THE UNITED STATES DISTRICT COURT\r\n\n                    FOR THE DISTRICT OF MAINE\r\n\n           [Hon. Morton A. Brody, U.S. District Judge]\r                                                               \r\n\n                                           \r\n\n                              Before\r\n\n                     Torruella, Chief Judge,\r                                                     \r\n\n                 Stahl and Lynch, Circuit Judges.\r                                                          \r\n\n                                           \r\n\n     Julio V. DeSanctis, with  whom Downeast Law Associates, P.A.\r                                                                           \r\nwas on brief for appellant.\r\n     F. Mark Terison, Assistant United States Attorney, with whom\r                              \r\nJay  P.  McCloskey,  United  States Attorney,  and  Elizabeth  C.\r                                                                           \r\nWoodcock,  Assistant United  States Attorney,  were on  brief for\r                  \r\nappellee.\r\n\n                                           \r\n\n                                           \r\r\n\n          Per Curiam.   Upon full  review of the  record in  this\r                    Per Curiam.\r                              \r\n\ncase,  and after  hearing argument  from counsel,  we affirm  the\r\n\ndecision of the district court.\r\n\n          Affirmed.\r                    Affirmed.\r                            \r\r\n'

Pretty nasty bit. Dated October 12, 1995 probably written on a typewriter — would love if someone can confirm that.

Let’s have a good look at this bad boy:

>>> print(nasty_string)
Truncated printing of nasty_string

That is missing some text now, isn’t is? What was expected was something like this:

Non-truncated printing of nasty_string

Much better. You can’t truncate the Chief Judge, that just wont do. To some the reason for this might be obvious but to me I found something that looked a bit buggy.

Briefest History of The Carriage Return

Image is probably subject to copyright.

Back in the olden days manual devices known as typewriters were used to print characters onto paper similar to what our current civilization would do using our computer and printers.

To advance to a newline while typing on these artifacts (as depicted above) one would operate a carriage return lever that would move the typing element back to left side of the paper. This operation was often followed by a line feed operation that advanced the paper to a newline.

Some ponderers might consider that they could have used one operation. After all if you wanted to go to leftmost point of the page, wouldn’t you also want to go to a newline? Turns out that there are at least two interesting reasons:

  • The speed to move the the typing element to the left off the page is much slower that the line feed operation. So an optimization would be to send out the return operation and do some other work like feed a new line or multiple new lines! Double spacing, anybody?
  • A return operation also could used to bold text by returning to the character and printing it a few times. tap A \r tap A → A.

Both operation where mapped into a single control (end-of-line-return-start-new-line) when electronic typewriters were introduced and computers followed suit. So today we hit ‘Enter’ or ‘Return’ on our keyboards and get the combined operation.

However the operations continued having independent meaning and representation on computers. That is, while computers had merged end-of-line-return and newline operations to a single control (Enter/Return), they still had different encodings:

decimal|  octal| hex|   description
10 0012 0x0a New Line \n
13 0015 0x0d Carriage Return \r

Additionally, what represents a newline often varies by OS: ‘\r’ , ‘\r\n’, ‘\n’ for Macs, Windows and Unix respectively. I’m on Ubuntu (UNIX variant) running a 14.04 build. So what does python do when it sees the variations of newline signals?

Python & Newline

In PEP 278, python introduced universal line support. So on f.read() and similar calls like f.readlines() conversion from the above mentioned differences to ‘/n’ is computed. I tested this with the above string by writing it to disk confirm that no work was done on conversation while writing and I read it back in and printed. Voila!

However, my case is a bit different. I’m reading this text from a key in a JSON file (from some unknown OS). So the ‘\r’ is not being handled on read and python’s print() is interpreting it literally that is, return-to-start-of-line-of-current-line. What’s happening here is that any text after ‘\r’ begins on the leftmost point of the current line ( just as the typewriter days) and thus overwrites existing text. Here is an illustration

>>> print('orson\radams')
adams

Sweet, my names have equal lengths.

This isn’t really a big deal since the string itself is not modified just that printing (to debug my regex for instance) was problematic. All goes as expected of course if I just remove the ‘\r’ with str.replace(‘\r’, ‘\n’)!

So ends the journey for now. Obviously not buggy behaviour I just had incorrect expectations. Consider that fixed!

Talk Soon.

Tidbits

  • I used od -x filename to confirm that ‘\r’s are still in the file after a write from python (in their hex form: 0d).
  • You can specify what you newline character should be on read by setting the newline keyword on open withopen(filename, ‘r’, newline=’’). Read the file without any conversions.
  • You can also avoid any conversion of newline character by reading the file in binary mode (files are read in textmode by default, but we knew that right): open(filename, ‘rb’)
  • Also, let me know if you would have been okay with the tile “The Case Of The Disappearing Judge”

--

--

Orson Adams

I’m super curious about language and the tools were using to model it. Currently building NLP products @MoodysAnalytics.