Linguistic Analysis of the Indian Tax Case — I

Published in

The Startup

8 min readFeb 27, 2019

The legal corpus of a country, the deliverance of law in the written format, is an amazing representation of its progress as a country over time. A perceivable presence of diversity in the writing style of judges and at the same time, a discernible uniformity, is what makes this corpus interesting for linguistic analysis. I work at a legal tech firm, with an intention of minimizing the subjectivity in the interpretation and application of law. I had the chance to analyze over 223k Income Tax cases in India. This data comprised of every tax case litigated in India till the point this article was written, with the obvious exception of certain unaccessible courts and certain judgments being scanned images embedded into PDFs(looking at you, High Court of Karnataka).

Intro to law through judgment documents

Case files(hereafter referred to as judgments) contain three essential elements. A semi structured Header, the Body part, and a concluding Footer.
The Header of the judgment contains details like the names of the parties contending, their attorneys, information about the parties(Addresses, PAN, TAN, etc.), and other important case details(asst. year for the case, disposal date, and so on). There is no uniformity in the writing style, the order of these elements, nor the presence of all of these elements. A lack of uniformity at the ITAT level is understandable, them being quasi-judicial institutions, but there is no reason for High Courts(HC) and the Supreme Court(SC) to not follow a set of rules when penning a judgment. What could have been structured set is now, a semi-structured one at best.
The Body part of the case contains the proceedings of the case in an orderly fashion. More on reading a case as a layman.
The Footer part contains names, dates of decisions, information about whom the copy is forwarded to, and digital signature of the judges presiding over the case and some other details about the case.
Sometimes, judgments contain footnotes, in other cases, images providing specific information points, or data organized into tables, and very rarely, hand-made amendments to some typed content.

Script

Case files are in PDF format. The language used is almost always English. There are certain exceptions to this. In a random sample of ten thousand cases, 195(~2%) contained one or more Devanagari characters in the body of the judgment file. Following table shows a court-wise distribution of cases that contained text written in the Devanagari script:

The data was generated using a Postgres query which looked something like this:

select id
from table s
where cast(s.content as varchar) similar to '%[\u0900-\u097f]%'

The Unicode range for Devanagari characters is 0900–097F. Source.

Judgment Size

Cases seemed to be increasing, both in number, and in their size. The average judgment was 6.87 pages in length. On average, a judgment has 2015 words present in 69.27 sentences coalesced into 22.89 paragraphs.

This page length has shown significant variance over the years. Here is a regression plot of order 2 showing the average size of a judgment across assessment years. A plot of order 3 was a classic example of over-fitting, since the graph seemed to go down after 2000–01s.

The rising size of Income Tax Judgment files.

A note on the number of cases

A threefold increase in the size of judgments over the span of 20 years cannot be overlooked. The late 2010s were important in the Indian legal history. Digitization of almost all cases led to a magnanimous increase in the number of cases a layman could access without having to go through hoops of legal proceedings. Prior to this, cases could be broadly categorized as reported and unreported. In simple terms, the handful of legal journals that were practically considered official chose certain cases that they deemed important and published them in their editions. This meant that a large chunk of cases went unreported. The relative ease of access meant that judges preferred referring to reported cases over unreported ones and using them as precedents in adjudicating cases. Even if they did refer to unreported ones, it was difficult for the common man to track them down and understand the law at hand. This sudden increase in the number of cases available freely skews every legal graph around 2010.

Looking deeper into the cases themselves, three peculiar features surfaces:

Blatant Quoting

Judges and members of the court, when responding to an argument made before the court citing a precedent, are supposed to summarize the precedent in the current case. This applies to both types of precedents — relevant to the current case, and the ones that are not. However, there has been a spur of cases that quote a precedent as it is.

People Interactive (I) P. Ltd vs. Asst Cit 4(3) is a case disposed in 2013. This case suits our analysis perfectly. Adjudicated by Shri Shailendra Kumar Yadav and Shri Ramit Kochar, the judge cites another case, Biocon Limited v. DCIT(LTU) in making its case. In fact, it quotes a staggering 10 pages from the mentioned case, word for word. Out of the 30 page judgment, 10 pages have been quoted directly! It is no wonder that the case of People Interactive has been cited as a precedent exactly zero times in the entire legal universe, till date. Direct quoting as opposed to intelligent paraphrasing, is one of the cause of the increasing swell in judgments.

Seven Steps To Clearer Judgment Writing reads,

“When citing from a decided case, the passage of the judgment should be chosen carefully and frugally. Only so much of it as expresses the proposition in question should be quoted. Ideally, that may amount to no more than one or two sentences, rather than a paragraph, or several paragraphs.”

Legal Fluff

Commissioner Of Income Tax vs. B. N. Bhattachargee & Anr. is a popular case. It has been cited in more than 1800 other tax cases. Addressing an issue from the case, V. R. Krishna Iyer goes,

The vampirish vices of black money and colossal tax evasion, both together using money power to prevent action against white-collar offender, had been a terrible menace to the health and wealth of the nation.

Rajiv Shakdher in the case of Shanti Bhushan vs. Commissioner Of Income Tax writes

No two individuals deal with matters of heart similarly; often confounded, as to how to deal with it which is why a famous lyricists expounds on this very peculiar quandary thus : Dil-E-Nadan Tujhe Hua Kya Hai Akhir Ess Dard Ke Dawa Kya Hai. (Here heart is personified. It is asked of it what ails it ? What is the remedy for the malady.)

These judgment definitely make up for an interesting (bedtime) read.

Interpreting judgments in the way intended by the judge is increasingly taxing. BBC and several other sources have called out on HC and SC judges for introducing unnecessary flair and fluff in legal judgments. While adding a smiley in a judgment may be up for a debate, judges should take necessary care in order to ensure that a judgment is simple enough for another judge, a lawyer or a layman. The use of legalese and abstruse language should be restricted to the absolute necessity.

What did you say? Kya kaha aapne?

The case of Ajay Kumar vs. ITO, ITAT Agra, assumes ones ability to read and comprehend Hinglish:

In para 7 of the penalty order, it is stated as follows:
“Uprokt Varnit Tatthyon Se Spasht Hai Ki Nirdhariti Dwara Rs. 16,59,914/- Ki Aay Ke Sambandh Mein Galat Byorey Prastut Kiye Gaye Hain Evam Tatthyon Ko Chhipaya Gaya, Hai”.

The case goes on without explaining what this line translates to.

A lot of judgments contain Hinglish — a beautiful amalgamation of Hindi and English. There is a palpable lack of the relevant explanation that is needed alongside this language.

This is different from the use of Devanagari script to write in the specific language(Marathi, Hindi, etc.) which has its own set of drawbacks. Cases containing Devnagari script become blind-spots in the realm of research-able law, since they cannot be conventionally searched for a particular point of law. It is also difficult to identify a precedent laid down in a case. This is, of course, barring the narrow band of people familiar with the language.

Inside the cases

I selected 100 random cases disposed in 2018 for analyzing the kind of language used. The Body part of the text was cleaned to ignore non-alphanumeric characters and some other basic text preprocessing was done to ensure the exclusion of outliers. Here is a quick graph showing various parts-of-speech that appeared in these cases.

Though Adverbs(ADV) at 4.3% and adjectives(ADJ) at 3.7% emphasize certain points, they are avoidable in most cases.

Another interesting insight that surfaced was that the words undisclosed, unaccounted and unexplained were among the top 15 most frequent adjectives used. Together, their frequency of occurrence was 862! These necessarily talk about income sources and possibly fall under income escaping assessment, but that is a topic for another day.

A display of verbal dexterity is evident, with adverbs like ‘arithmetically’, ‘clandestinely’ and ‘felicitously’ occurring in 5 instances.

A little bit of sophisticated querying can produce n-grams right at the query level:

I did not want to search for beautiful English words manually. So I trained a word-to-vec model to propose word vectors based on this legal corpus. I was able to extract all words related to undisclosed using this trained model. A visual representation of the relevant words is as follows:

The rising size of documents, and the many-fold increase in number of tax disputes beg the question, should judges and members of the court make conscious efforts and alter their writing styles in favor of legibility, cogency and ease of understanding?

End Notes:

Interim orders, records of proceedings and other non-final documents are exempted from this analysis since they contain trivial information.
The art of writing judgments is an interesting read.
Commissioner Of Income Tax vs. T. N. Aravinda Reddy- “A point of suffocating scholarship sometimes arrives in Court when one nostalgically remembers the escapist verse : “Where ignorance is bliss, ’Tis folly to be wise.” Amen!”
All the cases mentioned in the article can be found here.
Prof. Volokh of UCLA Law School and Prof.Tanford of Indiana University’s Maurer School of Law shared an excellent 12-page guide to fixing common legal writing issues. More.