Making India’s Budgets Accessible

Hrishant Singhal
CivicDataLab
Published in
8 min readJan 14, 2023

In our last article, we elaborated on the role of fiscal transparency in a democracy and how governments can ensure it. Currently only four state governments, Assam, Odisha, Kerala, and Tamil Nadu release their budget documents in machine readable format which facilitates research. The budget documents released by other states are originally in machine readable format but converted into PDFs and then put in public domain.

Standardizing State Government Budgets

Assam and Odisha budgets are also released in different formats so comparing them is exceedingly difficult. As a part of our initiative Open Budgets India, we want to ensure intra-state (across years) and inter-state comparison of government finances. We have made a standard template for budgets, which we call the Open Budget Data Standard. Research would become easier if we could have all the budget documents in a single format.

I was hired as an intern at CivicDataLab to help in standardizing the government budgets. Initially, I learnt what a budget contains and how we would look to standardize it. Then I was assigned the Odisha Budget document, I felt lost, it was so confusing that I could not even understand it, let alone write a code to standardize it. This article goes through the process of standardizing the Odisha Budget data.

But before that, here is a brief description on the elements of a government budget.

Understanding the Government Budget

Consider our personal budget, there are various categories like groceries, luxuries, entertainment, travel etc., and then we have further subdivisions of each of the categories. Government budgets are also grouped into various categories, there are 5–6 levels of division, these divisions are called heads. The divisions help present the objectives and purposes of government expenditure in terms of functions, programs, and activities (Source:GAO_Classification).

For example, a number like 2202–01–112–0900–47003 might not mean anything to most of us, but it impacts a vast number of children’s lives in Odisha. It represents the expenditure on the LPG connection of the Mid-Day Meal Scheme. It is categorized under:

2202 — General Education (Major Head)

01- Elementary Education (Sub Major Head)

112- National Programme of mid-day meals in schools (Minor Head)

0900-Mid-Day Meals (Sub Minor Head)

47003-LPG connection for MDM Programme (Object Head)

Similarly, all transactions are recorded with such a coding method in our Public Finance System.

The budget has other variables too:

  1. Nature of Expenditure/Receipt — Recurring (Revenue) or Once (Capital)
  2. The Nodal Department which is spending/receiving the money.

The Odisha Budget

Here’s one of its pages. On its left, there are various kinds of heads, and it is not immediately obvious which line corresponds to which head.

My task was to put it in a wide format where each row represents a line of expenditure in the budget and the columns on its left indicate its various heads. The reader can refer to the Open Budget Data Standard here (hyperlink).

Though I panicked at first, I finally did figure out how to go about it. The code used for wrangling the data can be directly accessed from here. The code is in R, but only basic knowledge of coding is assumed in this article. Even if you are not familiar with coding but want to learn working with data, this article should help you understand the work that goes into cleaning data.

Cleaning and Standardizing the Odisha Budget

We take a subset of our budget data to understand all the problems that we face in it and then we proceed to solve each problem before finally converting it into the standardized format. Our initial data looks like this:

Figure 1

Note: The TOTAL rows are there for each major, minor, and sub-minor head, we have not taken complete entries from the major head 0038 and 0041 in this subset so those rows are not showing up here.

Observations from this raw data before we attempt to standardize it:

  1. Major Head and Sub-minor Heads are both 4-digit codes so there can be common codes. We will have to figure out when the 4-digit code refers to a sub-minor head and when it refers to a major head.
  2. Similarly, Minor Head and Object Head are both 3-digit codes.
  3. Sub-major head “00” has not been mentioned for “0037” and “0041” major heads. ‘0038’ has a sub-major head “01”.
  4. Object Head “000” has not been mentioned for multiple detail heads. Budget numbers have been given in front of detail heads only.

Before we move onto solving each of these problems, you can have a look at this image to understand the final output that we are hoping to achieve.

Solving Problem 1: Figuring out when the 4-digit code refers to a Major Head and a Sub-Minor Head.

When we observe the data closely and filter out the 4-digit codes, we find that there is a pattern. If the head descriptions are in block letters, then those heads are major heads.

Figure 3

We use this observation to write a code to add “sm” at the end of all the sub-minor heads.

#Add position column for each row.
raw_budget<-raw_budget %>% mutate(position=1:n())
#Separate out the rows which have a 4 digit head.
major_twice<-raw_budget %>% filter(grepl("^[0-9]{4}$",heads))
#Filter columns which have lower case letters and add sm (for subminor) at their end.
major_twice_lower<-major_twice %>% filter(grepl("[[:lower:]]",heads_in_english)) %>%
mutate(heads=paste(heads,"sm",sep=""))
#Combine the above with major heads.
raw_budget<-raw_budget %>% filter(!position %in% major_twice_lower$position) %>%
bind_rows(major_twice_lower) %>% arrange(position)

Our output looks like this:

Figure 4 (Total Rows have been removed to avoid confusion)

But what if the pattern breaks somewhere? The budget file released by the Odisha Government also has a summation of budget figures for all the major heads, so we were able to cross check our major head totals using that

Solving Problem 2: Figuring out when the 3-digit code refers to a Minor Head and an Object Head.

Minor Head descriptions are also in block letters so we can follow the same method, but we would not have a way of cross checking our work since a list of totals of minor heads is not available in a machine-readable format. So, we use another observation.

Minor Codes do not have any value in the budget columns (represented by the arrows). Since the object head is the lowest head; they will always have values in those columns as the rectangular boxes show.

Figure 5

We use this observation to write a code to add “ob” at the end of all the object heads.

#Taking out object codes.
object_code<-raw_budget %>% filter(grepl("^[0-9]{3}$",heads) &
!grepl("TOTAL",heads_in_english) &
!is.na(`actuals_last_audited_year`)) %>%
mutate(heads=paste(heads,"ob",sep=""))
#Merging with the main dataset.
raw_budget<-raw_budget %>% filter(!position %in% voucher_code$position) %>%
bind_rows(voucher_code) %>% arrange(position)

Our output at this stage looks like this:

Figure 6

Problems 3 and 4 arise because of the method we have used for standardizing the dataset. If you are curious about why we can’t proceed ahead without solving problems 3 and 4, then you may want to skip to the standardization part first and come back to this section later.

Solving Problem 3: Adding Sub Major Heads where they are not present.

We add “00” after all the major heads (4-digit codes which don’t end with sm).

#Identifying Major Heads.
raw_budget_major<-raw_budget %>% filter(grepl("^[0-9]{4}$",heads))
#Adding Sub-Major Heads at position + 0.5.
raw_budget_submajor<-raw_budget_major %>% mutate(heads="00",heads_in_oriya="Null",
heads_in_english="Null",position=position+0.5)
#Merging the main dataset with the sub-major dataset and arranging by position so that sub-major heads come after major heads.
raw_budget<-raw_budget %>% bind_rows(raw_budget_submajor) %>% arrange(position) %>% mutate(position=1:n())

Solving Problem 4: Adding Object Heads “000” wherever required and adding budget values in front of them.

Step I: Add the Object Head “000” after each detail ahead and copy paste its budget values from its respective detail head.

Code:

#Identifying Detail Heads.
raw_budget_containg_detail<-raw_budget %>% filter(grepl("^[0-9]{5}$",heads))
#Adding Object Heads at position + 0.5.
raw_budget_object<-raw_budget_containg_detail %>% mutate(heads="000ob",heads_in_oriya="Null",
heads_in_english="Null",position=position+0.5)
#Merging the main dataset with the object dataset.
raw_budget<-raw_budget %>% bind_rows(raw_budget_object) %>% arrange(position)

Output:

Figure 7

Step II: Remove values in front of detail heads.

raw_budget<-raw_budget %>% filter(!grepl("^[0-9]{5}$",heads)) %>% 
mutate(`actuals_last_audited_year`=NULL,
`budget_estimates_previous_financial_year`=NULL,
`revised_estimates_previous_financial_year`=NULL,
`budget_estimates_financial_year`=NULL)

Output after doing both the steps:

Figure 8

The Standardization Process

After we have done all the data cleaning, the last step is to change the format of the data. The gif below gives an idea of what we will be trying to do.

Step I: Identifying the major heads.

We write this code to identify which rows contain a major head and based on that we copy values from its right. Code:

#The first dataframe goes through every row for each major head 
abc<-data.frame(sapply(major_heads,grepl,raw_budget$heads))*1
#This sums up all the rows to give us the row no. in which a major head is present
def<-data.frame(presence=rowSums(abc))
#We mutate the value from the column on the right based on that
budget_major<-raw_budget %>% mutate(def) %>% select(presence,everything())
budget_major<-budget_major %>% mutate(presence=ifelse(presence==1,heads,NA))

Our output looks like this:

Figure 9

Step II: Extending the row of major heads till another major head is found. (Filling all NAs)

Code:

budget_major<-budget_major %>%
fill(presence,major_head_desc_en,major_head_desc_olang, .direction = "down") %>%
select(major_head_code_en=presence,major_head_desc_en,major_head_desc_olang,everything()) %>%
mutate(position = 1:n())

Output:

Figure 10

We follow a similar procedure for all heads. Find them and extend them till another head of the same kind has been found. Our output after doing it for all the heads looks like this.

Figure 11

We proceed to remove all rows which do not have numbers in the last four columns. Our final output is this.

Figure 12

A similar process can be followed for head descriptions.

Conclusion

Even though Odisha is one of the two states where data is released in machine readable formats, we still had to spend a lot of time and put in a lot of work to reach a stage where the data is comfortably readable and usable for research.

It took me more than 60 hours to standardize all the Odisha Budgets, I had the technical expertise and was paid well to do it. It would be difficult for economic researchers and journalists to go through such a process before they start learning from the budgets and writing about them.

We can save crores of rupees if we can have high quality research on state public finances, it would all start with the governments releasing their budgets in an accessible format.

--

--