Database Reverse Engineering, Part 2: Main Approaches
Read in English:
Database Reverse Engineering, Part 1: Introduction
Database Reverse Engineering, Part 2: Main Approaches
Database Reverse Engineering, Part 3: Code Reuse, Conclusion
Read in Russian:
Database Reverse Engineering, Part 1: Introduction
Database Reverse Engineering, Part 2: Main Approaches
Database Reverse Engineering, Part 3: Code Reuse, Conclusion
A brief introduction to the proprietary database research field was made in the first part. From now we move to a practical side of the subject. The most important things are always given in the beginning and what will be discussed below is the main information I wanted to share starting the series. The article describes my vision of database reverse engineering process, and when I say “should”, “need”, and “must”, this means an appeal to oneself in the past rather than calling for anything. Also, please note that words “cross reference”, “bind”, “tie”, “interconnection”, and “dependency” are used as synonyms.
We will research a database of Microcat Ford USA software which is one of those developed by car manufacturers for use in service centers. The purpose of such software is to display car parts, repair times, search for a car by VIN number and so on. They provide a lot of space to work for a reverse engineer since they are written for different architectures, in different programming languages, at different times, by different level programmers, and most importantly, are relevant and actively used.
One of database RE goals of such software is to gain access to an information provided by a program to a user. In other words, we need to work with data directly, bypassing a GUI. Reverse engineering would not be required at all if developers would choose mainstream DBMSs based on SQL rather than creating their own proprietary DBs, which is true for the most cases I have met. Those DBs are not standardized among vehicle software developers, completely differs between each other and have a complex internals.
Our task will be to understand how the DB keeps a vehicle information, how to obtain a part list for each car, and how to process part diagrams. I want to say immediately that the task was chosen to illustrate a database reverse engineering process and Microcat Ford USA was picked only because of a good set of real-life challenges. We will not be interested in data semantics so feel free to read if you do not understand cars.
The First Database Reverse Engineering Approach
Considering the fact that each program working with a DB contains a DBMS component or itself is a DBMS it is reasonable to treat an examined program as a DBMS. This useful assumption is the first main approach.
[2.1] Treat a program working with a DB as a DBMS.
The point is that if there is a database and a program interacting with it, then it is enough to reproduce actions the DBMS performs to understand DB internals. There is no need to reverse engineer a DB from scratch since there is a program that “knows” the DB structure. The usefulness of these statements will be substantiated in practice later.
We see two folders after installing the program: MCFNA and FNA_Data.
It is clear by file extensions that MCFNA is an executable files directory and FNA_Data is a data files directory. Note that there are three data file extensions only: dat, idx, and xcd. Do not be deceived, often extensions do not mean anything.
A primary analysis consists of four steps:
- a GUI overview;
- a data files assessment;
- a code files assessment;
- search for code reuse opportunities.
I will give a few screenshots in order you to visualize the program we are talking about.
Data Files Assessment
Quickly review each file in a hex editor to assess their structure (text / binary, open source / proprietary, etc). If you use Hiew, start with a text mode and switch to a hex mode if necessary. Below are screenshots of several files.
As you can see, despite the same extension the files are structured in completely different ways. The only standardized extension is xcd — these are zip archives, but we will not study them.
Code Files Assessment
Using DiE or Hiew or any other PE tool we find the executables are of NE format i.e. New Executable 16-bit Windows programs. Many of them have VBX file extension, and it is Visual Basic as can be guessed by a content. Most DLLs are written in this language also.
In addition, some DLLs and VBXs were compressed by a packer which cannot be identified by any PE / NE tool.
Googling revealed this is Shrinker, an ancient Visual Basic 3.0-compiled NE file packer. Finding the unpacker (NED 2.30) among dead links was a great success because I did not want to waste time on the hassle of static unpacking or NTVDM debugging (which I still had to do but for another reason).
We have assessed data and code we are dealing — it is a legacy. Let’s summarize the performed actions in the form of a main approach.
[2.2] Perform a primary analysis of data and code you are working with. Find the maximum information on the internet for the keywords you found during an assessment.
Search for Code Reuse Opportunities
In programming, it is a good practice to code reuse and, in my opinion this is just as true for reverse engineering. The difference is that programmers usually have a module interface documentation while we have to first find modules and functions and then reverse engineer their interface, but this is may be much easier than writing code from scratch. In one of my projects there was a modified open source SQL database with an expected long reverse engineering time. Fortunately, there was a library in the program with export procedures similar to stored procedures, so it was enough for me to reconstruct several classes, to pass objects to library functions and to receive data from the database in response.
We apply this technique to our task and look at exports in all libraries. We find a lot of interesting in almost each of them despite the unknown database terminology (CODESTR, DMC, NISSANSECNUM, TMCSECNUM, etc).
[2.3] Assess program modules for potential code reusability. Use this approach periodically as you progress in the database reverse engineering as it usually not always clear what is reusable and what is not.
This is a very powerful method that can save you weeks or even months but it has a drawback: the more black boxes you have and the less actions they perform, the more complex code reuse process is. You see functions that compute and verify something in the screenshot above. Perhaps they must be called in a certain order, perhaps they depend on a global program state, etc etc. It would be much easier if there is one function GetAllDatabaseRecords, isn’t it?
Sometimes I meet software written for easy-to-decompile platforms like Java and .NET. There was only a single case where a convenient database interface existed, in all other cases I had to ignore original (decompiled) sources which were too large and difficult to understand in a reasonable time. These are the cases when it is easier to reverse engineer than to read a source code.
[2.4] Code reuse complexity is directly proportional to a number of black boxes and inversely proportional to a volume of actions they perform.
If there are many functions and the role of each of them is small, there is a risk to spend too much time studying their interfaces. If functions are small but the work they perform promises to simplify your work, assess the ratio of time to research their interfaces to the time of reverse engineering from scratch. If there are many functions and they do a lot, by all means try to use them!
This completes a primary analysis. It can include other steps such as reading documentation, comparing a database / a program being analyzed with previous projects, and so on, but I always do the three steps described above.
Entry Point Selection
The next step to understand a database is to select an entry point — a data set whose structure, firstly, does not depend on other structures, and second, will allow you to investigate other structures. Figuratively it looks like this: points corresponding to data required in a task are placed on an axis, and data on the right is conceptually dependent on data on the left, after which the most left point is selected.
The first data set to be processed in our task is a vehicle information. Moreover, we know that the program displays a vehicle selection menu at startup from which it can be concluded that vehicles are an original entry point; think of it as an OEP of the database. From this we make an assumption that “vehicles” does not depend on other structures and is the right choice of an EP.
On the other hand, there are chances that a task requires data whose point is not at a tree root. You have two ways in this case.
- Start from the required point in the hope that knowledge of a previous points structure will not be needed. There is a possibility this will not allow to understand complex structures later. Go this way only when you are acquainted with the subject domain and / or have a database RE experience.
- Research the tree from the root to the required point. This is a stable but more time-consuming way. I always choose it.
[2.5] Select a database entry point. The closer it is to an original entry point, the fewer structures you will have to research beforehand. Ideally if they match.
How can we make sure the EP is selected correctly? I do not know such methods except the intuitive-experienced. Continue to reverse engineer, and if you find yourself in a dead end then try to find another ways. Use a disassembler as a last resort.
Entry Point Reconnaissance
After selecting an EP you need to find a file or files containing required data, which is a vehicle information in our case. I do this by choosing several keywords and searching them in all files. Keywords are strings that exactly characterize data you are looking for. For example, it is known that there is a vehicle with “Bob Cat ML 1980” name, there is “2.3 L OHC” line in Engine drop-down menu, etc.
You should remember a few things when searching for strings.
- They can be in different encodings, usually ASCII or UTF-16. This is easy to handle using Total Commander which allows you to search in several encodings at once (perhaps Far Manager can do that also).
- They can be stored in a nonstandard form, be compressed or be saved in compressed files.
The search found “Bob Cat ML 1980” string in one file only — FNAc.dat
After examining the file we understand that it really contains names of all the vehicles displayed in the program main menu. We will call this file the catalogue. Therefore, the physical representation of the entry point is found.
[2.6] Investigate what file(s) represent a database entry point.
Entry Point Reverse Engineering
The definition of a database was given in the part one and it is time to recall it. Database is a set of binary files which contain structured data and cross reference to each other. Now we are interested in the first part of the definition, namely the statement that data in files is structured. The structure property means that a file is of records of the same format. For example, a single record of the catalogue will describe a single vehicle. Hence, a format of all records is automatically defined after we define a format of one record. In other words, it is sufficient to research the meaning of the bytes related to Bob Cat ML 1980 from the screenshot above to understand the meaning of bytes for all the other vehicles. But files are rarely consist of records of one type in reality; as a rule, they are consist of several tables containing records of different formats.
How to reverse engineer a record format and a file format in general? All the file format reverse engineering techniques are applicable here (recall the abstraction level hierarchy from the first article, I have placed database RE at the top). They are known for a long time (see the references at the end of the article) and I will use them below but will not separate them as main approaches.
The first step during a file research is to try to determine the number of tables in the file. This is done by a visual search for adjacent records which format is different. Let’s take grpinc.dat file as an example. Look at the screenshots below where a contrast between two record formats are clearly distinguished.
Four colored records can be related to, let’s say, table 1 followed by a solid text related to, say, table 2. Each table 1 record has a size of 0x3A. A size of table 2 records is unknown and it is not a fact that all the text is not a single large record. We can conclude that grpinc.dat file has two tables i.e. has two different record formats. This is a simple example. Sometimes you may meet high entropy binaries whose transition from one table to another can be figured out only by disappearance / appearance of some byte pattern. Nevertheless, they are cease to be an issue for a hinted eye.
Let’s return to the catalogue. Having looked the whole file we have not found any contrasting data: there are vehicle names at the beginning and at the end.
At the next step we should determine a size of a single record. For that we should find a pattern and calculate the difference between two adjacent patterns. Take a glance at an example. Let’s seek at the beginning of the catalogue and think of what information is repeated again and again. I’m talking about the meaning of information, not about exactly repeated bytes. The repeated information is vehicle names, strings beginning with “!CAT”, and strings beginning with “VL”. These are the three different patterns. Take one of them — the vehicle name pattern — and calculate the difference between the first and the second vehicle names.
The difference is 0x800–0x0 = 0x800. Thus, the catalogue record size is 0x800 bytes. This is an important file format RE technique and it should be mastered (as well as others, however). As an addition, it is not necessary to take the first two or the last two records to calculate the difference between patterns. You do not know from what byte records begin when going to determine a size of records residing in the middle of a file but still calculate the delta without problems.
It is time to document the results. We will use Kaitai Struct mentioned in the previous article to describe file formats. You can easily compile a file format description into a code in your programming language supported by KS if you need to process the file programmatically later. All we know about the catalogue at this point is the record size so we treat their content as raw bytes.
- id: vehicle
- id: unk_1
Kaitai Struct visualizer named ksv displays it as follows.
Next, you need to gradually investigate record fields instead of a single large field (of 0x800 bytes in our case). There are approaches for that depending on a heterogeneity of bytes. In our case, you can see that almost all bytes mean string characters, therefore we can try to separate each string from its beginning to the beginning of the next string into an individual field. Having done this, we get the following record format.
Let’s analyze each field. The first 0x64 bytes from the beginning of the record to the last space character is named vehicle_name_1 because it is a car name. The next two bytes (0x0026) are likely to be a separate field but its meaning is currently unknown so we named it unk_1. How to determine whether it is a separate field, or those two bytes are belong to the previous string, or may be even to the next string? There is no silver bullet-answer, we should make assumptions and use an analogy: if all the strings end with a space character (0x20) and without NULL-byte then it is unlikely the first string is an exception. Also, these two bytes may represent a length field for the second string but, firstly, its length is 0x64 (from “Aerostar” to “VL38!”) and, secondly, it is seems like an exception again which is doubtful. Checking the assumption on other records (remember, we are talking about the first one now) we make sure it is valid for them.
unk_1 field is followed by vehicle_name_2 as its content is the copy of vehicle_name_1. The next fields are vl_code_1 and cat_code and vl_code_2 strings named for the first letters of their values. The first two fields have a size of 0x16 bytes. vl_code_2 can be of 0x1A2 bytes until the next “MX” string but let’s not be hasty and set its size to 0x16.
The fact is that vl_code_2 field has the same value as vl_code_1 in many records and it would be fair to assume their length is the same. On the other hand, bytes from an vl_code_2 + 0x16 offset to “MX” string is equal to 0x20 (spaces) in all the records, hence they can be included into vl_code_2 but we will separate them into empty_1 field for clarity.
The next supposed field from “MX” string to “A” string has a size of 0xD2 but let’s not rush again and look its value in other records.
It turns out it is an array of strings of 0xA bytes each rather than a single string. Divide 0xD2 to 0xA and get 0x15 which is the number of entries of the array. Their meaning is not clear yet, although some of them look like country codes (US, CA) or state codes (CA, MX) so let’s name the field unk_2.
The following is “A” string whose size can be similarly determined as until the last 0x20 byte. Doing the same as for the previous field we find that it is also an array of 0x9 entries of length 0xC each, let’s name it unk_3.
The meaning of the remaining bytes is less clear since they are not strings.
The first thing we see is the two words (or dwords) in different places where the first value is 0x7C2 and the second is 0x7CB. Over time you will begin to notice numbers like these because they look like year numbers when you translate them to decimal notation: 0x7C2 = 1986, 0x7CB = 1995. Remember that vehicle_name_1 = “Aerostar A 1986–1995” to make sure these numbers are start production year and end production year, so we call them year_from and year_to.
The bytes between year_from and year_to are zeroed out in all the records so we separate them to zero_1 field of size 0x28. The same number of zeroed out bytes is after year_to, let’s separate them to zero_2 also. The purpose of the rest of bytes is unknown at this time so we name the unk_4.
And the last. Passing carefully through several records and noting unk_1 and vl_code_1 and vl_code_2 and cat_code fields we can see that a numeric value of unk_1 is present in those fields in a character form. It could be guessed that unk_1 is a vehicle identifier so we rename it to vehicle_id. As a result, the file format is as follows now.
- id: vehicle
- id: vehicle_name_1
- id: vehicle_id
- id: vehicle_name_2
- id: vl_code_1
- id: cat_code
- id: vl_code_2
- id: empty_1
- id: unk_2
- id: unk_3
- id: year_from
- id: zero_1
- id: year_to
- id: zero_2
- id: unk_4
That was a very simple case because the whole research was to consequently identify the meaning of each field. In anything more complex cases a research ceases to be linear and requires to suppose field types and field lengths, to search cross references between fields of different tables, and so on. Since the focus of this article is database RE rather than file format RE we will not present a detailed analysis of other individual files because our interest is in a higher level — an interconnection between those files.
[2.7] Research and describe a file format representing a database entry point.
BTW, there are almost none of doubts whether database EP is chosen right: we have not met any dependencies on other structures and have obtained enough information for the further research at the same time.
Cross Reference Research
We can continue to reverse engineer in several ways:
- search a required information by known one;
- search a required information and understand how it interconnects with known one.
The first way is to take values of fields — such as vehicle_id, vl_code_1, vl_code_2, cat_code — and search them in other files. If something will be found it could signal a possible interconnection between the catalogue and a file. The meaning of the second way is to search for keywords, use the keywords to find a file containing the required information, to reverse engineer it and to reveal cross references between the file and the catalogue. You can consider these two ways as vectors directed to each other. These are two ways to achieve the same goal.
In life all is more complicated than on abstract pictures so these two ways should be combined. First we should define what a required information is. According to the figure from “Entry Point Selection” section the next structure to research is vehicle parts. Researching part diagrams is not worth to do now since they are conceptually interconnected with vehicles via parts (if this hypothesis is true). Let’s see how the program provides an access to vehicle parts.
After choosing a vehicle a section list appears (“10 WHEELS & RELATED PARTS”, “20 BRAKE SYSTEM”, etc). When selecting one of them a second section list shows up (“10 WHEELS AND WHEEL COVERS WHEELS 1980–1980 [P10853]”, etc). After selecting an entry a part list is displayed with hyperlinks named as part numbers which can be clicked to pop up a bottom menu containing a part list. Currently our goal is to “reach out” to part lists.
As you can see, vehicles and parts are binded via the two section lists which we will call the part tree. The part tree has the two levels named “Alpha Index — Major” and “Alpha Index — Minor” which will be called the first and the second levels for simplicity. Based on the experience of study other vehicle databases, I make an assumption that the part tree and parts themselves are resided either in different data structures or in different files. It means that we should correct the goal: we go from vehicles to the first tree level, to the second level from that, and to parts at last.
How to achieve this goal? If you do a file search for names of the first level (“10 WHEELS & RELATED PARTS”), then you will get only one match in FEULex.dat file containing all localized program strings. Then we could go from this point taking offset to the string and searching the four bytes of the offset in little-endian in files again — this is an effective technique but it does not work in our case because FEULex.dat is referenced by offsets not from the beginning of the file.
It’s time to apply a different approach that is so powerful how is simple. It is based on [2.1] approach which states that it make sense to treat a program as a DBMS. Since it is a DBMS, it requests to a DB i.e. reads from files. Because a DBMS must be effective it does not read all files into memory at startup. So access to the files occurs when a user does corresponding requests to the program by clicking on certain menus, buttons and so on. Consequently, we can determine what files are accessed by the program in the moment of a user action i.e. select a subset of files containing the required data from a set of all the files.
To monitor a file access we will use patched ProcMon. The patch changes the decimal number representation to the hexadecimal one in Detail column. You can compare the screenshots below.
So we are going to find out what file represents the first part tree level. Let’s launch ProcMon, establish two filters: “Process Name is ntvdm.exe” and “Operation is ReadFile”, and stop the monitor. Launch the program, select a vehicle, start to monitor using ProcMon again and click the vehicle data loading button.
Bingo! There are only few files referenced by the program when reading the first part tree level: MCData.idx, MCData.dat, XGROUPS.idx, FEULex.dat, LexInd.idx. The last two are related to localized strings as it has been said earlier. XGROUPS.idx looks uninteresting and does not contain any hint of part trees. The first two is the only what we search for.
[2.8] Monitor an access to files when a program is going to load data you interested in.
From now fun ends and a daily reverse engineer work begins. As we agreed, I will not describe how the files was researched, also because of each of these two deserve individual articles. I had spent more than month to understand them but some other files was researched also, of course. Here is a little summary.
Size: 5,5 MB
Number of tables: 4 (2 headers + 2 data tables)
Purpose: binds a vehicle (vehicle_id) to the first part tree level (ai_major_id), the first level to the second level (ai_minor_id), the second level to a part list (part_list_offset) in MCData.dat and to part diagrams (image_offset) in MCImage.dat and MCImage2.dat
Size: 1,5 GB
Format: encrypted text
Purpose: consists of a part list at part_list_offset for each tree
I must to mention the second derivative from [2.1] approach. A monitoring using ProcMon shows another two things to boost your understanding file internals:
- a block offset and a block size read by a program at a time;
- a block read order.
(Block is a couple of bytes.) Programs read data block-by-block and one block is usually represents one record. Knowing the block offset and the block size we can determine the first table record, the table record size, the number of table records, the number of tables (different record sizes mean reads from different tables). Knowledge of a block read order allows us to plan a file research process: if a program reads table 1 first and then table 2, then it is good to start to research from table 1 since there are chances the understanding of the first table is required to understand the second one.
Also, sometimes knowledge of a block read order allows to understand what algorithm a program implements. Let’s look at the screenshot below.
Here we see many “chaotic” reads from LexInd.idx followed by one “precise” read from FEULex.dat. If we figure out why the program does these reads (a user action) and compare the last read from LexInd.idx with other reads, then it will be clear that it is a binary search. In this case the program searches for a string ID in LexInd.idx, after that it reads the string offset and reads the string from FEULex.dat. Understanding the algorithm allowed me to research these two files faster.
[2.9] Use knowledge of block offsets, block sizes, and block read orders gathered during a monitoring, to research files.
It is time to summarize our findings. We figured out what files contain structures of vehicles, a part tree, and parts, how the files are structured and what are cross references between them. Let’s visualize it.
But this is not the most important thing. My main goal was to introduce you to the database reverse engineering using the approaches developed by myself over several years. I tried to keep the narration on two levels, one of which was about the principles independent of any database internals, and another level was to apply these principles to the real DB research to prove their viability and practicality.
In the next part we will finish to explore the database using the code reuse approach to access part diagrams. And yes, we did not touch a disassembler and a debugger until now.
(There are two attachments to the Russian version of the article, they are an archive of pictures used in this part and NED 2.30 unpacker.)
- DGTEFF — XentaxWiki — http://wiki.xentax.com/index.php/DGTEFF
- How to crack a Binary File Format — http://www.iwriteiam.nl/Ha_HTCABFF.html
- BFF: A grammar for Binary File Formats — http://www.iwriteiam.nl/Ha_BFF.html
- File format reverse engineering, an introduction. — Nada LabsNada Labs — https://nada-labs.net/2010/file-format-reverse-engineering-an-introduction/
- Reverse Engineering/File Formats — Wikibooks, open books for an open world — https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats
- Reverse Engineering a file format — Matthew Ekenstedt — http://matthewekenstedt.com/73-06/reverse-engineering-a-file-format/
- Reverse Engineering Design File Formats | Details | Hackaday.io — https://hackaday.io/project/3149-reverse-engineering-design-file-formats/details
- Reverse engineering visual novels 101 — https://hackernoon.com/reverse-engineering-visual-novels-101-d0bc3bf7ab8
- Reverse engineering visual novels 101, part 2 — https://hackernoon.com/reverse-engineering-visual-novels-101-part-2-9258f547262a
- Experimentation with Reverse Engineering — Trails in the Sky (FC / SC) Extracting Sprite Data w/ Unix Tools & Kaitai Struct — http://vaughanhilts.me/blog/2016/11/16/reverse-engineering-trails-in-the-sky-ed-6-game-engine.html
- Kaitai Struct — http://kaitai.io/
A List of Main Approaches
The First Database Reverse Engineering Approach
- [2.1] Treat a program working with a DB as a DBMS.
- [2.2] Perform a primary analysis of data and code you are working with. Find the maximum information on the internet for the keywords you found during an assessment.
- [2.3] Assess program modules for potential code reusability. Use this approach periodically as you progress in the database reverse engineering as it usually not always clear what is reusable and what is not.
- [2.4] Code reuse complexity is directly proportional to a number of black boxes and inversely proportional to a volume of actions they perform.
Entry Point Research
- [2.5] Select a database entry point. The closer it is to an original entry point, the fewer structures you will have to research beforehand. Ideally if they match.
- [2.6] Investigate what file(s) represent a database entry point.
- [2.7] Research and describe a file format representing a database entry point.
Cross Reference Research
- [2.8] Monitor an access to files when a program is going to load data you interested in.
- [2.9] Use knowledge of block offsets, block sizes, and block read orders gathered during a monitoring, to research files.
The text is for informational purposes only. The author does not liable for any misuse of knowledge obtained from this article. Copying and distribution of information without the permission of the rightholders is illegal.