Business data induction 2022–04–21
Cross-referencing BoardEx tables with Excel
This week, we continued to explore the BoardEx challenge:
Tell me something interesting about UK directors using data from more than one BoardEx table
Person Z demonstrated their working.
- Get all individual employment profiles (UK).
- In Excel, filter and select three directors of interest.
- Copy and paste the three
DirectorIDfields, separated by spaces.
- Get the individual profile details for these three directors.
- Review the results in Excel, only three rows (one for each director).
- In Merge the fields from both tables, one at a time, into a new sheet.
What is good about this approach?
DirectorIDis the common element between the tables.
- Manual approach is good if you have a few matches you just want to look up.
- Good for proof of concept of what data is available.
- Keeping original, unedited data is safer, you can go back and try again.
What is bad about this approach?
- It is slow.
- You could not scale it up, copying many IDs is not efficient.
New approach: working at scale
The second search relies on selecting a number of companies to search on, based on the first search. Instead of copying the ID fields one at a time, we can create a list of many IDs, one per line, as a plain text file. The second search has an option to upload this text file, instead of pasting all the codes. This process is much more efficient when working at scale. However, more effort is required to join this data back together.
We looked at how to prepare this list of IDs and create a plain text file.
Side note: Person A suggested an alternative approach which was familiar, where you could select the codes, paste into Microsoft Word, replace the linebreaks with a space character, and treated in the same way as pasting a list of codes by hand. This approach would work but maybe for tens of codes, rather than thousands.
Steps to prepare list of IDs:
- Select cells in a column which you wish to use, here is
- Create new workbook.
- In the new workbook, choose to Paste as Values.
- Remove duplicates (Data tab, click ‘Remove Duplicates’ button).
- Save the file as a Text (Tab Delimited) (*.txt) and close workbook.
We can then upload them in the second query, choose ‘Browse…’ then Upload a plain text file.
We then download the larger second query and wish to join it back together with the first query. That is much harder when there could be thousands of rows. We look at using tables and VLOOKUP formulas instead.
(High level) Steps to create the mapping:
- Copy the first worksheet ‘employment’. Copy and paste the second worksheet ‘personal’ into the first workbook. These are now our left and right tables.
- In the right table, move the
DirectorIDcolumn to be the first column, then make the data into a table and name it
- In the left table, insert a new column to the right of the
DirectorIDcolumn. We will be pulling in something from the right column here. Let’s assume that the ID is in column G, the new column is H, the first row of data is row 2. We are going to bring in
Nationalitywhich is the sixth column of
personaltable (let’s say).
- In the first row of the new column (H2), start to type the formula:
=VLOOKUP(G2, personal, 6, false)where G2 is the ID you are looking up, personal is the name of the right table, 6 is the column index you are looking up, and false means we want an exact match of the ID. Press Enter.
- You will see the nationality of the director in this row. Copy the formula down for all rows in the left table.
- There may be N/A where a match was not found. You can wrap the formula with another function to remove these, or copy-paste as value and replace any error messages with blanks.
The steps above are only meant to give you an idea of what students will be doing, they are not intended to show you exactly what to do by yourself. Also, we made a named table so that the VLOOKUP formula is neater; you could use a locked cell range.
Why might you choose something other than Excel?
- Other formats you can download from WRDS include Stata, SPSS, SAS.
- The next steps (eg regression) has tools in the other tools already, ready to use, so you may get the data in those formats instead.
- Other analysis packages will encourage you do write scripts which will be more reliable than writing VLOOKUP and manual rough Excel setup. You could keep a record of what you did, make tweaks to re-run it easily, whereas Excel is more rough-and-ready.
- Made Person A reflect on student comments in the UoM Flexible and Blended Learning training and skills group, where people assume that students have such Excel skills which they may not have.
What have learned in general, what is transferrable?
- We saw the context of why you might want to combine data tables. Some searches cannot be done in one step, you need to follow through from one search to another search. We can compare this process to a systematic search (in a systematic review), in that you can’t get the data all in one step and have to show the process you gathered it.
- There is a different emphasis to systematic reviews.
- Excel can be used to manipulate the data and prepare subsequent searches.
- These processes can be applied to other datasets within WRDS.
- We can maintain data integrity by keeping records of each stage, processes, intermediate data files.
- Excel is one tool which we can’t assume all students will use, however, there are more appropriate/advanced ones for reproducible analysis of larger and multiple data sets.
- Install Datastream to your laptop. We will be starting from the beginning with Datastream next time.