What is a Data Dictionary?

Bonus: a data catalog template to get you started

Louise de Leyritz
CastorDoc
6 min readNov 3, 2021

--

If you’ve belonged to a data team for any length of time, you’ve probably felt submerged by the amount of tables you’re meant to keep track of. You’ve also probably noticed that other data analysts and data scientists spend a lot of their time dealing with questions of the style:

  • Where is the best data to answer my question?
  • What does the column name “XXXX” mean?
  • Can I trust it?
  • When was it last updated? What is the process to create it?
  • Who can I contact if I see something wrong?
  • Has someone already worked on this question?

That is, data people are spending more time on metadata management than on meaningful value-generating data analytics work. The good news is that the data dictionary, or metadata repository is a tool that can help with all these questions, allowing data people to focus on the core of their work. If you’re looking for such a tool, take a look at our data catalog benchmark. But let’s say you don’t want to install a data dictionary just yet, for any reason, what can you do?

The Excel solution

An “on the go” solution is for your data team to take care of data documentation using a shared excel spreadsheet and maintaining the data dictionary manually.

This spreadsheet; central repository of metadata, will contains a description of data objects. This includes table location, table and database description, quality of the data, details about the table columns, etc.. This spreadsheet will be a point of reference for anyone in your company looking for specific a table, allowing them to quickly find, understand and use the data objects.

‍This excel data dictionary is easy to use, reliable, and will quickly deliver value to your data team. The only drawback is that the spreadsheet becomes harder to maintain as your tables multiply. However, nothing stops you from starting manually before migrating to an automated data dictionary later.

‍Get the data discovery catalog template here. ‍

How to use this template?

For all the data objects you have, you can document it using this template. For example, take a table containing a list of your platform’s users in 2020. When documenting your table, you should pay particular attention to three aspects: Location, quality, and governance. We go through each aspect in turn.

Location

In the documentation process, your should aim at locating your table in terms where it can be found, and who is responsible for it. This, so that employees can find the table easily. You thus start by indicating the location of the table ( data storage — database — schema). That is, whether the table is located in your data warehouse or your data lake. You should also specify in which database and which schema the table can be found. If possible, indicate the URL of the table ( Data source endpoint), so it can be found easily by employees. You then outline the table name, a short description of what’s contained in the table, and a few tags. It’s essential to precise the name of the table owner, usually the person who created the table. This, so that employees know exactly who to contact for questions about the table instead of roaming desperately in quest of an answer. You can feature a list of contributors: the users who voluntarily enrich the table documentation. This a way to give credit to these data fairies, and to set them as reference points for fellows who might have questions about the table.

Quality

The second aspect of data documentation relates to data quality. You want to ensure that your data is coherent and qualitative, and to organise that the best data assets get the most users. In this light, start by filing the refresh frequency column. Is this table refreshed every 30 min, weekly? Monthly? This will allow user to spot an outdated table and wave a red flag. The quality score bit is meant to be filled by the data steward, or the person in charge of data quality. Is the data complete, accurate, consistent? We simply use emojis for this. But feel free to use whatever system you feel most comfortable with. The quality score is important, as analysts will be able to locate quickly which tables are readily usable for conducting great analysis.

Once your table is clearly documented, it’s time to clarify what’s happening on the column side. I hear your sighing already. Don’t. You have probably observed analysts reach high levels of frustration when finding themselves unable to start work because they (understandably) can’t figure out the name of some columns. And that’s the start of huge productivity losses for your data team. Hence the importance of outlining the column name and a short column description for each column. You should also precise the type of each column, describing the value it contains. Column types can be Text, Number, Date/Time or Location. This adds a layer of context for your analysts, who won’t have to pull out the table in question to get this information.

For the documentation level part, you don’t have to do anything. The template calculates the quality and level of your documentation. The score is between 0 and 1, with 1 indicating the highest documentation level. Again, this guides analysts towards well-documented tables, and encourages table owners to document their tables (a poorly documented table reflects badly on table owners.)‍

Governance

Finally, even a manual data dictionary can help you deal with governance concerns. In this dictionary, you can indicate whether a given column qualifies as PII (Personally Identifiable Information), meaning that it contains any data that could potentially be used to identify a particular person. This allows you to deal with issues of privacy, for example by restricting access to columns marked as PII.

Additional features

We encourage you to customize this data dictionary to your particular business, so don’t hesitate to add additional fields of documentation! The template proposes a few, although they aren’t part of our automated data catalog offer.

Dashboards & data dictionary

The “Dashboard” tab allows you to document your dashboards in the exact same manner as with tables.

The template also contains a “dictionary” tab. This is the space of the business glossary, containing common vocabulary and definitions for shared data. where words that are often used in the company are clearly defined. This ensures that business keywords are always used in the right context, and that everyone in the company agrees on one specific meaning. Take the term “paying user” for example. This word is used recurrently in the product department of companies. Although this word seems straightforward, it can be interpreted differently by different employees. For example, an analyst might categorize users on a free trial as “paying users”, while another might exclude them. Hence the importance to have a glossary for each department. The term owner is the person who wrote the definitions and is generally ready to answer questions about a specific word.

Data dictionaries communicate metadata regarding data in a system or a database. They are at the root of any database management system (DBMS). The primary purpose of a data dictionary is good metadata management. For each table in the database, a data dictionary includes the following:

  • Table name — the predefined name for the table
  • Table location — Precise database, file, book…
  • Table description — A detailed description of the contents and purpose for the table in the database schema. Example: “Each row represents a user of our platform”
  • Column Name — The predefined name for the column
  • Column Description — A detailed description of the contents and purpose for the column including Primary Key and Foreign Key designations.
  • Data Type — The predefined characteristics for the column. Data types vary according to the database management system (DBMS). For example, a column marked as BLOB data type in Oracle would be marked as TEXT in Microsoft SQL server.
  • Null Value — Indicates whether or not null values can be stored for the column.
  • Business rules — Includes schema validation or data quality
  • Entity-relationship and other system-level diagrams‍

If you have any questions about how to use the template, please reach out at louise@castordoc.com

Originally published at https://www.castordoc.com.

--

--

Louise de Leyritz
CastorDoc

Bridging Data and Business Value | Technical Writer | Host of the Data Couch Podcast 🛋️ https://www.linkedin.com/in/louise-de-leyritz-873049b2/