Why I hate the “Data Catalog” term

The messier your data warehouse is, the sooner you should implement a data catalog

Arnaud de Turckheim
CastorDoc
5 min readJul 21, 2021

--

For the last twelve months we’ve been building Castor, and yes, according to industry standards, it’s a data catalog.

Here is the problem, too many people believe that the end goal of a data catalog is to document nicely your data assets. Basically, you plug a data catalog only after you’ve cleaned your warehouse and defined your data KPIs. Like the icing on the cake.

We have a different vision. To bring visibility to the internet, we didn’t organize it in clean folders. We plugged Google on top. If your data warehouse is messy, if it takes time to find the relevant data, if you have trouble trusting your data, don’t spend weeks cleaning it, plug a search engine. Castor is a powerful search engine meant to help you find and trust data assets.

At Castor, our end goal is to make data users more efficient in answering questions.

Thanks to Castor, they find, understand, and use their assets faster. No matter how messy your warehouse is, even if you do not have any documentation yet.‍

The best metaphor I’ve found so far deals with hiking in a forest. Without Castor, any analyst newcomer gets dumped in the forest. If staffing is ok at that time, an experienced analyst comes along for a few hours to give him a quick tour, show where water can be found, where the grizzly sleeps… Then the buddy leaves and the new analyst is left alone, tasked to do the job on his own. Yes, from time to time a question can be asked around, but that depends on the team availability.

And with Castor? Your new analyst gets a fancy map of the forest, showing points of interest and paths frequently used. Oh, and this map is automatically updated. To add two metaphors on top of the first one, it’s kind of like the Marauder’s Map in Harry Potter, or the Age of Empires map when using cheat codes (I swear these are the last metaphors of this article).

So this map is amazing if you’re in the darkest forest ever (like Fangorn’s forest, oopsy, another super geeky metaphor) but it’s also useful in our beautiful city of Paris in France.

How does that translate in the data world? Key features such as:

  • Never getting lost using an unknown/unpopular/deprecated table thanks to tables and dashboards popularity
  • Find all tables containing a specific column
  • Not reinventing the wheel thanks to a query history organized by tables
  • Grasping dependencies thanks to lineage

The messier your data is, the more value Castor can bring. Castor is a data exploration/discovery tool (leveraging data cataloging features, yes). If it is a mess, use cheat codes now, clean later‍

Clean with Castor

Castor helps you clean your data warehouse and BI tools. Credits: The Creative Exchange

We have some useful features to help Castor admins make their warehouse cleaner. The main one being that content is always prioritized by popularity. It puts the focus on popular content so that documentation effort can be aligned with content popularity.

A reflex we’ve seen a lot is to document source tables, even if these are never used directly. We advise our clients to start with the top 10 most popular tables, listed in Castor. Of course, these tables, thanks to our very own SQL parser, are already enriched with Lineage information and Query history.

Also, at first, some of our clients only wanted to show their neatest schemas in Castor, well structured and approved, and hide the ugly ones, not considering how much these were used.

After a few weeks of working together, another strategy emerged: we added back all schemas in Castor. Our clients tagged their officially approved tables and dashboards. Finally, they added redirects from soon-to-be-deprecated popular ones to their new counterparts. Castor played here again its “map” role. This pattern is even stronger when clients are doing a data migration to a new warehouse.

To put things simply: make your users more efficient, on your brand new well-documented dbt models but also on that old production database that you never want to hear about, if they use these a lot.‍

Migrate with Castor

Castor helps you migrate from one warehouse cloud provider to another. Credits: Kenrick Mills

I love hearing that sentence “oh, we’re in the middle of a migration from Redshift to Snowflake, we’ll plug Castor when we’re done”. Why do I love that sentence? Simply because I know by heart the arguments to plug Castor as soon as possible.

Remember the paragraph above? About “The messier your data… The more value”. Could a warehouse be messier than during data migration? We typically see clients use Castor to map new & old tables, their users can see all old and new content in the same place, with links between them. (note: a hidden reference to the best TV show ever is hidden in this paragraph, I’m afraid only very few people will get it)‍

Build with Castor

Use Castor as you build your data stack. The earlier the better. Credits: Randy Fath

Are you laying down the foundation of a modern data stack in a company starting its data journey? Lucky you!! These are amazing times indeed. Plug Castor, now. Why? Because the sooner you enable exploration and documentation, the less work and hassle it will be. It’s super hard to climb that mountain when it’s 8000 high…‍

Plug Castor if:

  • You have more than 100 tables and the only way to get some knowledge about these is by asking questions on slack?
  • You’re building a data team and stack from scratch (see this article), plug Castor to avoid legacy, and enable your users from month 1
  • You’re migrating from Redshift to Snowflake or from Snowflake to BigQuery (yes we’ve seen that too), plug Castor to help your users find their way between old datasets and new ones
  • If you don’t want a fancy tool but excel makes sense for you, we’ve built a delightful data catalog template‍

Enough of this self-promoting talk, I think you got it now 😉‍

Originally published at https://www.castordoc.com.

--

--