The Circle of Data, Information, Knowledge, Understanding, and Wisdom
Published on TAUS Review Issue 3
Instead of being a comprehensive catalogue of translation-related data, this article is trying to generalize data gathering and curating approaches, especially for Asian translation demands. Before looking closer into the details, let’s review the famous hierarchy of data, information, knowledge, understanding, and wisdom (DIKUW), and then morph the hierarchy into a circle, or a positive feedback loop, in hope of shedding some light on the path of pursuing bigger and smarter data.
In 1989, American organizational theorist Russell Ackoff published the paper “From Data to Wisdom” in the Journal of Applied Systems Analysis, proposing the relationship of DIKUW:
- Data: symbols;
- Information: “who”, “what”, “where”, and “when” questions;
- Knowledge: “how” questions;
- Understanding: “why” questions;
- Wisdom: evaluated understanding.
The hierarchy is usually described as a pyramid, and sometimes with implication of time or value: the earlier/rawer/easier, the less valuable. It is quite intriguing to participate current trend of big data with this particular perception: not only the data is valuated by the size, but the swiftness. Thanks to the Internet, it is seemingly possible now, if not contradicting. The question is, however, is the urge of big data really not a contradiction to DIKUW?
Perhaps whether two concepts are compatible or not, is more likely a context issue. After dropping the desire to have a universal theory, it would become clearer that at very least in the context of translation, disambiguation is no doubt a crucial part, hence the compatibility in doubt could be transformed into a definition problem: when we talk about data, are you thinking what I am thinking?
In the shared field of translation-related research and industry, data in need is relatively more sophisticated than just symbols. Unlike common stories of big data these days, data required by translation is not just some search result or server log, hence the terms “corpus”, “bi-text”, “translation memory”, etc., along with the actions “curation”, “alignment”, “annotation”, and so on. In other words, when I said translation-related data, the “data” solely without the modifier was actually just a bunch of uncooked ingredients, such as texts, images, audios, or even videos, while for the whole phrase “translation-related data”, what I thought was more like information: who is involved with the data? What is the data about? Where is the data source? When is the data originated? Here comes a new quest: how to acquire the above information?
Again, thanks to the Internet, and search engines in addition, ingredients are almost free. The catch is, there is still no free lunch. At first glance, a simple keyword search may lead us to some nice resources. For example, combining what just popped up in the previous paragraph, one may formulate queries with specific language pairs, like “Japanese English corpus”, which happens to yell a nice list. The quest of information acquisition for translation-related data demands inter-discipline collaborations. Particularly for Asian translation business, it is not difficult to imagine that besides typical prerequisite of domain knowledge and genre/style of outcome, the deep understanding of the differences between Asian languages, or the heterogeneousness from Asian to non-Asian languages. If it sounds exaggerated and intimidating, allow me provide an almost stupid example beginning with a naïve question (and please bear with me if you knew the answer already): where to find the data to assist Japanese-to-English place name translation for online shopping/shipping?
As a computational linguist, the answer was trivial and tedious to me: just go scrapping Wikipedia, or if wanted to sounds competitive for job security, query DBpedia by SPARQL as a practice of “Semantic Web.” Of course it came to be disappointing that both the coverage and the quality did not suffice. Here comes the moment. One of my colleagues in sales department suggested it with the worry of being amateur and stepping over: how about the address data of Japan Post? Ta-da!
Well, unlike fairy tales, there is always much more after the happy ending. Japan Post’s address data turned out to be Romanization in upper case. Normalizing cases is not a big deal, but some Romanized terms proved problematic: basement, floor, ward, and several other typical units are still Roman scripts of Japanese. Luckily, it is still not too hard to search-and-replace them. The real important thing here is to be aware of the situation in the first place, and then talk to the customers for a mutual understanding: do you want “ward” to be “ku” or “area”, or…? Why?
Understanding, here we are. For shipment, if the place name will be presented to Japan Post eventually, why not keeping it as it is? For other potential customers, say an online photo-sharing site wants to have Japanese-English bilingual geo-locations, then it better be plain English than Romanized Japanese up to certain level, and floor and basement will probably be useless anyway. Furthermore, now DBpedia is welcome. Every decision customer approved will subsequently become evaluated knowledge, hopefully qualified as wisdom, even when it was so small and silly after looking back to the above story.
Wait, isn’t it still a long, tiresome, uncertain journey that conflict the idea of bigger, quicker, smarter data? I certainly hope not. Imagine that, the wisdom of why and how to prepare the data of place name, will soon embody the next round of data acquisition, and inspire more keywords for search engine queries. Even better, if one is willing to invest time and money to semi-automate this positive feedback loop, the pyramid of DIKUW will become the circle of DIKUW, for translation industry. Once the engine of the circle is started, the collision between big data and DIKUW will ease, and the next post-happy-ending quest shall reveal.