Derived Knowledge: The Next Frontier for Internet Search

An understanding of dispersed knowledge is enormously important for the function of a modern economy or an individual firm. The idea dates to the work of Friedrich Hayek, winner of the 1974 Nobel Prize for economic sciences.

Hayek puts forth the essence of his belief [1]:

“… knowledge [or information] … never exists in concentrated or integrated form, but solely as the dispersed bits of incomplete and frequently contradictory knowledge which all the separate individuals possess.”

The argument is that centralized systems, in government or business, are sub — optimal because they cannot capture all existing dispersed knowledge. Much of this knowledge is local, held by many individuals. This is exactly the philosophy behind part of the M Language, a means of making rapid connections between mathematical models and data using the Internet. It emphasizes a link between the global and the local. Developed at the MIT Data Center (2003–2009), there have been many publications on the idea and the associated software [2] [3] [4]. The authors won two major awards and several startups formed.

As powerful as the thinking of Hayek has proven to be concerning dispersed knowledge, he did not consider the refined vision of derived knowledge. Even today no rigorous definition exists in this area. Yet in the future, derived knowledge will be at least as important as dispersed knowledge. Both share the common element of fragmentation. While Google has commercialized methods for search of dispersed knowledge, there is no distinct technique for derived knowledge. Google treats both as the same, causing problems for users. It is a big opportunity for Internet search or otherwise, internal or external to a firm or entity.

Most define derived knowledge as inferred from another source, with the source being primary qualitative knowledge or data. It is a basic idea in artificial intelligence (A.I.). I extend the definition to include the outcome of any mathematical model. A model uses inputs to produce outputs, which give a new insight and create knowledge. The outputs are derived knowledge, something that exists separate from the data inputs. Models are a specific case and are an abstraction. Not all derived knowledge is model based. However, all derived knowledge takes various degrees of an abstraction.

Derived knowledge offers a special insight not widely known. It is the product of computation and draws from many primary sources. For example, a large data set on retail operations, when analyzed using machine learning, would yield rules that describe patterns of buyer behavior. The rules are derived knowledge.

My viewpoint is that for the Internet, all primary knowledge expressed in web pages must be somehow separate from derived knowledge. The argument for separation is that various versions of derived knowledge could exist based on different assumptions for a single model. All versions are equally valid from a model execution standpoint. However, a single version usually is the best for describing real-world behavior. In contrast, dispersed knowledge is primary and few if any versions exist.

As such, organizing derived knowledge using the Internet is a grand challenge. Thinking in this area is just beginning. In this regard, search engine technology has several shortcomings that need attention.

First, one or several concepts (in the computer science context) make up derived knowledge. A concept is a unit of meaning. The precision of defining a concept in a machine understandable way is lacking. This makes search less efficient. In practice, many concepts overlap. Often the connections between concepts are not stable through time. This makes consistent separation and linkage difficult. In this case, search algorithms break down.

Second, all derived knowledge must have some form of identification. This is critical for search. Simply understanding that a bit of knowledge derives from other sources is important information to know. To date, there are no systematic identifiers in this regard and no computerized frameworks exist to organize identifiers. Further, it is unlikely that all existing derived knowledge can undergo categorization without extensive use of A.I. The success of such applications is an open question because A.I. is most effective when a foundation of identifiers exists.

Finally, at the base of A.I. categorization of derived knowledge must be a treatment of machine understandable semantics. This suggests that central agreement must exist on semantics with the goal of ending ambiguity. Advances in this area will make A.I. soar.

The age is approaching where, in contrast to humans, models will create large amounts of derived knowledge. Among these, A.I. and machine learning are foremost. Other types of models will contribute as well. The framework put forth by the M Language is a prototype that would deal with some of the shortcomings of current Internet technology in searching for derived knowledge. While much more work needs to happen, many of the basic concepts of the M Language apply.

Edmund W. Schuster

schuster.us.com

References:

[1] Hayek, F.A., 1945. The use of knowledge in society. The American Economic Review, 35:4, pp. 519–530.

[2] Schuster, E.W., H-G Lee, R. Eshani, S.J. Allen, J.S. Rogers, 2011. Machine-to-machine communication for agricultural systems: an xml-based axillary language to enhance semantic interoperability. Computers and Electronics in Agriculture 78, 150–161.

[3] Lee, H-G, E.W. Schuster, S.J. Allen, P. Kar, 2010. The open system for master production scheduling: information technology for semantic connections between data and mathematical models. International Journal of Operations Research and Information Systems 1, 1–15.

[4] Brock, D.L., E.W. Schuster, S.J. Allen, P. Kar, 2004. An introduction to semantic modeling for logistical systems. Journal of Business Logistics 26, 97–117.