Data - Information - Knowledge
This article explains in a not too technical way the basic prerequisites for information integration and knowledge-based processes on the Web. The explanations take a simplified, largely IT-oriented point of view.
The figure below illustrates the general flow of knowledge processes. The right part of the graphic shows examples corresponding to the concepts on the left.

Parsing a format is not necessarily a one-step activity as syntaxes can be layered, e.g. HTML, RSS, or Atom are built on top of XML, Microformats are based on HTML, etc. Likewise in non-IT domains: characters are used to construct words and delimiters which can be arranged into sentences conforming to e.g. the English language grammar (The author of this article probably does a bad job concerning the latter ;).
As software developers we usually wire the semantics directly into our program code, i.e. we don't make them explicit, but for example programmatically loop through structures, convert strings into some native types, and write methods that do comparisons and output results. With regard to our example, this would mean parsing a number of available documents, extracting the values of the
Another aspect of the distinction between information and data is that in order to persistently store information, we have to move the pyramid downwards, i.e. we have to convert ("serialize") information back into data. The data can then be saved, duplicated, sent around, and parsed again. For data conversions, there always is semantics involved. Take for example an RSS-to-Atom converter: It needs a parser for RSS data, built-in semantics to transform RSS information into ATOM structures, and an ATOM serializer. Even if the implementation was solely based on some data-level string replacements, semantics (and information processing) were needed, in this case buried in the replace parameters.
Recap:
The higher layers of the Semantic Web technology stack even include ingredients for advanced knowledge processing, but as we could hopefully get across by now, the essential pieces are parsable structures and descriptions of their semantics. The Semantic Web is still based on data. Moreover, in most cases, a software program does not even need to consider the complete semantics of the data it is working with. With their bottom-up and Web-based approach, Semantic Web technologies allow making just a part of the semantics explicit (i.e. processable by software), namely those needed for the use case at hand. Upgradeable in the future, even by other developers, and without going down the slipperly slope of defining some global "world model".
Leveraging the World's largest Knowledge Base
These days almost any imaginable piece of information is available on the Web, with search engines and browsers as ubiquitous tools that help us find and retrieve whatever we are interested in. However, utilizing Web information is still far from perfect. While it's easy for a human to figure out the meaning of a web page ("Mom! They sell pokemons!"), this usually isn't true for software programs. How much more could we benefit from the knowledge freely available online if our computers understood the information encoded in Web documents?The figure below illustrates the general flow of knowledge processes. The right part of the graphic shows examples corresponding to the concepts on the left.

Data
"Data" as an IT term is usually considered as just raw bits, bytes, or characters. From a Web client perspective, this could be opaque ASCII code retrieved via HTTP, for a Web user this could be some (unread) text on a Web page. The example data in the graphic is composed of simple characters (<, d, c, t, :, m, o, ...).Syntax
"Syntax" means that the data conforms to a defined structure (i.e. to a grammar). You probably immediately identified thedct:modified tag in the data example. That's because humans are very good at format detection (and the higher layers as well). Any "consumer" trying to turn data into something useful has to be able to identify syntax constructs in the code at hand (or in any other carrier such as audio or video). A human may recognize a text as being written in english, an RSS reader could detect a supported feed format, a speech recognition program has to extract sentences from an incoming audio signal. The result of applying a known grammar to data (aka "parsing" or "de-serializing") is a structure which can then be further processed.Parsing a format is not necessarily a one-step activity as syntaxes can be layered, e.g. HTML, RSS, or Atom are built on top of XML, Microformats are based on HTML, etc. Likewise in non-IT domains: characters are used to construct words and delimiters which can be arranged into sentences conforming to e.g. the English language grammar (The author of this article probably does a bad job concerning the latter ;).
Semantics
Being able to parse data is a core requirement, but strictly speaking, it's not sufficient for getting at the encoded information. What is missing is the actual meaning of the elements in the identified structures: Their so-called "semantics". For instance, the semantics ofdct:modified is "Date on which the resource was changed." (taken from the Dublin Core website). There are different ways to express semantics, e.g. in natural language, or in more formal ways. The Semantic Web focuses on the latter which opens the door to automated information processing (and thus less custom program code as we will see later).Structured Data + Semantics = Information
Using the example from the graphic again, parsing the XML only brings us as far as e.g. creating a key-value pair sayingmodified => 2006-12-12. And again, the information is obvious for humans, but neither does a machine know the semantics of "modified" nor that of "2006-12-12". Without further help, it can't do anything reasonable with the current structure.As software developers we usually wire the semantics directly into our program code, i.e. we don't make them explicit, but for example programmatically loop through structures, convert strings into some native types, and write methods that do comparisons and output results. With regard to our example, this would mean parsing a number of available documents, extracting the values of the
dct:modified tags, converting them into dates, ordering them, picking the most recent one and returning the document identifier where we extracted the date from. So we do information processing today (of course!), but we don't really separate the semantics from program code. This means that we not only have to create custom code for parsing data (ususally only once for each format, though), but also for processing the encoded information (often for each use case individually).Another aspect of the distinction between information and data is that in order to persistently store information, we have to move the pyramid downwards, i.e. we have to convert ("serialize") information back into data. The data can then be saved, duplicated, sent around, and parsed again. For data conversions, there always is semantics involved. Take for example an RSS-to-Atom converter: It needs a parser for RSS data, built-in semantics to transform RSS information into ATOM structures, and an ATOM serializer. Even if the implementation was solely based on some data-level string replacements, semantics (and information processing) were needed, in this case buried in the replace parameters.
Recap:
- Information is interpreted structured data.
- Information as such is an "in-memory" thing.
- Only simple data operations (e.g. copy, move, delete) don't need to know the semantics of processed data.
- In any other case: No Semantics => No Party.
Targeted Combination of Information = Knowledge
From a technical point of view, the border between Information and Knowledge is blurry. And there are different definitions of the term "Knowledge". The way it is presented here is just one, but (we hope) it fits well into the IT perspective taken for this article. Information processing is not an end in itself. There is always an objective behind it. The one of the little use case in the graphic is finding the most recent document. Creating knowledge means combining information for a certain purpose, i.e. knowledge is context-specific. However, knowledge can again be combined with other information to "generate" further knowledge (possibly in a completely different context). The Semantic Web offers technologies to represent information in a way that facilitates this sort of purposeful integration.The higher layers of the Semantic Web technology stack even include ingredients for advanced knowledge processing, but as we could hopefully get across by now, the essential pieces are parsable structures and descriptions of their semantics. The Semantic Web is still based on data. Moreover, in most cases, a software program does not even need to consider the complete semantics of the data it is working with. With their bottom-up and Web-based approach, Semantic Web technologies allow making just a part of the semantics explicit (i.e. processable by software), namely those needed for the use case at hand. Upgradeable in the future, even by other developers, and without going down the slipperly slope of defining some global "world model".
Summary
- Data = Documents, Pages, Files, ...
- Structured Data = Data + Syntax
- Semantics = Meaning of Structured Data elements
- Information = Structured Data + Semantics
- Knowledge = Purposeful Combination of Information
- The Web: A Data repository
- The Semantic Web: Turning the Web of Data into a Web of Information
