Classifying the web and getting people what they are looking for. Now that is a daunting task when you know that there are over 10 billion web pages, that most of them are unstructured and a whole lot of them contain information that is hidden behind security, script and non-textual media.
The challenge is made more daunting by the emergence of folksonomies in Web 2.0. So everyone can “tag” a piece of content with their own “tag cloud” i.e. each person has a classification that makes intuitive sense and can assign one or more tags to information. Each person can then use this classification or share with others collaboratively.
So it would be very possible that we are both talking about the same thing, but have very different ways of classifying that information. For example, an interstate highway could be classified by one person as “path to the east” and by another as “path to the west” or “highway greater than 2000 miles, unsafe route, drive by day only” by a third. People looking for “unsafe routes” would find the third person’s comments on this highway and so on.
All these single or multiple tags reflect an individual perception of the information, point to the same object (“the interstate highway”) and are search-able under different contexts. Obviously, the quality of any search will depend heavily on the quality of tagging that describes it and that in turn will depend on…..you!
The problem is compounded when we move to other media such as pictures and sound. A picture is worth a thousand words alright, but whose thousand words?
So engines like vivisimo offer a clustered search as a potential solution. Vertical search (as propounded by Sramana Mitra) provide another solution while the new proposed authoring tool from Google (“knol“) provides yet another route to such classifications, this time for content at an “internal to page” or sectioned level.
So does Sir Tim Berners-Lee with his semantic web. It makes sense, doesn’t it – a whole lot of information is based on facts or objects that everyone agrees on in terms of definition, such as “credit card” or “interstate 101”. So it should be obvious that we should have these keywords/tags uniformly used across the web.
And then there is faceted classification which is a really interesting area that builds upon the uniqueness of the challenge we have here.
For companies and Learning and Development organizations, content is a key aspect. The inability to be able to classify it in a way that it can be re-used and the inability to assemble old content with the new presents an amazing cost to all of us. It also limits our ability to make content portable and context driven The problem is compounded if we used more unstructured knowledge through Web 2.0 artifacts such as blogs.