Monday, June 1, 2009

Taxonomies and their Cousins are Important

Understanding several models of organizing information will help in developing classification schemes.
David Shaw © 2009

It’s cruel when you discover a taxonomy for a wiki isn’t working to your satisfaction. Generally it’s easy to develop categories for a small project, but this one was for the deconstruction of a boat. Like many other real-life large objects, boats, MRIs, cars, airplanes and even IT projects cannot be categorized by decimal library systems.

In this case I started with the numerical taxonomy defined by S1000D, an international standard for airplanes, boats and vehicles. I adapted it based on experience from a project at the Coastguard, then three experts reviewed it and gave a thumbs-up.

The taxonomy has over 400 starter nodes. Around the creation of node 200 I began to feel “This isn’t right. It’s not how people think.”

This was a body blow. MediaWiki, the base of this project, doesn’t allow renaming categories in any easy way. The solution was to engage in a displacement activity like raiding the fridge. Instead, I wrote this blog to help you understand why taxonomies are important.

The Problem

S1000D uses a numerical classification scheme. The idea is that if you know the reference for engine repairs, then it doesn’t matter which aircraft is on the ramp for repairs because the information in its manual will be under the standard reference number. Even if it’s a car on the ramp for repairs, its engine information will be under the same reference number as in the aircraft manual.

Many taxonomies, library systems, record-management (RM) systems and even accounting systems use numerical taxonomies. Shown below is an extract from the standard Government of Canada scheme.
Numerical classifications are best left to witch doctors

The advantage of numerical schemes is that they allow a simple standard breakdown of information. For example, general information about taxes is 1008.100 and likewise, general information about marketing is 5006.100.

But the problem with numerical taxonomies is that they are inaccessible by mere mortals. It requires a trained RM witch doctor to tell you how to classify a document, and then how to find it a year later. The taxonomy becomes a black hole, probably of less use than the jumble on your hard drive.

Knowing this I converted the S1000D numerical model to a natural-language model. In other words, plain English. But one of the problems encountered was this:

 Boat Type
Boat Type.Sailboat
Boat Type.Sailboat.Sloop
Boat Type.Sailboat.Yawl
Boat Type.Trawler


This was logical in a numerical topic.subtopic.subtopic scheme but in plain language it was better as:
 Boat Type
Sailboat
Sloop
Yawl
Trawler


Also,

    Maintenance
Maintenance.Suppliers
Maintenance.Suppliers.Equipment


Was better as an entirely different construct:

 Maintenance
Suppliers

Equipment Suppliers


And

 Operations
Operations.Harbour
Operations.Mooring


Was better as:

 Operations
Harbour Operations

Mooring Operations


As you can see, a numerical model that allows an expert to know the topic and level of subtopic doesn’t translate easily to plain English. The danger with this revision is that I might encounter namespace clashes that I wouldn’t have with the original strongly typed names.

So, why bother with this, why not just put it into a wiki? Well, it’s not that simple. Even wikis need some kind of classification framework.

Let’s review some of the basics.

Namespaces

Taxonomies are associated with namespaces. If you read the technical definition of namespaces, your head will probably start to hurt. Here’s a simple one. A table in a written report is in the document namespace. A wooden table in your kitchen is in the furniture namespace. So, a namespace identifies the domain or context for a given vocabulary or set of terms.

Many namespaces are formalized. If you look at the source of an html page and see dc:publisher in the header, it simply means that the term publisher as used there is the one defined by the folks who developed the Dublin Core (dc) set of definitions.

Similarly, in MediaWiki a page name that starts with Category: denotes that the page is in the category namespace.

Without going into deep detail, you will recognize that this is similar to the example with an office document having multiple relationships.

But there are other possible knowledge representations.

Wheel of Wheels

Years ago on a project with thousands of nodes we experimented with a wheel of wheels. The goal was to let users navigate while always knowing what content was to the left and right of them, and also up and down.
Wheel of wheels for navigating nodespace

In the example in the figure, a user would start at 1.0. She could then navigate from 1.0 to either 2.0 or 3.0. If she went to 2.0, she could navigate to 2.2 through one of three paths. The node at 2.2 is also part of a sub-wheel with its own paths of navigation.

This example wheel represents the simple taxonomy shown in the figure left-below. The same figure on the right also shows all the possible navigational paths for one section. You could also adapt this to a navigational scheme for a set of web pages, using nested drop-down menus and hyperlinks for both the inter-links and back-links. But this would give you just one organizational scheme for the content, and people often come at content with a different context in their mind, i.e., a different mindset or facet.
Wheel of wheels in a more familiar form (left) and its navigational scheme (right)

Taxonomies

Taxonomies are trees or hierarchies of classification much like filing cabinets. We’ve all been brought up to understand filing cabinets or their modern equivalent: the Windows folder structure. In a taxonomy, or in your Windows folders there’s only one place to put a file. Supposedly.
Just tell me where to put my file...

Back in the real world, a year has gone by, you now have 12,000 files on your hard drive, you have a new task, and you remember a document that would help you. But you can’t remember the thought process that caused you to file it…where? You can’t even remember the file name or the title. And you didn’t put any keywords in the properties because you didn’t know you might need it in the future, or what the context would be. So you try Windows advanced search but you can’t even remember words specific enough for Windows brain-dead search and it returns 1200 hits including spreadsheets.

More than 50 hits is too many for you to process. Less than 20 means you might have missed the file you’re looking for.

Then you learn some simple tricks. Your organization doesn’t have a user-friendly document-management system, or if it does you don’t have permission to set up categories so you start using MSDOS-type file names to denote versions of a document: name_v1_2009-05-12.doc. You even create an archive folder to simplify housekeeping.

You make sure you email copies to co-workers, so you have backups. You start to put copies of the file into different folders because it has more than one context, and you want to be able to find it again in the distant future. Then somebody asks, “Who sent a copy to the customer, and does anyone know what version it was?”

That’s because your document really has relationships similar to this one:
Documents usually have many relationships

Relational Models

Hierarchical databases soon ran into the same sorts of classification problems, and so the relational database model was invented. Let’s use a simple example using a recipe for Coq au Vin. When you first started collecting recipes it was good enough to put this chicken recipe on an index card or copy-paste it into an OpenOffice or Word document.

As your collection of recipes grew, you started thinking about different ways of classifying them. One way would be to put them into a relational database that had a few classification tables such as shown here:
A relational database uses linked tables to establish relationships

Faceted Taxonomies

As folks began to realize that information has many facets, web designers developed faceted taxonomies. These have become very common.

Below is an example of a web site that categorizes recipes by Meal Type, Food Type (meat, vegetable, etc.) and Cuisine. Each of these categories is a facet. In the case of the Meal Type, it has been exposed at its second level, to reduce mouse clicks and simplify navigation. Food Type and Cuisine could be expandable menus.
Faceted taxonomies provide several navigational entry points

Thus, Coq au Vin could be found as:
  • Its name in the index.
  • A dinner meal.
  • A chicken recipe.
  • French cuisine.

Network (Wiki) Models

Wikis use a network model of organization. A network has no discernible root; although you might nominate one or more to serve as facets or entry points. The below figure shows our Coq au Vin recipe. Note that it is the only topic in our network. All the other nodes are either categories or subcategories.
A network has no root

Basically, this network says Coq au Vin is:
  • A recipe
  • French cuisine
  • A chicken dish
  • Suitable for a dinner meal
In MediaWiki or Wikka, as an example, at the bottom of the page we would put:

[[Category:Recipes]]
[[Category:FrenchCuisine]]
[[Category:Chicken]]
[[Category:Dinner]]

Also, Category:Dinner could readily be subdivided into Category:Appetizer, Category:Entrée and Category:Dessert.

Folksonomies & Ontologies

Are best left for another day....

No comments: