Fuzzy Text Classification

The inability to locate a piece of information that you know is out there somewhere but you can’t find can be extremely frustrating particularly if you know you’ve seen it before. Many knowledge managers would be millionaires if they got 10 pounds every time they said "It’s like trying to find a needle in a haystack!".

Text classification systems can ameliorate this problem. Much useful information is in the form of text: This ranges from emails, web pages, newspaper articles, market research reports, through to CVs, complaint letters from customers, and internally generated reports. So the way forward is to harness techniques for classifying text so that via keyword search these items can be retrieved when required. At least that is the goal. For anyone who has tried using web search engines, trying to find useful information using the basic keyword classifications that they are based on brings quite mixed results. (See Box 1 for a brief review of basic keyword classification.)

For a more intelligent approach to classifying an item of text, the topic of the text has to be determined. This of course can be very difficult. If we know a newspaper article is about "oil", is it always OK to assume that is about "petroleum"? No, there are plenty of exceptions such as when the article is about "cooking oil". Lexicons and thesauri can be part of a better solution but they are normally just lists of words without any indication of how words work together to indicate topics. One answer is to use rule-based techniques such as those handling fuzzy sets. (See Box 2 for a quick explanation of fuzzy sets and see Box 3 for an illustration of how they can be used in fuzzy rules for text classification).

An early advocate of rule-based systems for text classification was Reuters. In offering their subscribers text databases that are indexed for easy retrieval of relevant items, they wanted to address the problem that indexing news reports by hand was found to be an expensive, slow and labour-intensive activity. Consistent accuracy was difficult to obtain with human indexers, and the work tended to cause high staff turnover. With these issues in mind, Carnegie Group, based in Pittsburgh, worked with Reuters to develop the Construe system, an automated news story categorization system based on a fuzzy rule-based text categorization shell developed by Carnegie Group.

The Contrue system classifies news reports by labelling each report with one or more categories from a set of over 600 defined categories. These categories include ecomonic topics (such as mergers and acquisitions, corporate earnings, interest rates, various commodities and various currencies) and proper names (such as people, countries, international organizations and stock exchanges).

Because the Construe system uses rules that define concepts, the text classification can be more accurate than basic keyword classification. And whilst it involved a significant investment over a period of time, it does indicate how fuzzy text classification can be useful.

To illustrate this, many countries use a currency called "dollar", and a currency is not always specified by its full name in a story. In a sentence like "Australia announced today that it would increase overseas development spending by 100 million dollars next year", we infer that the Australims dollar is under consideration rather than the US dollar or Singapore dollar. In Contrue, there is a rule that says to assign the Australisn dollar category if there is no specific evidence that the US dollar or Singapore dollar is intended. Evidence that US dollar was intended might be if the article was published in a US publication, or if the Australian government normally discusses overseas development spending in US dollars.

One company that has developed fuzzy text classification for more general usage is Verity with the Topic software system. Verity has aimed the Topic software not at occasional users of an online system but rather at regular users who have a continuous or frequent need to identify reports on a particular subject.

In order to get optimal performance, the user needs to invest some effort in identifying subjects that are relevant in describing the reports of relevance. This may involve listing keywords, combinations of them that lead to the identification of useful intermediate concepts, and weightings on these.

A major early customer of Verity was the Pentagon. They deployed a system distributed around 50 sites worldwide that took input from the US Department of Defense messaging system and news stories from commercial wire services, and using Topic’s profiling of user’s needs, fed messages to intelligence analysts. With thousands of these analysts, some of whom have to handle up to 500 messages an hour, the overall system was extensive.

Verity was founded in California in 1988, and they opened a London office shortly afterwards. They have many of the Fortune500 and FT100 as customers including suppliers and users of online news resources. Though their customer base is by no means restricted to handling news. They have applications in areas as diverse as customer support, litigation support, and processing CVs.

Warner-Lambert, a major US pharmaceutical company, used the Vertity system as the basis of a massive information clearinghouse for helping cut the time taken to get their new drugs licenced in markets around the world. Once a drug has been patented a drug firm wants to move as quickly as possible to get returns on the enourmous expenditure in research and development before the patent runs out. In this clearinghouse, the company aims to catalogue everything that is written about their drugs and their competitors’ drugs. The system acts as an intermediary between the information and the staff who needs it. For example, they may receive questions from the pharmaceutical licencing authorities in a particular country about the use of a drug in a particular patient class. By having sophisticated text classification, such questions can be answered much more promptly.

For an up-front investment in developing the fuzzy rules, fuzzy text classification can offer some significant advantages and a number of companies are supplying software that incorporates this technology. When compared with other advanced text classification systems, such as those based on neural networks, a key advantage is that they can harness users’ knowledge about the domain and can allow users to directly develop rules for text classification. And there are various graphical tools to facilitate in this.

So in future, if a knowledge manager thinks a needle is likely to be lost, then with this kind of technology, the investment in the rules can be made, and that needle can more easily be found again!

Anthony Hunter is a lecturer in computer science at University College London and can be contacted on a.hunter@cs.ucl.ac.uk.

Box 1: Basic keyword classification

A simple approach to classifying an item of text is delete the stop words – these are words with little semantic content such as "the", "because", "and", "of", etc. The remaining words can then used as index terms to describe the item of text. Consider the news report "EMI and Time-Warner have today agreed to merge their record businesses into a new 50-50 owned joint venture to be based in New York". Removing stop words could give us the following sets of index terms "EMI, Time-Warner, today, agreed, merge, record businesses, new, 50-50, owned, joint venture, based, New York". Clearly, some keywords are more important than others in describing this news report. For example "EMI", "Time-Warner", "merge", and "record businesses", are more important than "today" and "New York", but a naïve approach does not offer any selection. So if a user is looking for items on New York, unfortuately this news port, which is unlikely to be what the user is looking for, will be retrieved.

Box 2: What are fuzzy sets?

A fuzzy set is a just a set with a measure of the degree of membership for each element in the set. The value is in the range 0 and 1. The larger the value, the more the element is regarded a member of the set. So 0 really means that element is not in the set, and 1 means that it is most strongly a member of the set. Consider the set of "Cars" that includes the members "Mini", "Ford", and "Ferrari". Here all the members are "Cars" with degree 1. Now consider the same members but with the set "Fast cars". Here, "Mini" might have a low value like 0.1, "Ford" might have a mid-range value like 0.5, and "Ferrari" might have the top value of 1. Using fuzzy sets allows us to represent this information. Furthermore, it allows for a simple and intuitive generalization of important ideas for manipulating sets including union, intersection, and complement. They can also be used to handle some of the uncertainty arising in rule-based systems (See box 3). The downside with fuzzy sets is the degree of membership values tend to be quite a subjective.

Box 3: Using fuzzy rules for text classification

Fuzzy rules can be used for text classification. Such a system allows users to specifiy topics of interest in terms of a hierarchy of sub-concepts. These sub-concepts can then be implemented as a set of rules. At the lowest level, sub-concepts are defined in terms of the combination of words in items of text.

The general scheme for rules is "IF X THEN Y" or "IF X THEN Y1 BUT IF ALSO Z THEN Y2". In the former case, if X is true for the item of text, then Y is true for the item of text. In the later case if X is true and Z is not true, then Y1 is true, but if X and Z are true then Y2 is true.

IF the text contains the word "bomb"

THEN the text is about an "explosive device" with degree 0.6

BUT IF ALSO the text contains the words "boxing match"

THEN the topic is about an "explosive device" with degree 0.3

IF the text contains the word "shooting"

THEN the text is about "violent event" with degree 0.7

IF the text is about "violent event" with degree more than 0.6

THEN the text is also about "terrorism" with degree 0.7

BUT IF ALSO the text is about "assassination" with degree more than 0.6

THEN the topic is about "terrorism" with degree 0.9

Here we see sub-concepts including "explosive device", "assassination" and "violent event" can be used to help present rules about whether an item of text is about "terrorism".