Data mining

Data mining is a technology that has been widely used in organizations over the past few years with some very impressive results. It is based on software that looks for interesting or important patterns in data (See also Box 1: What is data mining?). Whilst much of the recent interest in the technology has been at the level of analysing massive corporate data warehouses using very sophisticated statistical techniques, there is much potential for using data mining software intended for business users for proactive knowledge management.

To illustrate how data mining can be used for discovering knowledge in data, consider a database held by a credit card company. Here relationships between the number and type of air tickets bought per year could be compared with other kinds of expenditure. For example, there might be a relationship between people who buy first-class tickets and their likelihood of buying opera tickets. Once discovered, this is knowledge that could be exploited by the organization.

There is a classic example, often cited in articles on data mining, of Wal-Mart, a pioneering user of data mining in their supermarket chain in the US, finding that there was a significant number of men coming into their stores in the late afternoon and buying just a packet of nappies and a pack of beer. Once identified, such an interesting pattern is then a spur to further investigation to determine why it occurs, and whether it is something that could be exploited. We leave it your imagination what reasons Wal-Mart found that caused the pattern in this example and how they exploited it! However, it is a common theme that patterns discovered in data provoke further questions. Often it is not the patterns that are most useful, but the resulting explanations for why the patterns occur.

There are a number of software and consultancy companies offering data mining solutions. These include the big name IT companies such as IBM and Oracle as part of data warehousing and ebusiness offerings. It also includes more focussed companies such as SPSS, based in Chicago, and with offices around the world, which supplies statistical and data mining tools for business, finance, and scientific users.

SPSS sells one of the most widely used packages, called Clementine (www.spss.com). Clementine is billed as a rapid analysis environment - "a workbench for people to create models based on data mining". Users range from the power user with a background in statistics through to the business user with minimal IT experience such as the casual user of spreadsheets. To illustrate, we give examples of applications at the BBC in Box 3, and at Reuters in Box 4, that are based on Clementine.

In addition, to Clementine, there is a significant number of other specialist products being offered by smaller technology companies in the Europe and the US. Providers of PC-based solutions that can be set up and used by individuals or small groups include Attar (www.attar.com) and LPA (www.lpa.co.uk) which both offer tools for data mining and tools for building knowledgebased systems that use the learnt knowledge for further reasoning and analysis, InforSense (www.inforsense.com) which has sophisticated tools with applications in the bioinformatics and pharamceuticals area, and MineIt (www.mineit.com) with tools including Easyminer for analysing the usage of a website. An extensive resource guide to data mining, including a listing of numerous specialist software and consultancy companies can be found at www.kdnuggets.com.

With such tools there is a wealth of data that knowledge managers have that can be mined for useful patterns. Types of pattern include:

Classification (e.g. What are the features that would allow a telephone company to identify an existing customer who is likely to upgrade to an ISDN line?)

Association (e.g. What books are bought together in the same order by shoppers at an online bookstore?)

Sequence (e.g. What sequence of actions by a customer services department is likely to lead to a customer being satisfied after making a complaint?)

Clustering (e.g. How can you group different types of shopper at an online grocery store - maybe busy families, party organizers, wine buffs, healthy eaters, and so on?)

Different types of data and different types of pattern call for different tools. Classification is often a useful type of pattern to look for, and there are a number of relatively straightforward tools based on learning decision trees (See Box 2), that can be used by relatively inexperienced users. And this is raising interesing opportunities for knowledge managers.

The value of knowledge management is clearly growing, and through this it can become proactive. By this it is meant that there are ways that knowledge management can do much more than collate and retrieve knowledge: There are tools and methods that can give knowledge management the faciltities to discover, organize, check and analyse knowledge. Data mining is a key example. By being proactive, knowledge management can offer better quality knowledge, and can add value to knowledge.

Consider a large pharmaceutical company that recruits a large number of new graduates and PhDs each year. In order to get these new recruits working effectively as soon as possible, they need to access the company's knowledge resources. A good way of directing them to the most appropriate resources is to identify patterns of accessing the knowledge resources undertaken by previous cohorts of recruits and by more experienced members of staff. For example, suppose new PhDs who are working in chemical synthesis research find a particular set of webpages on patenting procedures particularly useful in their first month at the company, then this is a pattern that could be discovered by data mining. Here, data mining helps knowledge managers to improve the quality of their services.

Once discovered, the new knowledge might seem very straightforward and intuitive. But the data mining software does the hard work of ploughing through all the data rejecting many other possible patterns because they are not sufficiently reliable or because they are too complicated to be useful.

However, knowledge management doesn't have to be restricted to using data mining of data on users of knowledge management services. Putting data mining into the remit of knowledge management might constitute a more general and deeper move. Knowledge discovered from data warehouses can often be of significant interest to a number of different groups across an organization, and knowledge managers are very much concerned with the task of distributing knowledge across an organization.

Anthony Hunter is a lecturer in computer science at University College London. He can be contacted at: a.hunter@cs.ucl.ac.uk

Box 1: What is data mining?

Data mining involves analysing data with the aim of identifying interesting patterns or knowledge. Often, it is not known in advance what the patterns would be, though usually there is some idea of the kinds of relationship that would be interesting. To illustrate, consider the following simple customer database.

Name	salary	Sex	Age	Buy widget
Bloggs	15000	male	19	No
Jones	25000	male	33	Yes
Smit	23000	female	50	No
Smit	16000	male	40	No
Smit	200	male	10	No
Patel	30000	female	30	No
Steel	25000	male	23	Yes
Higgs	18000	female	55	No
Puggs	50000	male	57	Yes
Puggs	51000	female	57	No

This table contains 10 rows where each row is a customer. The table has four attributes - Salary, Sex, Age, and Buy widget. So a simple question to ask of data mining would be "can we find a pattern using some combination of the attributes Salary, Sex, and Age, to predict whether any future customer might buy a widget?" The answer is yes, and we give a pattern in the form of a decision tree in Box 2.

Box 2: What is a decision tree?

A decision tree is a simple way of representing a classification scheme. In particular, it can be a useful way of presenting a pattern identified in data. Below we give a simple example that has been derived from the database given in the previous box. We can use the tree on new customers to predict whether they will buy a widget. So we assume that we have data on the Salary, Sex and Age of our new customer, and then we use this with the tree to classify the customer as either "yes" or "no" for buying a widget. We start at the root of the tree, in this case Salary, and find the value for our customer. If it is less than or equal to 23000, we go down the left branch, and in this case classify the customer as "no". If it is greater than 23000, we go down the right branch, and arrive at the node Sex. If the value for Sex for our new customer is female, then we go down the left branch to the classification "no", otherwise we go down the right branch to the classification "yes".

This decision tree was easy to construct from the database. However, in general a database may have many times more examples (thousands, tens of thousands, or even more examples). This is where efficient software comes in to do the number crunching. One of the goals of a good data mining tool is to find relatively simple patterns. If a pattern is too complex, then it is difficult for users to absorb, understand, and verify.

Box 3: Predicting audience share at the BBC

The BBC were an earlier user of data mining for predicting audience share. The world of TV is going through some dramatic changes not least of which is the massive increase in competition both from new channels and from other media. Programme schedulers must be able to predict the likely audience for a proposed programme and the optimum time to show it.

BBC planners base their decisions on information about the audience share achieved by previous similar programes. Using Clementine, the BBC were able to apply data mining to the BBC historical vewing data. This data included times of programmes, genre, star presenter, preceding and following programmes on the channel and on other channels, the weather, time of year, major public and sporting events, etc. Learning from this data, the system was able to predict audience share with a similar accuracy to trained BBC staff.

However, all learnt knowledge needs to be treated with caution. It needs to be interpreted with respect to wider knowledge of the organization and its business. For example, one of the rules identified in this work was Any programme which follows a UK soap opera will achieve six percent less share than if it was put on at any other time. It is tempting to infer from this that UK soaps are so bad that they cause the audience to switch to other channels and then not turn back at the end of the soap. However, it seems that BBC staff knew that UK soap figures are dominated by the BBC programme East Enders, and that ITV targeted this audience by putting on one of their favourite programmes The Bill for immediately after East Enders. So this is an example where data mining found an interesting pattern, but not the explanation.

Box 4: Improving data quality at Reuters

As an example of how data mining can improve the quality of information being made available. Reuters provides on-line price data for a variety of financial instruments. Some data is contributed in real-time by customers, and may contain errors. Reuters International Data Quality Group have been exploring a data mining approach to this problem, focussing on error detection in contributed foreign exchange data. Some erroneous prices are easily detected by their deviation from the prevailing price level. Other errors are more subtle, and can only be detected by their deviation from normal patterns of price movements.

The approach chosen was to build models to make rough predictions of prices on the basis of a summary of recent price movements. Incoming prices could then be flagged as suspicious if they deviated widely from this prediction. Clementine was used to produce an error detection system, based on a combination of a neural network and automatically generated rules; this successfully detected many errors which had not been spotted by conventional means.

The data mining approach has the advantage that error detection is based on knowledge derived automatically from easily available data, rather than laboriously collected from experts. In addition, changing circumstances can rapidly be accommodated by re-training using up-to-date data.