A Characterization of Data Mining Technologies and Processes by Information Discovery, Inc.
Источник: http://www.dmreview.com/portals/portal.cfm?topicId=230001
White Paper: A Characterization of Data Mining Technologies and Processes
A Characterization of Data Mining Technologies and Processes
by Information Discovery, Inc. 
Data Mining Processes


Traditionally, there have been two types of statistical analyses: confirmatory analysis and exploratory analysis. 
In confirmatory analysis, one has a hypothesis and either confirms or refutes it. However, the bottleneck for confirmatory 
analysis is the shortage of hypotheses on the part of the analyst. In exploratory analysis, (Tukey, 1973), one finds 
suitable hypotheses to confirm or refute. Here the system takes the initiative in data analysis, not the user.
The concept of initiative also applies to multidimensional spaces. In a simple OLAP access system, the user may have to 
think of a hypothesis and generate a graph. But in OLAP data mining, the system thinks of the questions by itself (Parsaye,
 1997). I use the term data mining to refer to the automated process of data analysis in which the system takes the 
initiative to generate patterns by itself. 
From a process oriented view, there are three classes of data mining activity: discovery, predictive modeling and forensic 
analysis.


Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about
 what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are,
 without the user thinking of the relevant questions first. In large databases, there are so many patterns that the user 
can never practically think of the right questions to ask. The key issue here is the richness of the patterns that can be 
expressed and discovered and the quality of the information delivered -- determining the power and usefulness of the 
discovery technique.
As a simple example of discovery with system initiative, suppose we have a demographic database of the US. The user may 
take the initiative to ask a question from the database, such as 'what is the average age of bankers?' The system may then 
print 47 as the average age. The user may then ask the system to take the initiative and find something interesting about 
age by itself. The system will then act as a human analyst would. It will look at some data characteristics, distributions,
 etc. and try to find some data densities that might be away from ordinary. In this case the system may print the rule: 
'IF Profession=Athlete THEN Age < 30, with a 71% confidence.' This rule means that if we pick 100 athletes from the 
database, 71 of them are likely to be under 30. The system may also print: 'IF Profession=Athlete THEN Age < 60, with a 97%
 confidence.' This rule means that if we pick 100 athletes from the database, 97 of them are likely to be under 60. This 
delivers information to the user by distilling pattern from data.
In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus 
allows the user to submit records with some unknown field values, and the system will guess the unknown values based on 
previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the 
patterns to guess values for new data items.
To use the example above, once we know that athletes are usually under 30, we can guess someone's age if we know that they 
are an athlete. For instance, if we are shown a record for John Smith whose profession is athlete by applying the rules we 
found above, we can be over 70% sure that he is under 30 years old, and we can be almost certain that he is under 60. Note 
that discovery helps us find general knowledge, but prediction just guesses the value for the age of a specific individual.
 Also note that in this case the prediction is 'transparent' (i.e., we know why we guess the age as under 30). In some 
systems the age is guessed, but the reason for the guess is not provided, making the system opaque.
Forensic analysis is the process of applying the extracted patterns to find anomalous, or unusual data elements. To 
discover the unusual, we first find what is the norm, then we detect those items that deviate from the usual within a given 
threshold. Again, to use the example above, once we notice that 97% of athletes are under 60, we can wonder about the 3% 
who are over 60 and still listed as athletes. These are unusual, but we still do not know why. They may be unusually 
healthy or play sports where age is less important (e.g., golf) or the database may contain errors, etc. Note that discovery
 helps us find usual knowledge, but forensic analysis looks for unusual and specific cases.
Each of these processes can be further classified. There are several types of pattern discovery such as If/Then rules, 
associations, etc. While the rules discussed above have an IF- THEN nature, association rules refer to items groupings 
(e.g., when someone buys one product at a store, they may buy another product at the same time -- a process usually called
market basket analysis). The power of a discovery system is measured by the types and generality of the patterns it can 
find and express in a suitable language.
Data Mining Users and Activities
It is necessary to distinguish the data mining processes discussed above from the data mining activities in which the 
processes may be performed, and the users who perform them. First, the users. Data mining activities are usually performed 
by three different classes of users: executives, end users and analysts. 
Executives need top-level insights and spend far less time with computers than the other groups -- their attention span is
 usually less than 30 minutes. They may want information beyond what is available in their executive information system 
(EIS). Executives are usually assisted by end users and analysts. 
End users know how to use a spreadsheet, but they do not program -- they can spend several hours a day with computers. 
Examples of end users are sales people, market researchers, scientists, engineers, physicians, etc. At times, managers 
assume the role of both executive and end user. 
Analysts know how to interpret data and do occasional computing but are not programmers. They may be financial analysts, 
statisticians, consultants, or database designers. Analysts usually knows some statistics and SQL.
These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and 
continuous data mining. 
In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to 
understand this data set, or use it for prediction on new marketing campaigns. Episodic mining is usually performed by 
analysts.
In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of 
specific measures such as profitability. Hence, a strategic mining exercise may look to answer questions such as: where 
do our profits come from? or how do our customer segments and product usage patterns relate to each other? 
In continuous mining we try to understand how the world has changed within a given time period and try to gain an 
understanding of the factors that influence change. For instance, we may ask: how have sales patterns changed this month? 
or what were the changing sources of customer attrition last quarter? Obviously continuous mining is an on-going activity 
and usually takes place once strategic mining has been performed to provide a first understanding of the issues.
Continuous and strategic mining are often directed towards executives and managers, although analysts may help them here. 
As we shall see later, different technologies are best suited to each of these types of data mining activity.
The Technology Tree
The top level dichotomization of the data mining technologies can be based on the retention of data; that is, do we still 
keep or need the data after we have mined it? In most cases, not. However, in some early approaches much of the data set 
was still maintained for future pattern matching. Obviously, these retention-based techniques only apply to the tasks of 
predictive modeling and forensic analysis, and not knowledge discovery since they do not distill any patterns.
As one would expect, approaches based on data retention quickly run into problems because of large data sets. However, in 
some cases predictive results can be obtained with these techniques and for the sake of completeness I briefly review them 
in the next section.

As shown in Figure 2, approaches based on pattern distillation fall into three categories: logical, cross-tabulational and
 equational. I will review each of these and their sub-branches separately. Each leaf of the tree in Figure 2 shows a 
distinct method of implementing a system based on a technique (e.g., several types of decision tree algorithms).
Not all approaches based on pattern distillation provide knowledge, since the patterns may be distilled into an opaque 
language or formalism not easily readable by humans such as very complex equations. Hence, some of these approaches produce 
transparent and understandable patterns of knowledge, others just produce patterns used for opaque prediction.
Data Retention
While in pattern distillation we analyze data, extract patterns and then leave the data behind, in the retention approaches 
the data is kept for pattern matching. When new data items are presented, they are matched against the previous data set.
A well known example of an approach based on data retention is the nearest neighbor method. Here, a data set is kept 
(usually in memory) for comparison with new data items. When a new record is presented for prediction, the distance between 
it and similar records in the data set is found, and the most similar (or nearest neighbors) are identified.
For instance, given a prospective customer for banking services, the attributes of the prospect are compared with all 
existing bank customers (e.g., the age and income of the prospect are compared with the age and income of existing customers).
 Then a set of closest neighbors for the prospect are selected (based on closest income, age, etc.). 
The term K- nearest neighbor is used to mean that we select the top K (e.g. top 10) neighbors for the prospect, as in 
Figure 3. Next, a closer comparison is performed to select which new product is most suited to the prospect, based on the 
products used by the top K (e.g., top 10) neighbors. 
Figure 3 is currently unavailable.
Of course, it is quite expensive to keep all the data, and hence sometimes just a set of typical cases is retained. We may 
select a set of 100 typical customers as the basis for comparison. This is often called case-based reasoning. 
Obviously, the key problem here is that of selecting the typical customers as cases. If we do not really understand the
 customers, how can we expect to select the typical cases, and if the customer-base changes, how do we change the typical 
customers?
Another usually fatal problem for these approaches has to do with databases with a large number of non-numeric values (e.g.,
 many supermarket products or car parts). Since distances between these non-numeric values are not easily computed, some 
measure of approximation needs to be used -- and this is often hard to come by. And if there are many non- numeric values, 
there will be too many cases to manage.
Pattern Distillation
These technologies extract patterns from a data set, then use the patterns for various purposes. Naturally, the first two 
questions to ask here are: What types of patterns can be extracted and how are they represented? 
Obviously, patterns need to be expressed within a formalism and a language. This choice gives rise to three distinct 
approaches: logic, equations, or cross- tabulations. Each of these approaches traces its historical roots to a distinct
 mathematical origin. 
The concept of the language used for pattern expression can be clarified with a few simple diagrams, as in Figure 4. For
 instance, let us consider the distinction between equations and logic. In an equational system operators such as plus and 
times may be used to relate variables together e.g., (a * X) + b) while in a logical system the key operators are 
conditional (e.g., IF 6 < X < 7 THEN 1 < Y < 2).

Logic can deal with both numeric and non-numeric data. Equations require all data to be numeric, while cross-tabulations 
are the reverse and only work on non-numeric data; a key source of problems. But more importantly, equations compute 
distances from surfaces (such as lines) while cross-tabs focus on co- occurances.
Neural networks are opaque equational techniques since internally they compute surfaces within a numeric space. As data 
is repeatedly fed into the network, the parameters are changed so that the surface becomes closer to the data point.
When discussing data mining, it is necessary to distinguish between directed analysis and free-form roams through the 
database.
In directed analysis, also called supervised learning, there is a teacher who teaches the system, by saying when a 
prediction was correct or incorrect. Here the data has a specific column that is used a the goal for discovery or prediction.

In unsupervised learning, the system has no teacher, but simply tries to find interesting clusters of patterns within the 
dataset. 
Most of the business applications of data mining involve directed data mining, while unsupervised discovery can sometimes 
be used for data segmentation or clustering (e.g., finding classes of customers that group together).
Logical Approaches
Logic forms the basis of most written languages and is essential for left-brain thinking. Patterns expressed in logical 
languages are distinguished by two main features: on one hand they are readable and understandable, on the other hand they
 are excellent for representing crisp boxes and groupings of data elements. 
The central operator in a logical language is usually a variation on the well known If/Then statement (e.g., If it is 
raining, then it is cloudy). However, let us note that while the most common form of logic is conditional logic, often we 
may need to use other logical forms such as association logic with When/Also rules, (e.g., When paint is purchased, also 

a paint-brush is purchased (Parsaye, forthcoming). While the propositional and predicate logics (i.e., conditional logics) 
are best known, other forms of logic (e.g., variational and trend logics) are also useful in business data analysis.
Conditional logic systems can be separated into two distinct groups: rules and decision trees. Conditional rules may be 
implemented by induction or genetic algorithms and there are several approaches for generating decision trees (e.g., CART, 
CHAID, C4.5). 
Rules
Logical relationships are usually represented as rules. The simplest types of rules express conditional or association 
relationships. A conditional rule is a statement of the form: 
If Condition1
Then Condition2
For instance, in a demographic database we may have a rule: If Profession=Athlete Then Age < 30. Here we compare the values 
within fields of a given table (i.e., we have an attribute-value representation). Here Profession is the attribute and 
Athlete the value. Another example of an attribute-value expression is State=Arizona, where State is the attribute and 
Arizona the value. 
Conditional rules usually work on tables with attributes (i.e., fields) and values, such as below.
Name
Profession 
Age
John Smith
Athlete
27
...
....
....
Rules may easily go beyond attribute-value representations. They may have statements such as 
Shipping_State=Receiving_State. Here, in attribute logic, we compare the values of two fields, without explicitly naming
 any values. This relationship cannot be stated by decision trees or cross-tabs. 
Affinity logic is distinct from conditional logic both in terms of the language of expression and the data structures it 
uses. Affinity analysis (or association analysis) is the search for patterns and conditions that describe how various 
items group together or happen together within a series of events or transactions. An affinity rule has the form:
When Item1
Also Item2.
An example of this is, When Paint, Also Paint-Brush. A simple affinity analysis system uses a transaction table such as:
Transaction #
Item
123
Paint
123
Paint- Brush
123
Nails
124
Paint
124
Paint-Brush
124
Wood
125
....
to identify items that group together within transactions. Here, the Transaction# field is used to group items together, 
while the Item# field includes the entities being grouped. In this example, the affinity for Transactions 123 and 124 is 
the pair (Paint, Paint-Brush). 
Please note that this is a distinct data structure from the conditional logic rule above.
As pointed out in Data Mining with OLAP Affinities, (Parsaye, forthcoming) flat affinities need to be extended to 
dimensional or OLAP affinities for better results. A dimensional affinity has the form:
Confidence=95%
IF
Day=Saturday
WHEN
Item=Paint- Brush
ALSO
Item=Paint
Here logical conditions and associations are combined. This form of hybrid structure delivers the real power of transparent 
logic.
Rules have the advantage of being able to deal with numeric and non-numeric data in a uniform manner. When dealing with 
numeric data, some approaches have to break numeric fields into codes or specific values. This may effectively remove 
all numeric considerations from the codes, thus resulting in the loss of patterns. For instance, the field Age may need 
to be broken into 3 ranges (1-30), (31-60), (61-100), corresponding to young, middle-aged and old. Of course, the data may 
hold patterns that overlap any of these ranges (e.g., the range (27-34) may be very significant for some patterns and any 
approach based on code-assignment will miss these). 
Rules can also work well on multidimensional and OLAP data because they can deal with ranges of numeric data and their 
logical formats allows their patterns to be merged along multiple dimensions (Parsaye, 1997).
Rules do at times look like decision trees, but despite the surface level similarity they are a distinct and different 
technique. This is easy to see when we consider the fact that decision trees do not express associations, or attribute- 
based patterns such as Shipping_State=Receiving_State where the values of two fields are compared, without explicitly 
naming any values.
The man weakness of rules stems from their inability to deal with smooth surfaces that typically occur in nature (e.g., 
finger-print identification, facial pattern recognition). These naturally smooth surfaces are often best approximated by
 equational approaches such as neural nets. 
Below I review two approaches to rule generation, namely induction and genetic algorithms. However, these are not the only 
approaches to data mining with rules. Some approaches try to pre-compute every possible rule that a data set could include. 
In these cases, only a few columns of data may be used because the logical space is so large. Hence I will not review these
 since they are not practical for large scale applications.
Rule Induction
Rule induction is the process of looking at a data set and generating patterns. By automatically exploring the data set, as 
in Figure 5, the induction system forms hypotheses that lead to patterns. 

The process is in essence similar to what a human analyst would do in exploratory analysis. For example, given a database of
 demographic information, the induction system may first look at how ages are distributed, and it may notice an interesting 
variation for those people whose profession is listed as professional athlete. This hypothesis is then found to be relevant 
and the system will print a rule such as: 
IF Profession=Athlete
THEN Age < 30. 
This rule may have a confidence of 70% attached to it. However, this pattern may not hold for the ages of bankers or 
teachers in the same database.
We must also distinguish between fuzzy and inexact rules. Inexact rules often have a fixed confidence factor attached 
to them, i.e. each rule has a specific integer or percentage (such as 70%) representing its validity. However, the 
confidence in a fuzzy rules can vary in terms of the numeric values in the body of the rule; for instance the confidence 
may be proportional to the age of a person and as the age varies so does the confidence. In this way fuzzy rules can 
produce much more compact expressions of knowledge and lead to stable behavior.
Rule induction can discover very general rules which deal with both numeric and non- numeric data. And rules can combine 
conditional and affinity statements into hybrid patterns. A key issue here is the ability to go beyond flat databases and
 deal with OLAP patterns (Parsaye, 1997).
Genetic Algorithms
Genetic algorithms also generate rules from data sets, but do not follow the exploration oriented protocol of rule 
induction. Instead, they rely on the idea of mutation to make changes in patterns until a suitable form of pattern 
emerges via selective breeding, as shown in Figure 6. 


The genetic cross- over operation is in fact very similar to the operation breeders use when they cross-breed plants and/or
 animals. The exchange of genetic material by chromosomes is also based on the same method. In the case of rules, the 
material exchanged is a part of the pattern the rule describes. 
Let us note that this is different from rule induction since the main focus in genetic algorithms is the combination of 
patterns from rules that have been discovered so far, while in rule induction the main focus of the activity is the dataset.

Genetic algorithms are not just for rule generation and may be applied to a variety of other tasks to which rules do not 
immediately apply, such as the discovery of patterns in text, planning and control, system optimization, etc. (Holland, 
1995).
Decision Trees
Decision trees express a simple form of conditional logic. A decision tree system simply partitions a table into smaller 
tables by selecting subsets based on values for a given attribute. Based on how the table is partitioned, we get a 
different decision tree algorithm such as CART, CHAID and C4.5.
For example, consider the table:
Manufacturer
State
City
Product Color
Profit
Smith
CA
Los Angeles
Blue
High
Smith
AZ
Flagstaff
Green
Low
Adams
NY
NYC
Blue
High
Adams
AZ
Flagstaff
Red
Low
Johnson
NY
NYC
Green
Avg
Johnson
CA
Los Angeles
Red
Avg
A decision tree from this table is pictorially shown in Figure 7.

This decision tree first selected the attribute State to start the partitioning operation, then the attribute Manufacturer.
 Of course, if there are 100 columns in the table, the question of which attribute to select first becomes crucial. 
In fact, in many cases, including the table above, there is no best attribute, and whichever attribute the tree chooses 
there will be information loss, as shown in Rules Are Much More Than Decision Trees, (Parsaye, 1996). For example the 
two facts:
(a) Blue products are high profit.
(b) Arizona is low profit.
can never be obtained from the table above with a decision tree. We can either get fact (a) or fact (b) from the tree, 
not both, because a decision tree selects one specific attribute for partitioning at each stage. Rules and cross-tabs 
on the other hand, can discover both of these facts. For a more detailed discussion of these issues, please see Rules 
Are Much More Than Decision Trees (Parsaye, 1996).
Cross Tabulation
Cross tabulation is a very basic and simple form of data analysis, well known in statistics, and widely used for reporting.
 A two dimensional cross-tab is similar to a spreadsheet, with both row and column headings as attribute values. The cells 
in the spreadsheet represent an aggregate operation, usually the number of co-occurances of the attribute values together.
 Many cross-tabs are effectively equivalent to a 3D bar graph which displays co-occurence counts. 
Consider the table in the previous section. A cross-tab for the profit level could look as follows:

CA
AZ
NY
Blue
Green
Red
Profit High
1
0
1
2
0
0
Profit Avg
1
0
1
0
1
1
Profit Low
0
2
0
0
1
1
Here we have not included the fields Manufacturer and City because the cross-tab would look too large. However, as is 
readily seen here, the fact that the count of co- occurence of Blue and High is above the others indicates a stronger 
relationship.
When dealing with a small number of non-numeric values, cross-tabs are simple enough to use and find some conditional 
logic relationships (but not attribute logic, affinities or other forms of logic). Cross-tabs usually run into four classes 
of problems: first when the number of non-numeric values goes up, second when one has to deal with numeric values, third 
when several conjunctions are involved, and fourth when the relationships are not just based on counts.
Agents and belief networks are variations on the cross-tab theme and will be discussed next.
Agents
The term agent is sometimes used (among its other uses) to refer to cross-tabs that are graphically displayed in a 
network and allow some conjunctions (i.e., ANDs). In this context the term agent is effectively equivalent to the term 
field-value pair.
For instance, if we consider the cross-tab above, we may define 6 agents for the goal Profit:High and graphically show 
them as in Figure 8.

Note that here the weights 100 and 50 are simply the percentages of times the values are associated with the goal, (i.e., 
they represent impact ratios, not probabilities). Although it is possible to represent conjunctions here by adding nodes 
to the left of each circle, there are often too many possibilities and for larger data sets problems arise rather quickly.
Like other cross-tab techniques, when dealing with numeric values, agents have to break the numbers into fixed codes, 
(e.g., break Age into three age classes: (1-30), (31-60), (61- 100)). Of course, the data may hold patterns that overlap
 any of these ranges, (e.g., the range (27-34) and these will not be detected by the agent). And if the ranges selected 
are too small, there will be too many of them and larger patterns will be missed. Moreover, this inability to deal with 
numeric data causes problems with multidimensional data (Parsaye, 1997). 
Belief Networks
Belief networks (sometimes called causal networks) also rely on co-occurence counts, but both the graphic rendering and 
the probabilistic representation are slightly different from agents.
Belief networks are usually illustrated using a graphical representation of probability distributions (derived from counts).
 A belief network is thus a directed graph, consisting of nodes (representing variables) and arcs (representing probabilistic 
dependencies) between the node variables. 
An example of a belief network is shown in Figure 9, where just the color attribute has been drawn for the sake of 
simplicity. This is the same cross-tab as in the previous section.

Each node contains a conditional probability distribution that describes the relationship between the node and the parents 
of that node. The belief network graph is acyclic, meaning that there are no cycles.
Please compare this to Figure 8, to see that the arcs in this diagram denote the probabilistic dependencies between the 
nodes, rather than impacts computed from the cross-tabs. 
Equational Approaches
The underlying method of pattern expression in these systems is surface construction rather than logical expression or 
co-occurence counts. Such systems usually use a set of equations to define a surface within a numeric space, then measure
 distances from this surface for prediction. 
The best known example of such a surface is a straight line in a two dimensional space, as in Figure 10a. This has the 
simple equation Y=(a * X) + b and leads to the well known approach of linear regression in statistics. As the parameter 
a varies in this equation, the slope of the line changes. 

a) Regression line
Y=(a * X) + bb) Parabolic equation
Y=X2c) Inverse equation
Y=1 / X
Figure 10. 
Regression works well when the points to be approximated lie on a straight line. But as in Figures 10b and 10c it is also 
possible to use non- linear equations to approximate smoother surfaces. 
When the surfaces are even more complex (e.g., Y=(X2+X + (1 / X))), or when there are several dimensions, the ability of 
humans to understand the equations and surfaces decreases rather quickly. The system becomes opaque or black-box. However,
 it is still possible to construct such surfaces. 
In fact, neural nets are known to be universal approximators in theory. They can come very close to any function. However,
 present theory does not specify the practical limits of nets for achieving such approximation on large data sets and most 
neural net implementations rely on sampling.
The equational approaches almost always require the data set to be all numeric. Non-numeric data needs to be coded into
 numbers (the reverse of what cross-tabs do). This often causes a number of problems, as discussed below.
Statistics
There are so many books and articles on statistics that it simply does not make sense to paraphrase them here. The 
interested reader is referred to this widely available body of literature for a further exposition of the ideas. 
The non-technical reader may find an almost non-technical book called The Cartoon Guide to Statistics an interesting source
 for the summary of the main concepts, without the traditionally dry tone of technical presentations. 
Neural Nets
Neural nets are a class of predictive modeling system that work by iterative parameter adjustment. Structurally, a neural 
network consists of a number of interconnected elements (called neurons) organized in layers which learn by modifying the
 connection strengths (i.e., the parameters) connecting the layers, as in Figure 11.
Neural nets usually construct complex equational surfaces through repeated iterations, each time adjusting the parameters 
that define the surface. After many iterations, a surface may be internally defined that approximates many of the points 
within the dataset. 
The basic function of each neuron is to: (a) evaluate input values, (b) calculate a total for the combined input values, 
(c) compare the total with a threshold value and (d) determine what its own output will be. While the operation of each 
neuron is fairly simple, complex behavior can be created by connecting a number of neurons together. Typically, the input
 neurons are connected to a middle layer (or several intermediate layers) which is then connected to an outer layer, as is
 seen in Figure 9.

To build a neural model, we first train the net on a training dataset, then use the trained net to make predictions. We
 may, at times, also use a monitoring data set during the training phase to check on the progress of the training. 
Each neuron usually has a set of weights that determine how it evaluates the combined strength of the input signals. Inputs
 coming into a neuron can be either positive (excitatory) or negative (inhibitory). Learning takes place by changing the 
weights used by the neuron in accordance with classification errors that were made by the net as a whole. The inputs are 
usually scaled and normalized to produce a smooth behavior. 
During the training phase, the net sets the weights that determine the behavior of the intermediate layer. A popular 
approach is called backpropagation in which the weights are adjusted based on how closely the network has made guesses. 
Incorrect guesses reduce the thresholds for the appropriate connections, as in Figure 12.

Neural nets can be trained to reasonably approximate the behavior of functions on small and medium sized data sets since 
they are universal approximators. However, in practice they work only on subsets and samples of data and at times run into
 problems when dealing with larger data sets (e.g., failure to converge or being stuck in a local minimum.
It is well known that backpropagation networks are similar to regression. There are several other network training paradigms
 that go beyond backpropagation, but still have problems in dealing with large data sets. One key problem for applying 
neural nets to large data sets is the preparation problem. The data in the warehouse has to be mapped into real numbers 
before the net can use it. This is a difficult task for commercial data with many non- numeric values.
Since input to a neural net has to be numeric (and scaled), interfacing to a large data warehouse may become a problem. 
For each data field used in a neural net, we need to perform scaling and coding. The numeric (and date) fields are scaled. 
They are mapped into a scale that makes them uniform (i.e., if ages range between 1 and 100 and number of children between
 1 and 5, then we scale these into the same interval, such as - 1 to +1). This is not a very difficult task.
However, non-numeric values cannot easily be mapped to numbers in a direct manner since this will introduce unexpected 
relationships into the data, leading to errors later. For instance, if we have 100 cities, and assign 100 numbers to them,
 cities with values 98 and 99 will seem more related together than those with numbers 21 and 77. The net will think these 
cities are somehow related, and this may not be so.
To be used in a neural net, values for nonscalar fields such as City, State or Product need to be coded and mapped into 
new fields, taking the values 0 or 1 as in Figure 10. This means that the field State which may have the 7 values: {CA, 
NY, AZ, GA, MI, TX, VA} is no longer used. Instead, we have 7 new fields, called CA, NY, AZ, GA, MI, TX, VA each taking the 
value 0 or 1, depending on the value in the record. For each record, only one of these fields has the value 1, and the 
others have the value 0. In practice, there are often 50 states, requiring 50 new inputs.


Now the problem should be obvious: What if the field City has 1,000 values? Do we need to introduce 1,000 new input 
elements for the net? In the strict sense, yes, we have to. But in practice this is not easy, since the internal matrix 
representations for the net will become astronomically large and totally unmanageable. Hence, by-pass approaches are often 
used.
Some systems try to overcome this problem by grouping the 1,000 cities into 10 groups of 100 cities each. Yet, this often 
introduces bias into the system, since in practice it is hard to know what the optimal groups are, and for large warehouses 
this requires too much human intervention. In fact, the whole purpose of data mining is to find these clusters, not ask the 

human analyst to construct them.
The distinguishing power of neural nets comes from their ability to deal with smooth surfaces that can be expressed in 
equations. These suitable application areas are varied and include finger-print identification and facial pattern 
recognition. However, with suitable analytical effort neural net models can also succeed in many other areas such as 
financial analysis and adaptive control.
Eventually, the best way to use neural nets on large data sets will be to combine them with rules, allowing them to make 
predictions within a hybrid architecture.
Summary and Conclusions
The fundamental techniques used for data mining can be classified into distinct groups, each offering advantages and trade- 
offs. The modern techniques rely on pattern distillation, rather than data retention. Pattern distillation can be classified
 into logical, equational and cross-tabulational methods. The underlying structure of these approaches was discussed and
 compared. Hybrid approaches are likely to succeed best, merging logic and equations with multidimensional analysis. However,
 the over structure of how these techniques are used should be viewed in the context of machine-man interaction (Parsaye, 
1997).

A Characterization of Data Mining Technologies and Processes by Information Discovery, Inc.  

Источник: http://www.dmreview.com/portals/portal.cfm?topicId=230001