A Characterization of Data Mining Technologies and Processes by Information Discovery Inc

A Characterization of Data Mining Technologies and Processes by Information Discovery, Inc.
Источник: http://www.dmreview.com/portals/portal.cfm?topicId=230001

White Paper: A Characterization of Data Mining Technologies and Processes A Characterization of Data Mining Technologies and Processes by Information Discovery, Inc. Data Mining Processes Traditionally, there have been two types of statistical analyses: confirmatory analysis and exploratory analysis. In confirmatory analysis, one has a hypothesis and either confirms or refutes it. However, the bottleneck for confirmatory analysis is the shortage of hypotheses on the part of the analyst. In exploratory analysis, (Tukey, 1973), one finds suitable hypotheses to confirm or refute. Here the system takes the initiative in data analysis, not the user. The concept of initiative also applies to multidimensional spaces. In a simple OLAP access system, the user may have to think of a hypothesis and generate a graph. But in OLAP data mining, the system thinks of the questions by itself (Parsaye, 1997). I use the term data mining to refer to the automated process of data analysis in which the system takes the initiative to generate patterns by itself. From a process oriented view, there are three classes of data mining activity: discovery, predictive modeling and forensic analysis. Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first. In large databases, there are so many patterns that the user can never practically think of the right questions to ask. The key issue here is the richness of the patterns that can be expressed and discovered and the quality of the information delivered -- determining the power and usefulness of the discovery technique. As a simple example of discovery with system initiative, suppose we have a demographic database of the US. The user may take the initiative to ask a question from the database, such as 'what is the average age of bankers?' The system may then print 47 as the average age. The user may then ask the system to take the initiative and find something interesting about age by itself. The system will then act as a human analyst would. It will look at some data characteristics, distributions, etc. and try to find some data densities that might be away from ordinary. In this case the system may print the rule: 'IF Profession=Athlete THEN Age < 30, with a 71% confidence.' This rule means that if we pick 100 athletes from the database, 71 of them are likely to be under 30. The system may also print: 'IF Profession=Athlete THEN Age < 60, with a 97% confidence.' This rule means that if we pick 100 athletes from the database, 97 of them are likely to be under 60. This delivers information to the user by distilling pattern from data. In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the patterns to guess values for new data items. To use the example above, once we know that athletes are usually under 30, we can guess someone's age if we know that they are an athlete. For instance, if we are shown a record for John Smith whose profession is athlete by applying the rules we found above, we can be over 70% sure that he is under 30 years old, and we can be almost certain that he is under 60. Note that discovery helps us find general knowledge, but prediction just guesses the value for the age of a specific individual. Also note that in this case the prediction is 'transparent' (i.e., we know why we guess the age as under 30). In some systems the age is guessed, but the reason for the guess is not provided, making the system opaque. Forensic analysis is the process of applying the extracted patterns to find anomalous, or unusual data elements. To discover the unusual, we first find what is the norm, then we detect those items that deviate from the usual within a given threshold. Again, to use the example above, once we notice that 97% of athletes are under 60, we can wonder about the 3% who are over 60 and still listed as athletes. These are unusual, but we still do not know why. They may be unusually healthy or play sports where age is less important (e.g., golf) or the database may contain errors, etc. Note that discovery helps us find usual knowledge, but forensic analysis looks for unusual and specific cases. Each of these processes can be further classified. There are several types of pattern discovery such as If/Then rules, associations, etc. While the rules discussed above have an IF- THEN nature, association rules refer to items groupings (e.g., when someone buys one product at a store, they may buy another product at the same time -- a process usually called market basket analysis). The power of a discovery system is measured by the types and generality of the patterns it can find and express in a suitable language. Data Mining Users and Activities It is necessary to distinguish the data mining processes discussed above from the data mining activities in which the processes may be performed, and the users who perform them. First, the users. Data mining activities are usually performed by three different classes of users: executives, end users and analysts. Executives need top-level insights and spend far less time with computers than the other groups -- their attention span is usually less than 30 minutes. They may want information beyond what is available in their executive information system (EIS). Executives are usually assisted by end users and analysts. End users know how to use a spreadsheet, but they do not program -- they can spend several hours a day with computers. Examples of end users are sales people, market researchers, scientists, engineers, physicians, etc. At times, managers assume the role of both executive and end user. Analysts know how to interpret data and do occasional computing but are not programmers. They may be financial analysts, statisticians, consultants, or database designers. Analysts usually knows some statistics and SQL. These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and continuous data mining. In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to understand this data set, or use it for prediction on new marketing campaigns. Episodic mining is usually performed by analysts. In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of specific measures such as profitability. Hence, a strategic mining exercise may look to answer questions such as: where do our profits come from? or how do our customer segments and product usage patterns relate to each other? In continuous mining we try to understand how the world has changed within a given time period and try to gain an understanding of the factors that influence change. For instance, we may ask: how have sales patterns changed this month? or what were the changing sources of customer attrition last quarter? Obviously continuous mining is an on-going activity and usually takes place once strategic mining has been performed to provide a first understanding of the issues. Continuous and strategic mining are often directed towards executives and managers, although analysts may help them here. As we shall see later, different technologies are best suited to each of these types of data mining activity. The Technology Tree The top level dichotomization of the data mining technologies can be based on the retention of data; that is, do we still keep or need the data after we have mined it? In most cases, not. However, in some early approaches much of the data set was still maintained for future pattern matching. Obviously, these retention-based techniques only apply to the tasks of predictive modeling and forensic analysis, and not knowledge discovery since they do not distill any patterns. As one would expect, approaches based on data retention quickly run into problems because of large data sets. However, in some cases predictive results can be obtained with these techniques and for the sake of completeness I briefly review them in the next section. As shown in Figure 2, approaches based on pattern distillation fall into three categories: logical, cross-tabulational and equational. I will review each of these and their sub-branches separately. Each leaf of the tree in Figure 2 shows a distinct method of implementing a system based on a technique (e.g., several types of decision tree algorithms). Not all approaches based on pattern distillation provide knowledge, since the patterns may be distilled into an opaque language or formalism not easily readable by humans such as very complex equations. Hence, some of these approaches produce transparent and understandable patterns of knowledge, others just produce patterns used for opaque prediction. Data Retention While in pattern distillation we analyze data, extract patterns and then leave the data behind, in the retention approaches the data is kept for pattern matching. When new data items are presented, they are matched against the previous data set. A well known example of an approach based on data retention is the nearest neighbor method. Here, a data set is kept (usually in memory) for comparison with new data items. When a new record is presented for prediction, the distance between it and similar records in the data set is found, and the most similar (or nearest neighbors) are identified. For instance, given a prospective customer for banking services, the attributes of the prospect are compared with all existing bank customers (e.g., the age and income of the prospect are compared with the age and income of existing customers). Then a set of closest neighbors for the prospect are selected (based on closest income, age, etc.). The term K- nearest neighbor is used to mean that we select the top K (e.g. top 10) neighbors for the prospect, as in Figure 3. Next, a closer comparison is performed to select which new product is most suited to the prospect, based on the products used by the top K (e.g., top 10) neighbors. Figure 3 is currently unavailable. Of course, it is quite expensive to keep all the data, and hence sometimes just a set of typical cases is retained. We may select a set of 100 typical customers as the basis for comparison. This is often called case-based reasoning. Obviously, the key problem here is that of selecting the typical customers as cases. If we do not really understand the customers, how can we expect to select the typical cases, and if the customer-base changes, how do we change the typical customers? Another usually fatal problem for these approaches has to do with databases with a large number of non-numeric values (e.g., many supermarket products or car parts). Since distances between these non-numeric values are not easily computed, some measure of approximation needs to be used -- and this is often hard to come by. And if there are many non- numeric values, there will be too many cases to manage. Pattern Distillation These technologies extract patterns from a data set, then use the patterns for various purposes. Naturally, the first two questions to ask here are: What types of patterns can be extracted and how are they represented? Obviously, patterns need to be expressed within a formalism and a language. This choice gives rise to three distinct approaches: logic, equations, or cross- tabulations. Each of these approaches traces its historical roots to a distinct mathematical origin. The concept of the language used for pattern expression can be clarified with a few simple diagrams, as in Figure 4. For instance, let us consider the distinction between equations and logic. In an equational system operators such as plus and times may be used to relate variables together e.g., (a * X) + b) while in a logical system the key operators are conditional (e.g., IF 6 < X < 7 THEN 1 < Y < 2). Logic can deal with both numeric and non-numeric data. Equations require all data to be numeric, while cross-tabulations are the reverse and only work on non-numeric data; a key source of problems. But more importantly, equations compute distances from surfaces (such as lines) while cross-tabs focus on co- occurances. Neural networks are opaque equational techniques since internally they compute surfaces within a numeric space. As data is repeatedly fed into the network, the parameters are changed so that the surface becomes closer to the data point. When discussing data mining, it is necessary to distinguish between directed analysis and free-form roams through the database. In directed analysis, also called supervised learning, there is a teacher who teaches the system, by saying when a prediction was correct or incorrect. Here the data has a specific column that is used a the goal for discovery or prediction. In unsupervised learning, the system has no teacher, but simply tries to find interesting clusters of patterns within the dataset. Most of the business applications of data mining involve directed data mining, while unsupervised discovery can sometimes be used for data segmentation or clustering (e.g., finding classes of customers that group together). Logical Approaches Logic forms the basis of most written languages and is essential for left-brain thinking. Patterns expressed in logical languages are distinguished by two main features: on one hand they are readable and understandable, on the other hand they are excellent for representing crisp boxes and groupings of data elements. The central operator in a logical language is usually a variation on the well known If/Then statement (e.g., If it is raining, then it is cloudy). However, let us note that while the most common form of logic is conditional logic, often we may need to use other logical forms such as association logic with When/Also rules, (e.g., When paint is purchased, also a paint-brush is purchased (Parsaye, forthcoming). While the propositional and predicate logics (i.e., conditional logics) are best known, other forms of logic (e.g., variational and trend logics) are also useful in business data analysis. Conditional logic systems can be separated into two distinct groups: rules and decision trees. Conditional rules may be implemented by induction or genetic algorithms and there are several approaches for generating decision trees (e.g., CART, CHAID, C4.5). Rules Logical relationships are usually represented as rules. The simplest types of rules express conditional or association relationships. A conditional rule is a statement of the form: If Condition1 Then Condition2 For instance, in a demographic database we may have a rule: If Profession=Athlete Then Age < 30. Here we compare the values within fields of a given table (i.e., we have an attribute-value representation). Here Profession is the attribute and Athlete the value. Another example of an attribute-value expression is State=Arizona, where State is the attribute and Arizona the value. Conditional rules usually work on tables with attributes (i.e., fields) and values, such as below. Name Profession Age John Smith Athlete 27 ... .... .... Rules may easily go beyond attribute-value representations. They may have statements such as Shipping_State=Receiving_State. Here, in attribute logic, we compare the values of two fields, without explicitly naming any values. This relationship cannot be stated by decision trees or cross-tabs. Affinity logic is distinct from conditional logic both in terms of the language of expression and the data structures it uses. Affinity analysis (or association analysis) is the search for patterns and conditions that describe how various items group together or happen together within a series of events or transactions. An affinity rule has the form: When Item1 Also Item2. An example of this is, When Paint, Also Paint-Brush. A simple affinity analysis system uses a transaction table such as: Transaction # Item 123 Paint 123 Paint- Brush 123 Nails 124 Paint 124 Paint-Brush 124 Wood 125 .... to identify items that group together within transactions. Here, the Transaction# field is used to group items together, while the Item# field includes the entities being grouped. In this example, the affinity for Transactions 123 and 124 is the pair (Paint, Paint-Brush). Please note that this is a distinct data structure from the conditional logic rule above. As pointed out in Data Mining with OLAP Affinities, (Parsaye, forthcoming) flat affinities need to be extended to dimensional or OLAP affinities for better results. A dimensional affinity has the form: Confidence=95% IF Day=Saturday WHEN Item=Paint- Brush ALSO Item=Paint Here logical conditions and associations are combined. This form of hybrid structure delivers the real power of transparent logic. Rules have the advantage of being able to deal with numeric and non-numeric data in a uniform manner. When dealing with numeric data, some approaches have to break numeric fields into codes or specific values. This may effectively remove all numeric considerations from the codes, thus resulting in the loss of patterns. For instance, the field Age may need to be broken into 3 ranges (1-30), (31-60), (61-100), corresponding to young, middle-aged and old. Of course, the data may hold patterns that overlap any of these ranges (e.g., the range (27-34) may be very significant for some patterns and any approach based on code-assignment will miss these). Rules can also work well on multidimensional and OLAP data because they can deal with ranges of numeric data and their logical formats allows their patterns to be merged along multiple dimensions (Parsaye, 1997). Rules do at times look like decision trees, but despite the surface level similarity they are a distinct and different technique. This is easy to see when we consider the fact that decision trees do not express associations, or attribute- based patterns such as Shipping_State=Receiving_State where the values of two fields are compared, without explicitly naming any values. The man weakness of rules stems from their inability to deal with smooth surfaces that typically occur in nature (e.g., finger-print identification, facial pattern recognition). These naturally smooth surfaces are often best approximated by equational approaches such as neural nets. Below I review two approaches to rule generation, namely induction and genetic algorithms. However, these are not the only approaches to data mining with rules. Some approaches try to pre-compute every possible rule that a data set could include. In these cases, only a few columns of data may be used because the logical space is so large. Hence I will not review these since they are not practical for large scale applications. Rule Induction Rule induction is the process of looking at a data set and generating patterns. By automatically exploring the data set, as in Figure 5, the induction system forms hypotheses that lead to patterns. The process is in essence similar to what a human analyst would do in exploratory analysis. For example, given a database of demographic information, the induction system may first look at how ages are distributed, and it may notice an interesting variation for those people whose profession is listed as professional athlete. This hypothesis is then found to be relevant and the system will print a rule such as: IF Profession=Athlete THEN Age < 30. This rule may have a confidence of 70% attached to it. However, this pattern may not hold for the ages of bankers or teachers in the same database. We must also distinguish between fuzzy and inexact rules. Inexact rules often have a fixed confidence factor attached to them, i.e. each rule has a specific integer or percentage (such as 70%) representing its validity. However, the confidence in a fuzzy rules can vary in terms of the numeric values in the body of the rule; for instance the confidence may be proportional to the age of a person and as the age varies so does the confidence. In this way fuzzy rules can produce much more compact expressions of knowledge and lead to stable behavior. Rule induction can discover very general rules which deal with both numeric and non- numeric data. And rules can combine conditional and affinity statements into hybrid patterns. A key issue here is the ability to go beyond flat databases and deal with OLAP patterns (Parsaye, 1997). Genetic Algorithms Genetic algorithms also generate rules from data sets, but do not follow the exploration oriented protocol of rule induction. Instead, they rely on the idea of mutation to make changes in patterns until a suitable form of pattern emerges via selective breeding, as shown in Figure 6. The genetic cross- over operation is in fact very similar to the operation breeders use when they cross-breed plants and/or animals. The exchange of genetic material by chromosomes is also based on the same method. In the case of rules, the material exchanged is a part of the pattern the rule describes. Let us note that this is different from rule induction since the main focus in genetic algorithms is the combination of patterns from rules that have been discovered so far, while in rule induction the main focus of the activity is the dataset. Genetic algorithms are not just for rule generation and may be applied to a variety of other tasks to which rules do not immediately apply, such as the discovery of patterns in text, planning and control, system optimization, etc. (Holland, 1995). Decision Trees Decision trees express a simple form of conditional logic. A decision tree system simply partitions a table into smaller tables by selecting subsets based on values for a given attribute. Based on how the table is partitioned, we get a different decision tree algorithm such as CART, CHAID and C4.5. For example, consider the table: Manufacturer State City Product Color Profit Smith CA Los Angeles Blue High Smith AZ Flagstaff Green Low Adams NY NYC Blue High Adams AZ Flagstaff Red Low Johnson NY NYC Green Avg Johnson CA Los Angeles Red Avg A decision tree from this table is pictorially shown in Figure 7. This decision tree first selected the attribute State to start the partitioning operation, then the attribute Manufacturer. Of course, if there are 100 columns in the table, the question of which attribute to select first becomes crucial. In fact, in many cases, including the table above, there is no best attribute, and whichever attribute the tree chooses there will be information loss, as shown in Rules Are Much More Than Decision Trees, (Parsaye, 1996). For example the two facts: (a) Blue products are high profit. (b) Arizona is low profit. can never be obtained from the table above with a decision tree. We can either get fact (a) or fact (b) from the tree, not both, because a decision tree selects one specific attribute for partitioning at each stage. Rules and cross-tabs on the other hand, can discover both of these facts. For a more detailed discussion of these issues, please see Rules Are Much More Than Decision Trees (Parsaye, 1996). Cross Tabulation Cross tabulation is a very basic and simple form of data analysis, well known in statistics, and widely used for reporting. A two dimensional cross-tab is similar to a spreadsheet, with both row and column headings as attribute values. The cells in the spreadsheet represent an aggregate operation, usually the number of co-occurances of the attribute values together. Many cross-tabs are effectively equivalent to a 3D bar graph which displays co-occurence counts. Consider the table in the previous section. A cross-tab for the profit level could look as follows: CA AZ NY Blue Green Red Profit High 1 0 1 2 0 0 Profit Avg 1 0 1 0 1 1 Profit Low 0 2 0 0 1 1 Here we have not included the fields Manufacturer and City because the cross-tab would look too large. However, as is readily seen here, the fact that the count of co- occurence of Blue and High is above the others indicates a stronger relationship. When dealing with a small number of non-numeric values, cross-tabs are simple enough to use and find some conditional logic relationships (but not attribute logic, affinities or other forms of logic). Cross-tabs usually run into four classes of problems: first when the number of non-numeric values goes up, second when one has to deal with numeric values, third when several conjunctions are involved, and fourth when the relationships are not just based on counts. Agents and belief networks are variations on the cross-tab theme and will be discussed next. Agents The term agent is sometimes used (among its other uses) to refer to cross-tabs that are graphically displayed in a network and allow some conjunctions (i.e., ANDs). In this context the term agent is effectively equivalent to the term field-value pair. For instance, if we consider the cross-tab above, we may define 6 agents for the goal Profit:High and graphically show them as in Figure 8. Note that here the weights 100 and 50 are simply the percentages of times the values are associated with the goal, (i.e., they represent impact ratios, not probabilities). Although it is possible to represent conjunctions here by adding nodes to the left of each circle, there are often too many possibilities and for larger data sets problems arise rather quickly. Like other cross-tab techniques, when dealing with numeric values, agents have to break the numbers into fixed codes, (e.g., break Age into three age classes: (1-30), (31-60), (61- 100)). Of course, the data may hold patterns that overlap any of these ranges, (e.g., the range (27-34) and these will not be detected by the agent). And if the ranges selected are too small, there will be too many of them and larger patterns will be missed. Moreover, this inability to deal with numeric data causes problems with multidimensional data (Parsaye, 1997). Belief Networks Belief networks (sometimes called causal networks) also rely on co-occurence counts, but both the graphic rendering and the probabilistic representation are slightly different from agents. Belief networks are usually illustrated using a graphical representation of probability distributions (derived from counts). A belief network is thus a directed graph, consisting of nodes (representing variables) and arcs (representing probabilistic dependencies) between the node variables. An example of a belief network is shown in Figure 9, where just the color attribute has been drawn for the sake of simplicity. This is the same cross-tab as in the previous section. Each node contains a conditional probability distribution that describes the relationship between the node and the parents of that node. The belief network graph is acyclic, meaning that there are no cycles. Please compare this to Figure 8, to see that the arcs in this diagram denote the probabilistic dependencies between the nodes, rather than impacts computed from the cross-tabs. Equational Approaches The underlying method of pattern expression in these systems is surface construction rather than logical expression or co-occurence counts. Such systems usually use a set of equations to define a surface within a numeric space, then measure distances from this surface for prediction. The best known example of such a surface is a straight line in a two dimensional space, as in Figure 10a. This has the simple equation Y=(a * X) + b and leads to the well known approach of linear regression in statistics. As the parameter a varies in this equation, the slope of the line changes. a) Regression line Y=(a * X) + bb) Parabolic equation Y=X2c) Inverse equation Y=1 / X Figure 10. Regression works well when the points to be approximated lie on a straight line. But as in Figures 10b and 10c it is also possible to use non- linear equations to approximate smoother surfaces. When the surfaces are even more complex (e.g., Y=(X2+X + (1 / X))), or when there are several dimensions, the ability of humans to understand the equations and surfaces decreases rather quickly. The system becomes opaque or black-box. However, it is still possible to construct such surfaces. In fact, neural nets are known to be universal approximators in theory. They can come very close to any function. However, present theory does not specify the practical limits of nets for achieving such approximation on large data sets and most neural net implementations rely on sampling. The equational approaches almost always require the data set to be all numeric. Non-numeric data needs to be coded into numbers (the reverse of what cross-tabs do). This often causes a number of problems, as discussed below. Statistics There are so many books and articles on statistics that it simply does not make sense to paraphrase them here. The interested reader is referred to this widely available body of literature for a further exposition of the ideas. The non-technical reader may find an almost non-technical book called The Cartoon Guide to Statistics an interesting source for the summary of the main concepts, without the traditionally dry tone of technical presentations. Neural Nets Neural nets are a class of predictive modeling system that work by iterative parameter adjustment. Structurally, a neural network consists of a number of interconnected elements (called neurons) organized in layers which learn by modifying the connection strengths (i.e., the parameters) connecting the layers, as in Figure 11. Neural nets usually construct complex equational surfaces through repeated iterations, each time adjusting the parameters that define the surface. After many iterations, a surface may be internally defined that approximates many of the points within the dataset. The basic function of each neuron is to: (a) evaluate input values, (b) calculate a total for the combined input values, (c) compare the total with a threshold value and (d) determine what its own output will be. While the operation of each neuron is fairly simple, complex behavior can be created by connecting a number of neurons together. Typically, the input neurons are connected to a middle layer (or several intermediate layers) which is then connected to an outer layer, as is seen in Figure 9. To build a neural model, we first train the net on a training dataset, then use the trained net to make predictions. We may, at times, also use a monitoring data set during the training phase to check on the progress of the training. Each neuron usually has a set of weights that determine how it evaluates the combined strength of the input signals. Inputs coming into a neuron can be either positive (excitatory) or negative (inhibitory). Learning takes place by changing the weights used by the neuron in accordance with classification errors that were made by the net as a whole. The inputs are usually scaled and normalized to produce a smooth behavior. During the training phase, the net sets the weights that determine the behavior of the intermediate layer. A popular approach is called backpropagation in which the weights are adjusted based on how closely the network has made guesses. Incorrect guesses reduce the thresholds for the appropriate connections, as in Figure 12. Neural nets can be trained to reasonably approximate the behavior of functions on small and medium sized data sets since they are universal approximators. However, in practice they work only on subsets and samples of data and at times run into problems when dealing with larger data sets (e.g., failure to converge or being stuck in a local minimum. It is well known that backpropagation networks are similar to regression. There are several other network training paradigms that go beyond backpropagation, but still have problems in dealing with large data sets. One key problem for applying neural nets to large data sets is the preparation problem. The data in the warehouse has to be mapped into real numbers before the net can use it. This is a difficult task for commercial data with many non- numeric values. Since input to a neural net has to be numeric (and scaled), interfacing to a large data warehouse may become a problem. For each data field used in a neural net, we need to perform scaling and coding. The numeric (and date) fields are scaled. They are mapped into a scale that makes them uniform (i.e., if ages range between 1 and 100 and number of children between 1 and 5, then we scale these into the same interval, such as - 1 to +1). This is not a very difficult task. However, non-numeric values cannot easily be mapped to numbers in a direct manner since this will introduce unexpected relationships into the data, leading to errors later. For instance, if we have 100 cities, and assign 100 numbers to them, cities with values 98 and 99 will seem more related together than those with numbers 21 and 77. The net will think these cities are somehow related, and this may not be so. To be used in a neural net, values for nonscalar fields such as City, State or Product need to be coded and mapped into new fields, taking the values 0 or 1 as in Figure 10. This means that the field State which may have the 7 values: {CA, NY, AZ, GA, MI, TX, VA} is no longer used. Instead, we have 7 new fields, called CA, NY, AZ, GA, MI, TX, VA each taking the value 0 or 1, depending on the value in the record. For each record, only one of these fields has the value 1, and the others have the value 0. In practice, there are often 50 states, requiring 50 new inputs. Now the problem should be obvious: What if the field City has 1,000 values? Do we need to introduce 1,000 new input elements for the net? In the strict sense, yes, we have to. But in practice this is not easy, since the internal matrix representations for the net will become astronomically large and totally unmanageable. Hence, by-pass approaches are often used. Some systems try to overcome this problem by grouping the 1,000 cities into 10 groups of 100 cities each. Yet, this often introduces bias into the system, since in practice it is hard to know what the optimal groups are, and for large warehouses this requires too much human intervention. In fact, the whole purpose of data mining is to find these clusters, not ask the human analyst to construct them. The distinguishing power of neural nets comes from their ability to deal with smooth surfaces that can be expressed in equations. These suitable application areas are varied and include finger-print identification and facial pattern recognition. However, with suitable analytical effort neural net models can also succeed in many other areas such as financial analysis and adaptive control. Eventually, the best way to use neural nets on large data sets will be to combine them with rules, allowing them to make predictions within a hybrid architecture. Summary and Conclusions The fundamental techniques used for data mining can be classified into distinct groups, each offering advantages and trade- offs. The modern techniques rely on pattern distillation, rather than data retention. Pattern distillation can be classified into logical, equational and cross-tabulational methods. The underlying structure of these approaches was discussed and compared. Hybrid approaches are likely to succeed best, merging logic and equations with multidimensional analysis. However, the over structure of how these techniques are used should be viewed in the context of machine-man interaction (Parsaye, 1997). A Characterization of Data Mining Technologies and Processes by Information Discovery, Inc.
Источник: http://www.dmreview.com/portals/portal.cfm?topicId=230001