Библиотека - Ricky Ho - Characteristics of Machine Learning Model (Part 2)

Аннотация

Статья посвящена обзору автоматной модели обучения и её характеристикам.

Bayesian Network

It is basically a dependency graph where each node represents a binary variable and each edge (directional) represents the dependency relationship. If NodeA and NodeB has an edge to NodeC. This means the probably of C is true depends on different combinations of the boolean value of A and B. NodeC can point to NodeD and now NodeD depends on NodeA and NodeB as well.

The learning is about finding at each node the join-probability distribution of all incoming edges. This is done by counting the observed values of A, B and C and then update the joint probability distribution table at NodeC.

Once we have the probability distribution table at every node, then we can compute the probability of any hidden node (output variable) from the observed nodes (input variables) by using the Bayes rule.

The strength of Bayesian network is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution table. Similar to Neural Network, Bayesian network expects all data to be binary, categorical variable will need to be transformed into multiple binary variable as described above. Numeric variable is generally not a good fit for Bayesian network.

Support Vector Machine

Support Vector Machine takes numeric input and binary output. It is based on finding a linear plane with maximum margin to separate two class of output. Categorical input can be turned into numeric input as before and categorical output can be modeled as multiple binary output.
With a different lost function, SVM can also do regression (called SVR). I haven't used this myself so I can't talk much.

The strength of SVM is it can handle large number of dimensions. With the kernel function, it can handle non-linear relationship as well.

Support Vector Machine

We are not learning a model at all. The idea is to find K similar data point from the training set and use them to interpolate the output value, which is either the majority value for categorical output, or average (or weighted average) for numeric output. K is a tunable parameter which needs to be cross-validated to pick the best value.

Nearest Neighbor require the definition of a distance function which is used to find the nearest neighbor. For numeric input, the common practice is to normalize them by minus the mean and divided by the standard deviation. Euclidean distance is commonly used when the input are independent, otherwise mahalanobis distance (which account for correlation between pairs of input features) should be used instead. For binary attributes, Jaccard distance can be used.

The strength of K nearest neighbor is its simplicity as no model needs to be trained. Incremental learning is automatic when more data arrives (and old data can be deleted as well). Data, however, needs to be organized in a distance-aware tree such that finding the nearest neighbor is O(logN) rather than O(N). On the other hand, the weakness of KNN is it doesn't handle high number of dimensions well. Also, the weighting of different factors needs to be hand tuned (by cross-validation on different weighting combination) and can be a very tedious process.