Home > Uncategorized > Cooperative Ensemble Learning System (CELS)

Cooperative Ensemble Learning System (CELS)

Ensemble Learning.

17:15 Closing Panel
Tuesday September 08, 2020

Rikke Jansen. VR Artist and developer Rikke Jansen is an experienced games artist and technical artist, having been in the field for more than a decade. She discovered VR three years ago and saw the potential in the platform by discovering social VR. She’s been …

The post 17:15 Closing Panel appeared first on betfair.

You are currently browsing the archive for the Ensemble Learning category

“Stacked Generalization”.
February 8, 2013 in by | Permalink In the seminal paper “Stacked generalization“, David H.

How Datera Simplifies Storage for Container Clouds
Friday September 11, 2020

Solutions Data Center . Databases. Containers. Virtualization. DevOps. Clouds. Solutions Overview. Data Center. Databases. Containers. Virtualization. DevOps. Clouds. Solutions. Today’s enterprises depend on leveraging more application s and data than ever before, driving leaders to modernize IT for optimal agility, simplicity and cost savings. From …

The post How Datera Simplifies Storage for Container Clouds appeared first on betfair.

Wolpert generalizes the idea of cross-validation

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$.
For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data.
Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.
Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers.
The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on  $x_i$ trained on partitions of the data set that do not include $x_i$.
Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output.
The new classifier $f$ would map $R^k$ into $R$.
Let $G_j$ be the result of training the $j$th classifier on all of the data.

especially LSTM (a useful and mainstream type or RNN)
Friday September 11, 2020

Understanding LSTM forward propagation in two ways August 21, 20200 Comments in Artificial Intelligence, , , Data Science Hack, Deep Learning, Machine Learning, , Predictive Analytics by Yasuto Tamura *This article is only for the sake of understanding the equations in the second page of …

The post especially LSTM (a useful and mainstream type or RNN) appeared first on betfair.

Then the Wolpert generalized classifier would have the form $$h(x) = f( G_1(x) G_2(x), \ldots, G_k(x) ).$$ Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders.
The idea of leaving part of the data out is similar to denoising or dropout.
In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.
“Rapid object detection using a boosted cascade of simple features”.
December 28, 2012 in by | Permalink In the widely cited paper “Rapid object detection using a boosted cascade of simple features“, Viola and Jones (CVPR 2001) apply “Harr-like” features and AdaBoost to a fast “cascade” of increasingly complex image classifiers (mostly facial recognition ).
They write, “The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest.” The Harr-like de composition quickly (constant time) creates mostly localized features and AdaBoost learns quickly so the combination is fast.
They report, “In the domain of face detection it is possible to achieve fewer than 1% false negatives and 40% false positives using a classiﬁer constructed  from two Harr-like features.” [emphasis added] Product of Experts.
December 6, 2012 in , by | Permalink “A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result.”  Here’s the Scholarpedia page.

“Rotation Forest: A New Classifier Ensemble Method”

December 4, 2012 in by | Permalink In “Rotation Forest: A New Classifier Ensemble Method“, Rodrıguez and Kuncheva (2006) develop a Random Forest method where the decision boundaries are not of the form   original data feature value new feature s by “rotating” the data in the high dimensional feature space before creating a decision tree.
Each decision tree is created with a different rotation of the data, .

But all of the rotations are generated from Principal Component Analysis

The trees are combined to generate the final classifier.
Here is their abstract We propose a method for generating classifier ensembles based on feature extraction.
To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset.
All principal components are retained in order to preserve the variability information in the data.
Thus, K axis rotations take place to form the new features for a base classifier.
The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble.
Diversity is promoted through the feature extraction for each base classifier.  Decision tree s were chosen here because they are sensitive to rotation of the feature axes, hence the name “forest.” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier.
Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest.
The results were favorable to Rotation Forest and prompted an investigation into diversity-accuracy landscape of the ensemble models.
Diversity-error diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.

Using Bregman Distance to Grow Decision Trees

September 4, 2012 in by | Permalink In “Statistical Learning Algorithms Based on Bregman Distances“,  Laerty, Pietra, and Pietra (1999) take a fairly standard entropy based tree growing algorithm and replace KL distance with Bregman distance.
“In the feature selection step, each linear constraint in a pool of candidate features is evaluated by the reduction in Bregman distance that would result from adding it to the model.”  This is reminiscent of the distribution update step in Softboost and the “information projection” view of Adaboost (See Boosting as Entropy Projection).
The paper is somewhat technical containing interesting theorem proving techniques.
Softboost.
August 24, 2012 in by | 1 comment I’ve been looking at versions of Adaboost that are less sensitive to noise such as Softboost. Softboost works by ignoring a number of outliers set by the user (the parameter $v$), finding weak learners that are not highly correlated with the weak learners already in the boosted learner mix and updating the distribution by KL projection onto the set of distributions restricted to those uncorrelated to the mistakes of the latest learner and not placing too much weight on any particular data point.
Softboost avoids over fitting by stopping when no feasible point is found for the KL projection.
In “Soft Margins for AdaBoost”, Ratsch, Onoda, and Muller, generalize Adaboost by adding a softening parameter $\phi$ to the distribution update step.
They relate soft boosting to simulated annealing and minimization of a generalized exponential loss function.
The paper has numerous helpful graphs and experimental data.
August 16, 2012 in by | Permalink In “How to Use Expert Advice“, Cesa-Bianchi, Freund, Haussler, Helmbold, Sharpire, and Warmuth (1997) describe boosting type algorithms to combine classifiers.
They prove a bound of $$\sqrt{n \log(M)}$$ on the total regret where $n$ is the number of data points and $M$ is the number of classifiers.
They relate their bounds to the L1 distance between binary strings and apply their results to pattern recognition theory and PAC learning.
Cooperative Ensemble Learning System (CELS).
August 12, 2012 in , by | Permalink Carl and I have been discussing concepts related to sparsity in neural nets.
One older idea similar to sparsity is negatively correlated or uncorrelated neural nets.
To use this technique, you could train several neural nets on part or all of the data and then keep the neural networks that are less correlated.
The idea used CELS algorithm is to add a correlation penalty term to the loss function of the neural net optimization process. In “Simultaneous Training of Negatively Correlated Neural Networks in an Ensemble” Liu and Yao (1999) say that The idea behind CELS is to encourage different individual networks in an ensemble to learn different parts or aspects of a training data so that the ensemble can learn the whole training data better.
which seems similar to what happens in sparse neural networks.
NC.
August 10, 2012 in by | Permalink In “Negative Correlation Learning for Classiﬁcation Ensembles” Wang, Chen and Yao (2010) modify adaboost to put more weight on classifiers that are less correlated to the boosted classifier.
They do this by introducing a penalty term “ $$p_t = (h_t – H) \sum_{k\neq t} (h_k – H),$$ where H is the ﬁnal output by combining the decisions from the individuals.” The idea for this came from the Cooperative ensemble learning system (CELS) by Liu and Yao (1999).
Experimental results and performance bounds were given in “The Effectiveness of a New Negative Correlation Learning Algorithm for Classification Ensembles” Wang and Yao (2010).
August 8, 2012 in by | Permalink I ran across an interesting paper yesterday “Logistic Regression, .

AdaBoost and Bregman Distances” (Collins, Schapire, Singer 2000)

Adaboost is a great way to combine several weak predictors into a strong predictor (Freund 2012 is a great reference).

The Bregman distance is a generalization of the Euclidean norm and KL-Divergence

Many well known algorithms that use the Euclidean norm still work when the Euclidian norm is replaced with a Bregman distance (e.g.

Dykstra’s Algorithm  see Censor Reich 2000)

In the paper, the authors show that Adaboost solves a best approximation problem using the KL-distance and introduce many variants of Adaboost.
Here are some quotes from the paper: 1) “We are now able to borrow methods from the maximum entropy literature for logistic regression and apply them to the exponential loss used by AdaBoost, especially convergence proof techniques.” 2) “Duffy and Helmbold (1999) gave conditions under which a loss function gives a boosting algorithm.
They showed that minimizing logistic loss does lead to a boosting algorithm the PAC sense.” 3) “Our work builds heavily on that of Kivinen andWarmuth (1999) who, along with Lafferty, .

Were the first to make a connection between AdaBoost and information geometry

They showed that the update used by AdaBoost is a form of “entropy projection.”” (bold face added.) « Older entries (9).
(3).
(5).
(7).
(8).
(5).
(1).
(1).
(26).
(1).
(12).
(29).
(36).
(15).
(11).
(14).
(6).
(22).
(27).
(26).
(17).
(1).
(8).
(14).
(13).
(5).
(19).
(3).
(6).
(31).
(1).
(1).
(1).
(1).
(1).
(1).
(1).
(2).
(3).
(3).
(1).
(1).
(2).
(2).
(3).
(1).
(2).
(3).
(2).
(1).
(2).
(5).
(1).
(3).
(5).
(4).
(5).
(5).
(4).
(5).
(4).
(2).
(14).
(15).
(14).
(18).
(17).
(19).
(15).
(22).
(26).
(18).