Artificial Neural Networks

Tips

Er worden ook bijvragen gesteld, zorg dat je dus iets meer weet dan het antwoord op de vraag zelf.

Vragen met een "*" voor zijn nog niet opgelost

Vragen

We hebben 50 voorbeeldvragen gekregen waaruit we er 2 gesteld gaan krijgen.

Lecture 1

Explain differences between digital computers, neural networks and the human brain.

Literally in text, p 3-4

Digital Computer	Artificial Neural Network	Human Brain
Process bits as described by a program (software) Correctness based on mathematical logic and Boolean Algebra	Process patterns (examples in the data set). Mathematical non-linear functions with several variables	Process patterns
Process data sequentially (in general) (disadvantage!)	Processes data parallel (advantage!) Neurons of one layer can process data in parallel	Process data parallel (advantage!)
Needs software to operate properly	Needs training to operate properly. Training is crucial.	Needs training to operate properly
Rigidity, not-robust. Works with strict rules, algorithms, so a change in one single number an have major consequences (disadvantage!)	Robust against errors & noise in data (advantage!) Even error-correcting capabilities!	Robust against errors & noise in data (advantage!)

(See 4) for applicability of either ANN's or computers) There seems to be a high analogy between neural networks and the human brain (biological neural networks). Watch your step though, this analogy is NOT strong enough to convince engineers and computer engineers f the correctness of an artificial neural network, when it's correct in a biological neural network. The correctness of an artificial network MUST follow from mathematical analysis of non-linear functions or dynamical systems and computer simulations. Another comparison between the ANN and the human brain:

Artificial Neural Network	Human Brain
Low complexity (max Ã‚Â±100.000 neurons)	High complexity (Ã‚Â±100.000.000000 neurons)
High processing speed (30-200 million operations per second)	Low processing speed (reaction time Ã‚Â±1-2ms)
Energetic efficiency poor: min. (10^-6) Joule / (operation / second)	Energetic efficiency good: min. (10^-16) Joule / (operation / second)

Conclusion: Artificial Neural Networks must stay modest with respect to the human brain.

The gap between neurons won't be bridged in a few decennia.
The energetic efficiency of ANN's is much worse than for the human brain.
The only advantage of ANN's is their capability of fast processing.

Explain the difference between learning, memorization and generalization.

Learning

Learning is het automatisch proces om regels uit een dataset te ontdekken. Dit leren gebeurt door voorbeelden voor te schotelen aan de entiteit (een neuraal netwerk, het menselijke brein). In artificiÃƒÂ«le neurale netwerken wordt informatie opgeslagen in de gewichten van de connecties in het netwerk.

Er bestaat supervised en unsupervised learning. Bij supervised learning is er een "leraar" aanwezig die het leerproces begeleidt. Deze geeft feedback aan het netwerk over de beslissing die het neemt voor een bepaald voorbeeld. Unsupervised learning is een leerproces zonder aanwezigheid van een leraar.

Het grote voordeel van learning is dat er geen expert nodig is om de regels uit het probleemdomein af te leiden. Dit kan immers automatisch gebeuren door het leerproces van een neuraal netwerk.

Generalization

Generalization is de mogelijkheid van een netwerk om ongekende voorbeelden juist te classificeren. Dit zijn dus voorbeelden die niet voorkomen in een training set.

Een neuraal netwerk leert via een training set. De algemeenheid van de aangeleerde regels moet echter in het oog gehouden worden via een testset. Op een bepaald punt in het trainingsproces wordt enkel het resultaat voor de training set nog beter en verslechtert het resultaat voor de test set. Dit noemt men overtraining of overfitting en hierdoor gaat de algemeenheid van de aangeleerde regels achteruit. Daarom wordt een stopregel geÃƒÂ¯ntroduceerd die ervoor zorgt dat het netwerk slechts wordt getraind zolang de fout op de test set nog vermindert.

De mogelijkheid tot veralgemening van een neuraal net zorgt ervoor dat er geen grote hoeveelheid inputdata moet worden opgeslagen. Een netwerk kan via training met een beperkt aantal voorbeelden toch goede beslissingen nemen wanneer een ongekend voorbeeld aan de invoer wordt gepresenteerd.

Memorization

Memorization is het "van buiten" leren van details in de training data. Een neuraal netwerk dat enkel memoriseert zal falen in het generaliseren en zal geen ongekende data kunnen verwerken.

Explain the operating principle of the neuron and the perceptron and the adaptation of the weights

Neuron
- 1 unit in neutraal netwerk
- input & output van een neuron verbindt deze met andere in het netwerk
- propageert signaal dat binnenkomt als het sterk genoeg is
- weighted input
- non-linear activation function F

Perceptron
- simuleert 1 neuron
- propageert signaal indien som van het product van de weights groter is dan een treshold T
- output: de output van de non-linear activation function van de gewogen som van de inputs

Learning in ANN: het leren associÃƒÂ«ren van inputs met outputs aan de hand van het aanpassen van de verschillende weights van de inputs van de perceptrons (bv. via het back propagation algoritme)

De taak van het perceptron in het leerproces is dus het aanpassen van z'n weights om een zo goed mogelijke output te associÃƒÂ«ren met z'n inputs.

In which areas is it useful to apply neural networks and where is it not useful?

3 gebieden waar ANN goed bruikbaar zijn (zie p. 21 & 22):

Klassificatie
- Voorbeelden uit grote dataset klassificeren → data mining
- Vooral expert systems en pattern recognition
- Fraudedetectie, medische diagnoses, OCR, herkennen van getallen, letters, gezichten, spraakherkenning, speaker recognition, quality control

Neural prediction (lange en korte termijn)
- bv. Santa Fe
- voorspellen van bv. gasverbruik, aandelen, klantengedrag
- simulatie van een dynamisch model (space shuttle)
- MAAR: inzicht nodig en analyses en PC simulaties

Mechanische, chemische en biochemische processen optimaliseren en controleren
- traditioneel: lineaire controllers
- ANN niet lineair ' veel beter
- vb. pendulum

ANN is niÃƒÂ©t geschikt voor:

vaste, exacte berekeningen
systemen waarvan het gedrag perfect is vastgelegd (vb. Office)

Soms zijn alle fysische regels gekend en zou men exact kunnen werken, maar zijn de regels zo complex dat ANN toch geschikter is.

ANN kan vele taken, maar vraag niet waarom het iets doet! Moeilijk hetgeen het ANN geleerd heeft precies te extraheren door de omzetting naar numerische waarden waarmee het werkt: input, output en weights.

Lecture 2

Which kind of activation functions can one use for neural networks? What are advantages and disadvantages with respect to learning?

The standard choice is the sigmoid function (or soft-limiting function), either in symmetric (bipolar: -1,1) or asymmetric (unipolar: 0,1) form. The sigmoid function is global in the sense that it divides the feature space into two halves, one where the response is approaching 1 and another where it is approaching 0 (-1). Hence it is very efficient for making sweeping cuts in the feature space.

The simplest activation function is a step function: if the total net input is less than 0 (or more generally, less than some threshold T) then the output of the neuron is 0, otherwise it is I.

Too bad it can't help us out a lot here, although I found this interesting and relevant piece to read on the net:

Activation functions for the hidden units are needed to introduce nonlinearity into the network. Without nonlinearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a linear function of linear functions is again a linear function. However, it is the nonlinearity (i.e, the capability to represent nonlinear functions) that makes multilayer networks so powerful. Almost any nonlinear function does the job, except for polynomials. For backpropagation learning, the activation function must be differentiable, and it helps if the function is bounded; the sigmoidal functions such as logistic and tanh and the Gaussian function are the most common choices. Functions such as tanh or arctan that produce both positive and negative values tend to yield faster training than functions that produce only positive values such as logistic, because of better numerical conditioning (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html).

For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backprop or more efficient gradient-based training methods. Even for training methods that do not use gradients--such as simulated annealing and genetic algorithms--sigmoid units are easier to train than threshold units. With sigmoid units, a small change in the weights will usually produce a change in the outputs, which makes it possible to tell whether that change in the weights is good or bad. With threshold units, a small change in the weights will often produce no change in the outputs.

For the output units, you should choose an activation function suited to the distribution of the target values:

For binary (0/1) targets, the logistic function is an excellent choice (Jordan, 1995).
For categorical targets using 1-of-C coding, the softmax activation function is the logical extension of the logistic function.
For continuous-valued targets with a bounded range, the logistic and tanh functions can be used, provided you either scale the outputs to the range of the targets or scale the targets to the range of the output activation function ("scaling" means multiplying by and adding appropriate constants).
If the target values are positive but have no known upper bound, you can use an exponential output activation function, but beware of overflow.
For continuous-valued targets with no known bounds, use the identity or "linear" activation function (which amounts to no activation function) unless you have a very good reason to do otherwise.

Explain the difference between feedforward and recurrent nets. Give some examples.

In a feed forward network, the connections between units do not form cycles, this means that they contain no feedback loops, signals of unit activation travel in one direction only, namely from input layer possibly via intermediate layers containing hidden units to the output layer they usually produce a response to an input quickly.

generalized linear models

In a feed backward network, or recurrent network, there are cycles in the connections. In some feedback networks each time an input is presented, the NN must iterate for a potentially long time before it produces a response. These NNÃ¢â‚¬â„¢s are usually more difficult to train than feed forward networks.

Hopfield net

What is the difference between supervised and unsupervised learning? Give examples.

The training set for the TLU will consist of a set of pairs (v,t}, where v is an input vector and t is the target class or output ('1' or '0') that v belongs to. This type of training is known as supervised training, since we tell the net what output is expected for each vector. In supervised learning there is a teacher who gives the desired output values for given input values. Algorithms can use the difference between the output and the desired output (the error) to adapt its weights to minimize the error. In unsupervised learning there is no teacher and no output values. Here, the task of the network is to find regularities in the data (important statistical features, clusters). Examples supervised: Perceptrons Examples unsupervised: Competitive learning, Self Organizing maps. PCA (Hebbian Iearning, Oja)

From http://www.faqs.org/faqs/ai-faq/neural-nets/partl/:

In supervised learning, the correct results (target values, desired outputs) are known and are given to the NN during training so that the NN can adjust its weights to try match its outputs to the target values. After training, the NN is tested by giving it only input values, not target values, and seeing how close it comes to outputting the correct target values.
In unsupervised learning, the NN is not provided with the correct results during training. Unsupervised NN's usually perform some kind of data compression, such as dimensionality reduction or clustering. See "What does unsupervised learning learn?"

The distinction between supervised and unsupervised methods is not always clear-cut. An unsupervised method can learn a summary of a probability distribution, then that summarised distribution can be used to make predictions. Furthermore, supervised methods come in two sub varieties: auto-associative and hetero-associative. In auto-associative learning, the target values are the same as the inputs, whereas in hetero-associative learning. the targets are generally different from the inputs. Many unsupervised methods are equivalent to auto-associative supervised methods. For more details, see "What does unsupervised learning learn?" Examples: see question I also for supervised

Unsupervised Learning (explaining)

Hebbian Learning
- Extract redundancies from data
- Maximize output when input is similar to earlier inputs
- Applications
  - Principal component analysis
  - Clustering
  - Feature Mapping
Competitive learning
- Only one output unit can he on
  - Unit that wins inhibits all others
  - Output units are called winner-take-all or grandmother cells
- Applications are clustering or categorization

* Explain the discrete and continuous perceptron and delta learning rule. What is the difference and why?

Explain the difference between linearly separable and linearly non-separable tasks. Which tasks can be handled by a single layer neural network or by a multilayer neural network?

Geen zekerheid over oplossing!

Als je lineair scheidbare punten hebt, dan kun je dit met 1 layer doen (een perceptron dus), omdat ieder neuron feitelijk een lijn trekt. Dus als je met lineair onscheidbare dingen zit, dan moet je een aantal lijnen trekken en die dan samen evalueren in uw 2e layer, die dan "class A" of "class B" gaat zeggen. 2e layer is dan ook maar 1 neuron.

Als je dus met deze situatie zit: niet lineair separable. Dan kun je hier 3 lijnen trekken => 3 neuronen in 1e layer: die zeggen in mensenwoorden of een punt links (boven) of rechts (onder) van hun lijn ligt. Dan in de 2e layer kijk je naar de 3 'outputs'. Dan kun je volgens de positie tov de drie lijnen de klasse bepalen (bv boven lijn 1, onder lijn 2, boven lijn3 ' class A)

En dat is dus enkel mogelijk met multilayer, want als ze lineair scheidbaar zijn heb je genoeg met 1 lijn en is links="class A" en rechts="class B"

* Explain the backpropagation algorithm and the choice of the parameters and early stopping rule.

Lecture 3

What is an attractor neural network and how does it work? Give an example.

Zie p1-7, vanaf lectures 3-7 (tstuk van de andere prof)

Does the Hebb learning rule guarantee pattern stability in the Hopfield model? Explain.

Zie p6-7

Is the loading capacity of the Hopfield model bounded? Explain.

Zie p 7-10

The hopfield model is used to store a set of p given patterns Z_Ã‚Âµi so that when a new, unknown pattern F is presented to the network, the network responds by producing the pattern Z_j which closest resembles this newly presented pattern F.

The loading capacity of the model (this being the maximum patterns that can be stored) is dependent of the maximum acceptable error. That is the number of bits that are corrupted for storing this maximum number of patterns, p_max. Clearly, this error increases as we increase the number of patterns to be stored.

The capacity p_max is proportional to N (but never higher than 0.138N) if we are willing to accept a small percentage of errors in each pattern, with N being the number of units in the network.

It is proportional to N/log N if we want that most or all patterns be recalled perfectly.

So yes, the loading capacity is bounded.

Simple reasoning brings you the same conclusion:

It's obvious, the less patterns that are stored with a given number of N units, the easier it is to recall which of the stored patterns mostly resembles a newly given pattern.

If too many patterns would be stored, the network tends to remember nothing (="catastrophic forgetting") Also, the larger the network (more Units), the more patterns that can be stored.

So it's clear the storage capacity is in one way proportional to N, the number of units.

Explain the concept of energy function in neural networks?

Voor een uitgebreide uitleg van de Energy Function zie in de cursus, het Hopfield Model, pagina 11 tot 13. Ik tracht hieronder zoals de vraag het stelt, het belang van de energie functie te schetsen.

De energie functie is een functie die lokale minima bezit in de attractoren in een hopfield netwerk. Ze daalt altijd naarmate een systeem evolueert volgens de dynamische regel. Wanner een systeem gestart wordt in een vallei, convergeert het netwerk naar het minimum van die vallei.

De energie functie bestaat als de gewichten van het netwerk symmetrisch zijn. Voor biologische netwerken is dit onaanvaardbaar, maar voor artificiÃƒÂ«le neurale netwerken is dit een Ã¢â‚¬ËœslimmeÃ¢â‚¬â„¢ strategie, omdat de energie functie dan bestaat. Het is immers gemakkelijk aan te tonen dat door de dynamische regel toe te passen de energiefunctie enkel kan dalen (zie p. 12 voor een afleiding).

De energiefunctie kunnen we dus beschouwen als iets wat geminimaliseerd wordt in de stabiele toestanden van een neural net. Dit is vooral nuttig m voor gegeven patronen geschikte gewichten w_ij kan vinden aan de hand van de energie functie. Het is dus mogelijk om een energiefunctie neer te schrijven die een minimum bevat dat overeenkomt met een bepaald probleem. Dan kan men de gewichten w_ij bepalen uit de coÃƒÂ«fficiÃƒÂ«nten S_iS_j . (Zie p. 13)

Explain the projection method.

Zie p19 en vorige/volgende

* Can one describe a temporal sequence of patterns with the Hopfield model? Explain.

Lecture 4

How can one solve optimization problems using the Hopfield model? Give an example.

Zoals je weet is een Hopfield model een methode om een energie functie te gaan minimaliseren. Je kan dit echter ook bekijken als een optimalisatie probleem door - in plaats van met energie functies te werken - met kost functies te werken.

Door zo'n optimalisatie problemen voor te stellen adhv. een Hopfield model creÃƒÂ«ren we eigenlijk een bepaald type van parallelliseerbaar algoritme om dat specifieke optimalisatie probleem op te lossen. Dat algoritme kan dan vrij eenvoudig geÃƒÂ¯mplementeerd worden door bestaande Hopfield implementaties wat aan te passen.

[Cfr. 'Optimization ProblemsÃ¢â‚¬â„¢ p32 van Lecture 4]

Hopfield en Tank hebben in 1986 een VLSI chip gemaakt voor zulke netwerken, en die convergeren inderdaad zeer snel naar een lokaal minimum. Jammer genoeg is er geen garantie dat de oplossing optimaal is, maar de ervaring leert dat dit meestal toch een vrij goeie oplossing is [cfr. Traveling Salesman Problem].

* Explain the energy function of the travelling salesman problem.

Design a network for the weighted matching problem.

Zie p33

What is reinforcement learning? Give some examples.

In supervised learning hadden we tot nu toe steeds een "teacher" die de juiste output gat: we konden dan de verkregen output vergelijken met de juiste output.

In reinforcement learning hebben we geen "teacher". maar een "critic". De critic geeft alleen aan of de verkregen output juist of fout is.

Bij reinforcement learning moeten we ons het netwerk voorstellen in een omgeving: de omgeving zorgt voor input voor het netwerk

Het netwerk berekent de output en geeft deze terug aan de omgeving

De omgeving geeft een reinforcement signaal aan het netwerk (juist of fout)

We kunnen 3 klassen van reinforcement learning problems onderscheiden, afhankelijk van de aard van de omgeving

Klasse 1:

In het eenvoudigste geval is het reinforcement signaal steeds hetzelfde voor ieder input-output paar.

De input-patterns worden in een random volgorde gekozen door de omgeving of volgens een schema, maar onafhankelijk van eerdere outputs.

Klasse 2:

Hier bepaalt een input-output paar slechts de kans dat het signaal juist is. Deze kans is wel fixed. Het signaal is nog steeds "goed' of "fout" , maar voor 1 input-output paar zal dit niet steeds hetzelfde zijn. Toch is de kans dat het signaal "goed" is steeds even groot.

Ook hier is de inputvolgorde niet afhankelijk van het verleden.

Dit soort problemen komen vaak voor in "modelling animal learning", "economics systems" en in eenvoudige spelletjes.

Het is geen triviaal probleem: hoe bepalen we de output met de grootste kans op een positiefsignaal?

Een voorbeeld van een probleem met telkens slechts 1 output die slechts 2 waarden kan aannemen is het two-armed bandit problem: we mogen telkens aan l van de 2 armen trekken, zonder dat we weten hoeveel kans we hebben om "te scoren". http://www.willamette.edu/~gorr/classes/cs449/Reinforcement/Bandit/bandit.html

Hier kun je het three-armed bandit problem spelen.

Klasse 3:

Dit is het meest algemene geval. De omgeving zelf kan hier bestuurd worden door een ingewikkeld dynamisch proces. Zowel de reinforcement signalen als de input patterns mogen afhankelijk zijn van het verleden van de outputwaarden.

Een klassiek voorbeeld van zo'n applicatie is een spel, waarbij de tegenspeler de omgeving is. Stel nu dat we het netwerk willen laten schaken, dan krijgt het pas echt een signaal (win of verlies) na een hele reeks "zetten". Hoe moeten we dan een signaal geven aan tussenliggende "zetten"? Dit wordt het "credit assignment problem" genoemd.

Het Exploratie-Exploitatie probleem:

(Dit komt niet uit de cursus, maar het komt wel voor op iedere site over reinforcement learning die ik vind)

Een van de uitdagingen in reinforcement learning is de keuze die we moeten maken tussen exploratie en exploitatie: als we een actie vinden die een grote waardering (reinforcement signal) krijgt, moeten we deze actie vaak gebruiken (exploiteren), maar er zijn misschien nog betere acties, dus moeten we ook de andere acties proberen (exploreren).

Als we met een stochastische taak werken (die niet altijd dezelfde waardering geeft voor een input-output paar) moeten we elke actie een aantal keer toepassen om goed te kunnen schatten welke waardering we hiervoor waarschijnlijk zullen krijgen. (Om dit in te zien speel je best een aantal keer het hierbovenvermelde three-armed bandit, telkens met nieuwe rewards.)

* Explain the recurrent back-propagation algorithm.

* Explain the associative reward-penalty algorithm.

Explain learning with a critic. Can it be applied for the control of a plant?

Zie vraag 20

Lecture 5

What sort of general tasks can one perform with unsupervised learning?

Zie cursus Unsupervised leaming p 14-15. Unsupervised learning:

geen teacher
netwerk met inputs en outputs
geen feedback van de omgeving over correctheid van outputs
netwerk moet zelf patronen, eigenschappen, ... ontdekken in de invoer en deze coderen in de uitvoer

Algemene taken:

Familiariteiten

1 output met continue waarde kan zeggen hoeveel een nieuw invoerpatroon lijkt op typische of gemiddelde patronen uit het verleden. Het netwerk zou gradueel leren wat typisch is.

Principal Component Analysis

Om het vorig geval uit te breiden naar meerdere eenheden, moet men een multi-component basis, of een stel assen, construeren waarlangs men de gelijkenis naar de vorige voorbeelden meet. Een vaak gebruikte benadering uit de statistiek gekend als "principal component analysis", gebruikt de dominante eigenvector richtingen van de correlatiematrix van de invoerpatronen.

Clustering

Een set van binaire output4ner slechts ben actief per keel zou ons kunnen zeggen tot welke van de vele categorieÃƒÂ«n een invoerpatroon behoort. De geschikte categorieÃƒÂ«n zouden moeten gevonden worden door het netwerk op basis van de correlaties in de inputpatronen. Elke cluster van gelijkaardige of nabijgelegen patronen zou dan geclassificeerd worden als ÃƒÂ©ÃƒÂ©n enkele output klasse.

Prototyping

Het netwerk zou kunnen categorieÃƒÂ«n vormen als in het vorige geval, maar dan als output een prototype geven of een exemplaar van de geschikte klasse. Het zou dan de functie hebben van een associatief geheugen, maar de memories zouden rechtstreeks gevonden worden van de invoerpatronen, niet opgelegd van buiten af.

Encodering

De output zal een gecodeerde versie van de input kunnen zijn in minder bits en zoveel mogelijke relevante informatie als mogelijk behouden. Dit zou kunnen gebruikt worden voor datacompressie alvorens te verzenden over een kanaal met gelimiteerde bandbreedte, aangenomen dat een decodeernetwerk ook geconstrueerd kan worden.

Feature mapping (globale organisatie output)

Als de output units een vaste geometrische ordening hadden zoals een tweedimensionale rij met slechts ÃƒÂ©ÃƒÂ©n unit actief per keer konden ze inputpatroncn afbeelden op verschillende punten in deze ordening. Het idee is om een topografische afbeelding te maken van de input zodat gelijkaardige invoerpatronen altijd de dichtstbijzijnde outputs activeren. We verwachten dat een globale organisatie van de output units tevoorschijn komt.

Deze gevallen zijn niet noodzakelijk verschillend en kunnen ook op verschillende manieren gecombineerd worden. Bijvoorbeeld het coderingsprobleem kan gebeuren door "Principal Component Analysis" of "Clustering" te gebruiken. Principal Component Analysis of PCA kan zelf gebruikt worden voor dimensie reductie van de gegevens alvorens "clustering" of feature mapping te gebruiken.

* Explain the standard competitive learning rule.

* Explain vector quantization and learning vector quantization. What is the difference?

* Explain the adaptive resonance theory algorithm.

* What is a self-organizing map? Give an example.

Are hybrid learning schemes useful? Explain with an example.

Zie p29 van het deel unsupervised learning (vrij ver naar achteren)

Lecture 6

* Explain Oja's rule in the framework of unsupervised Hebbian learning.

* Give the connection between Sanger's rule and principal component analysis.

Discuss the relevance of input representation and the number of hidden units in a network construction.

Zie p45

What is the aim of pruning and construction algorithms? Give an example.

Zie 6.6 p60-63

To obtain good generalization ability one has to build into the network as much knowledge about the problem as possible and limit the number of connections appropriately. We need algorithms that optimize not only the weights but also the architecture itself. (optimizing number of layers and units per layer)

There are of course several different criteria judging the "optimality" of the network. There is the generalization ability, learning time, number of units etc. Given so many different hardware restrictions, the cost function for the architecture itself might get pretty complicated. So we mainly focus on using as few units as possible, this should not only reduce computational costs and perhaps training time, but should also improve generalization.

We could mount a search into the space of possible architectures and train it with back-propagation. The search will be carried out by a so called genetic algorithm. But some other more promising approaches were made in which we construct or modify an architecture to suit a particular task, proceeding incrementally. Starting with too many units and take some away; or starting with too few and adding some more.

Pruning: starts with too many units and takes some away
Construction algorithm: start with a too small network and gradually grow one of the appropriate size. (e.g.: tiling algorithm, cascade correlation algorithm, upstart algorithm)

Explain the tiling algorithm.

Zie p63-66, tekeningen 6.16 & 6.17

Start with a single unit that tries to produce the correct output on as many training examples as possible.
Add units to take care of the examples that the first unit got wrong
Only adds as many units as needed to cover all examples
Similar to the decision tree learning algorithm (zie machine learning)
Cross validation can be used to deciding if a network is found that has the right size.

Start at each layer with a master unit that does as well as possible on the target task, and then add further ancillary units until the representation on that layer is faithful. The next layer is constructed in just the same way, using the output of the previous layer as its input. Eventually a master unit itself classifies all patterns correctly, and is therefore the desired output unit.

The master unit in each layer is trained so as to produce the correct target output (Ã‚Â±1 on as many of its input patterns as possible, this can be done by the pocket algorithm (variant of perceptron learning rule: when data is not linearly separable, it searches weight space and stores the set of weights which has had the longest unmodified run of successes so far. Algorithm is stopped after chosen time t). The ancillary units in each layer are also trained using the pocket algorithm, but only on subsets of the patterns (subsets which still contain same targets or duplicates, and as such make it not-faithful → should all be different).

Lecture 7

* What is the relation between the Bayesian and maximum likelihood approach?

Explain: Bayesian learning treats the issue of model complexity differently than cross-validation does.

Zowel Bayesian learning als cross-validation zijn technieken die kunnen worden aangewend om een beslissing te maken over de complexiteit van een neuraal netwerk.

Bayesian Learning

Uit [1] haal ik het volgende:

Bayesian methods will provide solutions to such fundamental problems as:

How to judge the uncertainty of predictions. This can be solved by looking at the predictive distribution, as described above.
How to choose an appropriate network architecture (e.g., the number of hidden layers, the number of hidden units in each layer).
How to adapt to the characteristics of the data (e.g., the smoothness of the function, the degree to which different inputs are relevant.).

Meer uitleg over het kiezen van een architectuur:

Selection of an appropriate network architecture is another place where prior knowledge plays a rule. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. This approach is emphasized by Neal (1996), who argues that there is no statistical need to limit the complexity of the net-work architecture when using well-designed Bayesian methods. It is also possible to choose between architectures in a Bayesian fashion, using the Ã¢â‚¬Å“evidenceÃ¢â‚¬Â for an architecture, as discussed by Mackay (1992a, 1992b).

Hoe lost men dit probleem op met Bayesian Learning?

The result of Bayesian training is a posterior distribution over network weights. If the inputs of the network are set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known its the predictive distribution for this new erne. If a single-valued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is.

Cross Validation

Hier citeer ik uit [2]:

Cross-validation is a method for estimating generalization error based on "resampling". The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures.

Een technische uitleg over cross-validation kan je vinden in [2]

Hoe gaat cross-validation nu te werk om de complexiteit. van een model te bepalen:

Cross-validation can be used simply to estimate the generalization error of a given model. or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example. you might use cross-validation to choose the number of hidden units, or you could use cross-validation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "goodÃ¢â‚¬Â and Ã¢â‚¬Å“bestÃ¢â‚¬Â in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error.

Referenties:

[1] http://www.fags.org/fags/ai-faq/neural-nets/part3/section-7.html

[2] http://www.fags.org/fags/ai-faq/neural-nets/part3/section-12.html

* Why does one favour small values in Bayesian learning of network weights? Explain.

* What are the basic ideas implemented by support vector machines? Explain.

* What are the motivations for fuzzifying the perceptron rule? How can one do this?

Lecture 8

* Explain the problem of echo cancellation and the use of adaline networks to overcome it.

* What is a cellular neural network and how does it work ?

* What are typical application areas of cellular neural networks?

* What is a CNN template?

Lecture 9

* What are strong points and drawbacks for applications of neural networks to time-series prediction?

* Discuss time-series prediction competitions and results obtained by neural networks.

* Discuss the use of neural networks in the alvinn vehicle control system.

* Explain the use of backpropagation in control applications and what is backpropagation through the plant ?

* Explain the difference between feedforward and feedback neural control.

Lecture 10

* Discuss different neural control strategies for controlling an inverted pendulum.

* Discuss the design of a neural network for controlling a truck backer-upper.

* Discuss neural networks for use in ovarian cancer classification and prediction.

* Discuss the use of neural networks in fraud detection applications.

* Explain the receiver operating curve, its use and its role in comparing classification systems.

Bijvragen

(Algemene bijvragen kan je hier zetten. Indien het een uitbreiding op een andere vraag is zet je ze best bij die andere vraag.)