Why are sparse auto encoders sparse

In the previous blog post, I summarized the general structure of the neural network and the algorithm's solving process. Among other things, our improved neural network is mainly divided into monitored and unattended. In this blog post, I'll summarize the next more practical unattended neural network - Sparse Autoencoder.


1 Introduction


The figure above shows the general structure of the sparse self-coding. The biggest feature is that the number of input layer nodes (excluding bias nodes) is equal to the number of output layer nodes, while the number of hidden layer nodes is less than the number of input layer and output layer nodes. The purpose of this model is to learn The function and then the low-dimensional representation of the original data (i.e. the node of the hidden layer). The prerequisite for the model to be able to obtain a good low-dimensional representation of the original data is that the input data have a certain latent structure, such as e.g. B. a correlation so that a sparse automatic coding can learn a low dimensional representation similar to the main component analysis (PCA). If each feature of the input data is independent, the low-dimensional representation obtained through the final learning will be less effective.

In a monitored neural network, the training data are (x (i), y (i)). We hope that the model can accurately predict y in order to minimize the value of the loss function (specific actual needs to determine different loss functions). In order to avoid over-adjustment, we will of course introduce a term of punishment. In the unattended network, the training data is x (i), there is no labeled sample (i.e. a y-value), but to construct the loss function we still need the y-value. In the case of sparse automatic coding, y (i) = x (i). In my opinion, the sparse automatic coding is a special case of general neural networks. It only requires that the input value and the output value be similar and also makes the hidden layer sparse.


2. What is sparse (what is sparse)


As mentioned above, the hidden layer is generally made smaller than the number of input nodes, but we can also make the number of hidden layer nodes larger than the number of input nodes and just need to add a certain sparse border to get the same effect. The hiding of most of the nodes in the hidden layer is suppressed and a small part is activated. This is sparse. What is inhibition and what is activation energy? If the nonlinear function is a sigmoid function, if the output of the neuron is close to 1, it will be activated, and if it is close to 0, it will be sparse; when the tanh function is used, it is activated when the output of the neuron is close to 1 and it is sparse when it is close to -1.



What constraints can be added to make the hidden layer output sparse? A sparse automatic coding hopes to make the average hidden layer activation a relatively small value.

The average hidden layer activation dates are expressed as:


among them, Represents the degree of activation of the hidden neuron j when the input data is x.


To bring the average degree of activation to a relatively small value, introduce, This is called the sparsity parameter, which is generally a relatively small valueThis way, the activity of nodes with hidden layers can be very little.



3. Loss function (loss function)



The above is just a theoretical explanation. How to convert it to a data representation. The model introduces KL divergence, which makes the hidden layer nodes less active. The expression of KL is:


Let's assume The relative entropy obtainedalong withChange trend graph of change.


As shown in the figure above, the relative entropy is Reaches the minimum value of 0 ifAround 0 or 1, the relative entropy becomes very large (in fact, it tends to)。




According to this property, we can add a relative entropy to the loss to penalize the average activation level of Another value so that the last learned parameters can maintain the average activationThis level. We therefore believe that our loss function need only add a relative entropy penalty to the loss function with no sparse restriction. The expression of the loss function without sparse restriction is:

After adding the sparse constraint to the hidden layer, the loss function is:



4. Partial derivation of the loss function


The partial derivative calculation of the loss function also uses the back propagation method, but differs in solving the hidden layer. The BP algorithm with no sparse restriction can be expressed as follows:

1. Do the positive feedback line calculation and get Until the output shift The activation value.

2. To the output level (sectionShift), calculate:


3. ForCalculate for each layer:


4. Calculate the final required partial derivative value:


After adding the sparsity constraint, the error expression of the hidden node is given as follows:





5. Parameter solution

So far we have known the loss function of the model and the partial derivative of the loss function with respect to all parameters. To find the parameter that minimizes the loss function, we naturally think of a gradient descent. For details, see Previous blog post。



6. Summary:

In general, the sparse automatic coding is a special three-layer neural network with a sparse limitation added to the general neural network. The model hopes to get a low-dimensional representation from the original data that can be compared to PCA to deepen understanding. From the input layer to the hidden layer, similar to PCA, find a low-dimensional space and map high-dimensional data to low-dimensional data, from the hidden layer to the output layer, similar to PCA, the low-dimensional data.After projection, the original high-dimensional data is restored . In practical applications, sparse automatic coding is used as the data preprocessing process. Stacked auto-encoders use data preprocessing techniques to reduce the dimensionality of data and extract potential data information.

7. Reference:

Deep learning neural networks

UFLDL tutorial