Asked By : Gigili
Answered By : AJed
- The learning algorithm (gradient descent is the most explained of them)
- The number of layers (3,4 layers is usually enough – input, 1 or 2 hidden, output). The output layer depends on your output (e.g., if you want to classify yes/no then your output layer consists of two nodes). The same applies for input layer. However, you may consider the case of using only a subset of the input for the learning. For instance, you may think that your problem is affected by $k$ variable. However, if you take $k’$ of them then you may get better results.
- Number of nodes in the hidden layers (you select that by trial and error)
- Number of training iterations (not too much to avoid over-fitting)
- Size of training/testing data (there are some known rules like the 80:20 for example)
- The type of the function used at the nodes (neurons) (e.g. $f(x) = 1/(1+e^x)$ or $f(x) = tanh(x)$), usually the first is sufficient.
- An important issue is the pre-processing and post-processing of data (this is common in all pattern recognition techniques). You may for instance convert your data by applying a certain function $f$ and then run your experiments.
Note: given the many parameters you need to deal with, it is a good approach to use a search algorithm to select the best parameters for you. It is better be a heuristic search algorithm (e.g. genetic algorithm) if you had very large number of parameters set you will deal with (which is usually the case). Note: use the Matlab NN library or Weka (open source). They would have all these parameters for you. In fact, Weka has many other learning algorithms. Note: Perhaps, you may want to use other algorithms then. If this is case, try support vector machines. There was a historical battle between these two algorithms (in the 1990’s). SVM won it ! (I am not being very scientific here).
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/7714