Word classification algorithms

Problem Detail: I am looking for algorithms to classify words in a paragraph of text. I am particularly interested in a classification to determine if a certain word is noun, verb, etc., but also looking for any kind of word-classification algorithms. I am given a smaller input (e.g. 200 words) so there is no many options for machine learning but I would really appreciate algorithms involving machine learning.

Asked By : gen

Answered By : Franck Dernoncourt

As D.W. said, the task of classifying words as noun, verb, etc. is called part-of-speech tagging: enter image description here The machine learning algorithm typically used for this purpose is the Log-Linear Model (aka. Maximum-Entropy, in short MaxEnt). It is actually pretty straightforward:

  • Assume you read the sentence sequentially (left to right, or right to left)
  • Let $h$ be the history of the last 2 tags you have assigned to the 2 previous words in the sentence as well as the list of all the words in your sentence.
  • Let $t$ be the tag you assign to the current word.
  • You defined many features $f_i(h, t)$, typically some indicator functions like $f_i(h, t)=1$ if current word ends with ‘ing`, $f_i(h, t)=0$ otherwise.
  • $v$ is a parameter vector with the same number of elements of the number of features $f_i(h, t)$. $v$ is used to weight the features.
  • You define $p(t_i|t_{i-1},t_{i-2},w_1,…,w_n) = frac {exp left( sum_{i} v_i f_i(h, t) right) }{sum_{t’} exp left(sum_{i} v_i f_i(h, t’) right) }$ (this is named multinomial logistic regression, a.k.a. maximum entropy (MaxEnt) classifier, multiclass logistic regression, polytomous logistic regression, softmax regression.)
  • Use Markov assumption: $p(t_1…t_n|w_1…w_n) = prod_{i=1}^{n} p(t_i|t_{i-1},t_{i-2},w_1,…,w_n)$ (that’s the probability that your sentence $w_1,…,w_n$ has tags $t_1,…,t_n$)
  • You find the $t_1,…,t_n$ that maximizes $prod_{i=1}^{n} p(t_i|t_{i-1},t_{i-2},w_1,…,w_n)$ using the Viterbi algorithm: that’s the final tagging of your sentence.
  • You train the model ($v$) so that it maximizes the likelihood of your training set (and add some regularization to improve generalization)

Most of the work to improve this main idea is about:

  • Finding better features
  • Do some reranking (e.g. Global Linear Models, which allows to define global features that take as inputs the entire tagging history, and not just the last 2 tags as before, whose feature weights are trained with some gradient descent like algorithm)

As usual, the more data, the better. For more details see Coursera’s Natural Language Processing by Michael Collins (Coursera’s Natural Language Processing by Dan Jurafsky and Christopher Manning is great too). Note that the task to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. is called Named-entity recognition.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/14037

Leave a Reply