Model Training

Next: Topic Classification Up: Implementation Previous: Implementation

Model Training

The model used is a multinomial pdf (bag-of-words) computed from the counts of unique words for each topic:

$\begin{displaymath}P(word_i\vert c) = {N_{c}(word_i) \over \sum\limits_{j = 1}^{w}N_{c}(word_j)} \end{displaymath}$

(1)

where P_c(word_j | c) is a probability of a particular word coming from class c. In addition, N_c(word_j) is how many times this word was encountered in the training corpus for the class c. The total number of unique words is w. This model is a word frequency model which is guaranteed (in the maximum likelihood sense) to converge to true word probabilities given a large training corpus. We train the system for 12 different conversation topics from the web and newsgroup text documents in Fig. 2.

Tony Jebara
2000-08-17