next up previous
Next: Topic Classification Up: Implementation Previous: Implementation

Model Training

The model used is a multinomial pdf (bag-of-words) computed from the counts of unique words for each topic:

\begin{displaymath}P(word_i\vert c) = {N_{c}(word_i) \over
\sum\limits_{j = 1}^{w}N_{c}(word_j)}
\end{displaymath} (1)

where Pc(wordj | c) is a probability of a particular word coming from class c. In addition, Nc(wordj) is how many times this word was encountered in the training corpus for the class c. The total number of unique words is w. This model is a word frequency model which is guaranteed (in the maximum likelihood sense) to converge to true word probabilities given a large training corpus. We train the system for 12 different conversation topics from the web and newsgroup text documents in Fig. 2.



Tony Jebara
2000-08-17