-
Notifications
You must be signed in to change notification settings - Fork 127
/
Copy pathdoc.go
53 lines (42 loc) · 2.19 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*Package bayesian is a Naive Bayesian Classifier
Jake Brukhman <[email protected]>
BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set
of classes (e.g. categories) C := {C_1, ..., C_n}, and a
document D consisting of words D := {W_1, ..., W_k}.
We wish to ascertain the probability that the document
belongs to some class C_j given some set of training data
associating documents and classes.
By Bayes' Theorem, we have that
P(C_j|D) = P(D|C_j)*P(C_j)/P(D).
The LHS is the probability that the document belongs to class
C_j given the document itself (by which is meant, in practice,
the word frequencies occurring in this document), and our program
will calculate this probability for each j and spit out the
most likely class for this document.
P(C_j) is referred to as the "prior" probability, or the
probability that a document belongs to C_j in general, without
seeing the document first. P(D|C_j) is the probability of seeing
such a document, given that it belongs to C_j. Here, by assuming
that words appear independently in documents (this being the
"naive" assumption), we can estimate
P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j)
where P(W_i|C_j) is the probability of seeing the given word
in a document of the given class. Finally, P(D) can be seen as
merely a scaling factor and is not strictly relevant to
classificiation, unless you want to normalize the resulting
scores and actually see probabilities. In this case, note that
P(D) = SUM_j(P(D|C_j)*P(C_j))
One practical issue with performing these calculations is the
possibility of float64 underflow when calculating P(D|C_j), as
individual word probabilities can be arbitrarily small, and
a document can have an arbitrarily large number of them. A
typical method for dealing with this case is to transform the
probability to the log domain and perform additions instead
of multiplications:
log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j))
where i = 1, ..., k. Note that by doing this, we are discarding
the scaling factor P(D) and our scores are no longer
probabilities; however, the monotonic relationship of the
scores is preserved by the log function.
*/
package bayesian