Table Of ContentUS008412747B1
(12) United States Patent (10) Patent No.: US 8,412,747 B1
Harik et a]. (45) Date of Patent: Apr. 2, 2013
(54) METHOD AND APPARATUS FOR LEARNING 6,651,054 B1 11/2003 Judicibus
A PROBABILISTIC GENERATIVE MODEL 6,684,205 B1 1/2004 Modha er al~
FOR TEXT 6,751,611 B2 6/2004 Krupm et al.
6,820,093 B2 11/2004 De la Huerga
6,868,525 B1 3/2005 S b
(75) Inventors: Georges Harik, MountainView, CA 7,013,298 B1 3/2006 DialaOHuerga
(US); Noam M. Shazeer, Stanford, CA 7,231,393 B1 * 6/2007 Harik et a1. ......................... .. 1/1
(Us) 7,363,308 B2 4/2008 Dillon et a1.
7,383,258 B2 * 6/2008 Harik et al. ......................... .. 1/1
(73) Assl- gnee._ Google Inc., Mounta1- nV1-e w, CA (US) 20028/,0002847,331702 BA21 * 97/02000121 HLaerei ekt eatl ‘a l. ................. .. 707/803
2002/0120619 A1 8/2002 Marso et a1.
( * ) Notice: Subject to any disclaimer, the term of this 2003/0037041 A1 2/2003 Hem
patent is extended or adjusted under 35 2004/0088308 A1 5/2004 Bailey et a1.
U_S_C_ 154(b) by 0 days_ 2007/0156677 A1* 7/2007 SZabO ............................. .. 707/5
2008/0004904 A1 1/2008 Tran
_ 2009/0318779 A1 12/2009 Tran
(21) APPI- NO" 13/237,861 2010/0332583 A1* 12/2010 SZabO ......................... .. 709/202
(22) Filed: Sep. 20, 2011 OTHER PUBLICATIONS
Related US. Application D at a Geiger et al., “Asymptotic Model Selection for Directed Networks
with Hidden Variables,” May 1996, Technical Report MSR-TR-96
(63) Continuation of application No. 11/796,383, ?led on 07, Microsoft Research, Advanced Technology Division,
Apr. 27, 2007, now Pat. No. 8,024,372, which is a _
continuation of application No. 10/788,837, ?led on (commued)
Feb. 26, 2004, now Pat. No. 7,231,393, which is a P _ E _ D, M, hi
continuation-in-part of application No. 10/676,571, nmary mmmeri 1an_e lzra_ _
?led on Sell 30’ 2003, HOW Pat Iqo~ 7,383,258 (74) Attorney, Agent, 01’ Fll’m i FISh & R1chardson RC.
(60) Provisional application No. 60/416,144, ?led on Oct. (57) ABSTRACT
3’ 2002' One embodiment of the present invention provides a system
(51) Int Cl that learns a generative model for textual documents. During
Got-$172780 (2006 01) operation, the system receives a current model, which con
(52) U 5 Cl ' 707/803 tains terminal nodes representing random variables for words
58 F. l.d .... ... .... ..... ................... .. 707/803 and Cluster nodes representing Clusters of conceptually
( ) le 0 as“ canon 6332/3 26 765’ related words. Within the current model, nodes are coupled
S 1, _ ?l f 1 ’ h’h_ ’ ’ together by weighted links, so that if a cluster node in the
ee app Icanon e or comp ete seam lstory' probabilistic model ?res, a weighted link from the cluster
(56) References Cited node to another node causes the other node to ?re with a
probability proportionate to the link weight. The system also
receives a set of training documents, wherein each training
US. PATENT DOCUMENTS
document contains a set of words. Next, the system applies
5,794,050 A 8/1998 Dahlgren et a1. the set of training documents to the current model to produce
5,815,830 A 9/1998 Anthony
6,078,914 A 6/2000 Redfern a new model.
6,108,662 A 8/2000 Hoskins et al.
6,137,911 A 10/2000 Zhilyaev 20 Claims, 17 Drawing Sheets
'iSPARSE LINK
e-pax.epax=1 MESSAGE e'a"
7"..ETIMOBEF ol_sa
____ - - (PROBABILITY OF c IS p)
T EXTRA ADJUSTMENT
MESSAGE = e'ax
US 8,412,747 B1
Page 2
OTHER PUBLICATIONS Myka, A., “Automatic Hypertext Conversion of Paper Document
Graham, 1., “The HTML Sourcebook,” John Wiley & Sons, 1995 Collections,” (ch. 6), Digital LibrariesWorkshop DL ’94, Newark NJ,
(ISBN 04711 18494) (pages on “partial URLs” and “BASE element,” May 1994 (selected papers), pp. 65-90.
e.g., pp. 22-27; 87-88; 167-168). Notice of Allowance for related U.S. Appl. No. 10/676,571, mailed
Heckerman, David, “Learning Bayesian Networks: The Combina
from USPTO on Sep. 30, 2003.
tion of Knowledge and Statistical Data,” Mar. 1994, Technical Report
Notice ofAllowance for related U.S. Appl. No. 10/788,837, mailed
MSR-TR-94-09, Microsoft Research, Advanced Technology Divi
from USPTO on Feb. 26, 2004.
SlOIl. Of?ce Action for related U.S. Appl. No. 10/788,837, mailed from
Jordan et al., “Hidden Markov Decision Trees,” 1997, Center for
Biological and Computational Learning Massachusetts Institute of USPTO on Feb. 26, 2004.
Technology and Department of Computer Science, University of Of?ce Action for related U.S. Appl. No. 10/676,571, mailed from
Toronto Canada. USPTO on Sep. 30, 2003.
Meila et al., “Estimating Dependency Structure as a Hidden Vari Thistlewaite, P., “Automatic construction and management of large
able,” Massachusetts Institute of Technology, A.I. Memo No. 1648, open webs,” Information Processing and Management: an Interna
C.B.C.L. Memo No. 165, Sep. 1998. tional Journal, vol. 33, Issue 2, Mar. 1997, pp. 161-173 (ISSN 0306
Mills, T., Providing world wide access to historical sources, Com 4573).
puter Networks and ISDN Systems, vol. 29, Nos. 8-13, Sep. 1997, pp.
1317-1325. * cited by examiner
US. Patent Apr. 2, 2013 Sheet 1 01 17 US 8,412,747 B1
ELEPHANT GREY SKIES
ELEPHANT GREY SKIES
ALASKA CALIFORNIA WYOMING
FIG. 3
US. Patent Apr. 2, 2013 Sheet 2 01 17 US 8,412,747 B1
I SESSION 1.| I I
SESSION 2
ACTIVE TERMINALS ACTIVE TERMINALS
FIG. 4
US. Patent Apr. 2, 2013 Sheet 3 or 17 US 8,412,747 B1
FIG. 6
US. Patent
Apr. 2, 2013 Sheet 4 0f 17 US 8,412,747 B1
0.1
0.3
FIG .7A
FIG. 8
US. Patent Apr. 2, 2013 Sheet 5 or 17 US 8,412,747 B1
FIG. 10
US. Patent Apr. 2, 2013 Sheet 6 or 17 US 8,412,747 B1
(r1151) (r2152)
F1'$1=Q1 F2'$2=Q2
P(C1 | REST OF NETWORK) = P1 P(C2 | REST OF NETWORK) = P2
NEW NEW YORK YORK
FIG. 12
US. Patent
Apr. 2, 2013 Sheet 7 0f 17 US 8,412,747 B1
SPARSE LINK
MESSAGE e'ax
———— —_———
ACTIVATION OF C IS a
(PROBABILITY OF C IS p)
I EXTRA ADJUSTMENT
MESSAGE = e'ax
C—>CAL|FORN|A C—>BERKELEY C—>SAN FRANCISCO
CALIFORNIA BERKELEY SAN FRANCISCO
FIG. 14
US. Patent Apr. 2, 2013 Sheet 8 or 17 US 8,412,747 B1
O1(e|ephant)-#termina|s O1(grey)-#terminals
ELEPHANT GREY
FIG. 15.1
01(02)
ELEPHANT GREY MOUSE
FIG. 15.2A
01(02)
FIG. 15.25
Description:Thistlewaite, P., “Automatic construction and management of large open webs . companies in company companys based business (0.011109,inf) 32069 'nbs summer (001 library alaska fun (0.67032, int) 22452 I wark-at-home work-trom-hame at-hame home 'obs from-home (0.0608101, inf).