Table Of ContentLinköping Studies in Science and Technology
Dissertation No. 1045
Generalized Hebbian Algorithm for Dimensionality
Reduction in Natural Language Processing
by
Genevieve Gorrell
Department of Computer and Information Science
Linköpings universitet
SE581 83 Linköping, Sweden
Linköping 2006
Abstra
t
The
urrentsurgeofinterestinsear
hand
omparisontasksinnatu-
rallanguagepro
essinghasbroughtwithitafo
usonve
torspa
eap-
proa
hesand ve
torspa
e dimensionality redu
tion te
hniques. Pre-
senting data as points in hyperspa
e provides opportunities to use a
variety of well-developed tools pertinent to this representation. Di-
mensionalityredu
tionallowsdatatobe
ompressedandgeneralised.
Eigen de
omposition and related algorithms are one
ategory of ap-
proa
hes to dimensionality redu
tion, providing a prin
ipled way to
redu
edatadimensionalitythathastimeandagainshownitself
apa-
ble of enabling a
ess to powerful generalisationsin the data. Issues
with the approa
h, however, in
lude
omputational
omplexity and
limitationsonthesizeofdatasetthat
anreasonablybepro
essedin
this way. Large datasets are a persistent feature of natural language
pro
essing tasks.
This thesis fo
uses on twomain questions. Firstly, in what ways
an
eigen de
omposition and related te
hniques be extended to larger
datasets? Se
ondly, this having been a
hieved, of what value is the
resultingapproa
htoinformationretrievalandtostatisti
allanguage
modellingatthe n-gramlevel? Theappli
abilityofeigende
omposi-
tionisshowntobeextendablethroughtheuseofanextantalgorithm;
the Generalized Hebbian Algorithm (GHA), and the novelextension
of this algorithm to paired data; the Asymmetri
Generalized Heb-
bian Algorithm (AGHA). Several originalextensions to the these al-
gorithms are also presented, improving their appli
ability in various
domains. The appli
ability of GHA to Latent Semanti
Analysis-
style tasks is investigated. Finally, AGHA is used to investigate the
value of singular value de
omposition, an eigen de
omposition vari-
ant, to n-gram language modelling. A sizeable perplexity redu
tion
is demonstrated.
2
Parts of this do
toral thesis appear in other publi
ations:
Gorrell,G.,2006. GeneralizedHebbianAlgorithmforIn
rementalSingular
Value De
omposition in Natural Language Pro
essing. In the Pro
eedings
of the 11th Conferen
e of the European Chapter of the Asso
iation for
Computational Linguisti
s (EACL2006), Trento.
Gorrell,G.andWebb,B.,2005. GeneralizedHebbianAlgorithmforLatent
Semanti
Analysis. InthePro
eedingsofthe9thEuropeanConferen
eon
Spee
hCommuni
ationand Te
hnology (Interspee
h2005), Lisbon.
Also by this author:
Gorrell, G. 2004. Language Modelling and Error Handling in Spoken Dia-
logue Systems. Li
entiate thesis, LinköpingUniversity,2004.
Rayner, M., Boye, J., Lewin, I. and Gorrell, G. 2003. Plug and Play
Spoken Dialogue Pro
essing. In(cid:16)CurrentandNewDire
tionsinDis
ourse
and Dialogue.(cid:17) Eds. Jan van Kuppevelt and Ronnie W. Smith. Kluwer
A
ademi
Publishers.
Gorrell,G.2003. Re
ognitionError HandlinginSpokenDialogueSystems.
Pro
eedings of 2nd International Conferen
e on Mobile and Ubiquitous
Multimedia, Norrköping2003.
Gorrell, G. 2003. Using Statisti
al Language Modelling to Identify New
Vo
abulary in a Grammar-Based Spee
h Re
ognition System. Pro
eedings
of Eurospee
h 2003.
Gorrell, G., Lewin, I. and Rayner, M. 2002. Adding Intelligent Help to
Mixed Initiative Spoken Dialogue Systems. Pro
eedings of ICSLP 2002.
Knight, S., Gorrell, G., Rayner, M., Milward, D., Koeling, R. and Lewin,
I. 2001. Comparing Grammar-Based and Robust Approa
hes to Spee
h
Understanding: A Case Study. Pro
eedings of Eurospee
h 2001.
Rayner,M.,Lewin,I.,Gorrell,G.andBoye,J.2001. PlugandPlaySpee
h
Understanding. Pro
eedings of SIGDial 2001.
Rayner,M.,Gorrell,G.,Ho
key,B.A.,Dowding,J.andBoye,J.2001. Do
CFG-Based Language Models Need Agreement Constraints? Pro
eedings
of NAACL2001.
Korhonen,A.,Gorrell,G.andM
Carthy,D.,2000. Statisti
alFilteringand
Sub
ategorisation Frame A
quisition. Pro
eedings of the Joint SIGDAT
Conferen
e on Empiri
al Methods in Natural Language Pro
essing and
VeryLarge Corpora 2000.
Gorrell, G. 1999. A
quiring sub
ategorisation from textual
orpora. M.
Phil. thesis, Cambridge University,1999.
Thanks to:
Arne Jönsson, my supervisor, who has been a great supporter and
an outstanding role model for me; Robin Cooper, se
ond supervisor,
forkeepingmesane,onwhi
hallelsedepended; JoakimNivre,third
supervisor, for stepping in with additional support later on; Manny
Rayner, Ian Lewin and Brandyn Webb, major professional and in-
telle
tual in(cid:29)uen
es; RobertAndersson,theworld'sbest systemsad-
ministrator; everyone in GSLT, NLPLAB, KEDRI and Lingvistiken
(GU) for providing su
h ri
h working environments; and (cid:28)nally, my
beloved family, who mean more to me than anything else.
3
4
Contents
1 Introdu
tion 9
1.1 Eigen De
omposition . . . . . . . . . . . . . . . . . . . 12
1.2 Appli
ations of Eigen De
omposition in NLP . . . . . 17
1.3 Generalized Hebbian Algorithm . . . . . . . . . . . . . 18
1.4 Resear
h Issues . . . . . . . . . . . . . . . . . . . . . . 19
2 Matrix De
omposition Te
hniques and Appli
ations 23
2.1 The Ve
tor Spa
e Model . . . . . . . . . . . . . . . . . 24
2.2 Eigen De
omposition . . . . . . . . . . . . . . . . . . . 25
2.3 Singular Value De
omposition . . . . . . . . . . . . . . 27
2.4 Latent Semanti
Analysis . . . . . . . . . . . . . . . . 29
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 38
5
6 CONTENTS
3 The Generalized Hebbian Algorithm 39
3.1 Hebbian Learning for In
remental Eigen De
omposition 41
3.2 GHA and In
remental Approa
hes to SVD. . . . . . . 44
3.3 GHA Convergen
e . . . . . . . . . . . . . . . . . . . . 46
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Algorithmi
Variations 51
4.1 GHA for Latent Semanti
Analysis . . . . . . . . . . . 52
4.1.1 In
lusion of Global Normalisation. . . . . . . . 53
4.1.2 Epo
h Size and Impli
ations for Appli
ation. . 54
4.2 Random Indexing . . . . . . . . . . . . . . . . . . . . . 55
4.3 GHA and Singular Value De
omposition . . . . . . . . 57
4.4 Sparse Implementation . . . . . . . . . . . . . . . . . . 62
4.5 Convergen
e. . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Staged Training. . . . . . . . . . . . . . . . . . 65
4.5.2 Convergen
eCriteria . . . . . . . . . . . . . . . 65
4.5.3 Learning Rates . . . . . . . . . . . . . . . . . . 68
4.5.4 Asymmetri
GHA Convergen
eEvaluation . . 69
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 74
CONTENTS 7
5 GHA for Information Retrieval 77
5.1 GHA for LSA . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Method . . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Results . . . . . . . . . . . . . . . . . . . . . . 79
5.2 LSA and Large Training Sets . . . . . . . . . . . . . . 83
5.2.1 Memory Usage . . . . . . . . . . . . . . . . . . 84
5.2.2 Time . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.3 S
aling LSA . . . . . . . . . . . . . . . . . . . . 86
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 SVD and Language Modelling 89
6.1 Modelling Letter and Word Co-O
urren
e . . . . . . 92
6.1.1 Word Bigram Task . . . . . . . . . . . . . . . . 93
6.1.2 Letter Bigram Task . . . . . . . . . . . . . . . 94
6.2 SVD-Based Language Modelling . . . . . . . . . . . . 96
6.2.1 LargeCorpusN-GramLanguageModellingus-
ing Sparse AGHA . . . . . . . . . . . . . . . . 97
6.2.2 SmallCorpusN-GramLanguageModellingus-
ing LAS2 . . . . . . . . . . . . . . . . . . . . . 100
6.3 Interpolating SVDLM with Standard N-gram . . . . . 102
6.3.1 Method . . . . . . . . . . . . . . . . . . . . . . 102
8 CONTENTS
6.3.2 Small Bigram Corpus . . . . . . . . . . . . . . 104
6.3.3 Medium-Sized Trigram Corpus . . . . . . . . . 105
6.3.4 Improving Tra
tability . . . . . . . . . . . . . . 106
6.3.5 Large Trigram Corpus . . . . . . . . . . . . . . 111
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 Con
lusion 119
7.1 Detailed Overview . . . . . . . . . . . . . . . . . . . . 119
7.2 Summary of Contributions. . . . . . . . . . . . . . . . 121
7.3 Dis
ussion . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.1 Eigen De
omposition(cid:21)A Pana
ea? . . . . . . . 122
7.3.2 GeneralizedHebbian Algorithm(cid:21)Overhypedor
Underrated? . . . . . . . . . . . . . . . . . . . 124
7.3.3 Wider Perspe
tives . . . . . . . . . . . . . . . . 126
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . 130