Table Of ContentTowards a Unified Approach to Memory- and Statistical-Based
Machine Translation
Daniel Marcu
Information Sciences Instituteand
Departmentof ComputerScience
UniversityofSouthern California
4676AdmiraltyWay, Suite1001
Marina delRey, CA90292
[email protected]
Abstract and selects and generates new translations by
performing similarity matchings on these trees.
We present a set of algorithms that en- Veale and Way (1997) store complete sentences;
able us to translate natural language new translations are generated by modifying the
sentences by exploiting both a trans- TMEM translation that is most similar to the in-
lation memory and a statistical-based put sentence. Others store phrases; new trans-
translation model. Our results show lations are produced by optimally partitioning
that an automatically derived transla- the input into phrases that match examples from
tion memory can be used within a sta- the TMEM (Maruyana and Watanabe, 1992), or
tistical framework to often find trans- by finding all partial matches and then choosing
lations of higher probability than those the best possibletranslation using a multi-engine
found using solely a statistical model. translationsystem(Brown,1999).
The translations produced using both
With a few exceptions (Wu and Wong, 1998),
the translation memory and the sta-
mostSMTsystemsarecouchedinthenoisychan-
tistical model are significantly better
nelframework(seeFigure1). Inthisframework,
thantranslationsproducedbytwocom-
thesourcelanguage,let’ssayEnglish,isassumed
mercial systems: our hybrid system to be generated by a noisy probabilistic source.1
translated perfectly 58% of the 505
Most of the current statistical MT systems treat
sentences in a test collection, while
this sourceas a sequence of words(Brown et al.,
the commercial systems translated per-
1993). (Alternativeapproachesexist,inwhichthe
fectlyonly40-42%ofthem.
source is taken to be, for example, a sequence of
aligned templates/phrases (Wang, 1998; Och et
1 Introduction al.,1999)orasyntactictree(YamadaandKnight,
2001).) Inthenoisy-channelframework,amono-
Over the last decade, much progress has been
lingual corpus is used to derive a statistical lan-
madeinthefieldsofexample-based(EBMT)and
guage model that assigns a probability to a se-
statisticalmachinetranslation(SMT).EBMTsys-
quence of wordsor phrases, thusenabling oneto
tems work by modifying existing, human pro-
distinguish between sequences of words that are
duced translation instances, which are stored in
grammaticallycorrectandsequencesthatarenot.
a translation memory (TMEM). Many methods
A sentence-aligned parallel corpus is then used
have been proposed for storing translation pairs
inordertobuild aprobabilistictranslationmodel
in a TMEM, finding translation examples that
arerelevant for translatingunseen sentences, and 1For the rest of this paper, we use the terms source
modifying and integrating translation fragments andtargetlanguagesaccordingtothejargonspecifictothe
noisy-channelframework.Inthisframework,thesourcelan-
to produce correct outputs. Sato (1992), for ex-
guage is the language into which the machine translation
ample, stores complete parse trees in the TMEM systemtranslates.
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2001 2. REPORT TYPE 00-00-2001 to 00-00-2001
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Towards a Unified Approach to Memory- and Statistical-Based Machine
5b. GRANT NUMBER
Translation
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of California,Information Sciences Institute ,4676 Admiralty REPORT NUMBER
Way,Marina del Rey,CA,90292
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 8
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
the statistical model. The translations produced
Source Channel
P(e) e P(f | e) ufsingboththetranslationmemoryandthestatisti-
calmodelaresignificantlybetterthantranslations
producedbytwocommercialsystems.
best observed
Decoder 2 TheIBMModel 4
e f
For the work described in this paper we used a
modified version of the statistical machine trans-
argmax P(e | f) = argmax P(f | e) P(e)
e e lation tool developed in the context of the 1999
Johns Hopkins’ Summer Workshop (Al-Onaizan
Figure1: Thenoisychannelmodel.
et al., 1999), which implements IBM translation
model4(Brownetal.,1993).
that explains how the source can be turned into IBM model 4 revolves around the notion of
the target and that assigns a probability to every wordalignmentoverapairofsentences(seeFig-
wayinwhichasourceecanbemappedintoatar- ure 2). The word alignment is a graphical repre-
get f. Once the parameters of the language and sentationofanhypotheticalstochasticprocessby
translationmodelsareestimatedusingtraditional which a source string e is converted into a target
maximumlikelihoodandEMtechniques(Demp- string f. The probability of a given alignment a
steretal.,1977),onecantake asinput anystring andtarget sentencefgiven a sourcesentencee is
in the target language f, and find the source e of givenby
highest probability that could have generated the
(cid:0)
P(a,f e)=
target,aprocesscalleddecoding(seeFigure1).
(cid:1) (cid:1) (cid:15)(cid:17)(cid:16)
It is clear that EBMT and SMT systems have
(cid:2) (cid:0) (cid:2) (cid:2) (cid:0)
n (cid:3) e(cid:3)(cid:12)(cid:11)(cid:14)(cid:13) t (cid:3)(cid:18) e(cid:3)(cid:12)(cid:11)(cid:21)(cid:13)
different strengths and weaknesses. If a sen- (cid:8)(cid:10)(cid:9) (cid:8)(cid:20)(cid:19)
(cid:3)(cid:5)(cid:4)(cid:7)(cid:6) (cid:3)(cid:5)(cid:4)(cid:7)(cid:6) (cid:18) (cid:4)(cid:7)(cid:6)
tencetobetranslatedoraverysimilaronecanbe (cid:1)
foundintheTMEM,anEBMTsystemhasagood (cid:2) d(cid:6) (cid:3)(cid:29)(cid:6)(cid:31)(cid:30)! #" (cid:16) (cid:0) %$(cid:20)&(’(cid:24)’ e" (cid:16) (cid:11)*)+ %$(cid:20)&(’(cid:24)’ (cid:3)(cid:29)(cid:6),(cid:11)(cid:23)(cid:11)(cid:21)(cid:13)
(cid:8)(cid:20)(cid:28) (cid:8) (cid:8)(cid:20)(cid:19)
chance of producing a good translation. How- (cid:3)(cid:5)(cid:4)(cid:7)(cid:6)(cid:23)(cid:22)(cid:15)(cid:24)(cid:16)(cid:10)(cid:25)(cid:27)(cid:26)
(cid:1) (cid:15)(cid:24)(cid:16)
ever, if the sentence to be translated has no close
(cid:2) (cid:2) (cid:0)
matches in the TMEM, then an EBMT system is d(cid:25) (cid:6) (cid:8)(cid:20)(cid:28) (cid:3)(cid:18)/(cid:30) (cid:28) (cid:3)(cid:29)0(cid:18)(cid:17)1 (cid:6)32 %$(cid:29)&(’(cid:21)’ (cid:8)(cid:20)(cid:19) (cid:3)(cid:18)4(cid:11)(cid:23)(cid:11)(cid:21)(cid:13)
(cid:3)(cid:5)(cid:4)(cid:7)(cid:6) (cid:18) (cid:4).-
less likely to succeed. In contrast, an SMT sys- 576
(cid:30) (cid:9) (cid:26) (cid:15)(cid:24); 1 - (cid:15)(cid:24);
tem may be able to produce perfect translations (cid:6) (cid:30) (cid:6) (cid:11)3> (cid:13)
(cid:8)=<
even when the sentence given as input does not (cid:9) (cid:26) 8:9 9
(cid:15)(cid:17);
resemble any sentence from the training corpus. (cid:2) (cid:0)
(cid:26)(cid:23)(cid:18) NULL(cid:11)
However, such a system may be unable to gener- (cid:8)(cid:20)(cid:19)
(cid:18) (cid:4)(cid:7)(cid:6)@?
ate translations that use idioms and phrases that
reflect long-distance dependencies and contexts,
whichareusuallynotcapturedbycurrenttransla- wherethefactorsdelineatedby (cid:13) symbolscorre-
tionmodels. spond to hypothetical steps in the following gen-
This paper advances the state-of-the-art in two erativeprocess:
respects. First, we show how one can use an ex-
A EachEnglishworde(cid:3) isassignedwithprob-
isting statistical translation model (Brown et al., (cid:0)
ability n (cid:3) e(cid:3)(cid:12)(cid:11) a fertility (cid:3) , which corre-
(cid:8)(cid:10)(cid:9) (cid:9)
1993)inordertoautomaticallyderiveastatistical
sponds to the number of French words into
TMEM. Second, we adapt a decoding algorithm
whicheisgoingtobetranslated.
so that it can exploit information specific both to
the statistical TMEM and the translation model. A EachEnglishworde(cid:3) isthentranslatedwith
(cid:0)
Our experiments show that the automatically de- probability t (cid:3)(cid:18) e(cid:3) (cid:11) into a French word (cid:3)(cid:18) ,
(cid:8)(cid:20)(cid:19) (cid:19)
rived translation memory can be used within the where ranges from 1 to the number of
B
statistical framework to often find translations of words (cid:3) (fertility of e(cid:3) ) into which e(cid:3) is
(cid:9)
higher probability than those found using solely translated. For example, the English word
a b c c X^ \ Z\ WT S _ _ S \ Q S W_ S WS ‘ _ [‘ \ V ]
“no” in Figure 2 is a word of fertility 2 that
istranslatedinto“aucun”and“ne”.
P Q R Q S T U S V WR P X Y P ZXWR Q [W\ Z S \ \ T X \ S R P Q T \ ]
A The rest of the factors denote distorsion
probabilities (d), which capture the proba- Figure2: ExampleofViterbialignmentproduced
bilitythatwordschange theirpositionwhen byIBMmodel4.
translated from one language into another;
the probability of some French words being
sixtuplesshowninTable1,becausetheywerethe
generated from an invisible English NULL
only ones that satisfied all conditions mentioned
element (p(cid:6) ), etc. See (Brown et al., 1993) C
above. Forexample,thepair noone;aucunsyn-
or (Germann et al., 2001) for a detailed dis-
dicatparticulierne N doesnotoccurinthetransla-
cussion of this translation model and a de-
tionmemorybecausetheFrenchword“syndicat”
scriptionofitsparameters.
isgeneratedbytheword“union”,whichdoesnot
occurintheEnglishphrase“noone”.
3 Building astatistical translation
C(cid:10)D
By extracting all tuples of the form I,KdI+&eN
memory
fromthetrainingcorpus,weendedupwithmany
Companies that specialize in producing high- duplicates and with French phrases that were
quality human translations of documentation and paired with multiple English translations. We
newsrelyoftenontranslationmemorytoolstoin- chose for each French phrase only one possible
crease their productivity (Sprung, 2000). Build- English translation equivalent. We tried out two
ing high-quality TMEM is an expensive process distinctmethodsforchoosingatranslationequiv-
that requires many person-years of work. Since alent,thusconstructingtwodifferentprobabilistic
wearenot in thefortunatepositionof having ac- TMEMs:
cess to an existing TMEM, we decided to build
A The Frequency-based Translation MEMory
oneautomatically.
(FTMEM) was created by associating with
We trained IBM translation model 4 on
each French phrase the English equivalent
500,000 English-French sentence pairs from
that occurredmost often in thecollection of
the Hansard corpus. We then used the Viterbi
phrasesthatweextracted.
alignmentofeachsentence,i.e., thealignmentof
highest probability, to extract tuples of the form
A The Probability-based Translation MEMory
C(cid:10)D D D
(cid:3)=) (cid:3)(cid:5)E(cid:7)(cid:6)F)HGHGHGF) (cid:3)(cid:5)E (cid:18)JI,KHL4),KHL*E(cid:7)(cid:6)H)HGHGHGF),K L#E (cid:1)(cid:10)I+&ML4)+&ML*E(cid:7)(cid:6)H)
(PTMEM) was created by associating with
D D D
GHGHGF)+& L*E (cid:1)ON , where (cid:3) ) (cid:3)(cid:5)E(cid:7)(cid:6) )HGHGHGF) (cid:3)(cid:5)E (cid:18) represents
each French phrase the English equivalent
a contiguous English phrase, KHL4),KHL*E(cid:7)(cid:6)%)HGHGHG(cid:17)),K L#E (cid:1)
that corresponded to the alignment of high-
represents a contiguous French phrase, and
estprobability.
&(cid:21)L4)+&(cid:21)L#E(cid:7)(cid:6)F)HGHGHGF)+& L*E (cid:1) represents the Viterbi align-
ment between the two phrases. We selected IncontrasttootherTMEMs,ourTMEMsexplic-
only “contiguous” alignments, i.e., alignmentsin itly encode not only the mutual translation pairs
which the words in the English phrase generated but also their corresponding word-level align-
only words in the French phrase and each word ments, which are derived according to a certain
in the French phrase was generated either by the translation model (in our case, IBM model 4).
NULL word or a word from the English phrase. Themutualtranslationscanbeanywherebetween
We extracted only tuples in which the English two words long to complete sentences. Both
andFrenchphrasescontainedatleasttwowords. methods yielded translation memories that con-
For example, in the Viterbi alignment of the tained around 11.8 million word-aligned transla-
two sentences in Figure 2, which was produced tion pairs. Due to efficiency considerations and
automatically, “there”and“.” arewordsoffertil- memory limitations — the software we wrote
ity0,NULLgeneratestheFrenchlexeme“.”,“is” loads a complete TMEM into the memory — we
generates“est”,“no”generates“aucun”and“ne”, used in our experiments only a fraction of the
and so on. From this alignment we extracted the TMEMs, those thatcontained phrasesat most 10
English French Alignment
oneunion syndicatparticulier onefhg particulieri ;unionfhg syndicati
nooneunion aucunsyndicatparticulierne nofjg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
isnooneunion aucunsyndicatparticulierneest isfjg esti ;nofhg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
thereisnooneunion aucunsyndicatparticulierneest isfjg esti ;nofhg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
isnooneunioninvolved aucunsyndicatparticulierneestencause isfjg esti ;nofhg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
involvedfjg encausei
thereisnooneunioninvolved aucunsyndicatparticulierneestencause isfjg esti ;nofhg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
involvedfjg encausei
thereisnooneunioninvolved. aucunsyndicatparticulierneestencause. isfjg esti ;nofhg aucun,nei ;
onefhg particulieri ;unionfhg syndicati
involvedfjg encausei ;NULLfjg . i
Table1: Examplesofautomaticallyconstructedstatisticaltranslationmemoryentries.
TMEM Perfect Almost Incorrect Unable The results of the evaluation are shown in Ta-
perfect tojudge
ble 2. A visual inspection of the phrases in our
FTMEM 62.5% 8.5% 27.0% 2.0%
PTMEM 57.5% 7.5% 33.5% 1.5% TMEMsandthejudgmentsmadebytheevaluator
suggestthatmanyofthetranslationslabeledasin-
Table 2: Accuracy of automatically constructed
correctmakesensewhenassessedinalargercon-
TMEMs.
text. Forexample,“autresre´gionsdelepaysque”
and “other parts of Canada than” were judged as
words long. This yielded a working FTMEM of incorrect. However, when considered in a con-
4.1 million and a PTMEM of 5.7 million phrase textinwhichitisclearthat“Canada”and“pays”
translation pairs aligned at the word level using corefer, itwouldbereasonabletoassumethatthe
IBMstatisticalmodel4. translationiscorrect. Table3shows afewexam-
To evaluate the quality of both TMEMs we plesofphrasesfromourFTMEMandtheircorre-
built, we extracted randomly 200 phrase pairs spondingcorrectnessjudgments.
fromeachTMEM.Thesephraseswerejudgedby Although we found our evaluation to be ex-
abilingualspeakeras tremely conservative, we decided nevertheless to
sticktoitasitadequatelyreflectsconstraintsspe-
A perfecttranslationsifshecouldimaginecon-
cifictohigh-standardtranslation environmentsin
texts in which the aligned phrases could be
whichTMEMsarebuiltmanuallyandconstantly
mutualtranslationsofeachother;
checkedforqualitybyspecializedteams(Sprung,
2000).
A almost perfect translations if the aligned
phrases were mutual translations of each
4 Statistical decoding usingbotha
other and one phrase contained one single
statistical TMEM and astatistical
word with no equivalent in the other lan-
translation model
guage2;
TheresultsinTable2showthatabout70%ofthe
A incorrect translations if the judge could not
entries in our translation memory are correct or
imagine any contexts in which the aligned
almostcorrect(veryeasytofix). Itis,though,an
phrasescouldbemutualtranslationsofeach
empirical question to what extend such TMEMs
other.
can be used to improve the performance of cur-
rent translation systems. To determine this, we
2For example, the translation pair “final , le secre´taire modifiedanexistingdecodingalgorithmsothatit
de”and“finalact,thesecretaryof”werelabeledasalmost
canexploitinformationspecificbothtoastatisti-
perfectbecausetheEnglishword“act”hasnoFrenchequiv-
alent. caltranslationmodelandastatisticalTMEM.
English French Judgment
,butIcannotsay ,maisjenepuisdire correct
howdidthisallcomeabout? commentest-cearrive´e? correct
but,Ihumblybelieve mais,a`monhumbleavis correct
finalact,thesecretaryof final,lesecre´tairede almostcorrect
otherpartsofCanadathan autresre´gionsdelepaysque incorrect
whatisthetotalamountaccumulated acombiensee´le`vela incorrect
thatpartypresentthis cepartipre´sentaujourd’hui incorrect
theairraftcompanytopresentfurtherstudies deautree´tudes incorrect
Table3: ExamplesofTMEMentrieswithcorrectnessjudgments.
C
Thedecodingalgorithmthatweuseisagreedy tendu ,N and he is talking; il parleN but no pair
one—see(Germannetal.,2001)fordetails. The withtheFrenchphrase“bellevictoire”.
decoder guesses first an English translation for If the input sentence is found “as is” in the
the French sentence given as input and then at- translation memory, its translation is simply re-
tempts to improve it by exploring greedily alter- turned and there is no further processing. Oth-
nativetranslationsfromtheimmediatetranslation erwise, once an initial alignment is created, the
space. Wemodifiedthegreedydecoderdescribed greedy decoder tries to improve it, i.e., it tries to
by Germann et al. (2001) so that it attempts to findanalignment(andimplicitlyatranslation)of
find good translation starting from two distinct higherprobabilitybymodifyinglocallytheinitial
points in the space of possible translations: one alignment. The decoder attempts to find align-
point corresponds to a word-for-word “gloss” of ments and translations of higher probability by
the French input; the other point corresponds to employing a set of simple operations, such as
a translation that resembles most closely transla- changing the translation of one or two words in
tionsstoredintheTMEM. the alignment under consideration, inserting into
As discussed by Germann et al. (2001), the or deleting from the alignment words of fertility
word-for-word gloss is constructed by aligning zero,andswappingwordsorsegments.
each French word fL with its most likely En- In a stepwise fashion, starting from the ini-
glish translation e (e argmax t(e (cid:0) fL )). tialglossorinitialcover, thegreedydecoderiter-
f f n
k kml
For example, in translating the French sentence atesexhaustively overall alignmentsthatareone
“Bien entendu , il parle de une belle victoire .”, such simple operation away from the alignment
the greedy decoder initially assumes that a good under consideration. At every step, the decoder
translationofitis“Wellheard,ittalkingabeauti- chooses the alignment of highest probability, un-
fulvictory”becausethebesttranslationof“bien” tiltheprobabilityofthecurrentalignmentcanno
is “well”, the best translation of “entendu” is longerbeimproved.
“heard”, and so on. A word-for-word gloss re-
5 Evaluation
sults (at best) in English words written in French
wordorder. We extracted from the test corpus a collection
The translation that resembles most closely of 505 French sentences, uniformly distributed
translations stored in the TMEM is constructed across the lengths 6, 7, 8, 9, and 10. For each
byderivinga“cover”fortheinputsentenceusing French sentence, we had access to the human-
phrasesfromtheTMEM.Thederivationattempts generated English translation in the test corpus,
to cover with translation pairs from the TMEM and to translations generated by two commercial
as much of the input sentence as possible, using systems. We produced translations using three
the longest phrases in the TMEM. The words in versionsofthegreedydecoder: oneusedonlythe
theinputthatarenotpartofanyphraseextracted statistical translation model, one used the trans-
from the TMEM are glossed. For example, this lation model and the FTMEM, and one used the
approach may start the translation process from translationmodelandthePTMEM.
thephrase“well,heistalkingabeautifulvictory” Weinitiallyassessedhowoftenthetranslations
C
if theTMEMcontains thepairs well , ; bien en- obtained from TMEM seeds had higher proba-
Sent. Found Higher Same Higher tion was considered semantically incorrect.
length in prob. result prob.
For example, “this is rather provision dis-
FTMEM from from
FTMEM gloss turbing” was judged as a correct semantical
6 33 9 43 16 translation of “voila` une disposition plotoˆt
7 27 9 48 17
inquie´tante”, but “this disposalis rather dis-
8 29 16 42 14
9 31 15 28 27 turbing”wasjudgedasincorrect.
10 31 9 43 18
All(%) 30% 12% 40% 18% A If a translation was perfect from a gram-
matical perspective, it was considered to be
Table4: TheutilityoftheFTMEM.
grammatical. Otherwise, it was considered
Sent. Found Higher Same Higher incorrect. For example, “this is rather pro-
length in prob. result prob. vision disturbing” was judged as ungram-
FTMEM from from
matical,althoughonemayveryeasilymake
FTMEM gloss
6 33 9 43 16 senseofit.
7 27 10 50 14
8 30 16 41 14 We decided to use such harsh evaluation criteria
9 31 15 36 19 because, in previous experiments, we repeatedly
10 31 15 31 13
found that harsh criteria can be applied consis-
All(%) 31% 13% 41% 15%
tently. To ensure consistency during evaluation,
Table5: TheutilityofthePTMEM. the judge used a specialized interface: once the
correctnessofatranslationproducedbyasystem
S was judged, the same judgment was automati-
bility than the translations obtained from simple
callyrecordedwithrespecttotheothersystemsas
glosses. Tables 4 and 5 show that the transla-
well. This way, it became impossibleforatrans-
tionmemories significantlyhelpthedecoder find
lation to be judged as correct when produced by
translations of high probability. In about 30%
one system and incorrect when produced by an-
of the cases, the translations are simply copied
othersystem.
from a TMEM and in about 13% of the cases
Table6,whichsummarizestheresults,displays
thetranslationsobtainedfromaTMEMseedhave
the percent of perfect translations (both semanti-
higher probability that the best translations ob-
callyandgrammatically)producedbyavarietyof
tained from a simple gloss. In 40% of the cases
systems. Table6showsthattranslationsproduced
both seeds (the TMEM and the gloss) yield the
usingbothTMEMandglossseedsaremuchbet-
same translation. Only in about 15-18% of the
ter than translations that do not use TMEMs.
cases the translations obtained from the gloss
The translation systems that use both a TMEM
are better than the translations obtained from the
andthestatisticalmodeloutperformsignificantly
TMEM seeds. It appears that both TMEMs help
the two commercial systems. The figures in Ta-
thedecoderfindtranslationsofhigherprobability
ble 6 also reflect the harshness of our evaluation
consistently, acrossallsentencelengths.
metric: only 82% of the human translations ex-
In a second experiment, a bilingual judge
tractedfrom the testcorpus were considered per-
scored the human translations extracted from the
fect translation. A few of the errors were gen-
automatically aligned test corpus; the transla-
uine, and could be explained by failures of the
tionsproducedbyagreedydecoderthatuseboth
sentencealignmentprogramthatwasusedtocre-
TMEMandglossseeds;thetranslationsproduced
ate the corpus (Melamed, 1999). Most of the er-
by a greedy decoder that uses only the statistical
rors were judged as semantic, reflecting directly
model and the gloss seed; and translations pro-
theharshnessofourevaluationmetric.
ducedbytwocommercialsystems(AandB).
6 Discussion
A If an English translation had the very same
meaning as the French original, it was con- The approach to translation described in this pa-
sidered semantically correct. If the mean- per is quite general. It can be applied in con-
ing was just a little different, the transla- junction with other statistical translation mod-
Sentence Humans Greedywith Greedywith Greedywithout Commercial Commercial
length FTMEM PTMEM TMEM systemA systemB
6 92 72 70 52 55 59
7 73 58 52 37 42 43
8 80 53 52 30 38 29
9 84 53 53 37 40 35
10 85 57 60 36 40 37
All(%) 82% 58% 57% 38% 42% 40%
Table6: Percentofperfecttranslationsproducedbyvarioustranslationsystemsandalgorithms.
els. And it can be applied in conjunction with “kicked” and“bucket” beingtranslatedinto“est”
existing translation memories. To do this, one and “mort”. Because of this, a statistical-based
wouldsimplyhavetotrainthestatisticalmodelon MT system will have trouble producing a trans-
the translation memory provided as input, deter- lation that uses the phrase “kick the bucket”, no
minetheViterbialignments, andenhance theex- matterwhatdecodingtechniqueitemploys. How-
isting translation memory with word-level align- ever, if the two phrases are stored in the TMEM,
ments as produced by the statistical translation producingsuchatranslationbecomesfeasible.
model. Wesuspectthatusingmanuallyproduced
If optimal decoding algorithms capable of
TMEMs can only increase the performance as
searching exhaustively the space of all possible
such TMEMs undergo periodic checks for qual-
translations existed, using TMEMs in the style
ityassurance.
presented in this paper would never improve the
The work that comes closest to using a sta- performance of a system. Our approach works
tistical TMEM similar to the one we propose because it biases the decoder to search in sub-
here is that of Vogel and Ney (2000), who au- spacesthatarelikelyto yieldtranslationsof high
tomatically derive from a parallel corpus a hier- probability, subspaces which otherwise may not
archical TMEM. The hierarchical TMEM con- be explored. The bias introduced by TMEMs is
sists of a set of transducers that encode a sim- a practical alternative to finding optimal transla-
ple grammar. The transducers are automatically tions,whichisNP-complete(Knight,1999).
constructed: they reflect common patterns of us-
It is clear that one of the main strengths of the
age at levels of abstractions that are higher than
TMEM is its ability to encode contextual, long-
thewords. VogelandNey(2000)donotevaluate
distance dependencies that are incongruous with
theirTMEM-basedsystem,soitisdifficulttoem-
the parameters learned by current context poor,
piricallycomparetheirapproachwithours. From
reductionist channel models. Unfortunately, the
a theoretical perspective, it appears though that
criterion used by the decoder in order to choose
the two approaches are complementary: Vogel
between a translation produced starting from a
andNey(2000)identifyabstractpatternsofusage
gloss and one produced starting from a TMEM
and then use them during translation. This may
isbiasedinfavorofthegloss-basedtranslation. It
address the data sparseness problem that is char-
is possible for the decoder to produce a perfect
acteristic to any statistical modeling effort and
translation using phrases from the TMEM, and
producebettertranslationparameters.
yet, to discard the perfect translation in favor of
In contrast, our approach attempts to stir the an incorrect translation of higher probability that
statistical decoding process into directions that was obtained from a gloss (or from the TMEM).
are difficult to reach when one relies only on Itwouldbe desirableto develop alternative rank-
the parameters of a particular translation model. ingtechniques thatwould permitone topreferin
For example, the two phrases “il est mort” and some instances a TMEM-based translation, even
“he kicked the bucket” may appear only in one though that translation is not the best according
sentence in an arbitrary large corpus. The pa- totheprobabilisticchannelmodel. Theexamples
rameterslearned from theentirecorpuswill very in Table 7 shows though that this is not trivial: it
likelyassociateverylowprobabilitytothewords isnotalwaysthecasethatthetranslationofhigh-
Translations Doesthistranslation Isthis Isthisthetranslation
useTMEM translation ofhighest
phrases? correct? probability?
monsieurlepre´sident,jeaimeraissavoir.
mr.speaker,iwouldliketoknow. yes yes yes
mr.speaker,iwouldliketoknow. no yes yes
jenepeuxvousentendre,brian.
icannothearyou,brian. yes yes yes
icanyoulisten,brian. no no no
alors,jeterminela`-dessus.
therefore,iwillconcludemyremarks. yes yes no
therefore,iconclude-over. no no yes
Table7: Exampleofsystemoutputs,obtainedwithorwithoutTMEMhelp.
estprobabilityistheperfectone. ThefirstFrench H. Maruyana and H. Watanabe. 1992. Tree cover
sentenceinTable7iscorrectlytranslatedwithor searchalgorithmforexample-basedtranslation. In
ProceedingsofTMI’92,pages173–184.
without help from the translation memory. The
secondsentenceiscorrectlytranslatedonlywhen Dan Melamed. 1999. Bitext maps and alignment
viapatternrecognition. ComputationalLinguistics,
the system uses a TMEM seed; and fortunately,
25(1):107–130.
the translation of highest probability is the one
obtained using the TMEM seed. The translation Franz Josef Och, Christoph Tillmann, and Herman
Ney. 1999. Improved alignment models for sta-
obtainedfromtheTMEMseedisalsocorrectfor
tistical machine translation. In Proceedings of
thethirdsentence. Butunfortunately, inthiscase,
the EMNLP and VLC, pages 20–28, University of
theTMEM-basedtranslationisnotthemostprob- Maryland,Maryland.
able.
S. Sato. 1992. CTM: an example-based transla-
tionaidsystemusingthecharacter-basedmatchre-
Acknowledgments. This work was supported
trieval method. In Proceedings of the 14th Inter-
byDARPA-ITOgrantN66001-00-1-9814.
national Conferenceon Computational Linguistics
(COLING’92),Nantes,France.
References
RobertC.Sprung,editor. 2000. TranslatingIntoSuc-
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
cess: Cutting-EdgeStrategies For Going Multilin-
Knight, John Lafferty, Dan Melamed, Franz-Josef
gualInAGlobalAge. JohnBenjaminsPublishers.
Och, David Purdy, Noah A. Smith, and David
Yarowsky. 1999. Statistical machine translation. Tony Veale and Andy Way. 1997. Gaijin: A
FinalReport,JHUSummerWorkshop. template-basedbootstrappingapproachtoexample-
basedmachinetranslation. InProceedingsof“New
Peter F. Brown, Stephen A. Della Pietra, Vincent J.
Methods in Natural Language Processing”, Sofia,
Della Pietra, and Robert L. Mercer. 1993. The
Bulgaria.
mathematics of statistical machine translation: Pa-
rameter estimation. Computational Linguistics, S. Vogel and Herman Ney. 2000. Construction of a
19(2):263–311. hierarchicaltranslationmemory. InProceedingsof
COLING’00,pages1131–1135, Saarbru¨cken,Ger-
RalphD.Brown. 1999. Addinglinguisticknowledge
many.
to a lexical example-based translation system. In
ProceedingsofTMI’99,pages22–32,Chester,Eng- Ye-Yi Wang. 1998. Grammar Inference and Statis-
land. tical Machine Translation. Ph.D. thesis, Carnegie
Mellon University. Also available as CMU-LTI
A.P.Dempster,N.M.Laird,andD.B.Rubin. 1977.
TechnicalReport98-160.
Maximum likelihoodfrom incomplete datavia the
em algorithm. Journal of the Royal Statistical So- DekaiWuandHongsingWong. 1998. Machinetrans-
ciety,39(SerB):1–38. lation with a stochastic grammatical channel. In
Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Proceedings of ACL’98, pages 1408–1414, Mon-
Marcu, and Kenji Yamada. 2001. Fast decoding treal,Canada.
and optimal decoding for machine translation. In
Kenji Yamada and Kevin Knight. 2001. A syntax-
ProceedingsofACL’01,Toulouse,France.
basedstatisticaltranslation model. InProceedings
Kevin Knight. 1999. Decoding complexityin word- ofACL’01,Toulouse,France.
replacement translation models. Computational
Linguistics,25(4).