SMT(Statistical Machine Translation)

smt

SMT(Statistical Machine Translation)

고요한하늘... 2011. 10. 7. 00:43

-- under construction --

moses basic install guide

http://www.statmt.org/wmt11/baseline.html

SRILM 다운로드

http://www.speech.sri.com/projects/srilm/download.html

간단한 개인정보 넣고 다운로드

tar -zxvf srilm.tgz

vi Makefile

SRILM = 설치할 디렉토리 설정

make MACHINE_TYPE=i686-m64 World

export PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/data2/jchern/bin:/data2/jchern/SMT/bin/i686-m64:/data2/jchern/SMT/bin

PATH add $SRILM/bin/$MACHINE_TYPE and $SRILM/bin

MANPATH add $SRILM/man

GIZA++ 다운로드

http://code.google.com/p/giza-pp/downloads/detail?name=giza-pp-v1.0.5.tar.gz&can=2&q=

make

mkdir -p bin

cp GIZA++-v2/GIZA++ bin/

cp GIZA++-v2/snt2cooc.out bin/

cp mkcls-v2/mkcls bin/

SRILM 설치하는 도중 TCL이 없다고 해서 TCL 다운로드

tar -zxvf tcl8.5.10-src.tar.gz

cd tcl8.5.10-src/unix

./configure

make

make install

moses 설치

http://sourceforge.net/projects/mosesdecoder/files/mosesdecoder/2010-08-09/

unzip moses-2010-08-13.zip

./configure

make

설치하다 boost header 파일이 없다고 해서 boost 설치

boost_1_47_0.tar.gz 다운로드

tar -zxvf boost_1_47_0.tar.gz

cd boost_1_47_0

./boostrap.sh

./b2 install

BINDIR 설정 에러라고 뜨는데 스크립트를 보니 giza++/bin에 있는 파일들을 찾는것 같아 BINDIR설정하는 부분을 giza++/bin으로 변경했더니 컴파일 성공

europarl-v6.fr-en 라는 corpus가 없어서 다운로드

http://www.mail-archive.com/opennlp-issues@incubator.apache.org/msg00371.html

병렬코퍼스1(독일어)

wiederaufnahme der sitzungsperiode

ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene

sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen .

begruessung

병렬코퍼스2(영어)

resumption of the session

i declare resumed the session of the european parliament adjourned

on thursday , 28 march 1996 .

welcome

유닉스에서 처리하기 위해서는 환경변수 LC_ALL=C로 세팅

한문장이 한라인, 빈줄은 없도록

한문장은 100단어 이내

모든 단어는 소문자로( lowercase.perl 사용 )

트레이닝 데이터 포맷

word[0]factor[0]|word[0]factor[1]|word[0]factor[2]|word[1]factor[0]

팩터가 없는 경우

word0 word1 word2

코퍼스 정제( Cleaning the corpus )

clean-corpus-n.perl 사용

하는일

1. 빈라인제거

2. 연속해서 나타나는 공백 제거

3. 너무길거나 너무 짧은 라인

실행방법

./clean-corpus-n.perl | CORPUS L1 L2 OUT MIN MAX

ex> clean-corpus-n.perl raw de en clean 1 50

입력으로 사용하는 파일을 raw.de, raw.en

한라인에 50글자 이상은 지운다

최종 결과파일을 clean.de clean.en으로 생성

Training Step3 : Align Words

==> model/aligned.de <==

wiederaufnahme der sitzungsperiode

ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode

des europaeischen parlaments fuer wiederaufgenommen .

begruessung

==> model/aligned.en <==

resumption of the session

i declare resumed the session of the european parliament adjourned on

thursday , 28 march 1996 .

welcome

==> model/aligned.grow-diag-final <==

0-0 0-1 1-2 2-3

0-0 1-1 2-3 3-10 3-11 4-11 5-12 7-13 8-14 9-15 10-2 11-4 12-5 12-6 13-7

14-8 15-9 16-9 17-16

0-0

독일어 문서에서 ich의 문장내 위치가 0번째이고 영어문서에서 'i'가 문장내에서 0일때 0 - 0으로 표현

Training Step4: Get Lexical Translation Table

단어간 전이확률 테이블 구축 w(e|f), w(f|e)

europe europa 0.8874152

european europa 0.0542998

union europa 0.0047325

it europa 0.0039230

we europa 0.0021795

eu europa 0.0019304

europeans europa 0.0016190

euro-mediterranean europa 0.0011209

europa europa 0.0010586

continent europa 0.0008718

Training Step 5: Extract Phrases

하나의 파일에 phrase을 저장( 상위에 있는 엔트리 확인 )

> head model/extract

wiederaufnahme ||| resumption ||| 0-0

wiederaufnahme der ||| resumption of the ||| 0-0 1-1 1-2

wiederaufnahme der sitzungsperiode ||| resumption of the session ||| 0-0 1-1 1-2 2-3

der ||| of the ||| 0-0 0-1

der sitzungsperiode ||| of the session ||| 0-0 0-1 1-2

sitzungsperiode ||| session ||| 0-0

ich ||| i ||| 0-0

ich erklaere ||| i declare ||| 0-0 1-1

erklaere ||| declare ||| 0-0

sitzungsperiode ||| session ||| 0-0

Training Step 6: Score Phrases

단어간 정이 확률이 엄청나게 거대해질수 있기 때문에 메모리에 저장히자 않고 디스크에 저장한다.

구전이 확률을 구하기 위해서 우선 파일을 정렬한다.

영어와 다른 외국어가 동일하게 정렬됐기 때문에

라인별로 카운트 값을 가지고 전이확류을 계산한다. 역방향에 대해서는 파일을 역방향으로 정렬해서 구한다.

moses.ini

어순변화 가중치(distortion reordering weight),

언어모델 가중치(language model weights),

번역모델 가중치(translation model weights),

단어 패널티(word penalty)

<리스트 14> 튜닝

mkdir -p $WDIR/tuning

$SCRIPTS/tokenizer.perl -l en < $WDIR/dev.ko > $WDIR/tuning/input.pretok

$SCRIPTS/tokenizer.perl -l en < $WDIR/dev.en > $WDIR/tuning/reference.tok

cat $WDIR/tuning/input.pretok | sed 's/[0-9a-zA-Z][0-9a-zA-Z]*/ & /g' > $WDIR/tuning/input.tok

$SCRIPTS/lowercase.perl < $WDIR/tuning/input.tok > $WDIR/tuning/input

$SCRIPTS/lowercase.perl < $WDIR/tuning/reference.tok > $WDIR/tuning/reference

echo "%% 튜닝 스크립트 실행"

$BIN/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl $WDIR/tuning/input $WDIR/tuning/reference moses/moses-cmd/src/moses $WDIR/model/moses.ini --working-dir $WDIR/tuning --rootdir $BIN/moses-scripts/scripts-YYYYMMDD-HHMM

echo "%% 새로운 설정파일 생성"

$SCRIPTS/reuse-weights.perl $WDIR/tuning/moses.ini < $WDIR/model/moses.ini > $WDIR/tuning/moses.weight-reused.ini

<리스트 15> 디코더 실행(번역하기)

mkdir -p $WDIR/evaluation

$SCRIPTS/tokenizer.perl -l fr < $WDIR/devtest.en > $WDIR/evaluation/devtest.input.tok

$SCRIPTS/tokenizer.perl -l en < $WDIR/devtest.en > $WDIR/evaluation/devtest.reference.tok

$SCRIPTS/lowercase.perl < $WDIR/evaluation/devtest.input.tok > $WDIR/evaluation/devtest.input

$SCRIPTS/lowercase.perl < $WDIR/evaluation/devtest.reference.tok > $WDIR/evaluation/devtest.reference

$BIN/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl $WDIR/evaluation/filtered.devtest WDIR/tuning/moses.weight-reused.ini $WDIR/evaluation/devtest.input

$WDIR/smt/moses/moses-cmd/src/moses ?config $WDIR/evaluation/filtered.devtest/moses.ini -input-file $WDIR/evaluation/devtest.input > $WDIR/evaluation/devtest.output

참고1 : http://www.statmt.org/wmt11/baseline.html

참고2 : http://www.imaso.co.kr/?doc=bbs/gnuboard.php&bo_table=article&keywords=%C0%D0%C0%BB%B0%C5%B8%AE&page=11&wr_id=36469

참고3 : http://leona.springnote.com/pages/578960

설명이 잘돼있는 사이트 : http://www.guardiani.us/index.php/Moses_Language_Model_Howto_v2

#!/bin/bash

set -o nounset # Treat unset variables as an error when performing parameter expansion.

set -o errexit # Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status.

function error_message()

{

echo "Exits abnormally at line " $red_color `caller 0` $reset_color;

}

trap "error_message" ERR

red_color=^[[31m

green_color=^[[32m

reset_color=^[[0m

NGRAM=5

SMT_HOME=/data2/jchern/smt

DEV=$SMT_HOME/dev

SMT_BIN=$SMT_HOME/bin

SRILM=$SMT_HOME/srilm

SCRIPTS=$SMT_HOME/scripts

TUNNING=$SMT_HOME/tunning

TRAIN_DIR=$SMT_HOME/training

WORK_DIR=$SMT_HOME/working-dir

MOSES_SCRIPT=$SMT_HOME/bin/moses-scripts/scripts-20111006-1552

TRAIN_BIG_DIR=$SMT_HOME/training-monolingual

EXAMPLE_DIR=$SMT_HOME/example

MOSES=$SMT_HOME/moses/moses-cmd/src/moses

export SCRIPTS_ROOTDIR=$MOSES_SCRIPT

usage()

{

echo `caller`;

echo "./smt.sh --in-file-prefix=corpus --filename-for-lang-model=europarl-v6.en --step=0 --debug";

echo " --in-file-prefix : parallel corpus name";

echo " --filename-for-lang-model : big data file To build language model";

echo " --step : start step";

echo " --debug(d) : debug";

exit

}

BIG_FILE="europarl-v6.en"

set -- `getopt -n$0 -u -a --longoptions="in-file-prefix: filename-for-lang-model: step: help debug" "dh" "$@"` || usage

[ $# -eq 0 ] && usage

[ $# -eq 1 ] && usage

debug="0";

STEP=0;

while [ $# -gt 0 ]

case "$1" in

--in-file-prefix) CORPUS_NAME=$2; shift 2;;

--filename-for-lang-model) BIG_FILE=$2; shift 2;;

--step) STEP=$2-1; shift 2;;

--debug) debug=1; shift;;

--help) usage; break;;

-d) debug=1; shift;;

-h) usage; break;;

--) break;;

-*) echo "unknown option : $1"; usage; shift; break;;

*) echo "unknown option : $1"; usage; break;; #better be the crawl directory

esac

done

LANG1=ko

LANG2=en

echo "LANGUAGE 1 --> "$LANG1;

echo "LANGUAGE 2 --> "$LANG2;

if [[ $STEP -lt 1 ]]

then

echo "==============================="

echo "$red_color [1. PREPARE DATA]... $reset_color "

echo "==============================="

echo "------------------------"

echo "$green_color [1.1 TOKENIZING]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/tokenizer.perl -l en < $TRAIN_DIR/$CORPUS_NAME.$LANG1 > $WORK_DIR/corpus/$CORPUS_NAME.tok.$LANG1"

echo "$SCRIPTS/tokenizer.perl -l en < $TRAIN_DIR/$CORPUS_NAME.$LANG2 > $WORK_DIR/corpus/$CORPUS_NAME.tok.$LANG2"

$SCRIPTS/tokenizer.perl -l en < $TRAIN_DIR/$CORPUS_NAME.$LANG1 > $WORK_DIR/corpus/$CORPUS_NAME.tok.$LANG1 2>/dev/null

$SCRIPTS/tokenizer.perl -l en < $TRAIN_DIR/$CORPUS_NAME.$LANG2 > $WORK_DIR/corpus/$CORPUS_NAME.tok.$LANG2 2>/dev/null

echo "------------------------"

echo "$green_color [1.2 CLEANING]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$MOSES_SCRIPT/training/clean-corpus-n.perl $WORK_DIR/corpus/$CORPUS_NAME.tok $LANG1 $LANG2 $WORK_DIR/corpus/$CORPUS_NAME.clean 1 40"

$MOSES_SCRIPT/training/clean-corpus-n.perl $WORK_DIR/corpus/$CORPUS_NAME.tok $LANG1 $LANG2 $WORK_DIR/corpus/$CORPUS_NAME.clean 1 40 >/dev/null 2>/dev/null

echo "------------------------"

echo "$green_color [1.3 LOWERCASE]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/corpus/$CORPUS_NAME.clean.$LANG1 > $WORK_DIR/corpus/$CORPUS_NAME.lowcased.$LANG1"

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/corpus/$CORPUS_NAME.clean.$LANG2 > $WORK_DIR/corpus/$CORPUS_NAME.lowcased.$LANG2"

$SCRIPTS/lowercase.perl < $WORK_DIR/corpus/$CORPUS_NAME.clean.$LANG1 > $WORK_DIR/corpus/$CORPUS_NAME.lowcased.$LANG1 2>/dev/null

$SCRIPTS/lowercase.perl < $WORK_DIR/corpus/$CORPUS_NAME.clean.$LANG2 > $WORK_DIR/corpus/$CORPUS_NAME.lowcased.$LANG2 2>/dev/null

if [[ $STEP -lt 2 ]]

then

echo "==============================="

echo "$red_color [2.BUILD LANGUAGE MODEL]... $reset_color"

echo "==============================="

echo "------------------------"

echo "$green_color [2.1 MAKE DIC:lm]... $reset_color"

echo "------------------------"

mkdir -p $WORK_DIR/lm

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/tokenizer.perl -l $LANG2 "

echo " < $TRAIN_BIG_DIR/$BIG_FILE "

echo " > $WORK_DIR/lm/$BIG_FILE.tok"

$SCRIPTS/tokenizer.perl -l $LANG2 < $TRAIN_BIG_DIR/$BIG_FILE > $WORK_DIR/lm/$BIG_FILE.tok 2>/dev/null

echo "------------------------"

echo "$green_color [2.2 LOWCASE]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/lm/$BIG_FILE.tok > $WORK_DIR/lm/$BIG_FILE.lowercased"

$SCRIPTS/lowercase.perl < $WORK_DIR/lm/$BIG_FILE.tok > $WORK_DIR/lm/$BIG_FILE.lowercased

echo "------------------------"

echo "$green_color [2.3 BUILD LANGUAGE MODEL USING SRILM]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SRILM/bin/i686-m64/ngram-count -order $NGRAM -interpolate -kndiscount -text $WORK_DIR/lm/$BIG_FILE.lowcased -lm $WORK_DIR/lm/$BIG_FILE.lm"

$SRILM/bin/i686-m64/ngram-count -order $NGRAM -interpolate -kndiscount -text $WORK_DIR/lm/$BIG_FILE.lowercased -lm $WORK_DIR/lm/$BIG_FILE.lm

if [[ $STEP -lt 3 ]]

then

echo "==============================="

echo "$red_color [3. TRAIN MODEL]... $reset_color"

echo "==============================="

echo "------------------------"

echo "$green_color [3.1 RUN TRAINING SCRIPT]... $reset_color"

echo "------------------------"

cd $WORK_DIR

rm -rf model

if [[ "$debug" -eq "1" ]]

then

echo "$MOSES_SCRIPT/training/train-model.perl \\"

echo " -scripts-root-dir $MOSES_SCRIPT \\"

echo " -root-dir $WORK_DIR \\"

echo " -corpus $WORK_DIR/corpus/$CORPUS_NAME.lowcased \\"

echo " -f $LANG1 \\"

echo " -e $LANG2 \\"

echo " -alignment grow-diag-final-and \\"

echo " -reordering msd-bidirectional-fe \\"

echo " -lm 0:3:$WORK_DIR/lm/$BIG_FILE.lm:0"

$MOSES_SCRIPT/training/train-model.perl \

-scripts-root-dir $MOSES_SCRIPT \

-root-dir $WORK_DIR \

-corpus $WORK_DIR/corpus/$CORPUS_NAME.lowcased \

-f $LANG1 \

-e $LANG2 \

-alignment grow-diag-final-and \

-reordering msd-bidirectional-fe \

-lm 0:3:$WORK_DIR/lm/$BIG_FILE.lm:0

if [[ $STEP -lt 4 ]]

then

echo "==============================="

echo "$red_color [4. TUNING]... $reset_color"

echo "==============================="

echo "------------------------"

echo "$green_color [4.1 TOKENIZE TUNING SETS]... $reset_color"

echo "------------------------"

cd $WORK_DIR

rm -rf tuning

mkdir -p $WORK_DIR/tuning

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/tokenizer.perl -l $LANG2 < $DEV/dev.$LANG1 > $WORK_DIR/tuning/input.tok"

echo "$SCRIPTS/tokenizer.perl -l $LANG2 < $DEV/dev.$LANG2 > $WORK_DIR/tuning/reference.tok"

$SCRIPTS/tokenizer.perl -l $LANG2 < $DEV/dev.$LANG1 > $WORK_DIR/tuning/input.tok 2>/dev/null

$SCRIPTS/tokenizer.perl -l $LANG2 < $DEV/dev.$LANG2 > $WORK_DIR/tuning/reference.tok 2>/dev/null

echo "------------------------"

echo "$green_color [4.2 Lowercase tuning sets]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/tuning/input.tok > $WORK_DIR/tuning/input"

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/tuning/reference.tok > $WORK_DIR/tuning/reference"

$SCRIPTS/lowercase.perl < $WORK_DIR/tuning/input.tok > $WORK_DIR/tuning/input 2>/dev/null

$SCRIPTS/lowercase.perl < $WORK_DIR/tuning/reference.tok > $WORK_DIR/tuning/reference 2>/dev/null

echo "------------------------"

echo "$green_color [4.3 Run tuning script]... $reset_color"

echo "------------------------"

export SCRIPTS_ROOTDIR=$MOSES_SCRIPT

if [[ "$debug" -eq "1" ]]

then

echo "$MOSES_SCRIPT/training/mert-moses.pl $WORK_DIR/tuning/input \\"

echo " $WORK_DIR/tuning/reference \\"

echo " $MOSES \\"

echo " $WORK_DIR/model/moses.ini \\"

echo " --working-dir $WORK_DIR/tuning \\"

echo " --rootdir $MOSES_SCRIPT "

echo "$MOSES_SCRIPT/training/mert-moses.pl $WORK_DIR/tuning/input $WORK_DIR/tuning/reference $MOSES $WORK_DIR/model/moses.ini --working-dir $WORK_DIR/tuning --rootdir $MOSES_SCRIPT 2>/dev/null ";

$MOSES_SCRIPT/training/mert-moses.pl $WORK_DIR/tuning/input $WORK_DIR/tuning/reference $MOSES $WORK_DIR/model/moses.ini --working-dir $WORK_DIR/tuning --rootdir $MOSES_SCRIPT 2>/dev/null

echo "------------------------"

echo "$green_color [4.4 Insert weights into configuration file]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/reuse-weights.perl $WORK_DIR/tuning/moses.ini < $WORK_DIR/model/moses.ini > $WORK_DIR/tuning/moses.weight-reused.ini"

$SCRIPTS/reuse-weights.perl $WORK_DIR/tuning/moses.ini < $WORK_DIR/model/moses.ini > $WORK_DIR/tuning/moses.weight-reused.ini

if [[ $STEP -lt 5 ]]

then

echo "==============================="

echo "$red_color [5. Run System on Development Test Set]... $reset_color"

echo "==============================="

echo "------------------------"

echo "$green_color [5.1 Tokenize test set]... $reset_color"

echo "------------------------"

cd $WORK_DIR

rm -rf evaluation

mkdir -p $WORK_DIR/evaluation

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/tokenizer.perl -l $LANG2 < $EXAMPLE_DIR/test.$LANG1 > $WORK_DIR/evaluation/test.input.tok"

echo "$SCRIPTS/tokenizer.perl -l $LANG2 < $EXAMPLE_DIR/test.$LANG2 > $WORK_DIR/evaluation/test.reference.tok"

$SCRIPTS/tokenizer.perl -l $LANG2 < $EXAMPLE_DIR/test.$LANG1 > $WORK_DIR/evaluation/test.input.tok 2>/dev/null

$SCRIPTS/tokenizer.perl -l $LANG2 < $EXAMPLE_DIR/test.$LANG2 > $WORK_DIR/evaluation/test.reference.tok 2>/dev/null

echo "------------------------"

echo "$green_color [5.2 Lowercase tuning sets]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/evaluation/test.input.tok > $WORK_DIR/evaluation/test.input"

echo "$SCRIPTS/lowercase.perl < $WORK_DIR/evaluation/test.reference.tok > $WORK_DIR/evaluation/test.reference"

$SCRIPTS/lowercase.perl < $WORK_DIR/evaluation/test.input.tok > $WORK_DIR/evaluation/test.input 2>/dev/null

$SCRIPTS/lowercase.perl < $WORK_DIR/evaluation/test.reference.tok > $WORK_DIR/evaluation/test.reference 2>/dev/null

echo "------------------------"

echo "$green_color [5.3 Filter the model to fit into memory]... $reset_color"

echo "------------------------"

export SCRIPTS_ROOTDIR=$MOSES_SCRIPT

if [[ "$debug" -eq "1" ]]

then

echo "$MOSES_SCRIPT/training/filter-model-given-input.pl $WORK_DIR/evaluation/filtered.test $WORK_DIR/tuning/moses.weight-reused.ini $WORK_DIR/evaluation/test.input"

$MOSES_SCRIPT/training/filter-model-given-input.pl $WORK_DIR/evaluation/filtered.test $WORK_DIR/tuning/moses.weight-reused.ini $WORK_DIR/evaluation/test.input

echo "------------------------"

echo "$green_color [5.4 Decode with Moses]... $reset_color"

echo "------------------------"

if [[ "$debug" -eq "1" ]]

then

echo "$MOSES -config $WORK_DIR/evaluation/filtered.test/moses.ini -input-file $WORK_DIR/evaluation/test.input > $WORK_DIR/evaluation/test.output"

$MOSES -config $WORK_DIR/evaluation/filtered.test/moses.ini -input-file $WORK_DIR/evaluation/test.input > $WORK_DIR/evaluation/test.output

echo ""

echo "$green_color job completed............. $reset_color"

echo ""

입력 파일 중 일부

“ 네이버와 같은 검색엔진은 사용자가 만든 콘텐츠를 이용하여 검색 트래픽을 가동해 왔으나 그런 콘텐츠에 대해 그들에게 법적인 권리가 있다고 말하기는 어렵다 . ”

산자부는 새 시스템의 강력한 검색 엔진을 통해 보다 풍부한 통관 관련 정보를 이용할 수 있다는 점도 자랑했다 .

컴퓨터 조사를 하다

엠파스는 트래픽량에서 한국의 검색엔진 중 5위다 .

밀수품 [ 무기 ] 이 있나 몸 [ 소지품 ] 수색하다

장물 [ 도난품 ] 을 찾다

“ 마이크로소프트와의 제휴관계를 통해 우리는 당사의 검색엔진 서비스를 다른 플랫폼까지 확대하게 되었다 .

수색 영장을 가져왔단다 얘야

출력 파일 중 일부

search engines such as engine have been using content created but their search traffic it is by their to say that they have any legal rights over them .

it also boasts a stronger search engine and access to more abundant information related to trade .

do a computer search

엠파스는 트래픽량에서 korean of an integrated search the 5위다 the local elections .

search a person for smuggled goods [ weapons ]

search for stolen goods

마이크로소프트와의 2004 제휴관계를 to search 우리는 당사의 search engine services to different platforms .

a search warrant , son .

------------------------------------------------------------------------------------------------------------------------

sample.en( 어절 개수를 맞춰 정확도가 높아지는지 확인하기 위해 정관사 a 대신 명사를 복수형으로 )

I am students

I am boys

You are girls

sample.ko

나는 학생 이다

나는 소년 이다

너는 소녀 이다

aligned.grow-diag-final-and

0-0 1-1 3-1 2-2

I(0) am(1) a(2) student(3)

나는(0) 학생(1) 이다(2)

I(0) am(1) a(2) boy(3)

나는(0) 소년(1) 이다(2)

You(0) are(1) a(2) girl(3)

너는(0) 소녀(1) 이다(2)

extract.gz

i ||| 나는 ||| 0-0

i am a student ||| 나는 학생 이다 ||| 0-0 1-1 3-1 2-2

am a student ||| 학생 이다 ||| 0-0 2-0 1-1

a ||| 이다 ||| 0-0

i ||| 나는 ||| 0-0

i am a boy ||| 나는 소년 이다 ||| 0-0 1-1 3-1 2-2

am a boy ||| 소년 이다 ||| 0-0 2-0 1-1

a ||| 이다 ||| 0-0

you ||| 너는 ||| 0-0

you are a girl ||| 너는 소녀 이다 ||| 0-0 1-1 3-1 2-2

are a girl ||| 소녀 이다 ||| 0-0 2-0 1-1

a ||| 이다 ||| 0-0

extract.o.gz

i ||| 나는 ||| mono mono

i am a student ||| 나는 학생 이다 ||| mono mono

am a student ||| 학생 이다 ||| mono mono

a ||| 이다 ||| other other

i ||| 나는 ||| mono mono

i am a boy ||| 나는 소년 이다 ||| mono mono

am a boy ||| 소년 이다 ||| mono mono

a ||| 이다 ||| other other

you ||| 너는 ||| mono mono

you are a girl ||| 너는 소녀 이다 ||| mono mono

are a girl ||| 소녀 이다 ||| mono mono

a ||| 이다 ||| other other

extract.inv.gz

나는 ||| i ||| 0-0

나는 학생 이다 ||| i am a student ||| 0-0 1-1 1-3 2-2

학생 이다 ||| am a student ||| 0-0 0-2 1-1

이다 ||| a ||| 0-0

나는 ||| i ||| 0-0

나는 소년 이다 ||| i am a boy ||| 0-0 1-1 1-3 2-2

소년 이다 ||| am a boy ||| 0-0 0-2 1-1

이다 ||| a ||| 0-0

너는 ||| you ||| 0-0

너는 소녀 이다 ||| you are a girl ||| 0-0 1-1 1-3 2-2

소녀 이다 ||| are a girl ||| 0-0 0-2 1-1

이다 ||| a ||| 0-0

lex.e2f

you 너는 1.0000000

a 이다 1.0000000

girl 소녀 0.5000000

are 소녀 0.5000000

am 소년 0.5000000

am 학생 0.5000000

student 학생 0.5000000

boy 소년 0.5000000

i 나는 1.0000000

lex.f2e

너는 you 1.0000000

이다 a 1.0000000

소녀 girl 1.0000000

소녀 are 1.0000000

소년 am 0.5000000

학생 am 0.5000000

학생 student 1.0000000

소년 boy 1.0000000

나는 i 1.0000000

phrase-table.gz

a ||| 이다 ||| 1 1 1 1 2.718 ||| ||| 3 3

am a boy ||| 소년 이다 ||| 1 0.25 1 0.75 2.718 ||| ||| 1 1

am a student ||| 학생 이다 ||| 1 0.25 1 0.75 2.718 ||| ||| 1 1

are a girl ||| 소녀 이다 ||| 1 0.25 1 1 2.718 ||| ||| 1 1

i am a boy ||| 나는 소년 이다 ||| 1 0.25 1 0.75 2.718 ||| ||| 1 1

i am a student ||| 나는 학생 이다 ||| 1 0.25 1 0.75 2.718 ||| ||| 1 1

i ||| 나는 ||| 1 1 1 1 2.718 ||| ||| 2 2

you are a girl ||| 너는 소녀 이다 ||| 1 0.25 1 1 2.718 ||| ||| 1 1

you ||| 너는 ||| 1 1 1 1 2.718 ||| ||| 1 1

reordering-table.wbe-msd-bidirectional-fe

a ||| 이다 ||| 0.111111 0.111111 0.777778 0.111111 0.111111 0.777778

am a boy ||| 소년 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

am a student ||| 학생 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

are a girl ||| 소녀 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

i am a boy ||| 나는 소년 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

i am a student ||| 나는 학생 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

i ||| 나는 ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857

you are a girl ||| 너는 소녀 이다 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

you ||| 너는 ||| 0.600000 0.200000 0.200000 0.600000 0.200000 0.200000

/giza.en-ko/en-ko.A3.final

# Sentence pair (1) source length 3 target length 4 alignment score : 0.015202

i am a student

NULL ({ }) 나는 ({ 1 }) 학생 ({ 2 4 }) 이다 ({ 3 })

# Sentence pair (2) source length 3 target length 4 alignment score : 0.015202

i am a boy

NULL ({ }) 나는 ({ 1 }) 소년 ({ 2 4 }) 이다 ({ 3 })

# Sentence pair (3) source length 3 target length 4 alignment score : 0.0143526

you are a girl

NULL ({ }) 너는 ({ 1 }) 소녀 ({ 2 4 }) 이다 ({ 3 })

/giza.en-ko/ko-en.A3.final

# Sentence pair (1) source length 4 target length 3 alignment score : 0.31348

나는 학생 이다

NULL ({ }) i ({ 1 }) am ({ }) a ({ 3 }) student ({ 2 })

# Sentence pair (2) source length 4 target length 3 alignment score : 0.229885

나는 소년 이다

NULL ({ }) i ({ 1 }) am ({ }) a ({ 3 }) boy ({ 2 })

# Sentence pair (3) source length 4 target length 3 alignment score : 0.0728145

너는 소녀 이다

NULL ({ }) you ({ 1 }) are ({ }) a ({ 3 }) girl ({ 2 })

프로그램을 실행시키고 나면 최종적으로 사용할 파일들이 생성이 된다.

생성 리스트는 아래와 같다.

info

moses.ini

phrase-table.0-0.1.1(1)

reordering-table.wbe-msd-bidirectional-fe

이 파일중 phrase-table파일을 열어보면

model디렉토리에 있는 phrase-table.gz(2)파일과 다를수 있다.

정확히 이야기하면 (2)가 (1)의 subset인데

filter-model-given-input.pl 파일을 실행할때 입력으로 주는 파일(input.txt)을 기준으로

(1)의 파일에 있는 phrase 중에 input.txt파일에 존재하지 않는 경우는 빠지게 된다.

영어와 한국어를 비교했을때

일단 한국어에는 영어에서 사용되는 정관사 a나 the가 존재하지 않는다.

반대로 영어에는 한국어에서 사용하는 조사가 존재하지 않는다.

그래서 병렬코퍼스를 트레이닝하기위한 전처리시 영어에서는 정관사를 뒤에 오는 명사에 붙이고

한국어에서는 조사를 뒤에 앞에 오는 명사에 붙이거나 삭제해보자

저작자표시

'smt' 카테고리의 다른 글

Big_train_language_model.sh (0)	2011.11.24

현재글SMT(Statistical Machine Translation)

책, 검색어 추천, 형태소 분석, 자금성, 만리장성, 네이버, 구글, 리뷰, 디-워, 다음,

Today :
Yesterday :

고요한 하늘