get tf and df using suffix array

알고리즘

get tf and df using suffix array

고요한하늘... 2013. 7. 30. 18:15

paper : Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus.pdf

suffix array를 구하는 것은 간단하다.

문자열이 "to be$or$not to be$"라고 주어졌을때( $를 기준으로 3개의 문서로 간주 )

string =
[ 0] [to be$or$not to be$]
[ 1] [o be$or$not to be$]
[ 2] [ be$or$not to be$]
[ 3] [be$or$not to be$]
[ 4] [e$or$not to be$]
[ 5] [$or$not to be$]
[ 6] [or$not to be$]
[ 7] [r$not to be$]
[ 8] [$not to be$]
[ 9] [not to be$]
[10] [ot to be$]
[11] [t to be$]
[12] [ to be$]
[13] [to be$]
[14] [o be$]
[15] [ be$]
[16] [be$]
[17] [e$]
[18] [$]
위처럼 문자열을 잘라서

정렬을 하면 suffix array가 만들어진다.

정렬후에는

string =

[15] [ be$]
[ 2] [ be$or$not to be$]
[12] [ to be$]
[18] [$]
[ 8] [$not to be$]
[ 5] [$or$not to be$]
[16] [be$]
[ 3] [be$or$not to be$]
[17] [e$]
[ 4] [e$or$not to be$]
[ 9] [not to be$]
[14] [o be$]
[ 1] [o be$or$not to be$]
[ 6] [or$not to be$]
[10] [ot to be$]
[ 7] [r$not to be$]
[11] [t to be$]
[13] [to be$]
[ 0] [to be$or$not to be$]

위처럼 될것이다.

문제는 여기서

TF와 IDF를 계산하는 방법인데

lcp(longest common prefixes)를 먼저 계산하면

[18] [$]
[ 8] [$not_to_be$]
[ 5] [$or$not_to_be$]
[15] [_be$]
[ 2] [_be$or$not_to_be$]
[12] [_to_be$]

lcp[0] = 0

lcp[1] = 4

lcp[2] = 1

lcp[3] = 0

lcp[4] = 1

lcp[1] = 4를 보면

string[0]과

string[1]

string[0] = [15] [ be$]
string[1] = [ 2] [ be$or$not to be$]

에서 동일한 문자열길이( " be$" ) 즉 4가 된다.

이렇게 구해진 lcp를 가지고 TF, IDF를 계산하면 되는데

TF는 j - i + 1로 계산하고

DF의 경우

각 문자열의 몇번째 문서인지를 확인할수 있는 배열을 두고 카운팅 한다.

즉 "to be$or$not to be"

t o b e $ o r $ n o t t o b e $

docid = 0 0 0 0 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2

자세한 설명은 excel 파일을 참고

suffix_array.xlsx

0.01MB

Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus.pdf

1.68MB

저작자표시