<<  Windows Media Photo A new format for end-to-end digital imaging Reseach for lauching a New Product line  >>
Suffix Arrays: A new method for on-line string searches
Suffix Arrays: A new method for on-line string searches
Introduction - Problem definition
Introduction - Problem definition
Introduction  what is a suffix array
Introduction what is a suffix array
Introduction  what is a suffix array
Introduction what is a suffix array
Introduction  what is a suffix tree
Introduction what is a suffix tree
The Article Overview
The Article Overview
The Search algorithm - Definitions
The Search algorithm - Definitions
The Search algorithm  how does the array help us know if W is a
The Search algorithm how does the array help us know if W is a
Example:
Example:
Why finding LW, RW == Finding the matches: If LW > RW => W is not a
Why finding LW, RW == Finding the matches: If LW > RW => W is not a
The Search algorithm  The easy way - O(PlogN)
The Search algorithm The easy way - O(PlogN)
The Search algorithm using lcp values in O(P+logN)  Definitions:
The Search algorithm using lcp values in O(P+logN) Definitions:
The Search algorithm using lcp values in O(P+logN) Example: A=abcx
The Search algorithm using lcp values in O(P+logN) Example: A=abcx
So how do we use l,r,Llcp[M]
So how do we use l,r,Llcp[M]
Example: W=abcx (cont
Example: W=abcx (cont
Example: W=abcx (cont
Example: W=abcx (cont
The Search algorithm using lcp values- complexity
The Search algorithm using lcp values- complexity
The Article Overview
The Article Overview
Construction of suffix array in O(NlogN)
Construction of suffix array in O(NlogN)
Construction of suffix array  The general idea
Construction of suffix array The general idea
Construction of suffix array  The general idea (cont
Construction of suffix array The general idea (cont
Construction of suffix array  The algorithm
Construction of suffix array The algorithm
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  The algorithm Example:
Construction of suffix array The algorithm Example:
Construction of suffix array  Complexity Summary
Construction of suffix array Complexity Summary
The Article Overview
The Article Overview
How to find Longest Common Prefixes  the general idea
How to find Longest Common Prefixes the general idea
How to find Longest Common Prefixes  the general idea
How to find Longest Common Prefixes the general idea
How to find Longest Common Prefixes  the general idea
How to find Longest Common Prefixes the general idea
How to find lcp  algorithm and data structures  Hgt[]
How to find lcp algorithm and data structures Hgt[]
How to find lcp  Hgt[] example:
How to find lcp Hgt[] example:
How to find lcp  Hgt[] example (cont
How to find lcp Hgt[] example (cont
How to find lcp  data structures
How to find lcp data structures
How to find lcp  example of Interval tree
How to find lcp example of Interval tree
How to find lcp  Complexity
How to find lcp Complexity
The Article Overview
The Article Overview
Time Expected-case Improvement of the construction of pos[]
Time Expected-case Improvement of the construction of pos[]
Time Expected-case Improvement of the construction of pos[]
Time Expected-case Improvement of the construction of pos[]
Example of the mapping
Example of the mapping
Time Expected-case Improvement of the construction of pos[]
Time Expected-case Improvement of the construction of pos[]
So is a suffix array better then a suffix tree
So is a suffix array better then a suffix tree
The End
The End

: Suffix Arrays: A new method for on-line string searches. : orenw. : Suffix Arrays: A new method for on-line string searches.ppt. zip-: 615 .

Suffix Arrays: A new method for on-line string searches

Suffix Arrays: A new method for on-line string searches.ppt
1 Suffix Arrays: A new method for on-line string searches

Suffix Arrays: A new method for on-line string searches

Udi Manber Gene Myers May 1989 Presented by: Oren Weimann

1

2 Introduction - Problem definition

Introduction - Problem definition

W= badgfbb

A= abccbbadgfbbcahgjf

A= abccbbadgfbbcahgjf

Is W a substring of A? |A|=N and |W|=P A = a0a1aN-1 Ai = suffix beginning at index i = aiai+1aN-1

2

3 Introduction  what is a suffix array

Introduction what is a suffix array

Example:

A = assassin

Pos[2] = 6 (A6 = in)

Pos

3

0

1

2

3

4

5

6

7

4 Introduction  what is a suffix array

Introduction what is a suffix array

A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i ? Ai is the kth smallest suffix in the set {A0, A1, A2 AN-1}

4

5 Introduction  what is a suffix tree

Introduction what is a suffix tree

Example:

A = assassin

A trie that contains all suffixes of A:

5

s

a

s

s

a s s i n

s

1

i

i

a

n

i

n

s

a

n

i

s

s

6

s

5

4

i

n

i

n

n

0

3

2

0

1

2

3

4

5

6

7

6 The Article Overview

The Article Overview

A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement.

6

7 The Search algorithm - Definitions

The Search algorithm - Definitions

For any string u, up = u1u2u3.up (or u if |u| p) Let denote a Lexicographical order, We say u v ? up vp Note that for any choice of p: Note that W is a substring of A ? there is an i such that W

7

8 The Search algorithm  how does the array help us know if W is a

The Search algorithm how does the array help us know if W is a

substring of A?

We define a search interval: LW = min {k | W APos[k] or k = N} RW = max {k | W APos[k] or k = -1} W matches ai ai+1 ...ai+P-1 ? i=Pos[k] for some k [LW, RW]

8

9 Example:

Example:

A = assassin

Pos

9

Option 1

Option 2

Option 3

0

1

2

3

4

5

6

7

10 Why finding LW, RW == Finding the matches: If LW > RW => W is not a

Why finding LW, RW == Finding the matches: If LW > RW => W is not a

substring of A. Else: there are (RW-LW+1) matches - APos[LW],, APos[RW]

10

Pos

W>APos[k]

W<APos[k]

LW

RW

11 The Search algorithm  The easy way - O(PlogN)

The Search algorithm The easy way - O(PlogN)

W=abcx

11

Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2. In the end LW R

M

R

L

Pos

abcde...

abcdf...

abd...

12 The Search algorithm using lcp values in O(P+logN)  Definitions:

The Search algorithm using lcp values in O(P+logN) Definitions:

Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: l = lcp(APos[L], W) r=lcp(W, APos[R]) Llcp[M] = lcp(APos[L] APos[M]) Rlcp[M] = lcp(APos[M], APos[R])

12

13 The Search algorithm using lcp values in O(P+logN) Example: A=abcx

The Search algorithm using lcp values in O(P+logN) Example: A=abcx

Note that Llcp[M] is well defined because every midpoint M has one LM and one RM

13

M

R

L

l = 3

r = 2

Pos

Llcp[M]=4

Rlcp[M]=2

abcde...

abcdf...

abd...

14 So how do we use l,r,Llcp[M]

So how do we use l,r,Llcp[M]

Example: W=abcx

14

Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>APos[L] W>APos[M] Go right l is unchanged = 3

R

L

M

Llcp[M]=4

l=3

r=2

abcde...

abc...

abc...

abcdf

abd

15 Example: W=abcx (cont

Example: W=abcx (cont

15

Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) APos[L] <APos[M] W<APos[M] Go left r = Llcp[M] = 2

M

L

R

Llcp[M]=2

l=3

r=2

abcde...

abdf

abd

16 Example: W=abcx (cont

Example: W=abcx (cont

16

Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare Wl and APos[M]l until Wl+j APos[M]l+j Go right or left according to Wl+j, APos[M]l+j new l or r = (l+j) Number of comparisons = j+1

L

M

R

Llcp[M]=3

r=2

l=3

abcde...

abc...

abc...

abcp

abd

17 The Search algorithm using lcp values- complexity

The Search algorithm using lcp values- complexity

In each iteration there are maximum j+1 comparisons, when in total Total comparisons (P + #Iterations) O(P+logN) running time Requires only 3N-sized arrays

17

18 The Article Overview

The Article Overview

A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement.

18

19 Construction of suffix array in O(NlogN)

Construction of suffix array in O(NlogN)

Sorting the suffixes in a unique Radix sort We Will have O(logN) stages (numbered 1,2,4,8,16) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H thus, in stage H the suffixes are sorted by )

19

20 Construction of suffix array  The general idea

Construction of suffix array The general idea

If Ai, Aj H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H.

20

H=2:

first bucket

second bucket

third bucket

fourth bucket

Ai

Aj

Aj+H

Ai+H

abef

abcd

ab

bb...

bb

cd

cd

ef

21 Construction of suffix array  The general idea (cont

Construction of suffix array The general idea (cont

Let Ai be in first H-bucket after stage H Ai starts with smallest H-symbol string Ai-H should be first in its H-bucket

21

H=2:

Ai

Ai-H

abef

abcd

ab

bb...

bb

cdef

cdab

ef

22 Construction of suffix array  The algorithm

Construction of suffix array The algorithm

22

Go over the suffix array: For each Ai: Move Ai-H to next available place in its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)

23 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

23

A2

A3

Ai sets Ai-1

assin

assassin

in

n

sin

ssin

sassin

ssassin

0

1

2

3

4

5

6

7

24 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

24

A0

Ai sets Ai-1

assin

assassin

in

n

sassin

ssin

sin

ssassin

0

1

2

3

4

5

6

7

25 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

25

A6

A5

Ai sets Ai-1

assin

assassin

in

n

sassin

ssin

sin

ssassin

0

1

2

3

4

5

6

7

26 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

26

A6

A7

Ai sets Ai-1

assin

assassin

in

n

sassin

sin

ssin

ssassin

0

1

2

3

4

5

6

7

27 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

27

A2

A1

Ai sets Ai-1

assin

assassin

in

n

sassin

sin

ssin

ssassin

0

1

2

3

4

5

6

7

28 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

28

A4

A5

Ai sets Ai-1

assin

assassin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

29 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

29

A0

A1

Ai sets Ai-1

assin

assassin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

30 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

30

A3

A4

Ai sets Ai-1

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

31 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=1

Go over array to get new 2-buckets

31

lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so sin opens a new 2-bucket

Ai sets Ai-1

back

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

32 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

32

A0

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

33 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

33

A1

A3

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

34 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

34

A4

A6

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

35 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

35

A7

A5

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

36 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

36

A2

A0

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

37 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

37

A3

A5

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

38 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

38

A1

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

39 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

39

A2

A4

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

40 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=2

Go over array to get new 4-buckets

40

Ai sets Ai-2

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

41 Construction of suffix array  The algorithm Example:

Construction of suffix array The algorithm Example:

A = assassin

H=4

Thats it, we are sorted!

41

assassin

assin

in

n

sassin

sin

ssassin

ssin

0

1

2

3

4

5

6

7

42 Construction of suffix array  Complexity Summary

Construction of suffix array Complexity Summary

42

Sorting by first char O(N) O(logN) stages of O(N) operations = O(NlogN) Total - time: O(NlogN) - space: 2 integer arrays of size N back

43 The Article Overview

The Article Overview

A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement.

43

44 How to find Longest Common Prefixes  the general idea

How to find Longest Common Prefixes the general idea

We dont care what is the lcp between suffixes in the same H-bucket. For Ap, Aq in the same H-bucket but different 2H-buckets: H lcp(Ap, Aq) < 2H lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) lcp(Ap+H, Aq+H) < H ? that is why Ap+H,Aq+H Are in different H-buckets, but which ones?

44

45 How to find Longest Common Prefixes  the general idea

How to find Longest Common Prefixes the general idea

If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how? If not, Then: lcp(APos[i], APos[j]) = {lcp(APos[k],APos[k+1])}

45

46 How to find Longest Common Prefixes  the general idea

How to find Longest Common Prefixes the general idea

lcp(Ap+H, Aq+H) = min{1,1,2} = 1

46

H=2

Ap+h

Aq+h

Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H

1 1 2

assassin

assin

in

n

sassin

sin

ssassin

ssin

47 How to find lcp  algorithm and data structures  Hgt[]

How to find lcp algorithm and data structures Hgt[]

During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]), initialized so that Hgt[i]=N+1 for every i. In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket

47

48 How to find lcp  Hgt[] example:

How to find lcp Hgt[] example:

48

lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

49 How to find lcp  Hgt[] example (cont

How to find lcp Hgt[] example (cont

lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3

lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

49

50 How to find lcp  data structures

How to find lcp data structures

We need a data structure that will contain lcp(APos[j], APos[i]) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.

50

51 How to find lcp  example of Interval tree

How to find lcp example of Interval tree

51

0

9

0

0

0

9

9

0

0

0

9

9

9

52 How to find lcp  Complexity

How to find lcp Complexity

Each time a leaf opens a new bucket we change Hgt[i] for that leaf. That change requires O(logN) changes in the interval tree There are O(N) leaves opening new bucket In total we get O(NlogN) to get all lcp values

52

53 The Article Overview

The Article Overview

A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement.

53

54 Time Expected-case Improvement of the construction of pos[]

Time Expected-case Improvement of the construction of pos[]

54

Assumptions: - All N-symbol strings are equally likely. Under this assumption: Expected length of longest repeated substring = O(log| |N) This immediately implies that construction of pos[] is reduced to O(NLogLogN). why? Next is a way to reduce it to O(N).

55 Time Expected-case Improvement of the construction of pos[]

Time Expected-case Improvement of the construction of pos[]

55

Let T = We encode each possible T length string to an integer with the isomorphism IntT(u) Map each AP to IntT(AP) [0,| |T-1] : IntT(AP) = ap| |T-1 +

56 Example of the mapping

Example of the mapping

56

IntT(AP) = ap| |T-1 +

0*4^0 + 0

0

3*4^0 + 0

3

3*4^0 + 0

3

0*4^0 + 0

0

3*4^0 + 0

3

3*4^0 + 0

3

1*4^0 + 0

1

2*4^0 + 0

2

| |= 4 , a=0, i=1, n=2, s=3 N=8 T= =1

57 Time Expected-case Improvement of the construction of pos[]

Time Expected-case Improvement of the construction of pos[]

57

By the definition of IntT(AP) it takes O(N) to compute all IntT(AP) values of all suffixes. So now instead of starting with H=1 we start with H= But since the longest repeated substring length is O(log| |N) we will have O(1) stages of the radix sort. Thus, the total time for constructing pos[] = O(N)

58 So is a suffix array better then a suffix tree

So is a suffix array better then a suffix tree

Suffix array

Suffix tree

58

Construction time

O(NlogN) - for small | | O(N) needs additional space

O(N)

Time Complexity

O(P+logN) good for large alphabets

O(Plog| |)

Space Complexity

requires 2N integers this is the main advantage.

O(N)

dependent on | | ?

No

Yes

59 The End

The End

59

Suffix Arrays: A new method for on-line string searches
http://900igr.net/prezentacija/anglijskij-jazyk/suffix-arrays-a-new-method-for-on-line-string-searches-240140.html
c

29
900igr.net > > > Suffix Arrays: A new method for on-line string searches