Курсы английского
<<  Natural Language Processing for the Web Create the Internet of Your Things: The Microsoft Vision for IoT  >>
Estimating the Selectivity of XML Path Expressions for Internet Scale
Estimating the Selectivity of XML Path Expressions for Internet Scale
Motivation
Motivation
What is XML
What is XML
Querying XML
Querying XML
Goal of this Work
Goal of this Work
Outline of Presentation
Outline of Presentation
Path Trees
Path Trees
Summarizing Path Trees
Summarizing Path Trees
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Sibling-* Summarization
Original Path Tree
Original Path Tree
Sibling-* Summarization
Sibling-* Summarization
Level-* Summarization
Level-* Summarization
Level-* Summarization
Level-* Summarization
Level-* Summarization
Level-* Summarization
Global-* Summarization
Global-* Summarization
Global-* Summarization
Global-* Summarization
Global-* Summarization
Global-* Summarization
No-* Summarization
No-* Summarization
No-* Summarization
No-* Summarization
No-* Summarization
No-* Summarization
Outline
Outline
Markov Tables
Markov Tables
Markov Tables
Markov Tables
Summarizing Markov Tables
Summarizing Markov Tables
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Suffix-* Summarization
Global-*, No-* Summarization
Global-*, No-* Summarization
Outline
Outline
Data Sets for Experiments
Data Sets for Experiments
Query Workloads
Query Workloads
Best Summarization Methods
Best Summarization Methods
Path Trees vs
Path Trees vs
Synthetic Data – Random Paths
Synthetic Data – Random Paths
Synthetic Data – Random Tags
Synthetic Data – Random Tags
DBLP Data – Random Paths
DBLP Data – Random Paths
DBLP Data – Random Tags
DBLP Data – Random Tags
When are Markov Tables Better
When are Markov Tables Better
Conclusions
Conclusions

Презентация на тему: «Estimating the Selectivity of XML Path Expressions for Internet Scale Applications». Автор: Ashraf Aboulnaga. Файл: «Estimating the Selectivity of XML Path Expressions for Internet Scale Applications.ppt». Размер zip-архива: 333 КБ.

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

содержание презентации «Estimating the Selectivity of XML Path Expressions for Internet Scale Applications.ppt»
СлайдТекст
1 Estimating the Selectivity of XML Path Expressions for Internet Scale

Estimating the Selectivity of XML Path Expressions for Internet Scale

Applications

Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison

2 Motivation

Motivation

XML enables Internet scale applications that query data from many sources Niagara, Xyleme, … Queries over XML data use path expressions Optimizing these queries requires estimating the selectivity of the path expressions Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

3 What is XML

What is XML

<readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>

4 Querying XML

Querying XML

FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth Optimizing this query requires estimating the selectivity of the path expressions This requires information about the structure of the XML data

5 Goal of this Work

Goal of this Work

Build database statistics that capture the structure of XML data Ensure that the statistics fit in a small amount of memory For efficient query optimization Important for Internet scale applications Use the statistics to estimate the selectivity of simple XML path expressions //t1/t2/…/tn

6 Outline of Presentation

Outline of Presentation

Introduction Path Trees Markov Tables Performance Evaluation Conclusions

7 Path Trees

Path Trees

<A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C> </A>

8 Summarizing Path Trees

Summarizing Path Trees

Path trees contain all the information needed for selectivity estimation Problem: May not fit in available memory Small available memory Internet scale Remove low frequency nodes Removed nodes replaced with *-nodes Tag name: * meaning "any tag" Frequency: Average frequency of replaced nodes Sibling-*, Level-*, Global-*, No-*

9 Sibling-* Summarization

Sibling-* Summarization

10 Sibling-* Summarization

Sibling-* Summarization

A

1

11 Sibling-* Summarization

Sibling-* Summarization

A

1

I

2

12 Sibling-* Summarization

Sibling-* Summarization

A

1

I

J

2

4

13 Sibling-* Summarization

Sibling-* Summarization

*-nodes represent deleted sibling nodes Memory saved by coalescing nodes

A

1

14 Sibling-* Summarization

Sibling-* Summarization

A

1

E

5

*

f=6 n=2

15 Sibling-* Summarization

Sibling-* Summarization

A

1

E

H

5

6

*

f=6 n=2

16 Sibling-* Summarization

Sibling-* Summarization

A

1

D

E

H

7

5

6

*

f=6 n=2

17 Sibling-* Summarization

Sibling-* Summarization

A

1

*

f=6 n=2

18 Sibling-* Summarization

Sibling-* Summarization

A

1

C

9

*

f=12 n=2

H

6

*

f=6 n=2

19 Sibling-* Summarization

Sibling-* Summarization

A

1

C

9

*

f=12 n=2

G

H

10

6

*

f=6 n=2

20 Sibling-* Summarization

Sibling-* Summarization

A

1

C

9

*

f=12 n=2

*

f=6 n=2

21 Sibling-* Summarization

Sibling-* Summarization

A

1

C

9

*

f=12 n=2

*

f=16 n=2

*

f=6 n=2

K

f=23 n=2

22 Sibling-* Summarization

Sibling-* Summarization

A

1

C

9

*

*

6

8

*

K

f=23 n=2

3

23 Original Path Tree

Original Path Tree

24 Sibling-* Summarization

Sibling-* Summarization

Try to retain as much information as possible about the deleted nodes

A

1

C

9

*

*

6

8

*

K

f=23 n=2

3

25 Level-* Summarization

Level-* Summarization

26 Level-* Summarization

Level-* Summarization

A

1

D

E

H

7

5

6

I

J

2

4

27 Level-* Summarization

Level-* Summarization

Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*

A

1

28 Global-* Summarization

Global-* Summarization

29 Global-* Summarization

Global-* Summarization

A

1

D

E

H

7

5

6

I

J

2

4

30 Global-* Summarization

Global-* Summarization

3

*

D

H

7

6

31 No-* Summarization

No-* Summarization

32 No-* Summarization

No-* Summarization

A

1

D

E

H

7

5

6

I

J

2

4

33 No-* Summarization

No-* Summarization

Memory savings similar to global-* Conservative assumption about deleted nodes

D

E

H

7

5

6

34 Outline

Outline

Introduction Path Trees Markov Tables Performance Evaluation Conclusions

35 Markov Tables

Markov Tables

A table of all distinct paths of length up to m and their frequencies For paths of length greater than m, combine paths from the Markov table Example: Uses "short memory" or "Markov" property

36 Markov Tables

Markov Tables

Path

Freq

Path

Freq

A

1

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

37 Summarizing Markov Tables

Summarizing Markov Tables

Exact selectivities for paths of length up to m Approximate selectivities for paths longer than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2 with *-paths Suffix-*, Global-*, No-*

38 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A

1

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

39 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A

1

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

0

**

0

40 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A

1

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

0

**

0

41 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

42 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

AC

6

B

11

AD

4

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { }

43 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

AC

6

B

11

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { (AD,4) }

44 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

AC

6

B

11

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { (AD,4) }

45 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

AC

6

B

11

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { (AD,4) }

46 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { }

47 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

BD

7

AB

11

CD

8

*

f=1,n=1

**

0

SD= { }

48 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

AB

11

CD

8

*

f=1,n=1

**

0

SD= { (BD,7) }

49 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

AB

11

CD

8

*

f=1,n=1

**

0

SD= { (BD,7) }

50 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

AB

11

*

f=1,n=1

**

0

SD= { (BD,7), (CD,8) }

51 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

AB

11

*

f=1,n=1

**

0

SD= { (BD,7), (CD,8) }

52 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

BC

9

D

19

AB

11

*

f=1,n=1

**

0

SD= { (BD,7), (CD,8) }

53 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

B*

f=16,n=2

D

19

AB

11

*

f=1,n=1

**

0

SD= { (CD,8) }

54 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

A*

f=10,n=2

B

11

C

15

B*

f=16,n=2

D

19

AB

11

*

f=1,n=1

**

0

SD= { (CD,8) }

55 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

B

11

C

15

B*

f=16,n=2

D

19

AB

11

*

f=1,n=1

**

f=10,n=2

SD= { (CD,8) }

56 Suffix-* Summarization

Suffix-* Summarization

Path

Freq

Path

Freq

B

11

C

15

B*

8

D

19

AB

11

*

1

**

6

SD= { }

57 Global-*, No-* Summarization

Global-*, No-* Summarization

Global-* Two *-paths, * and ** Deletes fewer paths than suffix-* to summarize the Markov table No-* No *-paths Conservatively assumes that paths not in the Markov table do not exist in the data

58 Outline

Outline

Introduction Path Trees Markov Tables Performance Evaluation Conclusions

59 Data Sets for Experiments

Data Sets for Experiments

Synthetic data set 100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1) DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB

60 Query Workloads

Query Workloads

1,000 paths of length between 1 and 4 Random paths All query paths exist in the data Random tags Most query paths of length 2 or more do not exist in the data Available memory between 5 and 50 KB

61 Best Summarization Methods

Best Summarization Methods

Path trees Query paths in data: Global-* Query paths not in data: No-* Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*

62 Path Trees vs

Path Trees vs

Markov Tables

When to use path trees and when to use Markov tables? Also compared against Pruned Suffix Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values

63 Synthetic Data – Random Paths

Synthetic Data – Random Paths

64 Synthetic Data – Random Tags

Synthetic Data – Random Tags

65 DBLP Data – Random Paths

DBLP Data – Random Paths

66 DBLP Data – Random Tags

DBLP Data – Random Tags

67 When are Markov Tables Better

When are Markov Tables Better

DBLP Repeated sub-structures effectively captured by Markov tables

<sigmod> <inproceedings> <author>…</author> … </inproceedings> … </sigmod>

<vldb> <inproceedings> <author>…</author> … </inproceedings> … </vldb>

68 Conclusions

Conclusions

Novel statistics for estimating the selectivity of XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known alternative Repeated sub-structures: Markov tables No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-* Query paths do not exist in the data: No-* To appear in VLDB 2001

«Estimating the Selectivity of XML Path Expressions for Internet Scale Applications»
http://900igr.net/prezentacija/anglijskij-jazyk/estimating-the-selectivity-of-xml-path-expressions-for-internet-scale-applications-238688.html
cсылка на страницу
Урок

Английский язык

29 тем
Слайды
900igr.net > Презентации по английскому языку > Курсы английского > Estimating the Selectivity of XML Path Expressions for Internet Scale Applications