Курсы английского
<<  BUS 782 Information Systems for Management Разработка контрольно-измерительных материалов по диагностике достижения планируемых результатов для обучающихся 6-х классов  >>
Applying Syntactic Similarity Algorithms for Enterprise Information
Applying Syntactic Similarity Algorithms for Enterprise Information
New Applications in the Enterprise
New Applications in the Enterprise
Syntactic Similarity
Syntactic Similarity
Shingling technique
Shingling technique
Basic Metrics
Basic Metrics
Shingling-Based Approach
Shingling-Based Approach
Four Algorithms
Four Algorithms
Minn Algorithm
Minn Algorithm
Modn Algorithm
Modn Algorithm
Sketchn Algorithm
Sketchn Algorithm
BSWn (Basic Sliding Window) Algorithm
BSWn (Basic Sliding Window) Algorithm
Algorithm’s Properties and Parameters
Algorithm’s Properties and Parameters
Objective and Fair Comparison
Objective and Fair Comparison
Methodology
Methodology
Methodology
Methodology
Sensitivity to Sliding Window Size
Sensitivity to Sliding Window Size
Frequency Sampling
Frequency Sampling
Comparison of Similarity Algorithms
Comparison of Similarity Algorithms
Case study using Enterprise Collections
Case study using Enterprise Collections
Results
Results
Runtime Comparison
Runtime Comparison
Conclusion
Conclusion
Sensitivity to Sliding Window Size
Sensitivity to Sliding Window Size

Презентация: «Applying Syntactic Similarity Algorithms for Enterprise Information Management». Автор: lucy cherkasova. Файл: «Applying Syntactic Similarity Algorithms for Enterprise Information Management.ppt». Размер zip-архива: 276 КБ.

Applying Syntactic Similarity Algorithms for Enterprise Information Management

содержание презентации «Applying Syntactic Similarity Algorithms for Enterprise Information Management.ppt»
СлайдТекст
1 Applying Syntactic Similarity Algorithms for Enterprise Information

Applying Syntactic Similarity Algorithms for Enterprise Information

Management

Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs

2 New Applications in the Enterprise

New Applications in the Enterprise

Document deletion and compliance rules how do you identify all the users who might have a copy of these files? E-Discovery identify and retrieve a complete set of related documents (all earlier or later versions of the same document) Simplify the review process: in the set of semantically similar documents (returned to the expert) identify clusters of syntactically similar documents Keep document repositories with up-to-date information to identify and filter out the documents that are largely duplicates of newer versions in order to improve the quality of the collection.

2

3 Syntactic Similarity

Syntactic Similarity

Syntactic similarity is useful to identify documents with a large textual intersection. Syntactic similarity algorithms are entirely defined by the syntactic (text) properties of the document Shingling technique (Broder et al) Goal: to identify near-duplicates on the web document A is represented by the set of shingles (sequences of adjacent words)

3

4 Shingling technique

Shingling technique

S(A) = {w1, w2, … , wj, …, wN } the set of all shingles in document A

A:

4

5 Basic Metrics

Basic Metrics

Similarity metric (documents A and B are ~similar) Containment metric (document A is ~contained in B)

5

6 Shingling-Based Approach

Shingling-Based Approach

Instead of comparing shingles (sequences of words) it is more convenient to deal with fingerprints (hashes) of shingles 64-bit Rabin fingerprints are used due to fast software implementation To further simplify the computation of similarity metric one can sample the document shingles to build a more compact document signature i.e., instead of 1000 shingles take a sample of 100 shingles Different ways of sampling the shingles lead to different syntactic similarity algorithms

6

7 Four Algorithms

Four Algorithms

We will compare performance and properties of the four syntactic similarity algorithms: Three shingling-based algorithms (Minn, Modn, Sketchn) Chunking-based algorithm (BSWn) Three shingling-based algorithms (Minn, Modn, Sketchn) differ how they sample the set of document shingles and build the document signature.

7

8 Minn Algorithm

Minn Algorithm

A:

Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for document A. Minn : it selects the n numerically smallest fingerprinted shingles. Documents are represented by fixed-size signatures

8

9 Modn Algorithm

Modn Algorithm

A:

Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for A. Modn selects all fingerprints whose value modulo n is zero. Example: If n=100 and A=1000 bytes then Mod100(A) is represented by approximately 10 fingerprints. Documents are represented by variable-size signatures (proportional to the document size)

9

10 Sketchn Algorithm

Sketchn Algorithm

Each shingle is fingerprinted with a family of independent hash functions f1,…, fn For each fi the fingerprint with smallest value is retained in the sketch. Documents are represented by fixed-size signatures: {min f1(A), min f2(A), …, min fn(A) } This algorithm has an elegant theoretical justification that the percentage of common entries in sketches of A and B accurately approximates the percentage of common shingles in A and B.

A:

f1 f2 … fn

min {f1(w1), f1(w2), …., f1(wN) }

min { fi(w1), fi(w2), …., fi(wN) }

10

11 BSWn (Basic Sliding Window) Algorithm

BSWn (Basic Sliding Window) Algorithm

Document is represented by the chunks. Documents are represented by variable-size signatures (the signature is proportional to the document size) Example: If n=100 and A=1000 bytes then BSW100(A) is represented by approximately 10 fingerprints.

11

12 Algorithm’s Properties and Parameters

Algorithm’s Properties and Parameters

Algorithm’s properties: Algorithm’s parameters: Sliding window size Sampling frequency Published papers use very different values Questions: Sensitivity of the similarity metric to different values of algorithm’s parameters Comparison of the four algorithms

12

13 Objective and Fair Comparison

Objective and Fair Comparison

How to objectively compare the algorithms? While one document collection might favor a particular algorithm, the other collection might show better results for a different algorithm Can we design a framework for fair comparison? Can the same framework be used for sensitivity analysis of the parameters?

13

14 Methodology

Methodology

Controlled set of modifications over a given document set: add/remove words in the documents a predefined number of times

14

15 Methodology

Methodology

Research corpus RCorig: 100 different HPLabs TRs from 2007 converted to a text format Introduce modifications to documents in a controlled way: Add/remove words to/from the document a predefined number of times Modifications can be done in a random fashion or uniformly spread through the document RCia = {RCorig, where word “a “ is inserted into each document i times } New average similarity metric:

15

16 Sensitivity to Sliding Window Size

Sensitivity to Sliding Window Size

Window=20 is a good choice (~4words) Larger size window decreases significantly the similarity metric.

16

17 Frequency Sampling

Frequency Sampling

RCa50

A big variance in similarity metric values for different documents under the smaller frequency sampling. Frequency sampling parameter depends on the document length distribution and should be tuned accordingly. Trade-off between the accuracy and the storage requirements

17

18 Comparison of Similarity Algorithms

Comparison of Similarity Algorithms

Sketchn and BSWn are more sensitive to the number of changes in the documents (especially short ones) than Modn are Minn

18

19 Case study using Enterprise Collections

Case study using Enterprise Collections

Two enterprise collections: Collection_1 with 5040 documents; Collection_2 with 2500 documents.

19

20 Results

Results

Algorithms Modn are Minn have identified higher number of similar documents (with Modn being a leader). However, Modn has a higher number of false positives. For longer documents the difference between the algorithms is smaller. Moreover, for long documents (> than100KB) BSWn and related chunking-based algorithms might be a better choice (accuracy and storage wise).

20

21 Runtime Comparison

Runtime Comparison

Executing Sketchn is more expensive, especially for larger window size.

21

22 Conclusion

Conclusion

Syntactic similarity is useful to identify documents with a large textual intersection. We designed a useful framework for a fair algorithm comparison: compared performance of four syntactic similarity algorithms, and identified a useful range of their parameters Future work: modify, refine, and optimize the BSW algorithm: Chunking-based algorithms are actively used for deduplication in backup and storage enterprise solutions.

22

23 Sensitivity to Sliding Window Size

Sensitivity to Sliding Window Size

Potentially, Modn algorithm might have a higher rate of false positives

23

«Applying Syntactic Similarity Algorithms for Enterprise Information Management»
http://900igr.net/prezentacija/anglijskij-jazyk/applying-syntactic-similarity-algorithms-for-enterprise-information-management-243308.html
cсылка на страницу
Урок

Английский язык

29 тем
Слайды
900igr.net > Презентации по английскому языку > Курсы английского > Applying Syntactic Similarity Algorithms for Enterprise Information Management