Курсы английского
<<  BUS 782 Information Systems for Management Разработка контрольно-измерительных материалов по диагностике достижения планируемых результатов для обучающихся 6-х классов  >>
Basic Metrics
Basic Metrics
Basic Metrics
Basic Metrics
Algorithm’s Properties and Parameters
Algorithm’s Properties and Parameters
Methodology
Methodology
Methodology
Methodology
Sensitivity to Sliding Window Size
Sensitivity to Sliding Window Size
Frequency Sampling
Frequency Sampling
Frequency Sampling
Frequency Sampling
Comparison of Similarity Algorithms
Comparison of Similarity Algorithms
Comparison of Similarity Algorithms
Comparison of Similarity Algorithms
Case study using Enterprise Collections
Case study using Enterprise Collections
Results
Results
Runtime Comparison
Runtime Comparison
Sensitivity to Sliding Window Size
Sensitivity to Sliding Window Size
Картинки из презентации «Applying Syntactic Similarity Algorithms for Enterprise Information Management» к уроку английского языка на тему «Курсы английского»

Автор: lucy cherkasova. Чтобы познакомиться с картинкой полного размера, нажмите на её эскиз. Чтобы можно было использовать все картинки для урока английского языка, скачайте бесплатно презентацию «Applying Syntactic Similarity Algorithms for Enterprise Information Management.ppt» со всеми картинками в zip-архиве размером 276 КБ.

Applying Syntactic Similarity Algorithms for Enterprise Information Management

содержание презентации «Applying Syntactic Similarity Algorithms for Enterprise Information Management.ppt»
Сл Текст Сл Текст
1Applying Syntactic Similarity 11BSWn (Basic Sliding Window) Algorithm.
Algorithms for Enterprise Information Document is represented by the chunks.
Management. Lucy Cherkasova, Kave Eshghi, Documents are represented by variable-size
Brad Morrey, Joseph Tucek, Alistair Veitch signatures (the signature is proportional
Hewlett-Packard Labs. to the document size) Example: If n=100
2New Applications in the Enterprise. and A=1000 bytes then BSW100(A) is
Document deletion and compliance rules how represented by approximately 10
do you identify all the users who might fingerprints. 11.
have a copy of these files? E-Discovery 12Algorithm’s Properties and Parameters.
identify and retrieve a complete set of Algorithm’s properties: Algorithm’s
related documents (all earlier or later parameters: Sliding window size Sampling
versions of the same document) Simplify frequency Published papers use very
the review process: in the set of different values Questions: Sensitivity of
semantically similar documents (returned the similarity metric to different values
to the expert) identify clusters of of algorithm’s parameters Comparison of
syntactically similar documents Keep the four algorithms. 12.
document repositories with up-to-date 13Objective and Fair Comparison. How to
information to identify and filter out the objectively compare the algorithms? While
documents that are largely duplicates of one document collection might favor a
newer versions in order to improve the particular algorithm, the other collection
quality of the collection. 2. might show better results for a different
3Syntactic Similarity. Syntactic algorithm Can we design a framework for
similarity is useful to identify documents fair comparison? Can the same framework be
with a large textual intersection. used for sensitivity analysis of the
Syntactic similarity algorithms are parameters? 13.
entirely defined by the syntactic (text) 14Methodology. Controlled set of
properties of the document Shingling modifications over a given document set:
technique (Broder et al) Goal: to identify add/remove words in the documents a
near-duplicates on the web document A is predefined number of times. 14.
represented by the set of shingles 15Methodology. Research corpus RCorig:
(sequences of adjacent words). 3. 100 different HPLabs TRs from 2007
4Shingling technique. S(A) = {w1, w2, … converted to a text format Introduce
, wj, …, wN } the set of all shingles in modifications to documents in a controlled
document A. A: 4. way: Add/remove words to/from the document
5Basic Metrics. Similarity metric a predefined number of times Modifications
(documents A and B are ~similar) can be done in a random fashion or
Containment metric (document A is uniformly spread through the document RCia
~contained in B). 5. = {RCorig, where word “a “ is inserted
6Shingling-Based Approach. Instead of into each document i times } New average
comparing shingles (sequences of words) it similarity metric: 15.
is more convenient to deal with 16Sensitivity to Sliding Window Size.
fingerprints (hashes) of shingles 64-bit Window=20 is a good choice (~4words)
Rabin fingerprints are used due to fast Larger size window decreases significantly
software implementation To further the similarity metric. 16.
simplify the computation of similarity 17Frequency Sampling. RCa50. A big
metric one can sample the document variance in similarity metric values for
shingles to build a more compact document different documents under the smaller
signature i.e., instead of 1000 shingles frequency sampling. Frequency sampling
take a sample of 100 shingles Different parameter depends on the document length
ways of sampling the shingles lead to distribution and should be tuned
different syntactic similarity algorithms. accordingly. Trade-off between the
6. accuracy and the storage requirements. 17.
7Four Algorithms. We will compare 18Comparison of Similarity Algorithms.
performance and properties of the four Sketchn and BSWn are more sensitive to the
syntactic similarity algorithms: Three number of changes in the documents
shingling-based algorithms (Minn, Modn, (especially short ones) than Modn are
Sketchn) Chunking-based algorithm (BSWn) Minn. 18.
Three shingling-based algorithms (Minn, 19Case study using Enterprise
Modn, Sketchn) differ how they sample the Collections. Two enterprise collections:
set of document shingles and build the Collection_1 with 5040 documents;
document signature. 7. Collection_2 with 2500 documents. 19.
8Minn Algorithm. A: Let S(A)={f(w1), 20Results. Algorithms Modn are Minn have
f(w2), …., f(wN)} be all fingerprinted identified higher number of similar
shingles for document A. Minn : it selects documents (with Modn being a leader).
the n numerically smallest fingerprinted However, Modn has a higher number of false
shingles. Documents are represented by positives. For longer documents the
fixed-size signatures. 8. difference between the algorithms is
9Modn Algorithm. A: Let S(A)={f(w1), smaller. Moreover, for long documents
f(w2), …., f(wN)} be all fingerprinted (> than100KB) BSWn and related
shingles for A. Modn selects all chunking-based algorithms might be a
fingerprints whose value modulo n is zero. better choice (accuracy and storage wise).
Example: If n=100 and A=1000 bytes then 20.
Mod100(A) is represented by approximately 21Runtime Comparison. Executing Sketchn
10 fingerprints. Documents are represented is more expensive, especially for larger
by variable-size signatures (proportional window size. 21.
to the document size). 9. 22Conclusion. Syntactic similarity is
10Sketchn Algorithm. Each shingle is useful to identify documents with a large
fingerprinted with a family of independent textual intersection. We designed a useful
hash functions f1,…, fn For each fi the framework for a fair algorithm comparison:
fingerprint with smallest value is compared performance of four syntactic
retained in the sketch. Documents are similarity algorithms, and identified a
represented by fixed-size signatures: {min useful range of their parameters Future
f1(A), min f2(A), …, min fn(A) } This work: modify, refine, and optimize the BSW
algorithm has an elegant theoretical algorithm: Chunking-based algorithms are
justification that the percentage of actively used for deduplication in backup
common entries in sketches of A and B and storage enterprise solutions. 22.
accurately approximates the percentage of 23Sensitivity to Sliding Window Size.
common shingles in A and B. A: f1 f2 … fn. Potentially, Modn algorithm might have a
min {f1(w1), f1(w2), …., f1(wN) }. min { higher rate of false positives. 23.
fi(w1), fi(w2), …., fi(wN) }. 10.
Applying Syntactic Similarity Algorithms for Enterprise Information Management.ppt
http://900igr.net/kartinka/anglijskij-jazyk/applying-syntactic-similarity-algorithms-for-enterprise-information-management-243308.html
cсылка на страницу

Applying Syntactic Similarity Algorithms for Enterprise Information Management

другие презентации на тему «Applying Syntactic Similarity Algorithms for Enterprise Information Management»

«English for you» - Ты убедишься насколько интересным и захватывающим может быть обучение языку. Что ты научишься делать и узнаешь. ENGLISH FOR YOU. Грамматика станет твоим другом. Узнать насколько хорошо ты усвоил материал тебе помогут: You are welcome! Викторины Задания Игры Игровые упражнения Ты можешь выбрать уровень сложности.

«Сайт английского языка» - Продвижение сайта. Конкурентные преимущества www.sunny-plus.ru. Сайт школы изучения английского языка Sunny plus. Cфера бизнеса компании. Цели и задачи сайта. Целевая аудитория. English First. Проект сайта школы изучения английского языка Sunny plus . Индивидуальные клиенты. Предоставление услуг по изучению английского языка.

«Профессор Хиггинс» - Фонетика. Приложения. Выполняя упражнения, учащийся может натренировать данное правило и проверить, насколько хорошо он его понял. Добавлены раздел стихов (около 100), поговорок, скороговорок и раздел омонимы. Возможности использования диска на уроках в 6 классе. Уроки состоят из нескольких упражнений.

«Детский английский» - Вхождение в программу возможно на любом году обучения. Программа адресована. «Английский язык для младшего школьного возраста». Цель программы: В процессе обучения дети. По окончании обучения по данной программе учащиеся будут: Овладение английским языком на уровне elementary и приобретение социо-культурных знаний.

«English for you» - ENGLISH FOR YOU. Ты сможешь совершенствовать своё произношение. Что ты научишься делать и узнаешь. Викторины Задания Игры Игровые упражнения Ты можешь выбрать уровень сложности. Все слова и выражения озвучены носителями языка. Грамматика станет твоим другом. EuroTalk. Артикль Множественное число Предлоги.

«British Academic Centre» - По ключевому словосочетанию «Курсы английского языка в Екатеринбурге». Первая школа-франчайзи. Причина на поверхности. Увеличение числа запросов в 2010 г. по отношению к 2009 г. более чем на 30%. Качество обучения – конкурентное преимущество. Но франчайзинговая модель бизнеса сводит бизнес-риски до минимума.

Курсы английского

25 презентаций о курсах английского
Урок

Английский язык

29 тем
Картинки
900igr.net > Презентации по английскому языку > Курсы английского > Applying Syntactic Similarity Algorithms for Enterprise Information Management