ComputerScienceExpert

(11)

$18/per page/

About ComputerScienceExpert

Levels Tought:
Elementary,Middle School,High School,College,University,PHD

Expertise:
Applied Sciences,Calculus See all
Applied Sciences,Calculus,Chemistry,Computer Science,Environmental science,Information Systems,Science Hide all
Teaching Since: Apr 2017
Last Sign in: 103 Weeks Ago, 3 Days Ago
Questions Answered: 4870
Tutorials Posted: 4863

Education

  • MBA IT, Mater in Science and Technology
    Devry
    Jul-1996 - Jul-2000

Experience

  • Professor
    Devry University
    Mar-2010 - Oct-2016

Category > Programming Posted 05 May 2017 My Price 11.00

a standard IR evaluation data set

I'm working on a assignment for an information retrieval class. Please view the attachment.

 

 

1. Suppose that we have a standard IR evaluation data set containing 1000 documents. Assume that a particular query in this data set is deemed to be relevant to the following 25 documents in the collection:
REL = { d1, d5, d6, d10, d88, d150, d200, d210, d250, d300, d400, d405, d450, d472, d500, d501, d530, d545, d590, d600, d635, d700, d720, d800, d900 }
Two different retrieval systems S1 and S2 are used to retrieve ranked lists of documents from this collection using the above query. The top 10 retrieved documents for these two systems are given below (each list is in decreasing order of relevance).
RET(S1) = d2, d5, d150, d250, d11, d33, d50, d600, d500, d520
RET(S2) = d250, d400, d150, d210, d999, d3, d501, d800, d205, d300
a. Compute the Precision and the Recall graphs for each system as a function of the number of documents returned (for 1 document returned, 2 documents returned, etc). First, compile your calculations in a table which shows the Precision and Recall for S1 and S2 using these query results as a function of the
number of documents returned (from 1 to 10). Then, create graphs comparing Precision and Recall of the two systems.
b. A single metric that can be used to combine precision and recall is the F Measure(see Section 8.3 of the IR Book). Using the F1 measure (F measure with B=1), create graph similar to the above comparing the two systems.
c. Which system is better? Explain your answer. 2. Use the query "jaguar" on three search engines Google, Yahoo, and Bing. Assume
that you are interested in the sense of "Jaguar" as an animal (so a document is relevant
if it is about the animal Jaguar). For each of your searches record the top 10 results (do not count the "sponsored" results and only count the main results, not sub­
categories of results which are often other documents from the same site as the main result). Make a precision plot for the first 10 returned links as function
of the number of returned documents, comparing the precision of
the three search engines for this query (use a similar approach as
part (b) of problem 1, except only for precision). Which of the search engines did better? Which would have done better if you were
interested in "jaguar" as a car? Why do you think there might be a
difference among these search engines?
1. Inverted Index Construction:
Consider the following three short documents:
Doc #1
Glimpse is an indexing and query system that allows for search through a
file system or document collection quickly. Glimpse is the default search
engine in a larger information retrieval system. It has also been used as
part of some web based search engines.
Doc #2
The main processes in an retrieval system are document indexing, query
processing, query evaluation and relevance feedback. Among these, efficient
updating of the index is critical in large scale systems.
Doc #3 Clusters are created from short snippets of documents retrieved by
web search engines which are as good as clusters created from the
full text of web documents. a. First remove stop words, and punctuation, and apply Porter's stemming algorithm to the 3 documents (Note: use the online stemming application for this purpose).
b. Create an inverted index of the three documents, including the dictionary
and the postings. The dictionary should also contain (for each term) statistics such as total number of occurrences in the collection and the document frequency. The postings for each term should contain the document ids and the term frequencies (depict multiple postings for a term as a linked list, similar to Figure 1.3 in the IR Book).
c. What are the search results for the following Boolean queries (in each case explain how you obtained them from the inverted index):
i. index AND query ii. index OR query iii. index AND (NOT query) iv. (search AND query) OR (search AND retrieve) v. (index OR cluster) AND (web OR system) 2. Finite State Automata and Lexical Analysis (Note: your FSAs should only accept all terms corresponding to the specified regular expression or pattern, but no other terms).
a. Construct a Finite State Automata recognizing the set of all binary strings with an odd number of 1's. Examples include: 1, 010, 111, 10110,
0010100111, etc.
a. Suppose that during document indexing we are interested in identifying all dates in the format mm/dd/yyyy, where mm is a 2 digit string representing the month (ranging from 01 through 12), dd is a 2­digit string representing the day (01­31), and yyyy are 4 digits representing the
year. Furthermore, we are only interested in the years between 1990 and 2019. Example: 02/30/2012. Construct a Finite State Automata that exactly recognizes date tokens as described above. Notes: You do not need to match the month with the correct number of days. You may assume thatdig is a special token representing all digits (0­9). Also, specify a regular expression that corresponds to the finite automata for this language. 2 Ngrams: Using bigrams, which of the following terms is more likely to be considered "equivalent" to the term "informational"? Justify your answer by using Dice's coefficient to measure similarity of each term to " informational".
o informal o formalization 2. The following table shows the raw occurrence frequencies of some words in a hypothetical collection. If x and y represent two distinct words, f(x) and f(y) denote the raw frequencies for words x and y, respectively. Assume that the frequency of co­occurrence of x and y has already been computed and is denoted by f(x,y). [Note that the probability of occurrence is estimated as the raw frequency divided by the total number of terms.]
3.
4.
5.
6.
7.
8.
9.
10. x
y
f(x)
f(y)
f(x,y)
------------------------------------------color
blue
20
40
10
color
green
20
10
5
blue
green
40
10
5
united
states
25
20
15
united
airline
25
10
5
states
airline
20
10
2 Assuming that the total number of unique words, N, in the collection is 1000, rank these pairs of words in decreasing order of expected mutual information. Show your detailed calculations in at least one case and the final values for all cases. 11.WordNet is an advanced lexical ontology and thesaurus for the English language with many applications in information retrieval and Web information agents. Write a one page document about WordNet containing the following elements. (1) Your own description of WordNet. (2) What are the unique features of WordNet that distinguish it from other types of standard thesauri? And (3) what are some of the specific ways WordNet can be (or has been) used in the context of information retrieval?

 

Attachments:

Answers

(11)
Status NEW Posted 05 May 2017 01:05 AM My Price 11.00

-----------

Not Rated(0)