Approximately how many hours did you spent on this homework?
1. (5/5 points) Locate the article Reimagining Search by Alex Wright (CACM June 2016).
a. What type of study was conducted by researchers at Microsoft?
b. How many people participate in the Google Daily Information Needs study?
2. (15/15 points) Compute the hit list for ((paris AND NOT france) OR lear). Show each step.
3. (15/0 points) List alphabetically all case-normalized tokens and terms in the text below:
In late 1991 The Dallas Morning News became the lone major newspaper in the Dallas market when the Dallas Times-Herald was closed. The closure was after several years of circulation wars between the two papers especially over the then-burgeoning classified advertising market.
4. (20/20 points) Use an implementation of the Porter Stemmer from http://tartarus.org/martin/PorterStemmer/. Run your selected implementation on the text below:
The New York Times (NYT) is an American daily newspaper founded and continuously published in New York City since 1851. It has won 108 Pulitzer Prizes more than any other news organization. Its website nytimes.com is Americas most popular newspaper site receiving more than 30 million unique visitors per month.
Organ transplantation is the moving of an organ from one body to another or from a donor site to another location on the patients own body for the purpose of replacing the recipients damaged or absent organ. The emerging field of regenerative medicine is allowing scientists and engineers to create organs to be re-grown from the patients own cells (stem cells or cells extracted from the failing organs). Organs and/or tissues that are transplanted within the same persons body are called autografts. Transplants that are recently performed between two subjects of the same species are called allografts. Allografts can either be from a living or cadaveric source.
a. Take a look at the implementation. Indicate which rule(s) are used (spell out the lines of code that implement the rule(s) and what they do) to transform
website into websit engineers into engine continuously into continu
b. Find two words from the above text that are stemmed into the same sequence of characters even though (theoretically) they should not.
5. (15/15 points) Consider the phrase pizza with pepperoni.
a. One technique to implement this is using 2 biword phrases pizza with and with pepperoni. Consider just the first page of results from using Bing and Yahoo. How many results (not ads) are on each page? How many are relevant? How many results are common to both search engines?
b. Another technique to implement this is using an exact phrase pizza with pepperoni. Consider just the first page of results from using Bing and Yahoo. How many results (not ads) are on each page? How many are relevant? How many results are common to both search engines?
c. In your opinion which search engine produced the better results? Why?
6. (10/10 points) Web search engines A and B each crawl a random subset of the same size of the Web. Some of the pages crawled are duplicates yet different URLs. Assume that duplicates are distributed uniformly amongst the pages crawled by A and B. Assume a duplicate is a page that has exactly two copies. A indexes pages without duplicate elimination whereas B indexes only one copy of each duplicate page. If 45% of As indexed URLs are present in Bs index and 50% of Bs indexed URLs are present in As index what fraction of the Web consists of pages that do not have a duplicate?
7. (10/10 points) Why is it better to partition hosts (rather than individual URLs) between the nodes of a distributed crawl system?
8. (10/10 points) Consider the token lke. Since this does not match a word in the dictionary it must be misspelled. What do you consider to be the correct word? Explain your reasoning!
9. (0/15 points) Telephone numbers can be expressed in various formats such as +1 (800) 123-4567 (800) 123-4567 and 123-4567
Write and implement (or if found online cite source) a regular expression that can detect telephone numbers. Are there formats other than the examples provided above that your implementation handles? Provide the source code and the results of your testing.