World's most popular travel blog for travel bloggers.

How feasible is it for a non-distributed web crawler running on consumer hardware to search the internet?

, , No Comments
Problem Detail: 

I am looking for an automated way to answer the question: what are the URLs on the world wide web that contain at least two strings from a set of strings.

So if I have a set of strings {"A", "B" and "C"} -- I want to know, what pages on the world wide web contain "A" and B", "A" and "C", "B" and "C" or "A", "B" and "C."

Obviously, for this simple example: Google it!

But I want a scaleable, automated, and free solution. Google does not permit automated queries. Yahoo makes you pay.

One idea I have is (1) start with a URL, (2) check the text at that URL for the search strings (3) parse out the links from the text (4) record that you have checked the page and if it contains the strings then (5) search the links from the initial URL. Repeat until you have searched the tree.

How feasible is this in terms of time and space on a single commodity machine -- given the size the internet? The internet is really, really big -- but only a comparatively few pages will contain these strings (they are proper names).

I don't want to index the whole web as if my laptop were google!

Most of the crawler's time will be spent confirming that the pages don't contain the strings.

I'm trying to get a rough ballpark to understand if this is even remotely feasible.

Asked By : bernie2436

Answered By : Aaron

This is not remotely feasible. The number of pages indexed by Google and Bing are in the tens of billions. To get anything close to what they are doing you are going to need to process terabytes of data. It will take you days to download all of these webpages using your home internet connection at full speed. And you will probably hit some bandwidth cap by your ISP before you finish. Even if there was no bandwidth cap from your ISP you would have to be careful in how you request the pages. If you request too many from one server in too short a time you'll probably get your ip blocked by that server. So you'll need some kind of program to optimize the order in which you visit these hundreds of millions of URLs. You will never be able to finish visiting all of the URLs in your queue. New content will be added to the internet faster than you can keep up with it.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/19898

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback