Anime∙Dohyou
Please sign in to access your account or reply to posts. If you are not a member, you may still view and enjoy the majority of the site, but why not register? It's free!

    WebCrawlers - Bots / Spiders

    Share
    avatar
    Kyouri Kai
    Founder

    Knowledge :

    WebCrawlers - Bots / Spiders

    Post by Kyouri Kai on Wed 20 Jan 2010, 2:57 pm

    Basic info:
    Wikipedia wrote:A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot, or—especially in the FOAF community—Web scutter.

    This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

    A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
    http://en.wikipedia.org/wiki/Web_crawler

    ahfx.net wrote:Googlebot, Yahoo Slurp, and MSNbot and similar spiders, bots, and crawlers are the programs that harvest information for search engines.

    For anyone tracking statistics on their website, Googlebot, MSNbot, and Yahoo Slurp can be welcomed guests. These three search engine bots gather (harvest) information about your page for their respective search engine. Seeing these spiders more often is also desirable because this means that you are being indexed more often and more likely to show up quickly in the SERPs (search engine results page).

    A spider is nothing more than a computer program that follows certain links on the web and gathers information as it goes. For example, Googlebot will follow HREF or SRC tags to find pages and images that are associated with any given site. Because these crawlers are merely computer programs, they aren't always the smartest of creatures and may get caught in endless loops built by dynamically created webpages.
    Robots.txt

    While having Googlebot index your site more quickly is almost always a good thing, there are times when you don't want certain pages or images indexed. Most "reputable" spiders will obey a directive given by the robots.txt file. This file is document that tells spiders what they may and may not index. You can also explicitly instruct a robot not to follow any of the links on a page by the following meta tag:META NAME="Googlebot" CONTENT="nofollow".

    Because of how these bots work and the importance they place on text links, many people have begun placing keyword filled text links to their website in their signatures on blogs and other comment sections. To reduce the impact that these have, you can instruct spiders not to follow one specific link by placing the following in the anchor tag:rel="nofollow". This will reduce the outgoing number of links and help you to maintain your pagerank.
    Bad SPAM bots

    Now just as in life, not all bots are good. There are "bad" bots that don't care about your robots.txt and are only out there to harvest your email address. To fight these "bad" SPAM bots, some people use javascript to "hide" their email addresses. However, anything that can be written to avoid a bad bot can be broken by an even worse bot. One company is fighting bots by giving them just what they want, email addresses, and lots of them. However, they are all email addresses of known SPAMers. I found the sight to be quite clever.

    Hopefully this will clear up some confusion as to what a bot, crawler, spider is and how they go about collecting information. If you have any questions, post them below and we will try to answer as quickly as possible. If you need help with SEO (search engine optimization), we would love to help show you ways to increase the frequency and number of times Googlebot, Yahoo Slurp, and MSNbot index your site.
    http://www.ahfx.net/weblog/39


    Last edited by Kyouri Kai on Thu 20 May 2010, 10:45 pm; edited 1 time in total (Reason for editing : pinned)
    avatar
    KageSenko
    go'dan
    go'dan


    Re: WebCrawlers - Bots / Spiders

    Post by KageSenko on Fri 22 Jan 2010, 1:40 pm

    great info kyo