Knowledge

Robots.txt: Difference between revisions

Source 📝

519:{{cite web|title=Important: Spiders, Robots and Web Wanderers |first=Martijn |last=Koster |work=www-talk mailing list |date=25 February 1994 |url=http://inkdroid.org/tmp/www-talk/4113.html |format=] archived message |url-status=dead |archive-url=https://web.archive.org/web/20131029200350/http://inkdroid.org/tmp/www-talk/4113.html |archive-date=October 29, 2013 }}</ref> on the ''www-talk'' mailing list, the main communication channel for WWW-related activities at the time. ] claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a ] on Koster's server.<ref>{{cite web |url=http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html |title=How I got here in the end, part five: "things can only get better!" |work=Charlie's Diary |date=19 June 2006 |access-date=19 April 2014 |archive-url=https://web.archive.org/web/20131125220913/http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html |archive-date=2013-11-25 |url-status=live }}</ref> 497:|url-status=live }}</ref><ref>{{cite web |title=Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web |first=Roy |last=Fielding |work=First International Conference on the World Wide Web |year=1994 |place=Geneva |url=http://www94.web.cern.ch/WWW94/PapersWWW94/fielding.ps |access-date=September 25, 2013 |format=PostScript |archive-url=https://web.archive.org/web/20130927093658/http://www94.web.cern.ch/WWW94/PapersWWW94/fielding.ps |archive-date=2013-09-27 |url-status=live }}</ref> when working for ]<ref name=":0">{{cite web |url=http://www.robotstxt.org/orig.html#status |title=The Web Robots Pages |publisher=Robotstxt.org |date=1994-06-30 |access-date=2013-12-29 |archive-url=https://web.archive.org/web/20140112090633/http://www.robotstxt.org/orig.html#status |archive-date=2014-01-12 |url-status=live }}</ref> in February 1994<ref> 505:|url-status=live }}</ref><ref>{{cite web |title=Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web |first=Roy |last=Fielding |work=First International Conference on the World Wide Web |year=1994 |place=Geneva |url=http://www94.web.cern.ch/WWW94/PapersWWW94/fielding.ps |access-date=September 25, 2013 |format=PostScript |archive-url=https://web.archive.org/web/20130927093658/http://www94.web.cern.ch/WWW94/PapersWWW94/fielding.ps |archive-date=2013-09-27 |url-status=live }}</ref> when working for ]<ref name=":0">{{cite web |url=http://www.robotstxt.org/orig.html#status |title=The Web Robots Pages |publisher=Robotstxt.org |date=1994-06-30 |access-date=2013-12-29 |archive-url=https://web.archive.org/web/20140112090633/http://www.robotstxt.org/orig.html#status |archive-date=2014-01-12 |url-status=live }}</ref> in February 1994< 508:|archive-url=https://web.archive.org/web/20131029200350/http://inkdroid.org/tmp/www-talk/4113.html |archive-date=October 29, 2013 }}</ref> on the ''www-talk'' mailing list, the main communication channel for WWW-related activities at the time. ] claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a ] on Koster's server.<ref>{{cite web |url=http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html |title=How I got here in the end, part five: "things can only get better!" |work=Charlie's Diary |date=19 June 2006 |access-date=19 April 2014 |archive-url=https://web.archive.org/web/20131125220913/http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html |archive-date=2013-11-25 |url-status=live }}</ 551:
name="Verge">{{cite web|url=https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders|title=The text file that runs the internet|work=]|last=Pierce|first=David|date=14 February 2024|accessdate=16 March 2024}}</ref> most complied, including those operated by search engines such as ], ], and ].<ref name="sear_Robo">{{cite web |title=Robots.txt Celebrates 20 Years Of Blocking Search Engines |author=Barry Schwartz |work=Search Engine Land |date=30 June 2014 |access-date=2015-11-19 |url=http://searchengineland.com/robots-txt-celebrates-20-years-blocking-search-engines-195479 |archive-url=https://web.archive.org/web/20150907000430/http://searchengineland.com/robots-txt-celebrates-20-years-blocking-search-engines-195479 |archive-date=2015-09-07 |url-status=live }}</ref>
543:
name="Verge">{{cite web|url=https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders|title=The text file that runs the internet|work=]|last=Pierce|first=David|date=14 February 2024|accessdate=16 March 2024}}</ref> most complied, including those operated by search engines such as ], ], and ].<ref name="sear_Robo">{{cite web |title=Robots.txt Celebrates 20 Years Of Blocking Search Engines |author=Barry Schwartz |work=Search Engine Land |date=30 June 2014 |access-date=2015-11-19 |url=http://searchengineland.com/robots-txt-celebrates-20-years-blocking-search-engines-195479 |archive-url=https://web.archive.org/web/20150907000430/http://searchengineland.com/robots-txt-celebrates-20-years-blocking-search-engines-195479 |archive-date=2015-09-07 |url-status=live }}</ref>
2118: 1271:* {{Cite web |last=Allyn |first=Bobby |date=5 July 2024 |title=Artificial Intelligence Web Crawlers Are Running Amok |url=https://www.npr.org/2024/07/05/nx-s1-5026932/artificial-intelligence-web-crawlers-are-running-amok |url-status=live |archive-url=https://web.archive.org/web/20240706020252/https://www.npr.org/2024/07/05/nx-s1-5026932/artificial-intelligence-web-crawlers-are-running-amok |archive-date=6 July 2024 |work=] |publisher=] |access-date=6 July 2024}} 85: 1469: 768:
Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref>
760:
Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref>
767:
The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files,{{Ref RFC|9309|section=2.5: Limits}} which Google maintains as a 500 kibibyte file size restriction for robots.txt files.<ref>{{Cite web |title=How Google Interprets the robots.txt
759:
The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files,{{Ref RFC|9309|section=2.5: Limits}} which Google maintains as a 500 kibibyte file size restriction for robots.txt files.<ref>{{Cite web |title=How Google Interprets the robots.txt
550:
The standard, initially RobotsNotWanted.txt, allowed ]s to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots; ] overload was a primary concern. By June 1994 it had become a ];<ref
542:
The standard, initially RobotsNotWanted.txt, allowed ]s to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots; ] overload was a primary concern. By June 1994 it had become a ];<ref
2097:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
1969:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
1955:
User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
1674:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
504:
The standard was proposed by ],<ref>{{cite web |url=http://www.greenhills.co.uk/historical.html |title=Historical |website=Greenhills.co.uk |access-date=2017-03-03 |archive-url=https://web.archive.org/web/20170403152037/http://www.greenhills.co.uk/historical.html |archive-date=2017-04-03
496:
The standard was proposed by ],<ref>{{cite web |url=http://www.greenhills.co.uk/historical.html |title=Historical |website=Greenhills.co.uk |access-date=2017-03-03 |archive-url=https://web.archive.org/web/20170403152037/http://www.greenhills.co.uk/historical.html |archive-date=2017-04-03
1827:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 1819:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 1675:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
1473:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
165: 77: 1806:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
507:
ref>{{cite web|title=Important: Spiders, Robots and Web Wanderers |first=Martijn |last=Koster |work=www-talk mailing list |date=25 February 1994 |url=http://inkdroid.org/tmp/www-talk/4113.html |format=] archived message |url-status=dead
1970:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
1664:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 2039:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
1760:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 439:'''robots.txt''' is the ] used for implementing the '''Robots Exclusion Protocol''', a standard used by ]s to indicate to visiting ]s and other ] which portions of the website they are allowed to visit. 432:'''robots.txt''' is the ] used for implementing the '''Robots Exclusion Protocol''', a standard used by ]s to indicate to visiting ]s and other ] which portions of the website they are allowed to visit. 2106:
The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.
1671:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
114: 3109: 2651: 15: 2740: 2187: 1750:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
121: 2863: 3471: 1780:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 3247: 1618:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
2437: 2369: 2895: 1794:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".
3361: 3299: 3197: 3432: 3105: 2714: 2643: 1945:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
212: 148: 2736: 1949:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 1942:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 1824: 2992: 3332: 2803: 1668:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 2855: 868:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities 861:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities 125: 3034: 3277: 3457: 2284: 2833: 2466: 2035:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
3243: 2921: 1564:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 1839:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
3084: 2950: 2532: 2561: 2621: 1607:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
2429: 2361: 160: 107: 86: 1718:
A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.
190: 2773: 45: 42: 3512: 2251: 1815:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
194: 2332: 2310: 1769: 1573: 3218: 1936:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/
3028: 2885: 2148: 183: 3357: 3303: 2181: 398:| website = {{URL|https://robotstxt.org}}, {{URL|https://datatracker.ietf.org/doc/html/rfc9309|RFC 9309}} 391:| website = {{URL|https://robotstxt.org}}, {{URL|https://datatracker.ietf.org/doc/html/rfc9309|RFC 9309}} 3189: 2401: 1645: 3424: 2681: 1768:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
3134: 2706: 2145:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 1644:
On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under
172: 2583: 1797:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
156: 103: 3166: 1820: 1565: 3059: 2098:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
2973: 2160: 1608: 1391: 1182: 1166: 1148: 1066: 1036: 972: 804: 3324: 2795: 1730:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
2856:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 1870: 3358:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 1682:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 3522: 3462: 3018: 1971: 179: 80: 3403: 3269: 2508: 1561: 1656:
When a site owner wishes to give instructions to web robots they place a text file called
8: 3527: 3190:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 2273: 2829: 2458: 1930:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /
1568:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 2917: 1786: 1772:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 1679: 1619: 1569: 3080: 2942: 3517: 3024: 2528: 1623: 1603:
mailing list, the main communication channel for WWW-related activities at the time.
1572:
overload. In the 2020s many websites began denying bots that collect information for
2117: 3393: 2984: 2553: 2498: 2170: 1791: 1751: 16: 2613: 2175: 2165: 2123: 2036: 1889:
This example tells all robots that they can visit all files because the wildcard
3406: 3383: 2511: 2488: 1754:
announced that it would stop complying with robots.txt directives. According to
1525: 2890: 2765: 1756: 1727: 1604: 1592: 208:
Added a book to the "Further reading" section, and performed cleanup.
2274:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 2243: 1906:
The same result can be accomplished with an empty or missing robots.txt file.
3506: 2644:"Robots Exclusion Protocol: joining together to provide better documentation" 2340: 2218: 1615: 1323:*============================ {{No more links}} ============================+ 1316:*============================ {{No more links}} ============================+ 2988: 2306: 1933:
This example tells two specific robots not to enter one specific directory:
1854:
file that displays information meant for humans to read. Some sites such as
294:|For Knowledge's robots.txt file, see https://en.wikipedia.org/robots.txt.}} 283:|For Knowledge's robots.txt file, see https://en.wikipedia.org/robots.txt.}} 61: 3222: 2141: 1739: 1554: 852:* <code>]</code>, a standard for listing authorized ad sellers 845:* <code>]</code>, a standard for listing authorized ad sellers 3106:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 2825: 2223: 2213: 1878: 1747: 1550: 2886:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 2045: 1874: 1836: 1630: 3398: 3300:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 2673: 2503: 2406: 2362:"How I got here in the end, part five: "things can only get better!"" 2345: 2041: 1975: 1816: 1799: 1698:. In addition, each protocol and port needs its own robots.txt file; 1638: 3425:"How Google Interprets the robots.txt Specification | Documentation" 3130: 2188:
National Digital Information Infrastructure and Preservation Program
1629:; most complied, including those operated by search engines such as 2208: 2203: 2193: 1994: 1921:
This example tells all robots to stay away from one specific file:
1918:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
1840: 1648:. A proposed standard was published in September 2022 as RFC 9309. 1580: 1538: 3382:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
2591: 2487:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
2943:"Robots.txt tells hackers the places you don't want them to look" 2198: 2154: 2133: 1743: 1665: 1546: 3156: 3161: 1946: 1855: 1847: 1777: 1773: 3055: 1927:
All other files in the specified directory will be processed.
1915:
This example tells all robots not to enter three directories:
30: 29: 2087: 1634: 1622:
overload was a primary concern. By June 1994 it had become a
1596: 1505:
Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)
1468: 3495: 3389: 2494: 2430:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 3020:
Innocent Code: A Security Wake-Up Call for Web Programmers
2918:"Block URLs with robots.txt: Learn about robots.txt files" 3467: 3381: 2701: 2699: 2486: 2459:"Formalizing the Robots Exclusion Protocol Specification" 1897:
directive has no value, meaning no pages are disallowed.
1781: 1557:
which portions of the website they are allowed to visit.
1909:
This example tells all robots to stay out of a website:
1843:) when it detects a connection using one of the robots. 1517: 3458:"Artificial Intelligence Web Crawlers Are Running Amok" 3244:"To crawl or not to crawl, that is BingBot's question" 2696: 1579:
The "robots.txt" file can be used in conjunction with
68: 1614:
The standard, initially RobotsNotWanted.txt, allowed
2972:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
2729: 2281:
First International Conference on the World Wide Web
2113: 2101: 1448: 3270:"Change Googlebot crawl rate - Search Console Help" 2971: 1776:'s GPTBot in their robots.txt file and 85 blocked 2157:– Now inactive search engine for robots.txt files 2051: 1583:, another robot inclusion standard for websites. 3504: 3236: 3056:"List of User-Agents (Spiders, Robots, Browser)" 3016: 2877: 2760: 2758: 2636: 2576: 2014: 1939:Example demonstrating how comments can be used: 1364:* {{Official website|https://www.robotstxt.org}} 1357:* {{Official website|https://www.robotstxt.org}} 1738:Some web archiving projects ignore robots.txt. 3352: 3350: 2981:National Institute of Standards and Technology 2666: 2546: 2427: 2333:"Important: Spiders, Robots and Web Wanderers" 2137:, a standard for listing authorized ad sellers 1825:National Institute of Standards and Technology 1742:uses the file to discover more links, such as 628:<meta name="robots" content="noindex" /> 621:<meta name="robots" content="noindex" /> 2755: 1981:User-agent: bingbot Allow: / Crawl-delay: 10 1924:User-agent: * Disallow: /directory/file.html 1865:Previously, Google had a joke file hosted at 1491:1994 published, formally standardized in 2022 2606: 1952:Example demonstrating multiple user-agents: 1660:in the root of the web site hierarchy (e.g. 140: 3347: 2011:Sitemap: http://www.example.com/sitemap.xml 1560:The standard, developed in 1994, relies on 916:inactive search engine for robots.txt files 905:inactive search engine for robots.txt files 3414:sec. 2.5: Limits. 2737:"Submitting your website to Yahoo! Search" 2023:does not mention the "*" character in the 1763: 1467: 27: 3397: 3023:. John Wiley & Sons. pp. 91–92. 2502: 1959: 2421: 2271: 2151:– A failed proposal to extend robots.txt 2030: 1964: 1823:is discouraged by standards bodies. The 1690:did not, the rules that would apply for 28: 2824: 2707:"Webmasters: Robots.txt Specifications" 2395: 2393: 2391: 2389: 2387: 741:===Maximum size of a robots.txt file=== 734:===Maximum size of a robots.txt file=== 3505: 3187: 2480: 2463:Official Google Webmaster Central Blog 2402:"The text file that runs the internet" 2399: 2330: 1397: 1188: 1172: 1154: 1072: 1042: 978: 810: 664:===A "noindex" HTTP response header=== 657:===A "noindex" HTTP response header=== 3455: 3335:from the original on November 2, 2019 3137:from the original on January 24, 2017 3081:"Access Control - Apache HTTP Server" 2883: 2684:from the original on 16 February 2017 1447:For Knowledge's robots.txt file, see 2806:from the original on 10 October 2022 2564:from the original on 27 January 2013 2384: 2331:Koster, Martijn (25 February 1994). 892:failed proposal to extend robots.txt 881:failed proposal to extend robots.txt 94: 60: 1449:https://en.wikipedia.org/robots.txt 689:<syntaxhighlight lang="html"> 682:<syntaxhighlight lang="html"> 612:<syntaxhighlight lang="html"> 605:<syntaxhighlight lang="html"> 211: 206: 171: 154: 147: 139: 113: 101: 13: 3449: 3375: 2974:"Guide to General Server Security" 2624:from the original on 6 August 2013 2529:"Uncrawled URLs in search results" 2400:Pierce, David (14 February 2024). 1662:https://www.example.com/robots.txt 1574:generative artificial intelligence 1442: 49: 3539: 3487: 3169:from the original on May 30, 2016 2299: 2149:Automated Content Access Protocol 2102:Maximum size of a robots.txt file 1873:not to kill the company founders 1733: 1721: 1440:Revision as of 06:29, 6 July 2024 157:Revision as of 06:29, 6 July 2024 104:Revision as of 01:27, 6 July 2024 3474:from the original on 6 July 2024 2182:National Digital Library Program 2116: 2088:A "noindex" HTTP response header 1835:Many robots also pass a special 1502:Martijn Koster (original author) 374:| title = robots.txt 367:| title = robots.txt 3435:from the original on 2022-10-17 3417: 3364:from the original on 2013-08-08 3317: 3292: 3280:from the original on 2018-11-18 3262: 3250:from the original on 2016-02-03 3211: 3200:from the original on 2018-11-18 3188:Newman, Lily Hay (2014-07-03). 3181: 3149: 3123: 3112:from the original on 2014-01-01 3098: 3087:from the original on 2013-12-29 3073: 3062:from the original on 2014-01-07 3048: 3037:from the original on 2016-04-01 3010: 2998:from the original on 2011-10-08 2965: 2953:from the original on 2015-08-21 2935: 2924:from the original on 2015-08-14 2910: 2898:from the original on 2017-05-16 2866:from the original on 2018-12-04 2848: 2836:from the original on 2017-02-18 2818: 2788: 2776:from the original on 2013-01-25 2743:from the original on 2013-01-21 2717:from the original on 2013-01-15 2654:from the original on 2014-08-18 2535:from the original on 2014-01-06 2521: 2469:from the original on 2019-07-10 2440:from the original on 2015-09-07 2428:Barry Schwartz (30 June 2014). 2372:from the original on 2013-11-25 2313:from the original on 2014-01-12 2287:from the original on 2013-09-27 2254:from the original on 2017-04-03 1974:for webmasters, to control the 1830: 1646:Internet Engineering Task Force 246:description|Internet protocol}} 235:description|Internet protocol}} 2830:"Robots.txt is a suicide note" 2451: 2354: 2324: 2265: 2236: 1893:stands for all robots and the 1702:does not apply to pages under 1: 2884:Jones, Brad (24 April 2017). 2309:. Robotstxt.org. 1994-06-30. 2230: 1993:directive, allowing multiple 1713: 1700:http://example.com/robots.txt 1678:A robots.txt file covers one 1591:The standard was proposed by 1438: 1381: 1344: 1267: 1256: 1247: 1024: 1017: 908: 897: 884: 873: 828: 817: 746: 669: 592: 515: 500: 492: 421: 358:{{Infobox technology standard 351:{{Infobox technology standard 334: 323: 310: 299: 286: 275: 262: 251: 238: 227: 3456:Allyn, Bobby (5 July 2024). 3221:. 2018-01-10. Archived from 18:Browse history interactively 7: 3325:"Robots.txt Specifications" 2554:"About Ask.com: Webmasters" 2109: 1884: 1810: 1651: 10: 3544: 3513:Search engine optimization 2796:"ArchiveBot: Bad behavior" 1984: 1912:User-agent: * Disallow: / 1858:redirect humans.txt to an 1821:security through obscurity 1686:had a robots.txt file but 1586: 1566:security through obscurity 1541:used for implementing the 1446: 587:===A "noindex" meta tag=== 580:===A "noindex" meta tag=== 3385:Robots Exclusion Protocol 3017:Sverre H. Huseby (2004). 2490:Robots Exclusion Protocol 1900:User-agent: * Disallow: 1543:Robots Exclusion Protocol 1512: 1495: 1487: 1479: 1466: 1462:Robots Exclusion Protocol 1461: 1406: 1386: 1349: 1307: 1304: 1265: 1254: 1245: 1197: 1177: 1145: 1077: 1033: 969: 801: 751: 713: 710: 674: 636: 633: 597: 559: 556: 522: 459: 456: 419: 382: 379: 222: 219: 195:Pending changes reviewers 153: 100: 2531:. YouTube. Oct 5, 2009. 2161:Distributed web crawling 2091: 2055: 2021:Robot Exclusion Standard 1989:Some crawlers support a 1841:pass alternative content 1704:http://example.com:8080/ 1609:denial-of-service attack 1599:in February 1994 on the 1549:to indicate to visiting 191:Extended confirmed users 3108:. Iis.net. 2013-11-06. 2989:10.6028/NIST.SP.800-123 1903:User-agent: * Allow: / 1764:Artificial intelligence 99: 2307:"The Web Robots Pages" 2272:Fielding, Roy (1994). 2093:X-Robots-Tag: noindex 1978:'s subsequent visits. 1960:Nonstandard extensions 1074:* ] for search engines 980:* ] for search engines 116:CitationsRuleTheNation 3463:All Things Considered 2337:www-talk mailing list 2031:Meta tags and headers 1965:Crawl-delay directive 1790:. In 2023, blog host 1545:, a standard used by 1421:{{Authority control}} 1414:{{Authority control}} 1260:== Further reading == 705:X-Robots-Tag: noindex 698:X-Robots-Tag: noindex 144:remove an extra space 3219:"/killer-robots.txt" 3083:. Httpd.apache.org. 2800:wiki.archiveteam.org 2350:on October 29, 2013. 2052:A "noindex" meta tag 1846:Some sites, such as 1708:https://example.com/ 1611:on Koster's server. 1562:voluntary compliance 3157:"Github humans.txt" 3131:"Google humans.txt" 3058:. User-agents.org. 2594:on 13 December 2012 2079:"noindex" 2015:Universal "*" match 1694:would not apply to 1595:, when working for 1458: 836:col|colwidth=30em}} 825:col|colwidth=30em}} 812:{{Portal|Internet}} 3412:Proposed Standard. 3274:support.google.com 2766:"Using robots.txt" 2584:"About AOL Search" 2517:Proposed Standard. 2434:Search Engine Land 2178:for search engines 2070:"robots" 1867:/killer-robots.txt 1787:The New York Times 1726:Some major search 1456: 1299:==External links== 1292:==External links== 169: 111: 3429:Google Developers 3329:Google Developers 2862:. 17 April 2017. 2711:Google Developers 2348:archived message) 1532: 1531: 1483:Proposed Standard 1444:Internet protocol 1437: 155: 102: 82: 39: 3535: 3499: 3498: 3496:Official website 3483: 3481: 3479: 3444: 3443: 3441: 3440: 3421: 3415: 3410: 3401: 3399:10.17487/RFC9309 3379: 3373: 3372: 3370: 3369: 3354: 3345: 3344: 3342: 3340: 3321: 3315: 3314: 3312: 3311: 3302:. Archived from 3296: 3290: 3289: 3287: 3285: 3266: 3260: 3259: 3257: 3255: 3240: 3234: 3233: 3231: 3230: 3215: 3209: 3208: 3206: 3205: 3185: 3179: 3178: 3176: 3174: 3153: 3147: 3146: 3144: 3142: 3127: 3121: 3120: 3118: 3117: 3102: 3096: 3095: 3093: 3092: 3077: 3071: 3070: 3068: 3067: 3052: 3046: 3045: 3043: 3042: 3014: 3008: 3007: 3005: 3003: 2997: 2978: 2969: 2963: 2962: 2960: 2958: 2939: 2933: 2932: 2930: 2929: 2914: 2908: 2907: 2905: 2903: 2881: 2875: 2874: 2872: 2871: 2860:blog.archive.org 2852: 2846: 2845: 2843: 2841: 2832:. Archive Team. 2822: 2816: 2815: 2813: 2811: 2802:. Archive Team. 2792: 2786: 2785: 2783: 2781: 2762: 2753: 2752: 2750: 2748: 2733: 2727: 2726: 2724: 2722: 2703: 2694: 2693: 2691: 2689: 2674:"DuckDuckGo Bot" 2670: 2664: 2663: 2661: 2659: 2640: 2634: 2633: 2631: 2629: 2610: 2604: 2603: 2601: 2599: 2590:. Archived from 2580: 2574: 2573: 2571: 2569: 2550: 2544: 2543: 2541: 2540: 2525: 2519: 2515: 2506: 2504:10.17487/RFC9309 2484: 2478: 2477: 2475: 2474: 2455: 2449: 2448: 2446: 2445: 2425: 2419: 2418: 2416: 2414: 2397: 2382: 2381: 2379: 2377: 2368:. 19 June 2006. 2358: 2352: 2351: 2349: 2339:. Archived from 2328: 2322: 2321: 2319: 2318: 2303: 2297: 2296: 2294: 2292: 2278: 2269: 2263: 2262: 2260: 2259: 2248:Greenhills.co.uk 2240: 2171:Internet Archive 2144: 2136: 2126: 2121: 2120: 2083: 2080: 2077: 2074: 2071: 2068: 2065: 2062: 2059: 2037:Robots meta tags 2026: 2007: 2000: 1992: 1896: 1892: 1868: 1853: 1805: 1752:Internet Archive 1709: 1705: 1701: 1697: 1693: 1689: 1685: 1663: 1659: 1528: 1522: 1519: 1471: 1459: 1455: 1240:{{Reflist|30em}} 1233:{{Reflist|30em}} 209: 201: 187: 168: 163: 145: 142: 134: 129: 110: 83: 74: 73: 71: 66: 64: 56: 53: 32: 31: 21: 19: 3543: 3542: 3538: 3537: 3536: 3534: 3533: 3532: 3503: 3502: 3494: 3493: 3490: 3477: 3475: 3452: 3450:Further reading 3447: 3438: 3436: 3423: 3422: 3418: 3380: 3376: 3367: 3365: 3356: 3355: 3348: 3338: 3336: 3323: 3322: 3318: 3309: 3307: 3298: 3297: 3293: 3283: 3281: 3268: 3267: 3263: 3253: 3251: 3242: 3241: 3237: 3228: 3226: 3217: 3216: 3212: 3203: 3201: 3186: 3182: 3172: 3170: 3155: 3154: 3150: 3140: 3138: 3129: 3128: 3124: 3115: 3113: 3104: 3103: 3099: 3090: 3088: 3079: 3078: 3074: 3065: 3063: 3054: 3053: 3049: 3040: 3038: 3031: 3015: 3011: 3001: 2999: 2995: 2976: 2970: 2966: 2956: 2954: 2941: 2940: 2936: 2927: 2925: 2916: 2915: 2911: 2901: 2899: 2882: 2878: 2869: 2867: 2854: 2853: 2849: 2839: 2837: 2823: 2819: 2809: 2807: 2794: 2793: 2789: 2779: 2777: 2770:Help.yandex.com 2764: 2763: 2756: 2746: 2744: 2735: 2734: 2730: 2720: 2718: 2705: 2704: 2697: 2687: 2685: 2672: 2671: 2667: 2657: 2655: 2642: 2641: 2637: 2627: 2625: 2612: 2611: 2607: 2597: 2595: 2582: 2581: 2577: 2567: 2565: 2552: 2551: 2547: 2538: 2536: 2527: 2526: 2522: 2485: 2481: 2472: 2470: 2457: 2456: 2452: 2443: 2441: 2426: 2422: 2412: 2410: 2398: 2385: 2375: 2373: 2366:Charlie's Diary 2360: 2359: 2355: 2343: 2329: 2325: 2316: 2314: 2305: 2304: 2300: 2290: 2288: 2276: 2270: 2266: 2257: 2255: 2242: 2241: 2237: 2233: 2228: 2166:Focused crawler 2140: 2132: 2124:Internet portal 2122: 2115: 2112: 2104: 2095: 2094: 2090: 2085: 2084: 2081: 2078: 2075: 2072: 2069: 2066: 2063: 2060: 2057: 2054: 2033: 2024: 2017: 2012: 2002: 1998: 1990: 1987: 1982: 1967: 1962: 1957: 1943: 1937: 1931: 1925: 1919: 1913: 1904: 1901: 1894: 1890: 1887: 1866: 1851: 1833: 1813: 1803: 1766: 1736: 1724: 1716: 1707: 1703: 1699: 1695: 1691: 1687: 1683: 1661: 1657: 1654: 1589: 1524: 1516: 1508: 1488:First published 1475: 1452: 1445: 1434: 1429: 1422: 1415: 1404: 1402: 1384: 1377: 1372: 1365: 1358: 1347: 1340: 1333: 1324: 1317: 1300: 1293: 1284: 1279: 1272: 1261: 1250: 1241: 1234: 1225: 1218: 1209: 1204: 1195: 1193: 1175: 1161: 1159: 1141: 1134: 1125: 1118: 1109: 1102: 1093: 1086: 1075: 1061: 1054: 1045: 1029: 1022: 1013: 1006: 997: 990: 981: 965: 958: 949: 942: 933: 926: 917: 915: 906: 904: 893: 891: 882: 880: 869: 862: 853: 846: 837: 835: 826: 824: 813: 797: 790: 781: 776: 769: 761: 749: 742: 735: 726: 721: 706: 699: 690: 683: 672: 665: 658: 649: 644: 629: 622: 613: 606: 595: 588: 581: 572: 567: 552: 544: 534: 529: 520: 511: 509: 498: 488: 481: 472: 467: 452: 447: 440: 433: 424: 415: 408: 399: 392: 375: 368: 359: 352: 343: 342:-pc|small=yes}} 341: 332: 331:-pc|small=yes}} 330: 319: 317: 308: 306: 295: 293: 284: 282: 271: 269: 260: 258: 247: 245: 236: 234: 215: 210: 207: 205: 204: 203: 199: 197: 177: 175: 170: 164: 159: 151: 149:← Previous edit 146: 143: 138: 137: 136: 132: 119: 117: 112: 106: 98: 97: 96: 95: 93: 92: 91: 90: 89: 88: 79: 75: 69: 67: 62: 59: 57: 54: 52:Content deleted 51: 48: 43:← Previous edit 40: 26: 25: 24: 17: 12: 11: 5: 3541: 3531: 3530: 3525: 3520: 3515: 3501: 3500: 3489: 3488:External links 3486: 3485: 3484: 3451: 3448: 3446: 3445: 3416: 3374: 3346: 3316: 3291: 3261: 3246:. 3 May 2012. 3235: 3210: 3194:Slate Magazine 3180: 3148: 3122: 3097: 3072: 3047: 3029: 3009: 2964: 2934: 2909: 2891:Digital Trends 2876: 2847: 2817: 2787: 2754: 2728: 2695: 2678:DuckDuckGo.com 2665: 2648:Blogs.bing.com 2635: 2605: 2588:Search.aol.com 2575: 2545: 2520: 2479: 2450: 2420: 2383: 2353: 2323: 2298: 2264: 2234: 2232: 2229: 2227: 2226: 2221: 2216: 2211: 2206: 2201: 2196: 2191: 2185: 2179: 2173: 2168: 2163: 2158: 2152: 2146: 2138: 2129: 2128: 2127: 2111: 2108: 2103: 2100: 2092: 2089: 2086: 2056: 2053: 2050: 2032: 2029: 2016: 2013: 2010: 1986: 1983: 1980: 1972:search console 1966: 1963: 1961: 1958: 1954: 1941: 1935: 1929: 1923: 1917: 1911: 1902: 1899: 1886: 1883: 1871:the Terminator 1832: 1829: 1812: 1809: 1765: 1762: 1757:Digital Trends 1735: 1734:Archival sites 1732: 1723: 1722:Search engines 1720: 1715: 1712: 1653: 1650: 1616:web developers 1605:Charles Stross 1593:Martijn Koster 1588: 1585: 1530: 1529: 1514: 1510: 1509: 1507: 1506: 1503: 1499: 1497: 1493: 1492: 1489: 1485: 1484: 1481: 1477: 1476: 1472: 1464: 1463: 1443: 1441: 1436: 1435: 1432: 1430: 1427: 1424: 1423: 1420: 1418: 1416: 1413: 1411: 1408: 1407: 1405: 1400: 1396: 1394: 1388: 1387: 1385: 1382: 1379: 1378: 1375: 1373: 1370: 1367: 1366: 1363: 1361: 1359: 1356: 1354: 1351: 1350: 1348: 1345: 1342: 1341: 1338: 1336: 1334: 1331: 1329: 1326: 1325: 1322: 1320: 1318: 1315: 1313: 1310: 1309: 1306: 1302: 1301: 1298: 1296: 1294: 1291: 1289: 1286: 1285: 1282: 1280: 1277: 1274: 1273: 1270: 1268: 1266: 1263: 1262: 1259: 1257: 1255: 1252: 1251: 1248: 1246: 1243: 1242: 1239: 1237: 1235: 1232: 1230: 1227: 1226: 1224:==References== 1223: 1221: 1219: 1217:==References== 1216: 1214: 1211: 1210: 1207: 1205: 1202: 1199: 1198: 1196: 1191: 1187: 1185: 1179: 1178: 1176: 1171: 1169: 1163: 1162: 1157: 1153: 1151: 1146: 1143: 1142: 1139: 1137: 1135: 1132: 1130: 1127: 1126: 1123: 1121: 1119: 1116: 1114: 1111: 1110: 1107: 1105: 1103: 1100: 1098: 1095: 1094: 1091: 1089: 1087: 1084: 1082: 1079: 1078: 1076: 1071: 1069: 1063: 1062: 1059: 1057: 1055: 1052: 1050: 1047: 1046: 1041: 1039: 1034: 1031: 1030: 1027: 1025: 1023: 1020: 1018: 1015: 1014: 1011: 1009: 1007: 1004: 1002: 999: 998: 995: 993: 991: 988: 986: 983: 982: 977: 975: 970: 967: 966: 963: 961: 959: 956: 954: 951: 950: 947: 945: 943: 940: 938: 935: 934: 931: 929: 927: 924: 922: 919: 918: 913: 911: 909: 907: 902: 900: 898: 895: 894: 889: 887: 885: 883: 878: 876: 874: 871: 870: 867: 865: 863: 860: 858: 855: 854: 851: 849: 847: 844: 842: 839: 838: 833: 831: 829: 827: 822: 820: 818: 815: 814: 809: 807: 802: 799: 798: 795: 793: 791: 788: 786: 783: 782: 779: 777: 774: 771: 770: 766: 764: 762: 758: 756: 753: 752: 750: 747: 744: 743: 740: 738: 736: 733: 731: 728: 727: 724: 722: 719: 716: 715: 712: 708: 707: 704: 702: 700: 697: 695: 692: 691: 688: 686: 684: 681: 679: 676: 675: 673: 670: 667: 666: 663: 661: 659: 656: 654: 651: 650: 647: 645: 642: 639: 638: 635: 631: 630: 627: 625: 623: 620: 618: 615: 614: 611: 609: 607: 604: 602: 599: 598: 596: 593: 590: 589: 586: 584: 582: 579: 577: 574: 573: 570: 568: 565: 562: 561: 558: 554: 553: 549: 547: 545: 541: 539: 536: 535: 532: 530: 527: 524: 523: 521: 518: 516: 513: 512: 506: 503: 501: 499: 495: 493: 490: 489: 486: 484: 482: 479: 477: 474: 473: 470: 468: 465: 462: 461: 458: 454: 453: 450: 448: 445: 442: 441: 438: 436: 434: 431: 429: 426: 425: 422: 420: 417: 416: 413: 411: 409: 406: 404: 401: 400: 397: 395: 393: 390: 388: 385: 384: 381: 377: 376: 373: 371: 369: 366: 364: 361: 360: 357: 355: 353: 350: 348: 345: 344: 339: 337: 335: 333: 328: 326: 324: 321: 320: 315: 313: 311: 309: 304: 302: 300: 297: 296: 291: 289: 287: 285: 280: 278: 276: 273: 272: 267: 265: 263: 261: 256: 254: 252: 249: 248: 243: 241: 239: 237: 232: 230: 228: 225: 224: 221: 217: 216: 198: 189: 188: 173: 152: 131: 130: 115: 84: 78: 76: 58: 50: 41: 38: 37: 35: 23: 22: 14: 9: 6: 4: 3: 2: 3540: 3529: 3526: 3524: 3521: 3519: 3516: 3514: 3511: 3510: 3508: 3497: 3492: 3491: 3473: 3469: 3465: 3464: 3459: 3454: 3453: 3434: 3430: 3426: 3420: 3413: 3408: 3405: 3400: 3395: 3391: 3387: 3386: 3378: 3363: 3359: 3353: 3351: 3334: 3330: 3326: 3320: 3306:on 2009-03-05 3305: 3301: 3295: 3279: 3275: 3271: 3265: 3249: 3245: 3239: 3225:on 2018-01-10 3224: 3220: 3214: 3199: 3195: 3191: 3184: 3168: 3164: 3163: 3158: 3152: 3136: 3132: 3126: 3111: 3107: 3101: 3086: 3082: 3076: 3061: 3057: 3051: 3036: 3032: 3030:9780470857472 3026: 3022: 3021: 3013: 2994: 2990: 2986: 2982: 2975: 2968: 2952: 2948: 2944: 2938: 2923: 2919: 2913: 2897: 2893: 2892: 2887: 2880: 2865: 2861: 2857: 2851: 2835: 2831: 2827: 2821: 2805: 2801: 2797: 2791: 2775: 2771: 2767: 2761: 2759: 2742: 2738: 2732: 2716: 2712: 2708: 2702: 2700: 2683: 2679: 2675: 2669: 2653: 2649: 2645: 2639: 2623: 2619: 2615: 2614:"Baiduspider" 2609: 2593: 2589: 2585: 2579: 2563: 2559: 2558:About.ask.com 2555: 2549: 2534: 2530: 2524: 2518: 2513: 2510: 2505: 2500: 2496: 2492: 2491: 2483: 2468: 2464: 2460: 2454: 2439: 2435: 2431: 2424: 2409: 2408: 2403: 2396: 2394: 2392: 2390: 2388: 2371: 2367: 2363: 2357: 2347: 2342: 2338: 2334: 2327: 2312: 2308: 2302: 2291:September 25, 2286: 2282: 2275: 2268: 2253: 2249: 2245: 2239: 2235: 2225: 2222: 2220: 2219:Web archiving 2217: 2215: 2212: 2210: 2207: 2205: 2202: 2200: 2197: 2195: 2192: 2189: 2186: 2183: 2180: 2177: 2176:Meta elements 2174: 2172: 2169: 2167: 2164: 2162: 2159: 2156: 2153: 2150: 2147: 2143: 2139: 2135: 2131: 2130: 2125: 2119: 2114: 2107: 2099: 2049: 2047: 2043: 2038: 2028: 2022: 2009: 2006: 1996: 1979: 1977: 1973: 1953: 1950: 1948: 1940: 1934: 1928: 1922: 1916: 1910: 1907: 1898: 1882: 1880: 1876: 1872: 1863: 1861: 1857: 1849: 1844: 1842: 1838: 1828: 1826: 1822: 1818: 1808: 1802: 1801: 1795: 1793: 1789: 1788: 1783: 1779: 1775: 1771: 1770:generative AI 1761: 1759: 1758: 1753: 1749: 1746:. Co-founder 1745: 1741: 1731: 1729: 1719: 1711: 1696:a.example.com 1688:a.example.com 1681: 1676: 1672: 1669: 1667: 1649: 1647: 1642: 1640: 1636: 1632: 1628: 1626: 1621: 1617: 1612: 1610: 1606: 1602: 1598: 1594: 1584: 1582: 1577: 1575: 1571: 1567: 1563: 1558: 1556: 1552: 1548: 1544: 1540: 1536: 1527: 1521: 1515: 1511: 1504: 1501: 1500: 1498: 1494: 1490: 1486: 1482: 1478: 1470: 1465: 1460: 1454: 1450: 1439: 1433: 1431: 1428: 1426: 1425: 1419: 1417: 1412: 1410: 1409: 1398: 1395: 1393: 1390: 1389: 1383: 1380: 1376: 1374: 1371: 1369: 1368: 1362: 1360: 1355: 1353: 1352: 1346: 1343: 1337: 1335: 1330: 1328: 1327: 1321: 1319: 1314: 1312: 1311: 1303: 1297: 1295: 1290: 1288: 1287: 1283: 1281: 1278: 1276: 1275: 1269: 1264: 1258: 1253: 1249: 1244: 1238: 1236: 1231: 1229: 1228: 1222: 1220: 1215: 1213: 1212: 1208: 1206: 1203: 1201: 1200: 1189: 1186: 1184: 1181: 1180: 1173: 1170: 1168: 1165: 1164: 1155: 1152: 1150: 1147: 1144: 1138: 1136: 1131: 1129: 1128: 1122: 1120: 1115: 1113: 1112: 1106: 1104: 1099: 1097: 1096: 1090: 1088: 1083: 1081: 1080: 1073: 1070: 1068: 1065: 1064: 1058: 1056: 1051: 1049: 1048: 1043: 1040: 1038: 1035: 1032: 1026: 1019: 1016: 1010: 1008: 1003: 1001: 1000: 994: 992: 987: 985: 984: 979: 976: 974: 971: 968: 962: 960: 955: 953: 952: 946: 944: 939: 937: 936: 930: 928: 923: 921: 920: 910: 899: 896: 886: 875: 872: 866: 864: 859: 857: 856: 850: 848: 843: 841: 840: 830: 819: 816: 811: 808: 806: 803: 800: 794: 792: 787: 785: 784: 780: 778: 775: 773: 772: 765: 763: 757: 755: 754: 748: 745: 739: 737: 732: 730: 729: 725: 723: 720: 718: 717: 709: 703: 701: 696: 694: 693: 687: 685: 680: 678: 677: 671: 668: 662: 660: 655: 653: 652: 648: 646: 643: 641: 640: 632: 626: 624: 619: 617: 616: 610: 608: 603: 601: 600: 594: 591: 585: 583: 578: 576: 575: 571: 569: 566: 564: 563: 555: 548: 546: 540: 538: 537: 533: 531: 528: 526: 525: 517: 514: 502: 494: 491: 485: 483: 478: 476: 475: 471: 469: 466: 464: 463: 455: 451: 449: 446: 444: 443: 437: 435: 430: 428: 427: 423: 418: 412: 410: 405: 403: 402: 396: 394: 389: 387: 386: 378: 372: 370: 365: 363: 362: 356: 354: 349: 347: 346: 336: 325: 322: 312: 301: 298: 288: 277: 274: 264: 253: 250: 240: 229: 226: 218: 214: 196: 192: 185: 181: 176: 167: 162: 158: 150: 127: 123: 118: 109: 105: 87: 72: 65: 55:Content added 47: 44: 36: 34: 33: 20: 3523:Web scraping 3476:. Retrieved 3461: 3437:. Retrieved 3428: 3419: 3411: 3384: 3377: 3366:. Retrieved 3339:February 15, 3337:. Retrieved 3328: 3319: 3308:. Retrieved 3304:the original 3294: 3282:. Retrieved 3273: 3264: 3252:. Retrieved 3238: 3227:. Retrieved 3223:the original 3213: 3202:. Retrieved 3193: 3183: 3171:. Retrieved 3160: 3151: 3139:. Retrieved 3125: 3114:. Retrieved 3100: 3089:. Retrieved 3075: 3064:. Retrieved 3050: 3039:. Retrieved 3019: 3012: 3000:. Retrieved 2980: 2967: 2955:. Retrieved 2947:The Register 2946: 2937: 2926:. Retrieved 2912: 2900:. Retrieved 2889: 2879: 2868:. Retrieved 2859: 2850: 2838:. Retrieved 2820: 2808:. Retrieved 2799: 2790: 2778:. Retrieved 2769: 2745:. Retrieved 2731: 2719:. Retrieved 2710: 2686:. Retrieved 2677: 2668: 2656:. Retrieved 2647: 2638: 2626:. Retrieved 2617: 2608: 2596:. Retrieved 2592:the original 2587: 2578: 2566:. Retrieved 2557: 2548: 2537:. Retrieved 2523: 2516: 2489: 2482: 2471:. Retrieved 2462: 2453: 2442:. Retrieved 2433: 2423: 2411:. Retrieved 2405: 2374:. Retrieved 2365: 2356: 2341:the original 2336: 2326: 2315:. Retrieved 2301: 2289:. Retrieved 2280: 2277:(PostScript) 2267: 2256:. Retrieved 2247: 2244:"Historical" 2238: 2142:security.txt 2105: 2096: 2034: 2020: 2018: 2004: 2001:in the form 1997:in the same 1988: 1968: 1951: 1944: 1938: 1932: 1926: 1920: 1914: 1908: 1905: 1888: 1869:instructing 1864: 1859: 1845: 1834: 1831:Alternatives 1814: 1798: 1796: 1785: 1767: 1755: 1740:Archive Team 1737: 1725: 1717: 1677: 1673: 1670: 1655: 1643: 1624: 1613: 1600: 1590: 1578: 1559: 1551:web crawlers 1542: 1534: 1533: 1453: 1012:* ] (NDIIPP) 1005:* ] (NDIIPP) 796:==See also== 789:==See also== 2840:18 February 2826:Jason Scott 2780:16 February 2747:16 February 2721:16 February 2658:16 February 2628:16 February 2598:16 February 2568:16 February 2224:Web crawler 2214:Spider trap 2027:statement. 1879:Sergey Brin 1748:Jason Scott 1692:example.com 1684:example.com 1403:|Internet}} 487:==History== 480:==History== 213:Next edit → 174:DocWatson42 46:Next edit → 3528:Text files 3507:Categories 3439:2022-10-17 3368:2013-08-17 3310:2009-03-23 3284:22 October 3254:9 February 3229:2018-05-25 3204:2019-10-03 3173:October 3, 3141:October 3, 3116:2013-12-29 3091:2013-12-29 3066:2013-12-29 3041:2015-08-12 3002:August 12, 2957:August 12, 2928:2015-08-10 2870:2018-12-01 2810:10 October 2539:2013-12-29 2473:2019-07-10 2444:2015-11-19 2317:2013-12-29 2283:. Geneva. 2258:2017-03-03 2231:References 2046:httpd.conf 1999:robots.txt 1956:directory 1875:Larry Page 1852:humans.txt 1837:user-agent 1714:Compliance 1658:robots.txt 1631:WebCrawler 1555:web robots 1553:and other 1535:robots.txt 1457:robots.txt 996:* ] (NDLP) 989:* ] (NDLP) 2618:Baidu.com 2407:The Verge 2346:Hypermail 2042:.htaccess 2025:Disallow: 2003:Sitemap: 1976:Googlebot 1850:, host a 1817:web robot 1800:The Verge 1639:AltaVista 1518:robotstxt 1308:Line 234: 1305:Line 233: 1194:col end}} 1160:col end}} 714:Line 195: 711:Line 197: 637:Line 188: 634:Line 189: 560:Line 183: 557:Line 183: 268:Lowercase 257:lowercase 3518:Websites 3472:Archived 3433:Archived 3362:Archived 3333:Archived 3278:Archived 3248:Archived 3198:Archived 3167:Archived 3135:Archived 3110:Archived 3085:Archived 3060:Archived 3035:Archived 2993:Archived 2951:Archived 2922:Archived 2896:Archived 2864:Archived 2834:Archived 2804:Archived 2774:Archived 2741:Archived 2715:Archived 2688:25 April 2682:Archived 2652:Archived 2622:Archived 2562:Archived 2533:Archived 2467:Archived 2438:Archived 2413:16 March 2376:19 April 2370:Archived 2311:Archived 2285:Archived 2252:Archived 2209:Sitemaps 2204:Perma.cc 2194:nofollow 2190:(NDIIPP) 2110:See also 2005:full-url 1995:Sitemaps 1895:Disallow 1885:Examples 1811:Security 1744:sitemaps 1652:Standard 1627:standard 1625:de facto 1601:www-talk 1581:sitemaps 1547:websites 1539:filename 1526:RFC 9309 1399:{{Portal 1392:⚫ 1183:⚫ 1167:⚫ 1149:⚫ 1067:⚫ 1037:⚫ 973:⚫ 805:⚫ 460:Line 46: 457:Line 45: 383:Line 38: 380:Line 38: 184:contribs 126:contribs 70:Wikitext 2199:noindex 2155:BotSeer 2134:ads.txt 2073:content 2048:files. 1991:Sitemap 1985:Sitemap 1728:engines 1666:website 1587:History 1537:is the 1513:Website 1496:Authors 1474:folder. 510:ref> 292:Selfref 281:selfref 223:Line 1: 220:Line 1: 200:212,198 3478:6 July 3162:GitHub 3027:  2184:(NDLP) 1947:Google 1862:page. 1856:GitHub 1848:Google 1792:Medium 1778:Google 1774:OpenAI 1680:origin 1637:, and 1620:server 1570:server 1480:Status 1339:--> 1332:--> 912:* ] – 901:* ] – 888:* ] – 877:* ] – 318:-pc1}} 307:-pc1}} 81:Inline 63:Visual 2996:(PDF) 2977:(PDF) 2902:8 May 2082:/> 1860:About 1804:' 1635:Lycos 1597:Nexor 244:Short 233:short 202:edits 135:edits 3480:2024 3407:9309 3390:IETF 3341:2020 3286:2018 3256:2016 3175:2019 3143:2019 3025:ISBN 3004:2015 2959:2015 2904:2017 2842:2017 2812:2022 2782:2013 2749:2013 2723:2013 2690:2017 2660:2013 2630:2013 2600:2013 2570:2013 2512:9309 2495:IETF 2415:2024 2378:2014 2293:2013 2064:name 2061:meta 2058:< 2044:and 2019:The 1877:and 1784:and 1520:.org 180:talk 166:undo 161:edit 122:talk 108:edit 3468:NPR 3404:RFC 3394:doi 2985:doi 2509:RFC 2499:doi 1782:BBC 1706:or 1401:bar 1192:div 1174:* ] 1158:Div 1140:* ] 1133:* ] 1124:* ] 1117:* ] 1108:* ] 1101:* ] 1092:* ] 1085:* ] 1060:* ] 1053:* ] 1044:* ] 1028:* ] 1021:* ] 964:* ] 957:* ] 948:* ] 941:* ] 932:* ] 925:* ] 914:Now 903:now 834:Div 823:div 3509:: 3470:. 3466:. 3460:. 3431:. 3427:. 3402:. 3392:. 3388:. 3360:. 3349:^ 3331:. 3327:. 3276:. 3272:. 3196:. 3192:. 3165:. 3159:. 3133:. 3033:. 2991:. 2983:. 2979:. 2949:. 2945:. 2920:. 2894:. 2888:. 2858:. 2828:. 2798:. 2772:. 2768:. 2757:^ 2739:. 2713:. 2709:. 2698:^ 2680:. 2676:. 2650:. 2646:. 2620:. 2616:. 2586:. 2560:. 2556:. 2507:. 2497:. 2493:. 2465:. 2461:. 2436:. 2432:. 2404:. 2386:^ 2364:. 2335:. 2279:. 2250:. 2246:. 2008:: 1881:. 1710:. 1641:. 1633:, 1576:. 1523:, 1190:{{ 1156:{{ 832:{{ 821:{{ 414:}} 407:}} 340:Pp 338:{{ 329:pp 327:{{ 316:Pp 314:{{ 305:pp 303:{{ 290:{{ 279:{{ 270:}} 266:{{ 259:}} 255:{{ 242:{{ 231:{{ 193:, 182:| 124:| 3482:. 3442:. 3409:. 3396:: 3371:. 3343:. 3313:. 3288:. 3258:. 3232:. 3207:. 3177:. 3145:. 3119:. 3094:. 3069:. 3044:. 3006:. 2987:: 2961:. 2931:. 2906:. 2873:. 2844:. 2814:. 2784:. 2751:. 2725:. 2692:. 2662:. 2632:. 2602:. 2572:. 2542:. 2514:. 2501:: 2476:. 2447:. 2417:. 2380:. 2344:( 2320:. 2295:. 2261:. 2076:= 2067:= 1891:* 1451:. 890:A 879:a 186:) 178:( 141:m 133:5 128:) 120:(

Index

Browse history interactively
← Previous edit
Next edit →
Visual
Wikitext

Revision as of 01:27, 6 July 2024
edit
CitationsRuleTheNation
talk
contribs
← Previous edit
Revision as of 06:29, 6 July 2024
edit
undo
DocWatson42
talk
contribs
Extended confirmed users
Pending changes reviewers
Next edit →








Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.