305:|url=https://www.wired.com/2000/07/ebay-fights-spiders-on-the-web/ |access-date=2024-08-02 |work=] |language=en-US |issn=1059-1028}}</ref> where eBay attempted to block a bot, and the company operating the crawler was ordered to stop crawling eBay's servers using any automatic means, by ] the basis of ].<ref name="case">{{cite court|litigants=eBay v. Bidder's Edge|vol=100|reporter=F. Supp. 2d|opinion=1058|pinpoint=|court=]|date=2000|quote=|url=http://www.cand.uscourts.gov/cand/tentrule.nsf/3979517dd11390ce8825690a007c1b9e/d0fc1406324de0cd882568e90081ebf4/$ FILE/Ebay.pdf|archive-url=https://web.archive.org/web/20000817173849/http://www.cand.uscourts.gov/cand/tentrule.nsf/3979517dd11390ce8825690a007c1b9e/d0fc1406324de0cd882568e90081ebf4/$ FILE/Ebay.pdf|url-status=dead|accessdate=2000-08-17}}</ref><ref>{{Cite web |last=Hoffmann |first=Jay |date=2020-09-15 |title=Chapter 4: Search |url=https://thehistoryoftheweb.com/book/search/ |access-date=2024-08-02 |website=The History of the Web |language=en-US}}</ref><ref name=":1" />-->
1218:
402:> '']'' reported that companies like ] and ] circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular ]s.<ref>{{Cite web |last=Koebler |first=Jason |date=2024-07-29 |title=Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones) |url=https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/ |access-date=2024-07-29 |website=404 Media}}</ref
551:
82:
370:
GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the ] and '']''. In 2023, blog host ] announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".<ref name="Verge"/>
362:
GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the ] and '']''. In 2023, blog host ] announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".<ref name="Verge"/>
399:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but '']''{{'}}s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and
391:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but '']''{{'}}s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and
369:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for ]. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked ]'s GPTBot in their robots.txt file and 85 blocked ]'s Google-Extended. Many robots.txt files named
361:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for ]. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked ]'s GPTBot in their robots.txt file and 85 blocked ]'s Google-Extended. Many robots.txt files named
1197:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
1069:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
1055:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
756:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
927:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
919:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
757:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
555:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
176:
74:
888:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
304:
The robots.txt protocol is widely complied with by bot operators.<ref name="Verge"/> <!--It entered the court as part of '']'',<ref name=":1">{{Cite news |last= |first= |date=2000-07-31 |title=EBay Fights
Spiders on the Web
1070:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
746:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
1139:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
842:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
1206:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.
753:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
2239:
1756:
15:
1845:
1292:
832:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
1968:
2601:
862:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
2377:
700:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
1542:
1474:
2000:
876:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2491:
2429:
2327:
2562:
2235:
1819:
1748:
1045:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
159:
1841:
314:
A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.<ref name="Verge"/>
2023:
1049:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
1042:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
924:
2122:
2462:
1908:
750:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1960:
472:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities
465:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities
2164:
2407:
2587:
1389:
1938:
1571:
1135:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
2373:
2051:
646:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
939:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
2214:
2080:
1637:
1666:
1726:
689:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1534:
1466:
104:
201:
194:
129:
83:
1878:
42:
2642:
1356:
915:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
133:
1437:
1415:
851:
655:
2348:
1036:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
2158:
1990:
1253:
122:
2487:
2433:
183:
1286:
2319:
1506:
727:
220:
2554:
1786:
850:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
2264:
1811:
1245:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
726:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
111:
1688:
903:
circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular
879:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
167:
100:
2296:
920:
647:
171:
2189:
1198:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
2103:
1265:
1248:
690:
190:
2454:
1900:
812:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
1961:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
970:
2488:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters β Google Developers"
764:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
2652:
2592:
2148:
1071:
118:
77:
2533:
2399:
1613:
643:
738:
When a site owner wishes to give instructions to web robots they place a text file called
8:
2657:
2320:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
2024:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)"
1378:
224:
1934:
1563:
1030:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
650:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
2047:
868:
854:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
761:
701:
651:
205:
2210:
2072:
2647:
2154:
1633:
705:
685:
mailing list, the main communication channel for WWW-related activities at the time.
654:
overload. In the 2020s many websites began denying bots that collect information for
228:
1217:
2523:
2114:
1658:
1603:
1275:
873:
833:
16:
1718:
1280:
1270:
1223:
1136:
989:
This example tells all robots that they can visit all files because the wildcard
392:
it may be impossible to block only one of these options.<ref name="Verge"/>
2536:
2513:
1616:
1593:
836:
announced that it would stop complying with robots.txt directives. According to
607:
1995:
1870:
838:
809:
686:
674:
248:
1379:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
1348:
1006:
The same result can be accomplished with an empty or missing robots.txt file.
2636:
1749:"Robots Exclusion Protocol: joining together to provide better documentation"
1445:
1323:
900:
697:
400:
it may be impossible to block only one of these options.<ref name="Verge"/
2118:
1411:
1033:
This example tells two specific robots not to enter one specific directory:
954:
file that displays information meant for humans to read. Some sites such as
58:
2352:
1241:
821:
636:
243:
238:
231:
456:* <code>]</code>, a standard for listing authorized ad sellers
449:* <code>]</code>, a standard for listing authorized ad sellers
2236:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
1930:
1328:
1318:
978:
829:
632:
1991:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
1145:
974:
936:
712:
2528:
2430:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1778:
1608:
1511:
1467:"How I got here in the end, part five: "things can only get better!""
1450:
1141:
1075:
916:
904:
896:
891:
881:
780:. In addition, each protocol and port needs its own robots.txt file;
720:
2555:"How Google Interprets the robots.txt Specification | Documentation"
2260:
1293:
National
Digital Information Infrastructure and Preservation Program
711:; most complied, including those operated by search engines such as
1313:
1308:
1298:
1094:
1021:
This example tells all robots to stay away from one specific file:
1018:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
940:
800:
The robots.txt protocol is widely complied with by bot operators.
730:. A proposed standard was published in September 2022 as RFC 9309.
662:
620:
2512:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1696:
1592:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
2073:"Robots.txt tells hackers the places you don't want them to look"
1303:
1259:
1233:
825:
747:
628:
2286:
2291:
1046:
955:
947:
859:
855:
2185:
1027:
All other files in the specified directory will be processed.
1015:
This example tells all robots not to enter three directories:
30:
29:
1187:
716:
704:
overload was a primary concern. By June 1994 it had become a
678:
587:
Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)
550:
2625:
2519:
1599:
1535:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
2150:
Innocent Code: A Security Wake-Up Call for Web Programmers
2048:"Block URLs with robots.txt: Learn about robots.txt files"
2597:
2511:
1806:
1804:
1591:
1564:"Formalizing the Robots Exclusion Protocol Specification"
997:
directive has no value, meaning no pages are disallowed.
863:
639:
which portions of the website they are allowed to visit.
1009:
This example tells all robots to stay out of a website:
943:) when it detects a connection using one of the robots.
599:
2588:"Artificial Intelligence Web Crawlers Are Running Amok"
2374:"To crawl or not to crawl, that is BingBot's question"
1801:
661:
The "robots.txt" file can be used in conjunction with
65:
696:
The standard, initially RobotsNotWanted.txt, allowed
515:* ] β Now inactive search engine for robots.txt files
508:* ] β Now inactive search engine for robots.txt files
2102:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
1834:
1386:
First International Conference on the World Wide Web
1213:
1201:
530:
2400:"Change Googlebot crawl rate - Search Console Help"
2101:
858:'s GPTBot in their robots.txt file and 85 blocked
1262:β Now inactive search engine for robots.txt files
1151:
665:, another robot inclusion standard for websites.
2634:
2366:
2186:"List of User-Agents (Spiders, Robots, Browser)"
2146:
1982:
1865:
1863:
1741:
1681:
1114:
1039:Example demonstrating how comments can be used:
258:(11 intermediate revisions by 7 users not shown)
820:Some web archiving projects ignore robots.txt.
2482:
2480:
2111:National Institute of Standards and Technology
1771:
1651:
1532:
1438:"Important: Spiders, Robots and Web Wanderers"
1237:, a standard for listing authorized ad sellers
925:National Institute of Standards and Technology
824:uses the file to discover more links, such as
1860:
1081:User-agent: bingbot Allow: / Crawl-delay: 10
1024:User-agent: * Disallow: /directory/file.html
965:Previously, Google had a joke file hosted at
573:1994 published, formally standardized in 2022
1711:
1052:Example demonstrating multiple user-agents:
742:in the root of the web site hierarchy (e.g.
499:* ] β A failed proposal to extend robots.txt
492:* ] β A failed proposal to extend robots.txt
2477:
1111:Sitemap: http://www.example.com/sitemap.xml
642:The standard, developed in 1994, relies on
227:) to rev. 1239430727 by ClueBot NG: Likely
2544:sec. 2.5: Limits.
1842:"Submitting your website to Yahoo! Search"
1123:does not mention the "*" character in the
845:
549:
522:Latest revision as of 07:24, 9 August 2024
168:Latest revision as of 07:24, 9 August 2024
27:
2527:
2153:. John Wiley & Sons. pp. 91β92.
1607:
1059:
1526:
1376:
1256:β A failed proposal to extend robots.txt
1130:
1064:
923:is discouraged by standards bodies. The
772:did not, the rules that would apply for
28:
2021:
1929:
1812:"Webmasters: Robots.txt Specifications"
1500:
1498:
1496:
1494:
1492:
2635:
2317:
1585:
1568:Official Google Webmaster Central Blog
1507:"The text file that runs the internet"
1504:
1435:
2585:
2465:from the original on November 2, 2019
2267:from the original on January 24, 2017
2211:"Access Control - Apache HTTP Server"
1988:
1789:from the original on 16 February 2017
529:For Knowledge's robots.txt file, see
1911:from the original on 10 October 2022
1669:from the original on 27 January 2013
1489:
1436:Koster, Martijn (25 February 1994).
149:
91:
57:
531:https://en.wikipedia.org/robots.txt
257:
236:
217:
182:
165:
158:
145:
110:
98:
13:
2579:
2505:
2104:"Guide to General Server Security"
1729:from the original on 6 August 2013
1634:"Uncrawled URLs in search results"
1505:Pierce, David (14 February 2024).
744:https://www.example.com/robots.txt
656:generative artificial intelligence
524:
46:
2669:
2617:
2299:from the original on May 30, 2016
1404:
1254:Automated Content Access Protocol
1202:Maximum size of a robots.txt file
973:not to kill the company founders
815:
803:
101:Revision as of 06:33, 6 July 2024
2604:from the original on 6 July 2024
1287:National Digital Library Program
1216:
1188:A "noindex" HTTP response header
935:Many robots also pass a special
584:Martijn Koster (original author)
2565:from the original on 2022-10-17
2547:
2494:from the original on 2013-08-08
2447:
2422:
2410:from the original on 2018-11-18
2392:
2380:from the original on 2016-02-03
2341:
2330:from the original on 2018-11-18
2318:Newman, Lily Hay (2014-07-03).
2311:
2279:
2253:
2242:from the original on 2014-01-01
2228:
2217:from the original on 2013-12-29
2203:
2192:from the original on 2014-01-07
2178:
2167:from the original on 2016-04-01
2140:
2128:from the original on 2011-10-08
2095:
2083:from the original on 2015-08-21
2065:
2054:from the original on 2015-08-14
2040:
2015:
2003:from the original on 2017-05-16
1971:from the original on 2018-12-04
1953:
1941:from the original on 2017-02-18
1923:
1893:
1881:from the original on 2013-01-25
1848:from the original on 2013-01-21
1822:from the original on 2013-01-15
1759:from the original on 2014-08-18
1640:from the original on 2014-01-06
1626:
1574:from the original on 2019-07-10
1545:from the original on 2015-09-07
1533:Barry Schwartz (30 June 2014).
1477:from the original on 2013-11-25
1418:from the original on 2014-01-12
1392:from the original on 2013-09-27
1359:from the original on 2017-04-03
1074:for webmasters, to control the
930:
728:Internet Engineering Task Force
1935:"Robots.txt is a suicide note"
1556:
1459:
1429:
1370:
1341:
993:stands for all robots and the
784:does not apply to pages under
1:
2022:Koebler, Jason (2024-07-29).
1989:Jones, Brad (24 April 2017).
1414:. Robotstxt.org. 1994-06-30.
1335:
1093:directive, allowing multiple
895:reported that companies like
795:
782:http://example.com/robots.txt
760:A robots.txt file covers one
673:The standard was proposed by
520:
479:
395:
387:
310:
300:
2586:Allyn, Bobby (5 July 2024).
2351:. 2018-01-10. Archived from
18:Browse history interactively
7:
2455:"Robots.txt Specifications"
1659:"About Ask.com: Webmasters"
1209:
984:
910:
733:
10:
2674:
2643:Search engine optimization
1901:"ArchiveBot: Bad behavior"
1084:
1012:User-agent: * Disallow: /
958:redirect humans.txt to an
921:security through obscurity
768:had a robots.txt file but
668:
648:security through obscurity
623:used for implementing the
528:
147:
2515:Robots Exclusion Protocol
2147:Sverre H. Huseby (2004).
1595:Robots Exclusion Protocol
1000:User-agent: * Disallow:
625:Robots Exclusion Protocol
594:
577:
569:
561:
548:
544:Robots Exclusion Protocol
543:
477:
440:
437:
352:
349:
317:
298:
265:
262:
164:
134:Pending changes reviewers
97:
1636:. YouTube. Oct 5, 2009.
1266:Distributed web crawling
1191:
1155:
1121:Robot Exclusion Standard
1089:Some crawlers support a
941:pass alternative content
786:http://example.com:8080/
691:denial-of-service attack
681:in February 1994 on the
631:to indicate to visiting
202:Extended confirmed users
130:Extended confirmed users
2238:. Iis.net. 2013-11-06.
2119:10.6028/NIST.SP.800-123
1003:User-agent: * Allow: /
846:Artificial intelligence
96:
1412:"The Web Robots Pages"
1377:Fielding, Roy (1994).
1193:X-Robots-Tag: noindex
1078:'s subsequent visits.
1060:Nonstandard extensions
2593:All Things Considered
1442:www-talk mailing list
1249:eBay v. Bidder's Edge
1131:Meta tags and headers
1065:Crawl-delay directive
872:. In 2023, blog host
627:, a standard used by
219:Reverting edit(s) by
2349:"/killer-robots.txt"
2213:. Httpd.apache.org.
1905:wiki.archiveteam.org
1455:on October 29, 2013.
1152:A "noindex" meta tag
946:Some sites, such as
790:https://example.com/
693:on Koster's server.
644:voluntary compliance
344:===Search engines===
337:===Search engines===
2287:"Github humans.txt"
2261:"Google humans.txt"
2188:. User-agents.org.
1699:on 13 December 2012
1179:"noindex"
1115:Universal "*" match
776:would not apply to
677:, when working for
540:
229:copyright violation
2542:Proposed Standard.
2404:support.google.com
1871:"Using robots.txt"
1689:"About AOL Search"
1622:Proposed Standard.
1539:Search Engine Land
1283:for search engines
1170:"robots"
967:/killer-robots.txt
869:The New York Times
808:Some major search
538:
180:
108:
2559:Google Developers
2459:Google Developers
1967:. 17 April 2017.
1816:Google Developers
1453:archived message)
614:
613:
565:Proposed Standard
526:Internet protocol
519:
166:
150:ββFurther reading
99:
79:
39:
2665:
2629:
2628:
2626:Official website
2613:
2611:
2609:
2574:
2573:
2571:
2570:
2551:
2545:
2540:
2531:
2529:10.17487/RFC9309
2509:
2503:
2502:
2500:
2499:
2484:
2475:
2474:
2472:
2470:
2451:
2445:
2444:
2442:
2441:
2432:. Archived from
2426:
2420:
2419:
2417:
2415:
2396:
2390:
2389:
2387:
2385:
2370:
2364:
2363:
2361:
2360:
2345:
2339:
2338:
2336:
2335:
2315:
2309:
2308:
2306:
2304:
2283:
2277:
2276:
2274:
2272:
2257:
2251:
2250:
2248:
2247:
2232:
2226:
2225:
2223:
2222:
2207:
2201:
2200:
2198:
2197:
2182:
2176:
2175:
2173:
2172:
2144:
2138:
2137:
2135:
2133:
2127:
2108:
2099:
2093:
2092:
2090:
2088:
2069:
2063:
2062:
2060:
2059:
2044:
2038:
2037:
2035:
2034:
2019:
2013:
2012:
2010:
2008:
1986:
1980:
1979:
1977:
1976:
1965:blog.archive.org
1957:
1951:
1950:
1948:
1946:
1937:. Archive Team.
1927:
1921:
1920:
1918:
1916:
1907:. Archive Team.
1897:
1891:
1890:
1888:
1886:
1867:
1858:
1857:
1855:
1853:
1838:
1832:
1831:
1829:
1827:
1808:
1799:
1798:
1796:
1794:
1779:"DuckDuckGo Bot"
1775:
1769:
1768:
1766:
1764:
1745:
1739:
1738:
1736:
1734:
1715:
1709:
1708:
1706:
1704:
1695:. Archived from
1685:
1679:
1678:
1676:
1674:
1655:
1649:
1648:
1646:
1645:
1630:
1624:
1620:
1611:
1609:10.17487/RFC9309
1589:
1583:
1582:
1580:
1579:
1560:
1554:
1553:
1551:
1550:
1530:
1524:
1523:
1521:
1519:
1502:
1487:
1486:
1484:
1482:
1473:. 19 June 2006.
1463:
1457:
1456:
1454:
1444:. Archived from
1433:
1427:
1426:
1424:
1423:
1408:
1402:
1401:
1399:
1397:
1383:
1374:
1368:
1367:
1365:
1364:
1353:Greenhills.co.uk
1345:
1276:Internet Archive
1244:
1236:
1226:
1221:
1220:
1183:
1180:
1177:
1174:
1171:
1168:
1165:
1162:
1159:
1137:Robots meta tags
1126:
1107:
1100:
1092:
996:
992:
968:
953:
887:
834:Internet Archive
791:
787:
783:
779:
775:
771:
767:
745:
741:
610:
604:
601:
553:
541:
537:
252:
251:
246:
234:
212:
198:
179:
174:
156:
155:
153:
140:
126:
107:
80:
71:
70:
68:
63:
61:
53:
50:
32:
31:
21:
19:
2673:
2672:
2668:
2667:
2666:
2664:
2663:
2662:
2633:
2632:
2624:
2623:
2620:
2607:
2605:
2582:
2580:Further reading
2577:
2568:
2566:
2553:
2552:
2548:
2510:
2506:
2497:
2495:
2486:
2485:
2478:
2468:
2466:
2453:
2452:
2448:
2439:
2437:
2428:
2427:
2423:
2413:
2411:
2398:
2397:
2393:
2383:
2381:
2372:
2371:
2367:
2358:
2356:
2347:
2346:
2342:
2333:
2331:
2316:
2312:
2302:
2300:
2285:
2284:
2280:
2270:
2268:
2259:
2258:
2254:
2245:
2243:
2234:
2233:
2229:
2220:
2218:
2209:
2208:
2204:
2195:
2193:
2184:
2183:
2179:
2170:
2168:
2161:
2145:
2141:
2131:
2129:
2125:
2106:
2100:
2096:
2086:
2084:
2071:
2070:
2066:
2057:
2055:
2046:
2045:
2041:
2032:
2030:
2020:
2016:
2006:
2004:
1987:
1983:
1974:
1972:
1959:
1958:
1954:
1944:
1942:
1928:
1924:
1914:
1912:
1899:
1898:
1894:
1884:
1882:
1875:Help.yandex.com
1869:
1868:
1861:
1851:
1849:
1840:
1839:
1835:
1825:
1823:
1810:
1809:
1802:
1792:
1790:
1777:
1776:
1772:
1762:
1760:
1747:
1746:
1742:
1732:
1730:
1717:
1716:
1712:
1702:
1700:
1687:
1686:
1682:
1672:
1670:
1657:
1656:
1652:
1643:
1641:
1632:
1631:
1627:
1590:
1586:
1577:
1575:
1562:
1561:
1557:
1548:
1546:
1531:
1527:
1517:
1515:
1503:
1490:
1480:
1478:
1471:Charlie's Diary
1465:
1464:
1460:
1448:
1434:
1430:
1421:
1419:
1410:
1409:
1405:
1395:
1393:
1381:
1375:
1371:
1362:
1360:
1347:
1346:
1342:
1338:
1333:
1271:Focused crawler
1240:
1232:
1224:Internet portal
1222:
1215:
1212:
1204:
1195:
1194:
1190:
1185:
1184:
1181:
1178:
1175:
1172:
1169:
1166:
1163:
1160:
1157:
1154:
1133:
1124:
1117:
1112:
1102:
1098:
1090:
1087:
1082:
1067:
1062:
1057:
1043:
1037:
1031:
1025:
1019:
1013:
1004:
1001:
994:
990:
987:
966:
951:
933:
913:
885:
848:
818:
806:
798:
789:
785:
781:
777:
773:
769:
765:
743:
739:
736:
671:
606:
598:
590:
570:First published
557:
534:
527:
516:
509:
500:
493:
484:
473:
466:
457:
450:
433:
426:
417:
412:
405:
403:
393:
383:
378:
371:
363:
345:
338:
329:
324:
315:
306:
294:
287:
278:
273:
253:
247:
242:
237:
235:
218:
216:
215:
214:
210:
208:
188:
186:
181:
175:
170:
162:
160:β Previous edit
157:
148:
146:
144:
143:
142:
138:
136:
116:
114:
109:
103:
95:
94:
93:
92:
90:
89:
88:
87:
86:
85:
76:
72:
66:
64:
59:
56:
54:
51:
49:Content deleted
48:
45:
43:β Previous edit
40:
26:
25:
24:
17:
12:
11:
5:
2671:
2661:
2660:
2655:
2650:
2645:
2631:
2630:
2619:
2618:External links
2616:
2615:
2614:
2581:
2578:
2576:
2575:
2546:
2504:
2476:
2446:
2421:
2391:
2376:. 3 May 2012.
2365:
2340:
2324:Slate Magazine
2310:
2278:
2252:
2227:
2202:
2177:
2159:
2139:
2094:
2064:
2039:
2014:
1996:Digital Trends
1981:
1952:
1922:
1892:
1859:
1833:
1800:
1783:DuckDuckGo.com
1770:
1753:Blogs.bing.com
1740:
1710:
1693:Search.aol.com
1680:
1650:
1625:
1584:
1555:
1525:
1488:
1458:
1428:
1403:
1369:
1339:
1337:
1334:
1332:
1331:
1326:
1321:
1316:
1311:
1306:
1301:
1296:
1290:
1284:
1278:
1273:
1268:
1263:
1257:
1251:
1246:
1238:
1229:
1228:
1227:
1211:
1208:
1203:
1200:
1192:
1189:
1186:
1156:
1153:
1150:
1132:
1129:
1116:
1113:
1110:
1086:
1083:
1080:
1072:search console
1066:
1063:
1061:
1058:
1054:
1041:
1035:
1029:
1023:
1017:
1011:
1002:
999:
986:
983:
971:the Terminator
932:
929:
912:
909:
847:
844:
839:Digital Trends
817:
816:Archival sites
814:
805:
804:Search engines
802:
797:
794:
735:
732:
698:web developers
687:Charles Stross
675:Martijn Koster
670:
667:
612:
611:
596:
592:
591:
589:
588:
585:
581:
579:
575:
574:
571:
567:
566:
563:
559:
558:
554:
546:
545:
525:
523:
518:
517:
514:
512:
510:
507:
505:
502:
501:
498:
496:
494:
491:
489:
486:
485:
482:
480:
478:
475:
474:
471:
469:
467:
464:
462:
459:
458:
455:
453:
451:
448:
446:
443:
442:
439:
435:
434:
431:
429:
427:
424:
422:
419:
418:
415:
413:
410:
407:
406:
401:
398:
396:
394:
390:
388:
385:
384:
381:
379:
376:
373:
372:
368:
366:
364:
360:
358:
355:
354:
351:
347:
346:
343:
341:
339:
336:
334:
331:
330:
327:
325:
322:
319:
318:
316:
313:
311:
308:
307:
303:
301:
299:
296:
295:
293:==Compliance==
292:
290:
288:
286:==Compliance==
285:
283:
280:
279:
276:
274:
271:
268:
267:
264:
260:
259:
255:
254:
209:
200:
199:
185:ClaudineChionh
184:
163:
154:Updated a URL.
137:
128:
127:
112:
81:
75:
73:
55:
47:
41:
38:
37:
35:
23:
22:
14:
9:
6:
4:
3:
2:
2670:
2659:
2656:
2654:
2651:
2649:
2646:
2644:
2641:
2640:
2638:
2627:
2622:
2621:
2603:
2599:
2595:
2594:
2589:
2584:
2583:
2564:
2560:
2556:
2550:
2543:
2538:
2535:
2530:
2525:
2521:
2517:
2516:
2508:
2493:
2489:
2483:
2481:
2464:
2460:
2456:
2450:
2436:on 2009-03-05
2435:
2431:
2425:
2409:
2405:
2401:
2395:
2379:
2375:
2369:
2355:on 2018-01-10
2354:
2350:
2344:
2329:
2325:
2321:
2314:
2298:
2294:
2293:
2288:
2282:
2266:
2262:
2256:
2241:
2237:
2231:
2216:
2212:
2206:
2191:
2187:
2181:
2166:
2162:
2160:9780470857472
2156:
2152:
2151:
2143:
2124:
2120:
2116:
2112:
2105:
2098:
2082:
2078:
2074:
2068:
2053:
2049:
2043:
2029:
2025:
2018:
2002:
1998:
1997:
1992:
1985:
1970:
1966:
1962:
1956:
1940:
1936:
1932:
1926:
1910:
1906:
1902:
1896:
1880:
1876:
1872:
1866:
1864:
1847:
1843:
1837:
1821:
1817:
1813:
1807:
1805:
1788:
1784:
1780:
1774:
1758:
1754:
1750:
1744:
1728:
1724:
1720:
1719:"Baiduspider"
1714:
1698:
1694:
1690:
1684:
1668:
1664:
1663:About.ask.com
1660:
1654:
1639:
1635:
1629:
1623:
1618:
1615:
1610:
1605:
1601:
1597:
1596:
1588:
1573:
1569:
1565:
1559:
1544:
1540:
1536:
1529:
1514:
1513:
1508:
1501:
1499:
1497:
1495:
1493:
1476:
1472:
1468:
1462:
1452:
1447:
1443:
1439:
1432:
1417:
1413:
1407:
1396:September 25,
1391:
1387:
1380:
1373:
1358:
1354:
1350:
1344:
1340:
1330:
1327:
1325:
1324:Web archiving
1322:
1320:
1317:
1315:
1312:
1310:
1307:
1305:
1302:
1300:
1297:
1294:
1291:
1288:
1285:
1282:
1281:Meta elements
1279:
1277:
1274:
1272:
1269:
1267:
1264:
1261:
1258:
1255:
1252:
1250:
1247:
1243:
1239:
1235:
1231:
1230:
1225:
1219:
1214:
1207:
1199:
1149:
1147:
1143:
1138:
1128:
1122:
1109:
1106:
1096:
1079:
1077:
1073:
1053:
1050:
1048:
1040:
1034:
1028:
1022:
1016:
1010:
1007:
998:
982:
980:
976:
972:
963:
961:
957:
949:
944:
942:
938:
928:
926:
922:
918:
908:
906:
902:
901:Perplexity.ai
898:
894:
893:
884:
883:
877:
875:
871:
870:
865:
861:
857:
853:
852:generative AI
843:
841:
840:
835:
831:
828:. Co-founder
827:
823:
813:
811:
801:
793:
778:a.example.com
770:a.example.com
763:
758:
754:
751:
749:
731:
729:
724:
722:
718:
714:
710:
708:
703:
699:
694:
692:
688:
684:
680:
676:
666:
664:
659:
657:
653:
649:
645:
640:
638:
634:
630:
626:
622:
618:
609:
603:
597:
593:
586:
583:
582:
580:
576:
572:
568:
564:
560:
552:
547:
542:
536:
532:
521:
513:
511:
506:
504:
503:
497:
495:
490:
488:
487:
481:
476:
470:
468:
463:
461:
460:
454:
452:
447:
445:
444:
436:
430:
428:
423:
421:
420:
416:
414:
411:
409:
408:
397:
389:
386:
382:
380:
377:
375:
374:
367:
365:
359:
357:
356:
348:
342:
340:
335:
333:
332:
328:
326:
323:
321:
320:
312:
309:
302:
297:
291:
289:
284:
282:
281:
277:
275:
272:
270:
269:
261:
256:
250:
245:
240:
233:
230:
226:
222:
207:
203:
196:
192:
187:
178:
173:
169:
161:
151:
135:
131:
124:
120:
115:
106:
102:
84:
69:
62:
52:Content added
44:
36:
34:
33:
20:
2653:Web scraping
2606:. Retrieved
2591:
2567:. Retrieved
2558:
2549:
2541:
2514:
2507:
2496:. Retrieved
2469:February 15,
2467:. Retrieved
2458:
2449:
2438:. Retrieved
2434:the original
2424:
2412:. Retrieved
2403:
2394:
2382:. Retrieved
2368:
2357:. Retrieved
2353:the original
2343:
2332:. Retrieved
2323:
2313:
2301:. Retrieved
2290:
2281:
2269:. Retrieved
2255:
2244:. Retrieved
2230:
2219:. Retrieved
2205:
2194:. Retrieved
2180:
2169:. Retrieved
2149:
2142:
2130:. Retrieved
2110:
2097:
2085:. Retrieved
2077:The Register
2076:
2067:
2056:. Retrieved
2042:
2031:. Retrieved
2027:
2017:
2005:. Retrieved
1994:
1984:
1973:. Retrieved
1964:
1955:
1943:. Retrieved
1925:
1913:. Retrieved
1904:
1895:
1883:. Retrieved
1874:
1850:. Retrieved
1836:
1824:. Retrieved
1815:
1791:. Retrieved
1782:
1773:
1761:. Retrieved
1752:
1743:
1731:. Retrieved
1722:
1713:
1701:. Retrieved
1697:the original
1692:
1683:
1671:. Retrieved
1662:
1653:
1642:. Retrieved
1628:
1621:
1594:
1587:
1576:. Retrieved
1567:
1558:
1547:. Retrieved
1538:
1528:
1516:. Retrieved
1510:
1479:. Retrieved
1470:
1461:
1446:the original
1441:
1431:
1420:. Retrieved
1406:
1394:. Retrieved
1385:
1382:(PostScript)
1372:
1361:. Retrieved
1352:
1349:"Historical"
1343:
1242:security.txt
1205:
1196:
1134:
1120:
1118:
1104:
1101:in the form
1097:in the same
1088:
1068:
1051:
1044:
1038:
1032:
1026:
1020:
1014:
1008:
1005:
988:
969:instructing
964:
959:
945:
934:
931:Alternatives
914:
890:
880:
878:
867:
849:
837:
822:Archive Team
819:
807:
799:
759:
755:
752:
737:
725:
706:
695:
682:
672:
660:
641:
633:web crawlers
624:
616:
615:
535:
432:==Security==
425:==Security==
1945:18 February
1931:Jason Scott
1885:16 February
1852:16 February
1826:16 February
1763:16 February
1733:16 February
1703:16 February
1673:16 February
1329:Web crawler
1319:Spider trap
1127:statement.
979:Sergey Brin
830:Jason Scott
774:example.com
766:example.com
206:Rollbackers
113:DocWatson42
2658:Text files
2637:Categories
2569:2022-10-17
2498:2013-08-17
2440:2009-03-23
2414:22 October
2384:9 February
2359:2018-05-25
2334:2019-10-03
2303:October 3,
2271:October 3,
2246:2013-12-29
2221:2013-12-29
2196:2013-12-29
2171:2015-08-12
2132:August 12,
2087:August 12,
2058:2015-08-10
2033:2024-07-29
1975:2018-12-01
1915:10 October
1644:2013-12-29
1578:2019-07-10
1549:2015-11-19
1422:2013-12-29
1388:. Geneva.
1363:2017-03-03
1336:References
1146:httpd.conf
1099:robots.txt
1056:directory
975:Larry Page
952:humans.txt
937:user-agent
905:blocklists
796:Compliance
740:robots.txt
713:WebCrawler
637:web robots
635:and other
617:robots.txt
539:robots.txt
2028:404 Media
1723:Baidu.com
1512:The Verge
1451:Hypermail
1142:.htaccess
1125:Disallow:
1103:Sitemap:
1076:Googlebot
950:, host a
917:web robot
897:Anthropic
892:404 Media
882:The Verge
721:AltaVista
600:robotstxt
441:Line 202:
438:Line 202:
232:(RW 16.1)
221:Seo168168
2648:Websites
2602:Archived
2563:Archived
2492:Archived
2463:Archived
2408:Archived
2378:Archived
2328:Archived
2297:Archived
2265:Archived
2240:Archived
2215:Archived
2190:Archived
2165:Archived
2123:Archived
2081:Archived
2052:Archived
2001:Archived
1969:Archived
1939:Archived
1909:Archived
1879:Archived
1846:Archived
1820:Archived
1793:25 April
1787:Archived
1757:Archived
1727:Archived
1667:Archived
1638:Archived
1572:Archived
1543:Archived
1518:16 March
1481:19 April
1475:Archived
1416:Archived
1390:Archived
1357:Archived
1314:Sitemaps
1309:Perma.cc
1299:nofollow
1295:(NDIIPP)
1210:See also
1105:full-url
1095:Sitemaps
995:Disallow
985:Examples
911:Security
826:sitemaps
734:Standard
709:standard
707:de facto
683:www-talk
663:sitemaps
629:websites
621:filename
608:RFC 9309
353:Line 73:
350:Line 73:
266:Line 62:
263:Line 62:
195:contribs
123:contribs
67:Wikitext
1304:noindex
1260:BotSeer
1234:ads.txt
1173:content
1148:files.
1091:Sitemap
1085:Sitemap
810:engines
748:website
669:History
619:is the
595:Website
578:Authors
556:folder.
139:212,208
2608:6 July
2292:GitHub
2157:
1289:(NDLP)
1047:Google
962:page.
956:GitHub
948:Google
874:Medium
860:Google
856:OpenAI
762:origin
719:, and
702:server
652:server
562:Status
78:Inline
60:Visual
2126:(PDF)
2107:(PDF)
2007:8 May
1182:/>
960:About
886:'
717:Lycos
679:Nexor
213:edits
211:3,805
141:edits
2610:2024
2537:9309
2520:IETF
2471:2020
2416:2018
2386:2016
2305:2019
2273:2019
2155:ISBN
2134:2015
2089:2015
2009:2017
1947:2017
1917:2022
1887:2013
1854:2013
1828:2013
1795:2017
1765:2013
1735:2013
1705:2013
1675:2013
1617:9309
1600:IETF
1520:2024
1483:2014
1398:2013
1164:name
1161:meta
1158:<
1144:and
1119:The
977:and
899:and
866:and
602:.org
404:>
249:Undo
239:Tags
225:talk
191:talk
177:undo
172:edit
119:talk
105:edit
2598:NPR
2534:RFC
2524:doi
2115:doi
1614:RFC
1604:doi
864:BBC
788:or
483:* ]
2639::
2600:.
2596:.
2590:.
2561:.
2557:.
2532:.
2522:.
2518:.
2490:.
2479:^
2461:.
2457:.
2406:.
2402:.
2326:.
2322:.
2295:.
2289:.
2263:.
2163:.
2121:.
2113:.
2109:.
2079:.
2075:.
2050:.
2026:.
1999:.
1993:.
1963:.
1933:.
1903:.
1877:.
1873:.
1862:^
1844:.
1818:.
1814:.
1803:^
1785:.
1781:.
1755:.
1751:.
1725:.
1721:.
1691:.
1665:.
1661:.
1612:.
1602:.
1598:.
1570:.
1566:.
1541:.
1537:.
1509:.
1491:^
1469:.
1440:.
1384:.
1355:.
1351:.
1108::
981:.
907:.
792:.
723:.
715:,
658:.
605:,
244:RW
241::
204:,
193:|
152::
132:,
121:|
2612:.
2572:.
2539:.
2526::
2501:.
2473:.
2443:.
2418:.
2388:.
2362:.
2337:.
2307:.
2275:.
2249:.
2224:.
2199:.
2174:.
2136:.
2117::
2091:.
2061:.
2036:.
2011:.
1978:.
1949:.
1919:.
1889:.
1856:.
1830:.
1797:.
1767:.
1737:.
1707:.
1677:.
1647:.
1619:.
1606::
1581:.
1552:.
1522:.
1485:.
1449:(
1425:.
1400:.
1366:.
1176:=
1167:=
991:*
533:.
223:(
197:)
189:(
125:)
117:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.