2175:
180:
808:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
680:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
666:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
385:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
538:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
530:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
386:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
184:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
517:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
681:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
375:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
750:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
471:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
817:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .
382:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
1805:
1347:
1436:
877:
461:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
1559:
491:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
1943:
329:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
75:
26:
1133:
1065:
1591:
505:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2057:
1995:
1893:
2128:
1801:
1410:
1339:
656:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
1432:
660:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
653:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
535:
1688:
2028:
1499:
379:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1551:
1730:
1973:
980:
1529:
1162:
746:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
1939:
1617:
275:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
550:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
1780:
1646:
1228:
122:
110:
1257:
1317:
318:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1125:
1057:
429:
A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.
1469:
126:
106:
2195:
947:
526:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
118:
1028:
1006:
480:
284:
1914:
647:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
1724:
1581:
844:
2053:
1999:
871:
1885:
1097:
356:
2120:
1377:
479:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
1830:
1402:
841:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
86:
37:
355:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
1279:
508:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
1862:
143:
2165:
531:
276:
1755:
809:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
1669:
856:
319:
2020:
1491:
441:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
1552:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
581:
2054:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers"
393:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
2205:
1714:
682:
139:
2099:
1965:
1204:
272:
367:
When a site owner wishes to give instructions to web robots they place a text file called
8:
2210:
1886:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
969:
61:
1525:
1154:
641:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
279:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
1613:
497:
483:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
390:
330:
280:
82:
33:
1776:
1638:
2200:
1720:
1224:
334:
314:
mailing list, the main communication channel for WWW-related activities at the time.
283:
overload. In the 2020s many websites began denying bots that collect information for
21:
2174:
2089:
1680:
1249:
1194:
866:
502:
462:
1309:
2179:
893:
861:
747:
600:
This example tells all robots that they can visit all files because the wildcard
2102:
2079:
1207:
1184:
465:
announced that it would stop complying with robots.txt directives. According to
236:
1586:
1461:
467:
438:
315:
303:
970:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
939:
617:
The same result can be accomplished with an empty or missing robots.txt file.
2189:
1340:"Robots Exclusion Protocol: joining together to provide better documentation"
1036:
909:
326:
1684:
1002:
644:
This example tells two specific robots not to enter one specific directory:
565:
file that displays information meant for humans to read. Some sites such as
1918:
837:
450:
265:
1802:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
1521:
914:
904:
589:
458:
261:
1582:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
756:
585:
547:
341:
114:
2094:
1996:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1369:
1199:
1102:
1058:"How I got here in the end, part five: "things can only get better!""
1041:
752:
686:
527:
510:
409:. In addition, each protocol and port needs its own robots.txt file;
349:
2121:"How Google Interprets the robots.txt Specification | Documentation"
1826:
878:
National
Digital Information Infrastructure and Preservation Program
340:; most complied, including those operated by search engines such as
899:
888:
883:
705:
632:
This example tells all robots to stay away from one specific file:
629:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
551:
359:. A proposed standard was published in September 2022 as RFC 9309.
291:
249:
2078:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1287:
1183:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1639:"Robots.txt tells hackers the places you don't want them to look"
919:
850:
829:
454:
376:
257:
1852:
1857:
657:
566:
558:
488:
484:
1751:
638:
All other files in the specified directory will be processed.
626:
This example tells all robots not to enter three directories:
798:
345:
333:
overload was a primary concern. By June 1994 it had become a
307:
216:
Gary Illyes, Henner Zeller, Lizzi
Sassman (IETF contributors)
179:
2153:
2085:
1190:
1126:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
1716:
Innocent Code: A Security Wake-Up Call for Web
Programmers
1614:"Block URLs with robots.txt: Learn about robots.txt files"
64:
to this revision, which may differ significantly from the
2077:
1397:
1395:
1182:
1155:"Formalizing the Robots Exclusion Protocol Specification"
608:
directive has no value, meaning no pages are disallowed.
492:
268:
which portions of the website they are allowed to visit.
620:
This example tells all robots to stay out of a website:
554:) when it detects a connection using one of the robots.
228:
1940:"To crawl or not to crawl, that is BingBot's question"
1392:
290:
The "robots.txt" file can be used in conjunction with
95:
52:
2163:
325:
The standard, initially RobotsNotWanted.txt, allowed
1668:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
1425:
977:
812:
159:
66:
1966:"Change Googlebot crawl rate - Search Console Help"
1667:
487:'s GPTBot in their robots.txt file and 85 blocked
853:– now inactive search engine for robots.txt files
762:
294:, another robot inclusion standard for websites.
2187:
1932:
1752:"List of User-Agents (Spiders, Robots, Browser)"
1712:
1573:
1456:
1454:
1332:
1272:
725:
650:Example demonstrating how comments can be used:
449:Some web archiving projects ignore robots.txt.
2048:
2046:
1677:National Institute of Standards and Technology
1362:
1242:
1123:
1029:"Important: Spiders, Robots and Web Wanderers"
833:, a standard for listing authorized ad sellers
536:National Institute of Standards and Technology
453:uses the file to discover more links, such as
1451:
692:User-agent: bingbot Allow: / Crawl-delay: 10
635:User-agent: * Disallow: /directory/file.html
576:Previously, Google had a joke file hosted at
202:1994 published, formally standardized in 2022
1302:
663:Example demonstrating multiple user-agents:
371:in the root of the web site hierarchy (e.g.
2043:
722:Sitemap: http://www.example.com/sitemap.xml
271:The standard, developed in 1994, relies on
2110:sec. 2.5: Limits.
1433:"Submitting your website to Yahoo! Search"
734:does not mention the "*" character in the
474:
178:
136:
2093:
1719:. John Wiley & Sons. pp. 91–92.
1198:
670:
1117:
967:
847:– a failed proposal to extend robots.txt
741:
675:
534:is discouraged by standards bodies. The
401:did not, the rules that would apply for
47:
1520:
1403:"Webmasters: Robots.txt Specifications"
1091:
1089:
1087:
1085:
1083:
73:
14:
2188:
1883:
1176:
1159:Official Google Webmaster Central Blog
1098:"The text file that runs the internet"
1095:
1026:
2031:from the original on November 2, 2019
1833:from the original on January 24, 2017
1777:"Access Control - Apache HTTP Server"
1579:
1380:from the original on 16 February 2017
158:For Knowledge's robots.txt file, see
74:Revision as of 07:07, 4 July 2024 by
44:
25:
1502:from the original on 10 October 2022
1260:from the original on 27 January 2013
1080:
1027:Koster, Martijn (25 February 1994).
17:
160:https://en.wikipedia.org/robots.txt
135:
104:
96:→Maximum size of a robots.txt file
53:→Maximum size of a robots.txt file
2071:
1670:"Guide to General Server Security"
1320:from the original on 6 August 2013
1225:"Uncrawled URLs in search results"
1096:Pierce, David (14 February 2024).
373:https://www.example.com/robots.txt
285:generative artificial intelligence
153:
2222:
2145:
1865:from the original on May 30, 2016
995:
845:Automated Content Access Protocol
813:Maximum size of a robots.txt file
584:not to kill the company founders
444:
432:
60:. The present address (URL) is a
2173:
872:National Digital Library Program
799:A "noindex" HTTP response header
546:Many robots also pass a special
213:Martijn Koster (original author)
2131:from the original on 2022-10-17
2113:
2060:from the original on 2013-08-08
2013:
1988:
1976:from the original on 2018-11-18
1958:
1946:from the original on 2016-02-03
1907:
1896:from the original on 2018-11-18
1884:Newman, Lily Hay (2014-07-03).
1877:
1845:
1819:
1808:from the original on 2014-01-01
1794:
1783:from the original on 2013-12-29
1769:
1758:from the original on 2014-01-07
1744:
1733:from the original on 2016-04-01
1706:
1694:from the original on 2011-10-08
1661:
1649:from the original on 2015-08-21
1631:
1620:from the original on 2015-08-14
1606:
1594:from the original on 2017-05-16
1562:from the original on 2018-12-04
1544:
1532:from the original on 2017-02-18
1514:
1484:
1472:from the original on 2013-01-25
1439:from the original on 2013-01-21
1413:from the original on 2013-01-15
1350:from the original on 2014-08-18
1231:from the original on 2014-01-06
1217:
1165:from the original on 2019-07-10
1136:from the original on 2015-09-07
1124:Barry Schwartz (30 June 2014).
1068:from the original on 2013-11-25
1009:from the original on 2014-01-12
983:from the original on 2013-09-27
950:from the original on 2017-04-03
685:for webmasters, to control the
541:
357:Internet Engineering Task Force
1526:"Robots.txt is a suicide note"
1147:
1050:
1020:
961:
932:
604:stands for all robots and the
413:does not apply to pages under
13:
1:
1580:Jones, Brad (24 April 2017).
1005:. Robotstxt.org. 1994-06-30.
926:
704:directive, allowing multiple
424:
411:http://example.com/robots.txt
389:A robots.txt file covers one
302:The standard was proposed by
150:, was based on this revision.
1917:. 2018-01-10. Archived from
7:
2021:"Robots.txt Specifications"
1250:"About Ask.com: Webmasters"
821:
595:
521:
362:
24:of this page, as edited by
10:
2227:
2196:Search engine optimization
1492:"ArchiveBot: Bad behavior"
695:
623:User-agent: * Disallow: /
569:redirect humans.txt to an
532:security through obscurity
397:had a robots.txt file but
297:
277:security through obscurity
252:used for implementing the
157:
93:
50:
2081:Robots Exclusion Protocol
1713:Sverre H. Huseby (2004).
1186:Robots Exclusion Protocol
611:User-agent: * Disallow:
254:Robots Exclusion Protocol
223:
206:
198:
190:
177:
173:Robots Exclusion Protocol
172:
1227:. YouTube. Oct 5, 2009.
857:Distributed web crawling
802:
766:
732:Robot Exclusion Standard
700:Some crawlers support a
552:pass alternative content
415:http://example.com:8080/
320:denial-of-service attack
310:in February 1994 on the
260:to indicate to visiting
1804:. Iis.net. 2013-11-06.
1685:10.6028/NIST.SP.800-123
614:User-agent: * Allow: /
475:Artificial intelligence
1003:"The Web Robots Pages"
968:Fielding, Roy (1994).
804:X-Robots-Tag: noindex
689:'s subsequent visits.
671:Nonstandard extensions
1033:www-talk mailing list
742:Meta tags and headers
676:Crawl-delay directive
501:. In 2023, blog host
256:, a standard used by
1915:"/killer-robots.txt"
1779:. Httpd.apache.org.
1496:wiki.archiveteam.org
1046:on October 29, 2013.
763:A "noindex" meta tag
557:Some sites, such as
419:https://example.com/
322:on Koster's server.
273:voluntary compliance
1853:"Github humans.txt"
1827:"Google humans.txt"
1754:. User-agents.org.
1290:on 13 December 2012
790:"noindex"
726:Universal "*" match
405:would not apply to
306:, when working for
169:
111:← Previous revision
2108:Proposed Standard.
1970:support.google.com
1462:"Using robots.txt"
1280:"About AOL Search"
1213:Proposed Standard.
1130:Search Engine Land
896:for search engines
781:"robots"
578:/killer-robots.txt
498:The New York Times
437:Some major search
167:
45:07:07, 4 July 2024
2125:Google Developers
2025:Google Developers
1558:. 17 April 2017.
1407:Google Developers
1044:archived message)
243:
242:
194:Proposed Standard
155:Internet protocol
2218:
2178:
2177:
2169:
2157:
2156:
2154:Official website
2140:
2139:
2137:
2136:
2117:
2111:
2106:
2097:
2095:10.17487/RFC9309
2075:
2069:
2068:
2066:
2065:
2050:
2041:
2040:
2038:
2036:
2017:
2011:
2010:
2008:
2007:
1998:. Archived from
1992:
1986:
1985:
1983:
1981:
1962:
1956:
1955:
1953:
1951:
1936:
1930:
1929:
1927:
1926:
1911:
1905:
1904:
1902:
1901:
1881:
1875:
1874:
1872:
1870:
1849:
1843:
1842:
1840:
1838:
1823:
1817:
1816:
1814:
1813:
1798:
1792:
1791:
1789:
1788:
1773:
1767:
1766:
1764:
1763:
1748:
1742:
1741:
1739:
1738:
1710:
1704:
1703:
1701:
1699:
1693:
1674:
1665:
1659:
1658:
1656:
1654:
1635:
1629:
1628:
1626:
1625:
1610:
1604:
1603:
1601:
1599:
1577:
1571:
1570:
1568:
1567:
1556:blog.archive.org
1548:
1542:
1541:
1539:
1537:
1528:. Archive Team.
1518:
1512:
1511:
1509:
1507:
1498:. Archive Team.
1488:
1482:
1481:
1479:
1477:
1458:
1449:
1448:
1446:
1444:
1429:
1423:
1422:
1420:
1418:
1399:
1390:
1389:
1387:
1385:
1370:"DuckDuckGo Bot"
1366:
1360:
1359:
1357:
1355:
1336:
1330:
1329:
1327:
1325:
1306:
1300:
1299:
1297:
1295:
1286:. Archived from
1276:
1270:
1269:
1267:
1265:
1246:
1240:
1239:
1237:
1236:
1221:
1215:
1211:
1202:
1200:10.17487/RFC9309
1180:
1174:
1173:
1171:
1170:
1151:
1145:
1144:
1142:
1141:
1121:
1115:
1114:
1112:
1110:
1093:
1078:
1077:
1075:
1073:
1064:. 19 June 2006.
1054:
1048:
1047:
1045:
1035:. Archived from
1024:
1018:
1017:
1015:
1014:
999:
993:
992:
990:
988:
974:
965:
959:
958:
956:
955:
944:Greenhills.co.uk
936:
867:Internet Archive
840:
832:
794:
791:
788:
785:
782:
779:
776:
773:
770:
748:Robots meta tags
737:
718:
711:
703:
607:
603:
579:
564:
516:
463:Internet Archive
420:
416:
412:
408:
404:
400:
396:
374:
370:
239:
233:
230:
182:
170:
166:
140:accepted version
123:Newer revision →
101:
99:
98:
90:
69:
67:current revision
59:
58:
56:
55:
46:
42:
41:
2226:
2225:
2221:
2220:
2219:
2217:
2216:
2215:
2186:
2185:
2184:
2172:
2164:
2161:
2152:
2151:
2148:
2143:
2134:
2132:
2119:
2118:
2114:
2076:
2072:
2063:
2061:
2052:
2051:
2044:
2034:
2032:
2019:
2018:
2014:
2005:
2003:
1994:
1993:
1989:
1979:
1977:
1964:
1963:
1959:
1949:
1947:
1938:
1937:
1933:
1924:
1922:
1913:
1912:
1908:
1899:
1897:
1882:
1878:
1868:
1866:
1851:
1850:
1846:
1836:
1834:
1825:
1824:
1820:
1811:
1809:
1800:
1799:
1795:
1786:
1784:
1775:
1774:
1770:
1761:
1759:
1750:
1749:
1745:
1736:
1734:
1727:
1711:
1707:
1697:
1695:
1691:
1672:
1666:
1662:
1652:
1650:
1637:
1636:
1632:
1623:
1621:
1612:
1611:
1607:
1597:
1595:
1578:
1574:
1565:
1563:
1550:
1549:
1545:
1535:
1533:
1519:
1515:
1505:
1503:
1490:
1489:
1485:
1475:
1473:
1466:Help.yandex.com
1460:
1459:
1452:
1442:
1440:
1431:
1430:
1426:
1416:
1414:
1401:
1400:
1393:
1383:
1381:
1368:
1367:
1363:
1353:
1351:
1338:
1337:
1333:
1323:
1321:
1308:
1307:
1303:
1293:
1291:
1278:
1277:
1273:
1263:
1261:
1248:
1247:
1243:
1234:
1232:
1223:
1222:
1218:
1181:
1177:
1168:
1166:
1153:
1152:
1148:
1139:
1137:
1122:
1118:
1108:
1106:
1094:
1081:
1071:
1069:
1062:Charlie's Diary
1056:
1055:
1051:
1039:
1025:
1021:
1012:
1010:
1001:
1000:
996:
986:
984:
972:
966:
962:
953:
951:
938:
937:
933:
929:
924:
862:Focused crawler
836:
828:
824:
815:
806:
805:
801:
796:
795:
792:
789:
786:
783:
780:
777:
774:
771:
768:
765:
744:
735:
728:
723:
713:
709:
701:
698:
693:
678:
673:
668:
654:
648:
642:
636:
630:
624:
615:
612:
605:
601:
598:
577:
562:
544:
524:
514:
477:
447:
435:
427:
418:
414:
410:
406:
402:
398:
394:
372:
368:
365:
300:
235:
227:
219:
199:First published
186:
163:
156:
152:
151:
134:
133:
132:
131:
130:
115:Latest revision
103:
102:
94:
91:
80:
78:
65:
51:
48:
31:
29:
12:
11:
5:
2224:
2214:
2213:
2208:
2203:
2198:
2183:
2182:
2159:
2158:
2147:
2146:External links
2144:
2142:
2141:
2112:
2070:
2042:
2012:
1987:
1957:
1942:. 3 May 2012.
1931:
1906:
1890:Slate Magazine
1876:
1844:
1818:
1793:
1768:
1743:
1725:
1705:
1660:
1630:
1605:
1587:Digital Trends
1572:
1543:
1513:
1483:
1450:
1424:
1391:
1374:DuckDuckGo.com
1361:
1344:Blogs.bing.com
1331:
1301:
1284:Search.aol.com
1271:
1241:
1216:
1175:
1146:
1116:
1079:
1049:
1019:
994:
960:
930:
928:
925:
923:
922:
917:
912:
907:
902:
897:
891:
886:
881:
875:
869:
864:
859:
854:
848:
842:
834:
825:
823:
820:
814:
811:
803:
800:
797:
767:
764:
761:
743:
740:
727:
724:
721:
697:
694:
691:
683:search console
677:
674:
672:
669:
665:
652:
646:
640:
634:
628:
622:
613:
610:
597:
594:
582:the Terminator
543:
540:
523:
520:
476:
473:
468:Digital Trends
446:
445:Archival sites
443:
434:
433:Search engines
431:
426:
423:
364:
361:
327:web developers
316:Charles Stross
304:Martijn Koster
299:
296:
241:
240:
225:
221:
220:
218:
217:
214:
210:
208:
204:
203:
200:
196:
195:
192:
188:
187:
183:
175:
174:
154:
142:of this page,
137:
76:
62:permanent link
27:
16:
15:
9:
6:
4:
3:
2:
2223:
2212:
2209:
2207:
2204:
2202:
2199:
2197:
2194:
2193:
2191:
2181:
2176:
2171:
2170:
2167:
2162:
2155:
2150:
2149:
2130:
2126:
2122:
2116:
2109:
2104:
2101:
2096:
2091:
2087:
2083:
2082:
2074:
2059:
2055:
2049:
2047:
2030:
2026:
2022:
2016:
2002:on 2009-03-05
2001:
1997:
1991:
1975:
1971:
1967:
1961:
1945:
1941:
1935:
1921:on 2018-01-10
1920:
1916:
1910:
1895:
1891:
1887:
1880:
1864:
1860:
1859:
1854:
1848:
1832:
1828:
1822:
1807:
1803:
1797:
1782:
1778:
1772:
1757:
1753:
1747:
1732:
1728:
1726:9780470857472
1722:
1718:
1717:
1709:
1690:
1686:
1682:
1678:
1671:
1664:
1648:
1644:
1640:
1634:
1619:
1615:
1609:
1593:
1589:
1588:
1583:
1576:
1561:
1557:
1553:
1547:
1531:
1527:
1523:
1517:
1501:
1497:
1493:
1487:
1471:
1467:
1463:
1457:
1455:
1438:
1434:
1428:
1412:
1408:
1404:
1398:
1396:
1379:
1375:
1371:
1365:
1349:
1345:
1341:
1335:
1319:
1315:
1311:
1310:"Baiduspider"
1305:
1289:
1285:
1281:
1275:
1259:
1255:
1254:About.ask.com
1251:
1245:
1230:
1226:
1220:
1214:
1209:
1206:
1201:
1196:
1192:
1188:
1187:
1179:
1164:
1160:
1156:
1150:
1135:
1131:
1127:
1120:
1105:
1104:
1099:
1092:
1090:
1088:
1086:
1084:
1067:
1063:
1059:
1053:
1043:
1038:
1034:
1030:
1023:
1008:
1004:
998:
987:September 25,
982:
978:
971:
964:
949:
945:
941:
935:
931:
921:
918:
916:
913:
911:
910:Web archiving
908:
906:
903:
901:
898:
895:
894:Meta elements
892:
890:
887:
885:
882:
879:
876:
873:
870:
868:
865:
863:
860:
858:
855:
852:
849:
846:
843:
839:
835:
831:
827:
826:
819:
810:
760:
758:
754:
749:
739:
733:
720:
717:
707:
690:
688:
684:
664:
661:
659:
651:
645:
639:
633:
627:
621:
618:
609:
593:
591:
587:
583:
574:
572:
568:
560:
555:
553:
549:
539:
537:
533:
529:
519:
513:
512:
506:
504:
500:
499:
494:
490:
486:
482:
481:generative AI
472:
470:
469:
464:
460:
457:. Co-founder
456:
452:
442:
440:
430:
422:
407:a.example.com
399:a.example.com
392:
387:
383:
380:
378:
360:
358:
353:
351:
347:
343:
339:
337:
332:
328:
323:
321:
317:
313:
309:
305:
295:
293:
288:
286:
282:
278:
274:
269:
267:
263:
259:
255:
251:
247:
238:
232:
226:
222:
215:
212:
211:
209:
205:
201:
197:
193:
189:
181:
176:
171:
165:
161:
149:
145:
141:
128:
124:
120:
116:
112:
108:
97:
88:
84:
79:
72:
71:
68:
63:
54:
39:
35:
30:
23:
2206:Web scraping
2160:
2133:. Retrieved
2124:
2115:
2107:
2080:
2073:
2062:. Retrieved
2035:February 15,
2033:. Retrieved
2024:
2015:
2004:. Retrieved
2000:the original
1990:
1978:. Retrieved
1969:
1960:
1948:. Retrieved
1934:
1923:. Retrieved
1919:the original
1909:
1898:. Retrieved
1889:
1879:
1867:. Retrieved
1856:
1847:
1835:. Retrieved
1821:
1810:. Retrieved
1796:
1785:. Retrieved
1771:
1760:. Retrieved
1746:
1735:. Retrieved
1715:
1708:
1696:. Retrieved
1676:
1663:
1651:. Retrieved
1643:The Register
1642:
1633:
1622:. Retrieved
1608:
1596:. Retrieved
1585:
1575:
1564:. Retrieved
1555:
1546:
1534:. Retrieved
1516:
1504:. Retrieved
1495:
1486:
1474:. Retrieved
1465:
1441:. Retrieved
1427:
1415:. Retrieved
1406:
1382:. Retrieved
1373:
1364:
1352:. Retrieved
1343:
1334:
1322:. Retrieved
1313:
1304:
1292:. Retrieved
1288:the original
1283:
1274:
1262:. Retrieved
1253:
1244:
1233:. Retrieved
1219:
1212:
1185:
1178:
1167:. Retrieved
1158:
1149:
1138:. Retrieved
1129:
1119:
1107:. Retrieved
1101:
1070:. Retrieved
1061:
1052:
1037:the original
1032:
1022:
1011:. Retrieved
997:
985:. Retrieved
976:
973:(PostScript)
963:
952:. Retrieved
943:
940:"Historical"
934:
838:security.txt
816:
807:
745:
731:
729:
715:
712:in the form
708:in the same
699:
679:
662:
655:
649:
643:
637:
631:
625:
619:
616:
599:
580:instructing
575:
570:
556:
545:
542:Alternatives
525:
509:
507:
496:
478:
466:
451:Archive Team
448:
436:
428:
388:
384:
381:
366:
354:
335:
324:
311:
301:
289:
270:
262:web crawlers
253:
245:
244:
164:
147:
22:old revision
19:
18:
1536:18 February
1522:Jason Scott
1476:16 February
1443:16 February
1417:16 February
1354:16 February
1324:16 February
1294:16 February
1264:16 February
915:Web crawler
905:Spider trap
738:statement.
590:Sergey Brin
459:Jason Scott
403:example.com
395:example.com
148:4 July 2024
20:This is an
2211:Text files
2190:Categories
2135:2022-10-17
2064:2013-08-17
2006:2009-03-23
1980:22 October
1950:9 February
1925:2018-05-25
1900:2019-10-03
1869:October 3,
1837:October 3,
1812:2013-12-29
1787:2013-12-29
1762:2013-12-29
1737:2015-08-12
1698:August 12,
1653:August 12,
1624:2015-08-10
1566:2018-12-01
1506:10 October
1235:2013-12-29
1169:2019-07-10
1140:2015-11-19
1013:2013-12-29
979:. Geneva.
954:2017-03-03
927:References
757:httpd.conf
710:robots.txt
667:directory
586:Larry Page
563:humans.txt
548:user-agent
425:Compliance
369:robots.txt
342:WebCrawler
266:web robots
264:and other
246:robots.txt
168:robots.txt
1314:Baidu.com
1103:The Verge
1042:Hypermail
753:.htaccess
736:Disallow:
714:Sitemap:
687:Googlebot
561:, host a
528:web robot
511:The Verge
350:AltaVista
229:robotstxt
2201:Websites
2180:Internet
2129:Archived
2058:Archived
2029:Archived
1974:Archived
1944:Archived
1894:Archived
1863:Archived
1831:Archived
1806:Archived
1781:Archived
1756:Archived
1731:Archived
1689:Archived
1647:Archived
1618:Archived
1592:Archived
1560:Archived
1530:Archived
1500:Archived
1470:Archived
1437:Archived
1411:Archived
1384:25 April
1378:Archived
1348:Archived
1318:Archived
1258:Archived
1229:Archived
1163:Archived
1134:Archived
1109:16 March
1072:19 April
1066:Archived
1007:Archived
981:Archived
948:Archived
900:Sitemaps
889:Perma.cc
884:Nofollow
880:(NDIIPP)
822:See also
716:full-url
706:Sitemaps
606:Disallow
596:Examples
522:Security
455:sitemaps
363:Standard
338:standard
336:de facto
312:www-talk
292:sitemaps
258:websites
250:filename
237:RFC 9309
144:accepted
87:contribs
77:Pcrooker
38:contribs
28:Pcrooker
920:noindex
851:BotSeer
830:ads.txt
784:content
759:files.
702:Sitemap
696:Sitemap
439:engines
377:website
298:History
248:is the
224:Website
207:Authors
185:folder.
2166:Portal
1858:GitHub
1723:
874:(NDLP)
658:Google
573:page.
567:GitHub
559:Google
503:Medium
489:Google
485:OpenAI
391:origin
348:, and
331:server
281:server
191:Status
1692:(PDF)
1673:(PDF)
1598:8 May
793:/>
571:About
515:'
346:Lycos
308:Nexor
2103:9309
2086:IETF
2037:2020
1982:2018
1952:2016
1871:2019
1839:2019
1721:ISBN
1700:2015
1655:2015
1600:2017
1538:2017
1508:2022
1478:2013
1445:2013
1419:2013
1386:2017
1356:2013
1326:2013
1296:2013
1266:2013
1208:9309
1191:IETF
1111:2024
1074:2014
989:2013
775:name
772:meta
769:<
755:and
730:The
588:and
495:and
231:.org
127:diff
121:) |
119:diff
107:diff
83:talk
34:talk
2100:RFC
2090:doi
1681:doi
1205:RFC
1195:doi
493:BBC
417:or
146:on
138:An
43:at
2192::
2127:.
2123:.
2098:.
2088:.
2084:.
2056:.
2045:^
2027:.
2023:.
1972:.
1968:.
1892:.
1888:.
1861:.
1855:.
1829:.
1729:.
1687:.
1679:.
1675:.
1645:.
1641:.
1616:.
1590:.
1584:.
1554:.
1524:.
1494:.
1468:.
1464:.
1453:^
1435:.
1409:.
1405:.
1394:^
1376:.
1372:.
1346:.
1342:.
1316:.
1312:.
1282:.
1256:.
1252:.
1203:.
1193:.
1189:.
1161:.
1157:.
1132:.
1128:.
1100:.
1082:^
1060:.
1031:.
975:.
946:.
942:.
719::
592:.
421:.
352:.
344:,
287:.
234:,
113:|
109:)
85:|
36:|
2168::
2138:.
2105:.
2092::
2067:.
2039:.
2009:.
1984:.
1954:.
1928:.
1903:.
1873:.
1841:.
1815:.
1790:.
1765:.
1740:.
1702:.
1683::
1657:.
1627:.
1602:.
1569:.
1540:.
1510:.
1480:.
1447:.
1421:.
1388:.
1358:.
1328:.
1298:.
1268:.
1238:.
1210:.
1197::
1172:.
1143:.
1113:.
1076:.
1040:(
1016:.
991:.
957:.
787:=
778:=
602:*
162:.
129:)
125:(
117:(
105:(
100:)
92:(
89:)
81:(
70:.
57:)
49:(
40:)
32:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.