743:
76:
722:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
594:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
580:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
281:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
452:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
444:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
282:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
80:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
413:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
595:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
271:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
664:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
367:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
731:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.
278:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
1764:
1281:
1370:
817:
357:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
1493:
2126:
387:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
1902:
225:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
1067:
999:
1525:
401:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2016:
1954:
1852:
2087:
1760:
1344:
1273:
570:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
1366:
1548:
574:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
567:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
449:
1647:
1987:
1433:
275:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1485:
1689:
1932:
2112:
914:
1463:
1096:
660:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
1898:
1576:
171:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
464:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
1739:
1605:
1162:
1191:
1251:
214:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1059:
991:
1403:
2167:
881:
440:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
962:
940:
376:
180:
1873:
561:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
1683:
1515:
778:
2012:
1958:
811:
1844:
1031:
252:
2079:
1311:
375:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
1789:
1336:
770:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
251:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
1213:
428:
circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular
404:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
1821:
34:
445:
172:
1714:
723:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
30:
1628:
790:
773:
215:
1979:
1425:
337:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
1486:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
495:
2013:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers"
289:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
2177:
2117:
1673:
596:
2058:
1924:
1138:
168:
263:
When a site owner wishes to give instructions to web robots they place a text file called
8:
2182:
1845:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
1549:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)"
903:
1459:
1088:
555:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
175:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
1572:
393:
379:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
286:
226:
176:
1735:
1597:
2172:
1679:
1158:
230:
210:
mailing list, the main communication channel for WWW-related activities at the time.
179:
overload. In the 2020s many websites began denying bots that collect information for
742:
2048:
1639:
1183:
1128:
800:
398:
358:
1243:
805:
795:
748:
661:
514:
This example tells all robots that they can visit all files because the wildcard
2061:
2038:
1141:
1118:
361:
announced that it would stop complying with robots.txt directives. According to
132:
1520:
1395:
363:
334:
211:
199:
904:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
873:
531:
The same result can be accomplished with an empty or missing robots.txt file.
2161:
1274:"Robots Exclusion Protocol: joining together to provide better documentation"
970:
848:
425:
222:
1643:
936:
558:
This example tells two specific robots not to enter one specific directory:
479:
file that displays information meant for humans to read. Some sites such as
1877:
766:
346:
161:
1761:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
1455:
853:
843:
503:
354:
157:
1516:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
670:
499:
461:
237:
2053:
1955:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1303:
1133:
1036:
992:"How I got here in the end, part five: "things can only get better!""
975:
666:
600:
441:
429:
421:
416:
406:
305:. In addition, each protocol and port needs its own robots.txt file;
245:
2080:"How Google Interprets the robots.txt Specification | Documentation"
1785:
818:
National
Digital Information Infrastructure and Preservation Program
236:; most complied, including those operated by search engines such as
838:
833:
823:
619:
546:
This example tells all robots to stay away from one specific file:
543:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
465:
325:
The robots.txt protocol is widely complied with by bot operators.
255:. A proposed standard was published in September 2022 as RFC 9309.
187:
145:
2037:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1221:
1117:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1598:"Robots.txt tells hackers the places you don't want them to look"
828:
784:
758:
350:
272:
153:
1811:
1816:
571:
480:
472:
384:
380:
1710:
552:
All other files in the specified directory will be processed.
540:
This example tells all robots not to enter three directories:
712:
241:
229:
overload was a primary concern. By June 1994 it had become a
203:
112:
Gary Illyes, Henner Zeller, Lizzi
Sassman (IETF contributors)
75:
2150:
2044:
1124:
1060:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
1675:
Innocent Code: A Security Wake-Up Call for Web
Programmers
1573:"Block URLs with robots.txt: Learn about robots.txt files"
2122:
2036:
1331:
1329:
1116:
1089:"Formalizing the Robots Exclusion Protocol Specification"
522:
directive has no value, meaning no pages are disallowed.
388:
164:
which portions of the website they are allowed to visit.
534:
This example tells all robots to stay out of a website:
468:) when it detects a connection using one of the robots.
124:
2113:"Artificial Intelligence Web Crawlers Are Running Amok"
1899:"To crawl or not to crawl, that is BingBot's question"
1326:
186:
The "robots.txt" file can be used in conjunction with
221:
The standard, initially RobotsNotWanted.txt, allowed
1627:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
1359:
911:
738:
726:
55:
1925:"Change Googlebot crawl rate - Search Console Help"
1626:
383:'s GPTBot in their robots.txt file and 85 blocked
787:– Now inactive search engine for robots.txt files
676:
190:, another robot inclusion standard for websites.
2159:
1891:
1711:"List of User-Agents (Spiders, Robots, Browser)"
1671:
1507:
1390:
1388:
1266:
1206:
639:
564:Example demonstrating how comments can be used:
345:Some web archiving projects ignore robots.txt.
2007:
2005:
1636:National Institute of Standards and Technology
1296:
1176:
1057:
963:"Important: Spiders, Robots and Web Wanderers"
762:, a standard for listing authorized ad sellers
450:National Institute of Standards and Technology
349:uses the file to discover more links, such as
1385:
606:User-agent: bingbot Allow: / Crawl-delay: 10
549:User-agent: * Disallow: /directory/file.html
490:Previously, Google had a joke file hosted at
98:1994 published, formally standardized in 2022
1236:
577:Example demonstrating multiple user-agents:
267:in the root of the web site hierarchy (e.g.
2002:
636:Sitemap: http://www.example.com/sitemap.xml
167:The standard, developed in 1994, relies on
15:
2069:sec. 2.5: Limits.
1367:"Submitting your website to Yahoo! Search"
648:does not mention the "*" character in the
370:
74:
16:
2052:
1678:. John Wiley & Sons. pp. 91–92.
1132:
584:
1051:
901:
781:– A failed proposal to extend robots.txt
655:
589:
448:is discouraged by standards bodies. The
297:did not, the rules that would apply for
24:This is an accepted version of this page
1546:
1454:
1337:"Webmasters: Robots.txt Specifications"
1025:
1023:
1021:
1019:
1017:
14:
2160:
1842:
1110:
1093:Official Google Webmaster Central Blog
1032:"The text file that runs the internet"
1029:
960:
2110:
1990:from the original on November 2, 2019
1792:from the original on January 24, 2017
1736:"Access Control - Apache HTTP Server"
1513:
1314:from the original on 16 February 2017
54:For Knowledge's robots.txt file, see
1436:from the original on 10 October 2022
1194:from the original on 27 January 2013
1014:
961:Koster, Martijn (25 February 1994).
56:https://en.wikipedia.org/robots.txt
48:
2104:
2030:
1629:"Guide to General Server Security"
1254:from the original on 6 August 2013
1159:"Uncrawled URLs in search results"
1030:Pierce, David (14 February 2024).
269:https://www.example.com/robots.txt
181:generative artificial intelligence
49:
2194:
2142:
1824:from the original on May 30, 2016
929:
779:Automated Content Access Protocol
727:Maximum size of a robots.txt file
498:not to kill the company founders
340:
328:
2129:from the original on 6 July 2024
812:National Digital Library Program
741:
713:A "noindex" HTTP response header
460:Many robots also pass a special
109:Martijn Koster (original author)
2090:from the original on 2022-10-17
2072:
2019:from the original on 2013-08-08
1972:
1947:
1935:from the original on 2018-11-18
1917:
1905:from the original on 2016-02-03
1866:
1855:from the original on 2018-11-18
1843:Newman, Lily Hay (2014-07-03).
1836:
1804:
1778:
1767:from the original on 2014-01-01
1753:
1742:from the original on 2013-12-29
1728:
1717:from the original on 2014-01-07
1703:
1692:from the original on 2016-04-01
1665:
1653:from the original on 2011-10-08
1620:
1608:from the original on 2015-08-21
1590:
1579:from the original on 2015-08-14
1565:
1540:
1528:from the original on 2017-05-16
1496:from the original on 2018-12-04
1478:
1466:from the original on 2017-02-18
1448:
1418:
1406:from the original on 2013-01-25
1373:from the original on 2013-01-21
1347:from the original on 2013-01-15
1284:from the original on 2014-08-18
1165:from the original on 2014-01-06
1151:
1099:from the original on 2019-07-10
1070:from the original on 2015-09-07
1058:Barry Schwartz (30 June 2014).
1002:from the original on 2013-11-25
943:from the original on 2014-01-12
917:from the original on 2013-09-27
884:from the original on 2017-04-03
599:for webmasters, to control the
455:
253:Internet Engineering Task Force
1460:"Robots.txt is a suicide note"
1081:
984:
954:
895:
866:
518:stands for all robots and the
309:does not apply to pages under
13:
1:
1547:Koebler, Jason (2024-07-29).
1514:Jones, Brad (24 April 2017).
939:. Robotstxt.org. 1994-06-30.
860:
618:directive, allowing multiple
420:reported that companies like
320:
307:http://example.com/robots.txt
285:A robots.txt file covers one
198:The standard was proposed by
2111:Allyn, Bobby (5 July 2024).
1876:. 2018-01-10. Archived from
7:
1980:"Robots.txt Specifications"
1184:"About Ask.com: Webmasters"
734:
509:
435:
258:
10:
2199:
2168:Search engine optimization
1426:"ArchiveBot: Bad behavior"
609:
537:User-agent: * Disallow: /
483:redirect humans.txt to an
446:security through obscurity
293:had a robots.txt file but
193:
173:security through obscurity
148:used for implementing the
53:
2040:Robots Exclusion Protocol
1672:Sverre H. Huseby (2004).
1120:Robots Exclusion Protocol
525:User-agent: * Disallow:
150:Robots Exclusion Protocol
119:
102:
94:
86:
73:
69:Robots Exclusion Protocol
68:
1161:. YouTube. Oct 5, 2009.
791:Distributed web crawling
716:
680:
646:Robot Exclusion Standard
614:Some crawlers support a
466:pass alternative content
311:http://example.com:8080/
216:denial-of-service attack
206:in February 1994 on the
156:to indicate to visiting
31:latest accepted revision
1763:. Iis.net. 2013-11-06.
1644:10.6028/NIST.SP.800-123
528:User-agent: * Allow: /
371:Artificial intelligence
937:"The Web Robots Pages"
902:Fielding, Roy (1994).
718:X-Robots-Tag: noindex
603:'s subsequent visits.
585:Nonstandard extensions
2118:All Things Considered
967:www-talk mailing list
774:eBay v. Bidder's Edge
656:Meta tags and headers
590:Crawl-delay directive
397:. In 2023, blog host
152:, a standard used by
1874:"/killer-robots.txt"
1738:. Httpd.apache.org.
1430:wiki.archiveteam.org
980:on October 29, 2013.
677:A "noindex" meta tag
471:Some sites, such as
315:https://example.com/
218:on Koster's server.
169:voluntary compliance
1812:"Github humans.txt"
1786:"Google humans.txt"
1713:. User-agents.org.
1224:on 13 December 2012
704:"noindex"
640:Universal "*" match
301:would not apply to
202:, when working for
65:
21:Page version status
2067:Proposed Standard.
1929:support.google.com
1396:"Using robots.txt"
1214:"About AOL Search"
1147:Proposed Standard.
1064:Search Engine Land
808:for search engines
695:"robots"
492:/killer-robots.txt
394:The New York Times
333:Some major search
63:
27:
2084:Google Developers
1984:Google Developers
1492:. 17 April 2017.
1341:Google Developers
978:archived message)
139:
138:
90:Proposed Standard
51:Internet protocol
18:
2190:
2154:
2153:
2151:Official website
2138:
2136:
2134:
2099:
2098:
2096:
2095:
2076:
2070:
2065:
2056:
2054:10.17487/RFC9309
2034:
2028:
2027:
2025:
2024:
2009:
2000:
1999:
1997:
1995:
1976:
1970:
1969:
1967:
1966:
1957:. Archived from
1951:
1945:
1944:
1942:
1940:
1921:
1915:
1914:
1912:
1910:
1895:
1889:
1888:
1886:
1885:
1870:
1864:
1863:
1861:
1860:
1840:
1834:
1833:
1831:
1829:
1808:
1802:
1801:
1799:
1797:
1782:
1776:
1775:
1773:
1772:
1757:
1751:
1750:
1748:
1747:
1732:
1726:
1725:
1723:
1722:
1707:
1701:
1700:
1698:
1697:
1669:
1663:
1662:
1660:
1658:
1652:
1633:
1624:
1618:
1617:
1615:
1613:
1594:
1588:
1587:
1585:
1584:
1569:
1563:
1562:
1560:
1559:
1544:
1538:
1537:
1535:
1533:
1511:
1505:
1504:
1502:
1501:
1490:blog.archive.org
1482:
1476:
1475:
1473:
1471:
1462:. Archive Team.
1452:
1446:
1445:
1443:
1441:
1432:. Archive Team.
1422:
1416:
1415:
1413:
1411:
1392:
1383:
1382:
1380:
1378:
1363:
1357:
1356:
1354:
1352:
1333:
1324:
1323:
1321:
1319:
1304:"DuckDuckGo Bot"
1300:
1294:
1293:
1291:
1289:
1270:
1264:
1263:
1261:
1259:
1240:
1234:
1233:
1231:
1229:
1220:. Archived from
1210:
1204:
1203:
1201:
1199:
1180:
1174:
1173:
1171:
1170:
1155:
1149:
1145:
1136:
1134:10.17487/RFC9309
1114:
1108:
1107:
1105:
1104:
1085:
1079:
1078:
1076:
1075:
1055:
1049:
1048:
1046:
1044:
1027:
1012:
1011:
1009:
1007:
998:. 19 June 2006.
988:
982:
981:
979:
969:. Archived from
958:
952:
951:
949:
948:
933:
927:
926:
924:
922:
908:
899:
893:
892:
890:
889:
878:Greenhills.co.uk
870:
801:Internet Archive
769:
761:
751:
746:
745:
708:
705:
702:
699:
696:
693:
690:
687:
684:
662:Robots meta tags
651:
632:
625:
617:
521:
517:
493:
478:
412:
359:Internet Archive
316:
312:
308:
304:
300:
296:
292:
270:
266:
135:
129:
126:
78:
66:
62:
2198:
2197:
2193:
2192:
2191:
2189:
2188:
2187:
2158:
2157:
2149:
2148:
2145:
2132:
2130:
2107:
2105:Further reading
2102:
2093:
2091:
2078:
2077:
2073:
2035:
2031:
2022:
2020:
2011:
2010:
2003:
1993:
1991:
1978:
1977:
1973:
1964:
1962:
1953:
1952:
1948:
1938:
1936:
1923:
1922:
1918:
1908:
1906:
1897:
1896:
1892:
1883:
1881:
1872:
1871:
1867:
1858:
1856:
1841:
1837:
1827:
1825:
1810:
1809:
1805:
1795:
1793:
1784:
1783:
1779:
1770:
1768:
1759:
1758:
1754:
1745:
1743:
1734:
1733:
1729:
1720:
1718:
1709:
1708:
1704:
1695:
1693:
1686:
1670:
1666:
1656:
1654:
1650:
1631:
1625:
1621:
1611:
1609:
1596:
1595:
1591:
1582:
1580:
1571:
1570:
1566:
1557:
1555:
1545:
1541:
1531:
1529:
1512:
1508:
1499:
1497:
1484:
1483:
1479:
1469:
1467:
1453:
1449:
1439:
1437:
1424:
1423:
1419:
1409:
1407:
1400:Help.yandex.com
1394:
1393:
1386:
1376:
1374:
1365:
1364:
1360:
1350:
1348:
1335:
1334:
1327:
1317:
1315:
1302:
1301:
1297:
1287:
1285:
1272:
1271:
1267:
1257:
1255:
1242:
1241:
1237:
1227:
1225:
1212:
1211:
1207:
1197:
1195:
1182:
1181:
1177:
1168:
1166:
1157:
1156:
1152:
1115:
1111:
1102:
1100:
1087:
1086:
1082:
1073:
1071:
1056:
1052:
1042:
1040:
1028:
1015:
1005:
1003:
996:Charlie's Diary
990:
989:
985:
973:
959:
955:
946:
944:
935:
934:
930:
920:
918:
906:
900:
896:
887:
885:
872:
871:
867:
863:
858:
796:Focused crawler
765:
757:
749:Internet portal
747:
740:
737:
729:
720:
719:
715:
710:
709:
706:
703:
700:
697:
694:
691:
688:
685:
682:
679:
658:
649:
642:
637:
627:
623:
615:
612:
607:
592:
587:
582:
568:
562:
556:
550:
544:
538:
529:
526:
519:
515:
512:
491:
476:
458:
438:
410:
373:
343:
331:
323:
314:
310:
306:
302:
298:
294:
290:
268:
264:
261:
196:
131:
123:
115:
95:First published
82:
59:
52:
47:
46:
45:
44:
43:
42:
26:
12:
11:
5:
2196:
2186:
2185:
2180:
2175:
2170:
2156:
2155:
2144:
2143:External links
2141:
2140:
2139:
2106:
2103:
2101:
2100:
2071:
2029:
2001:
1971:
1946:
1916:
1901:. 3 May 2012.
1890:
1865:
1849:Slate Magazine
1835:
1803:
1777:
1752:
1727:
1702:
1684:
1664:
1619:
1589:
1564:
1539:
1521:Digital Trends
1506:
1477:
1447:
1417:
1384:
1358:
1325:
1308:DuckDuckGo.com
1295:
1278:Blogs.bing.com
1265:
1235:
1218:Search.aol.com
1205:
1175:
1150:
1109:
1080:
1050:
1013:
983:
953:
928:
894:
864:
862:
859:
857:
856:
851:
846:
841:
836:
831:
826:
821:
815:
809:
803:
798:
793:
788:
782:
776:
771:
763:
754:
753:
752:
736:
733:
728:
725:
717:
714:
711:
681:
678:
675:
657:
654:
641:
638:
635:
611:
608:
605:
597:search console
591:
588:
586:
583:
579:
566:
560:
554:
548:
542:
536:
527:
524:
511:
508:
496:the Terminator
457:
454:
437:
434:
372:
369:
364:Digital Trends
342:
341:Archival sites
339:
330:
329:Search engines
327:
322:
319:
260:
257:
223:web developers
212:Charles Stross
200:Martijn Koster
195:
192:
137:
136:
121:
117:
116:
114:
113:
110:
106:
104:
100:
99:
96:
92:
91:
88:
84:
83:
79:
71:
70:
50:
28:
22:
19:
17:
9:
6:
4:
3:
2:
2195:
2184:
2181:
2179:
2176:
2174:
2171:
2169:
2166:
2165:
2163:
2152:
2147:
2146:
2128:
2124:
2120:
2119:
2114:
2109:
2108:
2089:
2085:
2081:
2075:
2068:
2063:
2060:
2055:
2050:
2046:
2042:
2041:
2033:
2018:
2014:
2008:
2006:
1989:
1985:
1981:
1975:
1961:on 2009-03-05
1960:
1956:
1950:
1934:
1930:
1926:
1920:
1904:
1900:
1894:
1880:on 2018-01-10
1879:
1875:
1869:
1854:
1850:
1846:
1839:
1823:
1819:
1818:
1813:
1807:
1791:
1787:
1781:
1766:
1762:
1756:
1741:
1737:
1731:
1716:
1712:
1706:
1691:
1687:
1685:9780470857472
1681:
1677:
1676:
1668:
1649:
1645:
1641:
1637:
1630:
1623:
1607:
1603:
1599:
1593:
1578:
1574:
1568:
1554:
1550:
1543:
1527:
1523:
1522:
1517:
1510:
1495:
1491:
1487:
1481:
1465:
1461:
1457:
1451:
1435:
1431:
1427:
1421:
1405:
1401:
1397:
1391:
1389:
1372:
1368:
1362:
1346:
1342:
1338:
1332:
1330:
1313:
1309:
1305:
1299:
1283:
1279:
1275:
1269:
1253:
1249:
1245:
1244:"Baiduspider"
1239:
1223:
1219:
1215:
1209:
1193:
1189:
1188:About.ask.com
1185:
1179:
1164:
1160:
1154:
1148:
1143:
1140:
1135:
1130:
1126:
1122:
1121:
1113:
1098:
1094:
1090:
1084:
1069:
1065:
1061:
1054:
1039:
1038:
1033:
1026:
1024:
1022:
1020:
1018:
1001:
997:
993:
987:
977:
972:
968:
964:
957:
942:
938:
932:
921:September 25,
916:
912:
905:
898:
883:
879:
875:
869:
865:
855:
852:
850:
849:Web archiving
847:
845:
842:
840:
837:
835:
832:
830:
827:
825:
822:
819:
816:
813:
810:
807:
806:Meta elements
804:
802:
799:
797:
794:
792:
789:
786:
783:
780:
777:
775:
772:
768:
764:
760:
756:
755:
750:
744:
739:
732:
724:
674:
672:
668:
663:
653:
647:
634:
631:
621:
604:
602:
598:
578:
575:
573:
565:
559:
553:
547:
541:
535:
532:
523:
507:
505:
501:
497:
488:
486:
482:
474:
469:
467:
463:
453:
451:
447:
443:
433:
431:
427:
426:Perplexity.ai
423:
419:
418:
409:
408:
402:
400:
396:
395:
390:
386:
382:
378:
377:generative AI
368:
366:
365:
360:
356:
353:. Co-founder
352:
348:
338:
336:
326:
318:
303:a.example.com
295:a.example.com
288:
283:
279:
276:
274:
256:
254:
249:
247:
243:
239:
235:
233:
228:
224:
219:
217:
213:
209:
205:
201:
191:
189:
184:
182:
178:
174:
170:
165:
163:
159:
155:
151:
147:
143:
134:
128:
122:
118:
111:
108:
107:
105:
101:
97:
93:
89:
85:
77:
72:
67:
61:
57:
40:
39:9 August 2024
36:
32:
25:
20:
2178:Web scraping
2131:. Retrieved
2116:
2092:. Retrieved
2083:
2074:
2066:
2039:
2032:
2021:. Retrieved
1994:February 15,
1992:. Retrieved
1983:
1974:
1963:. Retrieved
1959:the original
1949:
1937:. Retrieved
1928:
1919:
1907:. Retrieved
1893:
1882:. Retrieved
1878:the original
1868:
1857:. Retrieved
1848:
1838:
1826:. Retrieved
1815:
1806:
1794:. Retrieved
1780:
1769:. Retrieved
1755:
1744:. Retrieved
1730:
1719:. Retrieved
1705:
1694:. Retrieved
1674:
1667:
1655:. Retrieved
1635:
1622:
1610:. Retrieved
1602:The Register
1601:
1592:
1581:. Retrieved
1567:
1556:. Retrieved
1552:
1542:
1530:. Retrieved
1519:
1509:
1498:. Retrieved
1489:
1480:
1468:. Retrieved
1450:
1438:. Retrieved
1429:
1420:
1408:. Retrieved
1399:
1375:. Retrieved
1361:
1349:. Retrieved
1340:
1316:. Retrieved
1307:
1298:
1286:. Retrieved
1277:
1268:
1256:. Retrieved
1247:
1238:
1226:. Retrieved
1222:the original
1217:
1208:
1196:. Retrieved
1187:
1178:
1167:. Retrieved
1153:
1146:
1119:
1112:
1101:. Retrieved
1092:
1083:
1072:. Retrieved
1063:
1053:
1041:. Retrieved
1035:
1004:. Retrieved
995:
986:
971:the original
966:
956:
945:. Retrieved
931:
919:. Retrieved
910:
907:(PostScript)
897:
886:. Retrieved
877:
874:"Historical"
868:
767:security.txt
730:
721:
659:
645:
643:
629:
626:in the form
622:in the same
613:
593:
576:
569:
563:
557:
551:
545:
539:
533:
530:
513:
494:instructing
489:
484:
470:
459:
456:Alternatives
439:
415:
405:
403:
392:
374:
362:
347:Archive Team
344:
332:
324:
284:
280:
277:
262:
250:
231:
220:
207:
197:
185:
166:
158:web crawlers
149:
141:
140:
60:
38:
29:This is the
23:
1470:18 February
1456:Jason Scott
1410:16 February
1377:16 February
1351:16 February
1288:16 February
1258:16 February
1228:16 February
1198:16 February
854:Web crawler
844:Spider trap
652:statement.
504:Sergey Brin
355:Jason Scott
299:example.com
291:example.com
2183:Text files
2162:Categories
2094:2022-10-17
2023:2013-08-17
1965:2009-03-23
1939:22 October
1909:9 February
1884:2018-05-25
1859:2019-10-03
1828:October 3,
1796:October 3,
1771:2013-12-29
1746:2013-12-29
1721:2013-12-29
1696:2015-08-12
1657:August 12,
1612:August 12,
1583:2015-08-10
1558:2024-07-29
1500:2018-12-01
1440:10 October
1169:2013-12-29
1103:2019-07-10
1074:2015-11-19
947:2013-12-29
913:. Geneva.
888:2017-03-03
861:References
671:httpd.conf
624:robots.txt
581:directory
500:Larry Page
477:humans.txt
462:user-agent
430:blocklists
321:Compliance
265:robots.txt
238:WebCrawler
162:web robots
160:and other
142:robots.txt
64:robots.txt
1553:404 Media
1248:Baidu.com
1037:The Verge
976:Hypermail
667:.htaccess
650:Disallow:
628:Sitemap:
601:Googlebot
475:, host a
442:web robot
422:Anthropic
417:404 Media
407:The Verge
246:AltaVista
125:robotstxt
2173:Websites
2127:Archived
2088:Archived
2017:Archived
1988:Archived
1933:Archived
1903:Archived
1853:Archived
1822:Archived
1790:Archived
1765:Archived
1740:Archived
1715:Archived
1690:Archived
1648:Archived
1606:Archived
1577:Archived
1526:Archived
1494:Archived
1464:Archived
1434:Archived
1404:Archived
1371:Archived
1345:Archived
1318:25 April
1312:Archived
1282:Archived
1252:Archived
1192:Archived
1163:Archived
1097:Archived
1068:Archived
1043:16 March
1006:19 April
1000:Archived
941:Archived
915:Archived
882:Archived
839:Sitemaps
834:Perma.cc
824:nofollow
820:(NDIIPP)
735:See also
630:full-url
620:Sitemaps
520:Disallow
510:Examples
436:Security
351:sitemaps
259:Standard
234:standard
232:de facto
208:www-talk
188:sitemaps
154:websites
146:filename
133:RFC 9309
35:reviewed
829:noindex
785:BotSeer
759:ads.txt
698:content
673:files.
616:Sitemap
610:Sitemap
335:engines
273:website
194:History
144:is the
120:Website
103:Authors
81:folder.
2133:6 July
1817:GitHub
1682:
814:(NDLP)
572:Google
487:page.
481:GitHub
473:Google
399:Medium
385:Google
381:OpenAI
287:origin
244:, and
227:server
177:server
87:Status
1651:(PDF)
1632:(PDF)
1532:8 May
707:/>
485:About
411:'
242:Lycos
204:Nexor
2135:2024
2062:9309
2045:IETF
1996:2020
1941:2018
1911:2016
1830:2019
1798:2019
1680:ISBN
1659:2015
1614:2015
1534:2017
1472:2017
1442:2022
1412:2013
1379:2013
1353:2013
1320:2017
1290:2013
1260:2013
1230:2013
1200:2013
1142:9309
1125:IETF
1045:2024
1008:2014
923:2013
689:name
686:meta
683:<
669:and
644:The
502:and
424:and
391:and
127:.org
2123:NPR
2059:RFC
2049:doi
1640:doi
1139:RFC
1129:doi
389:BBC
313:or
37:on
2164::
2125:.
2121:.
2115:.
2086:.
2082:.
2057:.
2047:.
2043:.
2015:.
2004:^
1986:.
1982:.
1931:.
1927:.
1851:.
1847:.
1820:.
1814:.
1788:.
1688:.
1646:.
1638:.
1634:.
1604:.
1600:.
1575:.
1551:.
1524:.
1518:.
1488:.
1458:.
1428:.
1402:.
1398:.
1387:^
1369:.
1343:.
1339:.
1328:^
1310:.
1306:.
1280:.
1276:.
1250:.
1246:.
1216:.
1190:.
1186:.
1137:.
1127:.
1123:.
1095:.
1091:.
1066:.
1062:.
1034:.
1016:^
994:.
965:.
909:.
880:.
876:.
633::
506:.
432:.
317:.
248:.
240:,
183:.
130:,
33:,
2137:.
2097:.
2064:.
2051::
2026:.
1998:.
1968:.
1943:.
1913:.
1887:.
1862:.
1832:.
1800:.
1774:.
1749:.
1724:.
1699:.
1661:.
1642::
1616:.
1586:.
1561:.
1536:.
1503:.
1474:.
1444:.
1414:.
1381:.
1355:.
1322:.
1292:.
1262:.
1232:.
1202:.
1172:.
1144:.
1131::
1106:.
1077:.
1047:.
1010:.
974:(
950:.
925:.
891:.
701:=
692:=
516:*
58:.
41:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.