2322:
85:
327:
955:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
827:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
813:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
532:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
685:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
677:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
533:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
331:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
171:
77:
664:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
828:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
522:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
185:
897:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
618:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
964:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .
529:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
178:
1952:
1494:
15:
1583:
1024:
608:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
1706:
638:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
2090:
476:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
114:
1280:
1212:
1738:
652:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2204:
2142:
2040:
2275:
1948:
1557:
1486:
803:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
205:
154:
1579:
807:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
800:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
682:
1835:
2175:
1646:
526:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1698:
1877:
2120:
1127:
1676:
1309:
893:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
2086:
1764:
422:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
86:
697:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
1927:
1793:
1375:
1404:
1464:
465:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1272:
1204:
166:
107:
576:
A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.
1616:
45:
42:
2342:
1094:
673:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
1175:
1153:
627:
431:
2061:
794:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
1871:
1728:
991:
2200:
2146:
1018:
2032:
1244:
503:
2267:
1524:
626:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
1977:
1549:
988:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
125:
502:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
1426:
655:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
162:
103:
2009:
2312:
678:
423:
1902:
956:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
1816:
1003:
466:
2167:
1638:
588:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
1699:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
728:
2201:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers"
540:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
2352:
1861:
829:
80:
2246:
2112:
1351:
419:
514:
When a site owner wishes to give instructions to web robots they place a text file called
8:
2357:
2033:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
1116:
1672:
1301:
788:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
426:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
1760:
644:
630:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
537:
477:
427:
121:
1923:
1785:
2347:
1867:
1371:
481:
461:
mailing list, the main communication channel for WWW-related activities at the time.
430:
overload. In the 2020s many websites began denying bots that collect information for
2321:
2236:
1827:
1396:
1341:
1013:
649:
609:
16:
1456:
2326:
1040:
1008:
894:
747:
This example tells all robots that they can visit all files because the wildcard
2249:
2226:
1354:
1331:
612:
announced that it would stop complying with robots.txt directives. According to
383:
1733:
1608:
614:
585:
462:
450:
1117:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
1086:
764:
The same result can be accomplished with an empty or missing robots.txt file.
2336:
1487:"Robots Exclusion Protocol: joining together to provide better documentation"
1183:
1056:
473:
1831:
1149:
791:
This example tells two specific robots not to enter one specific directory:
712:
file that displays information meant for humans to read. Some sites such as
61:
2065:
984:
597:
412:
1949:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
1668:
1061:
1051:
736:
605:
408:
1729:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
903:
732:
694:
488:
2241:
2143:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1516:
1346:
1249:
1205:"How I got here in the end, part five: "things can only get better!""
1188:
899:
833:
674:
657:
556:. In addition, each protocol and port needs its own robots.txt file;
496:
2268:"How Google Interprets the robots.txt Specification | Documentation"
1973:
1025:
National
Digital Information Infrastructure and Preservation Program
487:; most complied, including those operated by search engines such as
1046:
1035:
1030:
852:
779:
This example tells all robots to stay away from one specific file:
776:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
698:
506:. A proposed standard was published in September 2022 as RFC 9309.
438:
396:
2225:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1434:
1330:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1786:"Robots.txt tells hackers the places you don't want them to look"
1066:
997:
976:
601:
523:
404:
1999:
2004:
804:
713:
705:
635:
631:
1898:
785:
All other files in the specified directory will be processed.
773:
This example tells all robots not to enter three directories:
30:
29:
945:
492:
480:
overload was a primary concern. By June 1994 it had become a
454:
363:
Gary Illyes, Henner Zeller, Lizzi
Sassman (IETF contributors)
326:
2300:
2232:
1337:
1273:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
1863:
Innocent Code: A Security Wake-Up Call for Web
Programmers
1761:"Block URLs with robots.txt: Learn about robots.txt files"
2224:
1544:
1542:
1329:
1302:"Formalizing the Robots Exclusion Protocol Specification"
755:
directive has no value, meaning no pages are disallowed.
639:
415:
which portions of the website they are allowed to visit.
767:
This example tells all robots to stay out of a website:
701:) when it detects a connection using one of the robots.
375:
2087:"To crawl or not to crawl, that is BingBot's question"
1539:
437:
The "robots.txt" file can be used in conjunction with
146:
68:
2310:
472:
The standard, initially RobotsNotWanted.txt, allowed
1815:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
1572:
1124:
959:
306:
2113:"Change Googlebot crawl rate - Search Console Help"
1814:
634:'s GPTBot in their robots.txt file and 85 blocked
1000:– now inactive search engine for robots.txt files
909:
441:, another robot inclusion standard for websites.
2334:
2079:
1899:"List of User-Agents (Spiders, Robots, Browser)"
1859:
1720:
1603:
1601:
1479:
1419:
872:
797:Example demonstrating how comments can be used:
596:Some web archiving projects ignore robots.txt.
2195:
2193:
1824:National Institute of Standards and Technology
1509:
1389:
1270:
1176:"Important: Spiders, Robots and Web Wanderers"
980:, a standard for listing authorized ad sellers
683:National Institute of Standards and Technology
600:uses the file to discover more links, such as
1598:
839:User-agent: bingbot Allow: / Crawl-delay: 10
782:User-agent: * Disallow: /directory/file.html
723:Previously, Google had a joke file hosted at
349:1994 published, formally standardized in 2022
1449:
810:Example demonstrating multiple user-agents:
518:in the root of the web site hierarchy (e.g.
140:
2190:
869:Sitemap: http://www.example.com/sitemap.xml
418:The standard, developed in 1994, relies on
2257:sec. 2.5: Limits.
1580:"Submitting your website to Yahoo! Search"
881:does not mention the "*" character in the
621:
325:
27:
2240:
1866:. John Wiley & Sons. pp. 91–92.
1345:
817:
1264:
1114:
994:– a failed proposal to extend robots.txt
888:
822:
681:is discouraged by standards bodies. The
548:did not, the rules that would apply for
28:
1667:
1550:"Webmasters: Robots.txt Specifications"
1238:
1236:
1234:
1232:
1230:
180:2a00:1e88:b0b3:7900:6717:dc15:703e:a02d
2335:
2030:
1323:
1306:Official Google Webmaster Central Blog
1245:"The text file that runs the internet"
1242:
1173:
2178:from the original on November 2, 2019
1980:from the original on January 24, 2017
1924:"Access Control - Apache HTTP Server"
1726:
1527:from the original on 16 February 2017
305:For Knowledge's robots.txt file, see
1649:from the original on 10 October 2022
1407:from the original on 27 January 2013
1227:
1174:Koster, Martijn (25 February 1994).
94:
60:
307:https://en.wikipedia.org/robots.txt
204:
195:
191:
177:
160:
153:
147:→Maximum size of a robots.txt file
139:
113:
101:
13:
2218:
1817:"Guide to General Server Security"
1467:from the original on 6 August 2013
1372:"Uncrawled URLs in search results"
1243:Pierce, David (14 February 2024).
520:https://www.example.com/robots.txt
432:generative artificial intelligence
300:
49:
2369:
2292:
2012:from the original on May 30, 2016
1142:
992:Automated Content Access Protocol
960:Maximum size of a robots.txt file
731:not to kill the company founders
591:
579:
298:Revision as of 07:47, 4 July 2024
163:Revision as of 07:47, 4 July 2024
104:Revision as of 07:07, 4 July 2024
2320:
1019:National Digital Library Program
946:A "noindex" HTTP response header
693:Many robots also pass a special
360:Martijn Koster (original author)
2278:from the original on 2022-10-17
2260:
2207:from the original on 2013-08-08
2160:
2135:
2123:from the original on 2018-11-18
2105:
2093:from the original on 2016-02-03
2054:
2043:from the original on 2018-11-18
2031:Newman, Lily Hay (2014-07-03).
2024:
1992:
1966:
1955:from the original on 2014-01-01
1941:
1930:from the original on 2013-12-29
1916:
1905:from the original on 2014-01-07
1891:
1880:from the original on 2016-04-01
1853:
1841:from the original on 2011-10-08
1808:
1796:from the original on 2015-08-21
1778:
1767:from the original on 2015-08-14
1753:
1741:from the original on 2017-05-16
1709:from the original on 2018-12-04
1691:
1679:from the original on 2017-02-18
1661:
1631:
1619:from the original on 2013-01-25
1586:from the original on 2013-01-21
1560:from the original on 2013-01-15
1497:from the original on 2014-08-18
1378:from the original on 2014-01-06
1364:
1312:from the original on 2019-07-10
1283:from the original on 2015-09-07
1271:Barry Schwartz (30 June 2014).
1215:from the original on 2013-11-25
1156:from the original on 2014-01-12
1130:from the original on 2013-09-27
1097:from the original on 2017-04-03
832:for webmasters, to control the
688:
504:Internet Engineering Task Force
1673:"Robots.txt is a suicide note"
1294:
1197:
1167:
1108:
1079:
751:stands for all robots and the
560:does not apply to pages under
1:
1727:Jones, Brad (24 April 2017).
1152:. Robotstxt.org. 1994-06-30.
1073:
851:directive, allowing multiple
571:
558:http://example.com/robots.txt
536:A robots.txt file covers one
449:The standard was proposed by
296:
259:
248:
2064:. 2018-01-10. Archived from
18:Browse history interactively
7:
2168:"Robots.txt Specifications"
1397:"About Ask.com: Webmasters"
968:
742:
668:
509:
10:
2374:
2343:Search engine optimization
1639:"ArchiveBot: Bad behavior"
842:
770:User-agent: * Disallow: /
716:redirect humans.txt to an
679:security through obscurity
544:had a robots.txt file but
444:
424:security through obscurity
399:used for implementing the
304:
193:
144:
2228:Robots Exclusion Protocol
1860:Sverre H. Huseby (2004).
1333:Robots Exclusion Protocol
758:User-agent: * Disallow:
401:Robots Exclusion Protocol
370:
353:
345:
337:
324:
320:Robots Exclusion Protocol
319:
215:
212:
159:
100:
1374:. YouTube. Oct 5, 2009.
1004:Distributed web crawling
949:
913:
879:Robot Exclusion Standard
847:Some crawlers support a
699:pass alternative content
562:http://example.com:8080/
467:denial-of-service attack
457:in February 1994 on the
407:to indicate to visiting
1951:. Iis.net. 2013-11-06.
1832:10.6028/NIST.SP.800-123
761:User-agent: * Allow: /
622:Artificial intelligence
99:
1150:"The Web Robots Pages"
1115:Fielding, Roy (1994).
951:X-Robots-Tag: noindex
836:'s subsequent visits.
818:Nonstandard extensions
267:://www.robotstxt.org}}
256:://www.robotstxt.org}}
1180:www-talk mailing list
889:Meta tags and headers
823:Crawl-delay directive
648:. In 2023, blog host
403:, a standard used by
263:* {{Official website|
252:* {{Official website|
2062:"/killer-robots.txt"
1926:. Httpd.apache.org.
1643:wiki.archiveteam.org
1193:on October 29, 2013.
910:A "noindex" meta tag
704:Some sites, such as
566:https://example.com/
469:on Koster's server.
420:voluntary compliance
2000:"Github humans.txt"
1974:"Google humans.txt"
1901:. User-agents.org.
1437:on 13 December 2012
937:"noindex"
873:Universal "*" match
552:would not apply to
453:, when working for
316:
2255:Proposed Standard.
2117:support.google.com
1609:"Using robots.txt"
1427:"About AOL Search"
1360:Proposed Standard.
1277:Search Engine Land
1043:for search engines
928:"robots"
725:/killer-robots.txt
645:The New York Times
584:Some major search
314:
175:
111:
2272:Google Developers
2172:Google Developers
1705:. 17 April 2017.
1554:Google Developers
1191:archived message)
390:
389:
341:Proposed Standard
302:Internet protocol
295:
161:
102:
82:
39:
2365:
2325:
2324:
2316:
2304:
2303:
2301:Official website
2287:
2286:
2284:
2283:
2264:
2258:
2253:
2244:
2242:10.17487/RFC9309
2222:
2216:
2215:
2213:
2212:
2197:
2188:
2187:
2185:
2183:
2164:
2158:
2157:
2155:
2154:
2145:. Archived from
2139:
2133:
2132:
2130:
2128:
2109:
2103:
2102:
2100:
2098:
2083:
2077:
2076:
2074:
2073:
2058:
2052:
2051:
2049:
2048:
2028:
2022:
2021:
2019:
2017:
1996:
1990:
1989:
1987:
1985:
1970:
1964:
1963:
1961:
1960:
1945:
1939:
1938:
1936:
1935:
1920:
1914:
1913:
1911:
1910:
1895:
1889:
1888:
1886:
1885:
1857:
1851:
1850:
1848:
1846:
1840:
1821:
1812:
1806:
1805:
1803:
1801:
1782:
1776:
1775:
1773:
1772:
1757:
1751:
1750:
1748:
1746:
1724:
1718:
1717:
1715:
1714:
1703:blog.archive.org
1695:
1689:
1688:
1686:
1684:
1675:. Archive Team.
1665:
1659:
1658:
1656:
1654:
1645:. Archive Team.
1635:
1629:
1628:
1626:
1624:
1605:
1596:
1595:
1593:
1591:
1576:
1570:
1569:
1567:
1565:
1546:
1537:
1536:
1534:
1532:
1517:"DuckDuckGo Bot"
1513:
1507:
1506:
1504:
1502:
1483:
1477:
1476:
1474:
1472:
1453:
1447:
1446:
1444:
1442:
1433:. Archived from
1423:
1417:
1416:
1414:
1412:
1393:
1387:
1386:
1384:
1383:
1368:
1362:
1358:
1349:
1347:10.17487/RFC9309
1327:
1321:
1320:
1318:
1317:
1298:
1292:
1291:
1289:
1288:
1268:
1262:
1261:
1259:
1257:
1240:
1225:
1224:
1222:
1220:
1211:. 19 June 2006.
1201:
1195:
1194:
1192:
1182:. Archived from
1171:
1165:
1164:
1162:
1161:
1146:
1140:
1139:
1137:
1135:
1121:
1112:
1106:
1105:
1103:
1102:
1091:Greenhills.co.uk
1083:
1014:Internet Archive
987:
979:
941:
938:
935:
932:
929:
926:
923:
920:
917:
895:Robots meta tags
884:
865:
858:
850:
754:
750:
726:
711:
663:
610:Internet Archive
567:
563:
559:
555:
551:
547:
543:
521:
517:
386:
380:
377:
329:
317:
313:
202:
201:
199:
196:→External links
189:
174:
169:
151:
150:
149:
142:
134:
129:
110:
83:
74:
73:
71:
66:
64:
56:
53:
32:
31:
21:
19:
2373:
2372:
2368:
2367:
2366:
2364:
2363:
2362:
2333:
2332:
2331:
2319:
2311:
2308:
2299:
2298:
2295:
2290:
2281:
2279:
2266:
2265:
2261:
2223:
2219:
2210:
2208:
2199:
2198:
2191:
2181:
2179:
2166:
2165:
2161:
2152:
2150:
2141:
2140:
2136:
2126:
2124:
2111:
2110:
2106:
2096:
2094:
2085:
2084:
2080:
2071:
2069:
2060:
2059:
2055:
2046:
2044:
2029:
2025:
2015:
2013:
1998:
1997:
1993:
1983:
1981:
1972:
1971:
1967:
1958:
1956:
1947:
1946:
1942:
1933:
1931:
1922:
1921:
1917:
1908:
1906:
1897:
1896:
1892:
1883:
1881:
1874:
1858:
1854:
1844:
1842:
1838:
1819:
1813:
1809:
1799:
1797:
1784:
1783:
1779:
1770:
1768:
1759:
1758:
1754:
1744:
1742:
1725:
1721:
1712:
1710:
1697:
1696:
1692:
1682:
1680:
1666:
1662:
1652:
1650:
1637:
1636:
1632:
1622:
1620:
1613:Help.yandex.com
1607:
1606:
1599:
1589:
1587:
1578:
1577:
1573:
1563:
1561:
1548:
1547:
1540:
1530:
1528:
1515:
1514:
1510:
1500:
1498:
1485:
1484:
1480:
1470:
1468:
1455:
1454:
1450:
1440:
1438:
1425:
1424:
1420:
1410:
1408:
1395:
1394:
1390:
1381:
1379:
1370:
1369:
1365:
1328:
1324:
1315:
1313:
1300:
1299:
1295:
1286:
1284:
1269:
1265:
1255:
1253:
1241:
1228:
1218:
1216:
1209:Charlie's Diary
1203:
1202:
1198:
1186:
1172:
1168:
1159:
1157:
1148:
1147:
1143:
1133:
1131:
1119:
1113:
1109:
1100:
1098:
1085:
1084:
1080:
1076:
1071:
1009:Focused crawler
983:
975:
971:
962:
953:
952:
948:
943:
942:
939:
936:
933:
930:
927:
924:
921:
918:
915:
912:
891:
882:
875:
870:
860:
856:
848:
845:
840:
825:
820:
815:
801:
795:
789:
783:
777:
771:
762:
759:
752:
748:
745:
724:
709:
691:
671:
661:
624:
594:
582:
574:
565:
561:
557:
553:
549:
545:
541:
519:
515:
512:
447:
382:
374:
366:
346:First published
333:
310:
303:
292:
287:
280:
275:
268:
266:
257:
255:
244:
239:
232:
225:
208:
203:
194:
192:
190:
183:
181:
176:
170:
165:
157:
155:← Previous edit
152:
145:
143:
138:
137:
136:
132:
119:
117:
112:
106:
98:
97:
96:
95:
93:
92:
91:
90:
89:
88:
79:
75:
69:
67:
62:
59:
57:
54:
52:Content deleted
51:
48:
43:← Previous edit
40:
26:
25:
24:
17:
12:
11:
5:
2371:
2361:
2360:
2355:
2350:
2345:
2330:
2329:
2306:
2305:
2294:
2293:External links
2291:
2289:
2288:
2259:
2217:
2189:
2159:
2134:
2104:
2089:. 3 May 2012.
2078:
2053:
2037:Slate Magazine
2023:
1991:
1965:
1940:
1915:
1890:
1872:
1852:
1807:
1777:
1752:
1734:Digital Trends
1719:
1690:
1660:
1630:
1597:
1571:
1538:
1521:DuckDuckGo.com
1508:
1491:Blogs.bing.com
1478:
1448:
1431:Search.aol.com
1418:
1388:
1363:
1322:
1293:
1263:
1226:
1196:
1166:
1141:
1107:
1077:
1075:
1072:
1070:
1069:
1064:
1059:
1054:
1049:
1044:
1038:
1033:
1028:
1022:
1016:
1011:
1006:
1001:
995:
989:
981:
972:
970:
967:
961:
958:
950:
947:
944:
914:
911:
908:
890:
887:
874:
871:
868:
844:
841:
838:
830:search console
824:
821:
819:
816:
812:
799:
793:
787:
781:
775:
769:
760:
757:
744:
741:
729:the Terminator
690:
687:
670:
667:
623:
620:
615:Digital Trends
593:
592:Archival sites
590:
581:
580:Search engines
578:
573:
570:
511:
508:
474:web developers
463:Charles Stross
451:Martijn Koster
446:
443:
388:
387:
372:
368:
367:
365:
364:
361:
357:
355:
351:
350:
347:
343:
342:
339:
335:
334:
330:
322:
321:
301:
299:
294:
293:
290:
288:
285:
282:
281:
278:
276:
273:
270:
269:
264:
262:
260:
258:
253:
251:
249:
246:
245:
242:
240:
237:
234:
233:
230:
228:
226:
223:
221:
218:
217:
214:
210:
209:
179:
158:
131:
130:
115:
84:
78:
76:
58:
50:
41:
38:
37:
35:
23:
22:
14:
9:
6:
4:
3:
2:
2370:
2359:
2356:
2354:
2351:
2349:
2346:
2344:
2341:
2340:
2338:
2328:
2323:
2318:
2317:
2314:
2309:
2302:
2297:
2296:
2277:
2273:
2269:
2263:
2256:
2251:
2248:
2243:
2238:
2234:
2230:
2229:
2221:
2206:
2202:
2196:
2194:
2177:
2173:
2169:
2163:
2149:on 2009-03-05
2148:
2144:
2138:
2122:
2118:
2114:
2108:
2092:
2088:
2082:
2068:on 2018-01-10
2067:
2063:
2057:
2042:
2038:
2034:
2027:
2011:
2007:
2006:
2001:
1995:
1979:
1975:
1969:
1954:
1950:
1944:
1929:
1925:
1919:
1904:
1900:
1894:
1879:
1875:
1873:9780470857472
1869:
1865:
1864:
1856:
1837:
1833:
1829:
1825:
1818:
1811:
1795:
1791:
1787:
1781:
1766:
1762:
1756:
1740:
1736:
1735:
1730:
1723:
1708:
1704:
1700:
1694:
1678:
1674:
1670:
1664:
1648:
1644:
1640:
1634:
1618:
1614:
1610:
1604:
1602:
1585:
1581:
1575:
1559:
1555:
1551:
1545:
1543:
1526:
1522:
1518:
1512:
1496:
1492:
1488:
1482:
1466:
1462:
1458:
1457:"Baiduspider"
1452:
1436:
1432:
1428:
1422:
1406:
1402:
1401:About.ask.com
1398:
1392:
1377:
1373:
1367:
1361:
1356:
1353:
1348:
1343:
1339:
1335:
1334:
1326:
1311:
1307:
1303:
1297:
1282:
1278:
1274:
1267:
1252:
1251:
1246:
1239:
1237:
1235:
1233:
1231:
1214:
1210:
1206:
1200:
1190:
1185:
1181:
1177:
1170:
1155:
1151:
1145:
1134:September 25,
1129:
1125:
1118:
1111:
1096:
1092:
1088:
1082:
1078:
1068:
1065:
1063:
1060:
1058:
1057:Web archiving
1055:
1053:
1050:
1048:
1045:
1042:
1041:Meta elements
1039:
1037:
1034:
1032:
1029:
1026:
1023:
1020:
1017:
1015:
1012:
1010:
1007:
1005:
1002:
999:
996:
993:
990:
986:
982:
978:
974:
973:
966:
957:
907:
905:
901:
896:
886:
880:
867:
864:
854:
837:
835:
831:
811:
808:
806:
798:
792:
786:
780:
774:
768:
765:
756:
740:
738:
734:
730:
721:
719:
715:
707:
702:
700:
696:
686:
684:
680:
676:
666:
660:
659:
653:
651:
647:
646:
641:
637:
633:
629:
628:generative AI
619:
617:
616:
611:
607:
604:. Co-founder
603:
599:
589:
587:
577:
569:
554:a.example.com
546:a.example.com
539:
534:
530:
527:
525:
507:
505:
500:
498:
494:
490:
486:
484:
479:
475:
470:
468:
464:
460:
456:
452:
442:
440:
435:
433:
429:
425:
421:
416:
414:
410:
406:
402:
398:
394:
385:
379:
373:
369:
362:
359:
358:
356:
352:
348:
344:
340:
336:
328:
323:
318:
312:
308:
297:
291:
289:
286:
284:
283:
279:
277:
274:
272:
271:
261:
250:
247:
243:
241:
238:
236:
235:
229:
227:
222:
220:
219:
211:
207:
197:
187:
182:
173:
168:
164:
156:
148:
127:
123:
118:
109:
105:
87:
72:
65:
55:Content added
47:
44:
36:
34:
33:
20:
2353:Web scraping
2307:
2280:. Retrieved
2271:
2262:
2254:
2227:
2220:
2209:. Retrieved
2182:February 15,
2180:. Retrieved
2171:
2162:
2151:. Retrieved
2147:the original
2137:
2125:. Retrieved
2116:
2107:
2095:. Retrieved
2081:
2070:. Retrieved
2066:the original
2056:
2045:. Retrieved
2036:
2026:
2014:. Retrieved
2003:
1994:
1982:. Retrieved
1968:
1957:. Retrieved
1943:
1932:. Retrieved
1918:
1907:. Retrieved
1893:
1882:. Retrieved
1862:
1855:
1843:. Retrieved
1823:
1810:
1798:. Retrieved
1790:The Register
1789:
1780:
1769:. Retrieved
1755:
1743:. Retrieved
1732:
1722:
1711:. Retrieved
1702:
1693:
1681:. Retrieved
1663:
1651:. Retrieved
1642:
1633:
1621:. Retrieved
1612:
1588:. Retrieved
1574:
1562:. Retrieved
1553:
1529:. Retrieved
1520:
1511:
1499:. Retrieved
1490:
1481:
1469:. Retrieved
1460:
1451:
1439:. Retrieved
1435:the original
1430:
1421:
1409:. Retrieved
1400:
1391:
1380:. Retrieved
1366:
1359:
1332:
1325:
1314:. Retrieved
1305:
1296:
1285:. Retrieved
1276:
1266:
1254:. Retrieved
1248:
1217:. Retrieved
1208:
1199:
1184:the original
1179:
1169:
1158:. Retrieved
1144:
1132:. Retrieved
1123:
1120:(PostScript)
1110:
1099:. Retrieved
1090:
1087:"Historical"
1081:
985:security.txt
963:
954:
892:
878:
876:
862:
859:in the form
855:in the same
846:
826:
809:
802:
796:
790:
784:
778:
772:
766:
763:
746:
727:instructing
722:
717:
703:
692:
689:Alternatives
672:
656:
654:
643:
625:
613:
598:Archive Team
595:
583:
575:
535:
531:
528:
513:
501:
482:
471:
458:
448:
436:
417:
409:web crawlers
400:
392:
391:
311:
1683:18 February
1669:Jason Scott
1623:16 February
1590:16 February
1564:16 February
1501:16 February
1471:16 February
1441:16 February
1411:16 February
1062:Web crawler
1052:Spider trap
885:statement.
737:Sergey Brin
606:Jason Scott
550:example.com
542:example.com
206:Next edit →
46:Next edit →
2358:Text files
2337:Categories
2282:2022-10-17
2211:2013-08-17
2153:2009-03-23
2127:22 October
2097:9 February
2072:2018-05-25
2047:2019-10-03
2016:October 3,
1984:October 3,
1959:2013-12-29
1934:2013-12-29
1909:2013-12-29
1884:2015-08-12
1845:August 12,
1800:August 12,
1771:2015-08-10
1713:2018-12-01
1653:10 October
1382:2013-12-29
1316:2019-07-10
1287:2015-11-19
1160:2013-12-29
1126:. Geneva.
1101:2017-03-03
1074:References
904:httpd.conf
857:robots.txt
814:directory
733:Larry Page
710:humans.txt
695:user-agent
572:Compliance
516:robots.txt
489:WebCrawler
413:web robots
411:and other
393:robots.txt
315:robots.txt
1461:Baidu.com
1250:The Verge
1189:Hypermail
900:.htaccess
883:Disallow:
861:Sitemap:
834:Googlebot
708:, host a
675:web robot
658:The Verge
497:AltaVista
376:robotstxt
216:Line 234:
213:Line 234:
2348:Websites
2327:Internet
2276:Archived
2205:Archived
2176:Archived
2121:Archived
2091:Archived
2041:Archived
2010:Archived
1978:Archived
1953:Archived
1928:Archived
1903:Archived
1878:Archived
1836:Archived
1794:Archived
1765:Archived
1739:Archived
1707:Archived
1677:Archived
1647:Archived
1617:Archived
1584:Archived
1558:Archived
1531:25 April
1525:Archived
1495:Archived
1465:Archived
1405:Archived
1376:Archived
1310:Archived
1281:Archived
1256:16 March
1219:19 April
1213:Archived
1154:Archived
1128:Archived
1095:Archived
1047:Sitemaps
1036:Perma.cc
1031:Nofollow
1027:(NDIIPP)
969:See also
863:full-url
853:Sitemaps
753:Disallow
743:Examples
669:Security
602:sitemaps
510:Standard
485:standard
483:de facto
459:www-talk
439:sitemaps
405:websites
397:filename
384:RFC 9309
126:contribs
116:Pcrooker
70:Wikitext
1067:noindex
998:BotSeer
977:ads.txt
931:content
906:files.
849:Sitemap
843:Sitemap
586:engines
524:website
445:History
395:is the
371:Website
354:Authors
332:folder.
2313:Portal
2005:GitHub
1870:
1021:(NDLP)
805:Google
720:page.
714:GitHub
706:Google
650:Medium
636:Google
632:OpenAI
538:origin
495:, and
478:server
428:server
338:Status
231:-->
224:-->
81:Inline
63:Visual
1839:(PDF)
1820:(PDF)
1745:8 May
940:/>
718:About
662:'
493:Lycos
455:Nexor
265:https
200:HTTPS
135:edits
2250:9309
2233:IETF
2184:2020
2129:2018
2099:2016
2018:2019
1986:2019
1868:ISBN
1847:2015
1802:2015
1747:2017
1685:2017
1655:2022
1625:2013
1592:2013
1566:2013
1533:2017
1503:2013
1473:2013
1443:2013
1413:2013
1355:9309
1338:IETF
1258:2024
1221:2014
1136:2013
922:name
919:meta
916:<
902:and
877:The
735:and
642:and
378:.org
254:http
186:talk
172:undo
167:edit
122:talk
108:edit
2247:RFC
2237:doi
1828:doi
1352:RFC
1342:doi
640:BBC
564:or
133:126
2339::
2274:.
2270:.
2245:.
2235:.
2231:.
2203:.
2192:^
2174:.
2170:.
2119:.
2115:.
2039:.
2035:.
2008:.
2002:.
1976:.
1876:.
1834:.
1826:.
1822:.
1792:.
1788:.
1763:.
1737:.
1731:.
1701:.
1671:.
1641:.
1615:.
1611:.
1600:^
1582:.
1556:.
1552:.
1541:^
1523:.
1519:.
1493:.
1489:.
1463:.
1459:.
1429:.
1403:.
1399:.
1350:.
1340:.
1336:.
1308:.
1304:.
1279:.
1275:.
1247:.
1229:^
1207:.
1178:.
1122:.
1093:.
1089:.
866::
739:.
568:.
499:.
491:,
434:.
381:,
198::
124:|
2315::
2285:.
2252:.
2239::
2214:.
2186:.
2156:.
2131:.
2101:.
2075:.
2050:.
2020:.
1988:.
1962:.
1937:.
1912:.
1887:.
1849:.
1830::
1804:.
1774:.
1749:.
1716:.
1687:.
1657:.
1627:.
1594:.
1568:.
1535:.
1505:.
1475:.
1445:.
1415:.
1385:.
1357:.
1344::
1319:.
1290:.
1260:.
1223:.
1187:(
1163:.
1138:.
1104:.
934:=
925:=
749:*
309:.
188:)
184:(
141:m
128:)
120:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.