2316:
250:.<ref>{{Cite web |title=How Google Interprets the robots.txt Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref>
85:
322:
258:
Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google
Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref>
257:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files,{{Ref RFC|9309|section=2.5: Limits}} which Google maintains as a 500 kibibyte file size restriction for robots.txt files.<ref>{{Cite web |title=How Google Interprets the robots.txt
950:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
822:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
808:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
527:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
680:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
672:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
528:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
326:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
158:
77:
659:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
823:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
246:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files,{{Ref RFC|9309|section=2.5: Limits}} which Google maintains as a 500 kibibyte file size restriction for robots.txt
517:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
121:
892:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
613:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
959:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.
524:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
165:
114:
1946:
1488:
15:
1577:
1018:
603:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
172:
1700:
633:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
2084:
471:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
1274:
1206:
1732:
647:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2198:
2136:
2034:
2269:
1942:
1551:
1480:
798:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
199:
141:
1573:
802:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
795:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
677:
1829:
2169:
1640:
521:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1692:
176:
1871:
2114:
1121:
1670:
1303:
888:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
2080:
1758:
417:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
692:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
1921:
1787:
1369:
1398:
1458:
460:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1266:
1198:
153:
107:
86:
571:
A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.
1610:
45:
42:
2336:
1088:
668:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
1169:
1147:
622:
426:
2055:
789:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
1865:
1722:
985:
2194:
2140:
1012:
2026:
1238:
498:
2261:
1518:
621:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
1971:
1543:
982:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
497:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
1420:
650:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
149:
103:
2003:
2306:
673:
418:
1896:
951:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
1810:
997:
461:
2161:
1632:
583:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
1693:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
723:
2195:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters β Google Developers"
535:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
2346:
1855:
824:
80:
2240:
2106:
1345:
414:
509:
When a site owner wishes to give instructions to web robots they place a text file called
8:
2351:
2027:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
1110:
1666:
1295:
783:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
421:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
1754:
639:
625:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
532:
472:
422:
1917:
1779:
2341:
1861:
1365:
476:
456:
mailing list, the main communication channel for WWW-related activities at the time.
425:
overload. In the 2020s many websites began denying bots that collect information for
2315:
2230:
1821:
1390:
1335:
1007:
644:
604:
16:
1450:
2320:
1034:
1002:
889:
742:
This example tells all robots that they can visit all files because the wildcard
2243:
2220:
1348:
1325:
607:
announced that it would stop complying with robots.txt directives. According to
378:
1727:
1602:
609:
580:
457:
445:
1111:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
1080:
759:
The same result can be accomplished with an empty or missing robots.txt file.
2330:
1481:"Robots Exclusion Protocol: joining together to provide better documentation"
1177:
1050:
468:
1825:
1143:
786:
This example tells two specific robots not to enter one specific directory:
707:
file that displays information meant for humans to read. Some sites such as
61:
2059:
978:
592:
407:
1943:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
1662:
1055:
1045:
731:
600:
403:
1723:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
898:
727:
689:
483:
2235:
2137:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1510:
1340:
1243:
1199:"How I got here in the end, part five: "things can only get better!""
1182:
894:
828:
669:
652:
551:. In addition, each protocol and port needs its own robots.txt file;
491:
2262:"How Google Interprets the robots.txt Specification | Documentation"
1967:
1019:
National
Digital Information Infrastructure and Preservation Program
482:; most complied, including those operated by search engines such as
1040:
1029:
1024:
847:
774:
This example tells all robots to stay away from one specific file:
771:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
693:
501:. A proposed standard was published in September 2022 as RFC 9309.
433:
391:
2219:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1428:
1324:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1780:"Robots.txt tells hackers the places you don't want them to look"
1060:
991:
970:
596:
518:
399:
1993:
1998:
799:
708:
700:
630:
626:
1892:
780:
All other files in the specified directory will be processed.
768:
This example tells all robots not to enter three directories:
30:
29:
940:
487:
475:
overload was a primary concern. By June 1994 it had become a
449:
358:
Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)
321:
2294:
2226:
1331:
1267:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
1857:
Innocent Code: A Security Wake-Up Call for Web Programmers
1755:"Block URLs with robots.txt: Learn about robots.txt files"
2218:
1538:
1536:
1323:
1296:"Formalizing the Robots Exclusion Protocol Specification"
750:
directive has no value, meaning no pages are disallowed.
634:
410:
which portions of the website they are allowed to visit.
762:
This example tells all robots to stay out of a website:
696:) when it detects a connection using one of the robots.
370:
2081:"To crawl or not to crawl, that is BingBot's question"
1533:
432:
The "robots.txt" file can be used in conjunction with
68:
2304:
467:
The standard, initially RobotsNotWanted.txt, allowed
1809:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
1566:
1118:
First International Conference on the World Wide Web
954:
301:
2107:"Change Googlebot crawl rate - Search Console Help"
1808:
629:'s GPTBot in their robots.txt file and 85 blocked
994:β now inactive search engine for robots.txt files
904:
436:, another robot inclusion standard for websites.
2328:
2073:
1893:"List of User-Agents (Spiders, Robots, Browser)"
1853:
1714:
1597:
1595:
1473:
1413:
867:
792:Example demonstrating how comments can be used:
591:Some web archiving projects ignore robots.txt.
2189:
2187:
1818:National Institute of Standards and Technology
1503:
1383:
1264:
1170:"Important: Spiders, Robots and Web Wanderers"
974:, a standard for listing authorized ad sellers
678:National Institute of Standards and Technology
595:uses the file to discover more links, such as
1592:
834:User-agent: bingbot Allow: / Crawl-delay: 10
777:User-agent: * Disallow: /directory/file.html
718:Previously, Google had a joke file hosted at
344:1994 published, formally standardized in 2022
1443:
805:Example demonstrating multiple user-agents:
513:in the root of the web site hierarchy (e.g.
191:
2184:
864:Sitemap: http://www.example.com/sitemap.xml
413:The standard, developed in 1994, relies on
2251:sec. 2.5: Limits.
1574:"Submitting your website to Yahoo! Search"
876:does not mention the "*" character in the
616:
320:
27:
2234:
1860:. John Wiley & Sons. pp. 91β92.
1339:
812:
1258:
1108:
988:β a failed proposal to extend robots.txt
883:
817:
676:is discouraged by standards bodies. The
543:did not, the rules that would apply for
28:
1661:
1544:"Webmasters: Robots.txt Specifications"
1232:
1230:
1228:
1226:
1224:
225:===Maximum size of a robots.txt file===
218:===Maximum size of a robots.txt file===
116:2a00:1e88:b0b3:7900:6717:dc15:703e:a02d
2329:
2024:
1317:
1300:Official Google Webmaster Central Blog
1239:"The text file that runs the internet"
1236:
1167:
2172:from the original on November 2, 2019
1974:from the original on January 24, 2017
1918:"Access Control - Apache HTTP Server"
1720:
1521:from the original on 16 February 2017
300:For Knowledge's robots.txt file, see
1643:from the original on 10 October 2022
1401:from the original on 27 January 2013
1221:
1168:Koster, Martijn (25 February 1994).
94:
60:
302:https://en.wikipedia.org/robots.txt
198:
190:
164:
147:
140:
131:
127:
113:
101:
13:
2212:
1811:"Guide to General Server Security"
1461:from the original on 6 August 2013
1366:"Uncrawled URLs in search results"
1237:Pierce, David (14 February 2024).
515:https://www.example.com/robots.txt
427:generative artificial intelligence
295:
49:
2363:
2286:
2006:from the original on May 30, 2016
1136:
986:Automated Content Access Protocol
955:Maximum size of a robots.txt file
726:not to kill the company founders
586:
574:
293:Revision as of 01:27, 6 July 2024
150:Revision as of 01:27, 6 July 2024
104:Revision as of 07:47, 4 July 2024
2314:
1013:National Digital Library Program
941:A "noindex" HTTP response header
688:Many robots also pass a special
355:Martijn Koster (original author)
2272:from the original on 2022-10-17
2254:
2201:from the original on 2013-08-08
2154:
2129:
2117:from the original on 2018-11-18
2099:
2087:from the original on 2016-02-03
2048:
2037:from the original on 2018-11-18
2025:Newman, Lily Hay (2014-07-03).
2018:
1986:
1960:
1949:from the original on 2014-01-01
1935:
1924:from the original on 2013-12-29
1910:
1899:from the original on 2014-01-07
1885:
1874:from the original on 2016-04-01
1847:
1835:from the original on 2011-10-08
1802:
1790:from the original on 2015-08-21
1772:
1761:from the original on 2015-08-14
1747:
1735:from the original on 2017-05-16
1703:from the original on 2018-12-04
1685:
1673:from the original on 2017-02-18
1655:
1625:
1613:from the original on 2013-01-25
1580:from the original on 2013-01-21
1554:from the original on 2013-01-15
1491:from the original on 2014-08-18
1372:from the original on 2014-01-06
1358:
1306:from the original on 2019-07-10
1277:from the original on 2015-09-07
1265:Barry Schwartz (30 June 2014).
1209:from the original on 2013-11-25
1150:from the original on 2014-01-12
1124:from the original on 2013-09-27
1091:from the original on 2017-04-03
827:for webmasters, to control the
683:
499:Internet Engineering Task Force
1667:"Robots.txt is a suicide note"
1288:
1191:
1161:
1102:
1073:
746:stands for all robots and the
555:does not apply to pages under
1:
1721:Jones, Brad (24 April 2017).
1146:. Robotstxt.org. 1994-06-30.
1067:
846:directive, allowing multiple
566:
553:http://example.com/robots.txt
531:A robots.txt file covers one
444:The standard was proposed by
291:
253:
242:
2058:. 2018-01-10. Archived from
18:Browse history interactively
7:
2162:"Robots.txt Specifications"
1391:"About Ask.com: Webmasters"
962:
737:
663:
504:
10:
2368:
2337:Search engine optimization
1633:"ArchiveBot: Bad behavior"
837:
765:User-agent: * Disallow: /
711:redirect humans.txt to an
674:security through obscurity
539:had a robots.txt file but
439:
419:security through obscurity
394:used for implementing the
299:
129:
2222:Robots Exclusion Protocol
1854:Sverre H. Huseby (2004).
1327:Robots Exclusion Protocol
753:User-agent: * Disallow:
396:Robots Exclusion Protocol
365:
348:
340:
332:
319:
315:Robots Exclusion Protocol
314:
209:
206:
146:
100:
1368:. YouTube. Oct 5, 2009.
998:Distributed web crawling
944:
908:
874:Robot Exclusion Standard
842:Some crawlers support a
694:pass alternative content
557:http://example.com:8080/
462:denial-of-service attack
452:in February 1994 on the
402:to indicate to visiting
1945:. Iis.net. 2013-11-06.
1826:10.6028/NIST.SP.800-123
756:User-agent: * Allow: /
617:Artificial intelligence
99:
1144:"The Web Robots Pages"
1109:Fielding, Roy (1994).
946:X-Robots-Tag: noindex
831:'s subsequent visits.
813:Nonstandard extensions
167:CitationsRuleTheNation
1174:www-talk mailing list
884:Meta tags and headers
818:Crawl-delay directive
643:. In 2023, blog host
398:, a standard used by
195:remove an extra space
2056:"/killer-robots.txt"
1920:. Httpd.apache.org.
1637:wiki.archiveteam.org
1187:on October 29, 2013.
905:A "noindex" meta tag
699:Some sites, such as
561:https://example.com/
464:on Koster's server.
415:voluntary compliance
1994:"Github humans.txt"
1968:"Google humans.txt"
1895:. User-agents.org.
1431:on 13 December 2012
932:"noindex"
868:Universal "*" match
547:would not apply to
448:, when working for
311:
2249:Proposed Standard.
2111:support.google.com
1603:"Using robots.txt"
1421:"About AOL Search"
1354:Proposed Standard.
1271:Search Engine Land
1037:for search engines
923:"robots"
720:/killer-robots.txt
640:The New York Times
579:Some major search
309:
162:
111:
2266:Google Developers
2166:Google Developers
1699:. 17 April 2017.
1548:Google Developers
1185:archived message)
385:
384:
336:Proposed Standard
297:Internet protocol
290:
148:
102:
82:
39:
2359:
2319:
2318:
2310:
2298:
2297:
2295:Official website
2281:
2280:
2278:
2277:
2258:
2252:
2247:
2238:
2236:10.17487/RFC9309
2216:
2210:
2209:
2207:
2206:
2191:
2182:
2181:
2179:
2177:
2158:
2152:
2151:
2149:
2148:
2139:. Archived from
2133:
2127:
2126:
2124:
2122:
2103:
2097:
2096:
2094:
2092:
2077:
2071:
2070:
2068:
2067:
2052:
2046:
2045:
2043:
2042:
2022:
2016:
2015:
2013:
2011:
1990:
1984:
1983:
1981:
1979:
1964:
1958:
1957:
1955:
1954:
1939:
1933:
1932:
1930:
1929:
1914:
1908:
1907:
1905:
1904:
1889:
1883:
1882:
1880:
1879:
1851:
1845:
1844:
1842:
1840:
1834:
1815:
1806:
1800:
1799:
1797:
1795:
1776:
1770:
1769:
1767:
1766:
1751:
1745:
1744:
1742:
1740:
1718:
1712:
1711:
1709:
1708:
1697:blog.archive.org
1689:
1683:
1682:
1680:
1678:
1669:. Archive Team.
1659:
1653:
1652:
1650:
1648:
1639:. Archive Team.
1629:
1623:
1622:
1620:
1618:
1599:
1590:
1589:
1587:
1585:
1570:
1564:
1563:
1561:
1559:
1540:
1531:
1530:
1528:
1526:
1511:"DuckDuckGo Bot"
1507:
1501:
1500:
1498:
1496:
1477:
1471:
1470:
1468:
1466:
1447:
1441:
1440:
1438:
1436:
1427:. Archived from
1417:
1411:
1410:
1408:
1406:
1387:
1381:
1380:
1378:
1377:
1362:
1356:
1352:
1343:
1341:10.17487/RFC9309
1321:
1315:
1314:
1312:
1311:
1292:
1286:
1285:
1283:
1282:
1262:
1256:
1255:
1253:
1251:
1234:
1219:
1218:
1216:
1214:
1205:. 19 June 2006.
1195:
1189:
1188:
1186:
1176:. Archived from
1165:
1159:
1158:
1156:
1155:
1140:
1134:
1133:
1131:
1129:
1115:
1106:
1100:
1099:
1097:
1096:
1085:Greenhills.co.uk
1077:
1008:Internet Archive
981:
973:
936:
933:
930:
927:
924:
921:
918:
915:
912:
890:Robots meta tags
879:
860:
853:
845:
749:
745:
721:
706:
658:
605:Internet Archive
562:
558:
554:
550:
546:
542:
538:
516:
512:
381:
375:
372:
324:
312:
308:
196:
193:
185:
180:
161:
156:
138:
137:
135:
132:ββExternal links
125:
110:
83:
74:
73:
71:
66:
64:
56:
53:
32:
31:
21:
19:
2367:
2366:
2362:
2361:
2360:
2358:
2357:
2356:
2327:
2326:
2325:
2313:
2305:
2302:
2293:
2292:
2289:
2284:
2275:
2273:
2260:
2259:
2255:
2217:
2213:
2204:
2202:
2193:
2192:
2185:
2175:
2173:
2160:
2159:
2155:
2146:
2144:
2135:
2134:
2130:
2120:
2118:
2105:
2104:
2100:
2090:
2088:
2079:
2078:
2074:
2065:
2063:
2054:
2053:
2049:
2040:
2038:
2023:
2019:
2009:
2007:
1992:
1991:
1987:
1977:
1975:
1966:
1965:
1961:
1952:
1950:
1941:
1940:
1936:
1927:
1925:
1916:
1915:
1911:
1902:
1900:
1891:
1890:
1886:
1877:
1875:
1868:
1852:
1848:
1838:
1836:
1832:
1813:
1807:
1803:
1793:
1791:
1778:
1777:
1773:
1764:
1762:
1753:
1752:
1748:
1738:
1736:
1719:
1715:
1706:
1704:
1691:
1690:
1686:
1676:
1674:
1660:
1656:
1646:
1644:
1631:
1630:
1626:
1616:
1614:
1607:Help.yandex.com
1601:
1600:
1593:
1583:
1581:
1572:
1571:
1567:
1557:
1555:
1542:
1541:
1534:
1524:
1522:
1509:
1508:
1504:
1494:
1492:
1479:
1478:
1474:
1464:
1462:
1449:
1448:
1444:
1434:
1432:
1419:
1418:
1414:
1404:
1402:
1389:
1388:
1384:
1375:
1373:
1364:
1363:
1359:
1322:
1318:
1309:
1307:
1294:
1293:
1289:
1280:
1278:
1263:
1259:
1249:
1247:
1235:
1222:
1212:
1210:
1203:Charlie's Diary
1197:
1196:
1192:
1180:
1166:
1162:
1153:
1151:
1142:
1141:
1137:
1127:
1125:
1113:
1107:
1103:
1094:
1092:
1079:
1078:
1074:
1070:
1065:
1003:Focused crawler
977:
969:
965:
957:
948:
947:
943:
938:
937:
934:
931:
928:
925:
922:
919:
916:
913:
910:
907:
886:
877:
870:
865:
855:
851:
843:
840:
835:
820:
815:
810:
796:
790:
784:
778:
772:
766:
757:
754:
747:
743:
740:
719:
704:
686:
666:
656:
619:
589:
577:
569:
560:
556:
552:
548:
544:
540:
536:
514:
510:
507:
442:
377:
369:
361:
341:First published
328:
305:
298:
287:
280:
271:
266:
259:
251:
249:
238:
233:
226:
219:
202:
197:
194:
189:
188:
187:
183:
170:
168:
163:
157:
152:
144:
142:β Previous edit
139:
130:
128:
126:
119:
117:
112:
106:
98:
97:
96:
95:
93:
92:
91:
90:
89:
88:
79:
75:
69:
67:
62:
59:
57:
54:
52:Content deleted
51:
48:
43:β Previous edit
40:
26:
25:
24:
17:
12:
11:
5:
2365:
2355:
2354:
2349:
2344:
2339:
2324:
2323:
2300:
2299:
2288:
2287:External links
2285:
2283:
2282:
2253:
2211:
2183:
2153:
2128:
2098:
2083:. 3 May 2012.
2072:
2047:
2031:Slate Magazine
2017:
1985:
1959:
1934:
1909:
1884:
1866:
1846:
1801:
1771:
1746:
1728:Digital Trends
1713:
1684:
1654:
1624:
1591:
1565:
1532:
1515:DuckDuckGo.com
1502:
1485:Blogs.bing.com
1472:
1442:
1425:Search.aol.com
1412:
1382:
1357:
1316:
1287:
1257:
1220:
1190:
1160:
1135:
1101:
1071:
1069:
1066:
1064:
1063:
1058:
1053:
1048:
1043:
1038:
1032:
1027:
1022:
1016:
1010:
1005:
1000:
995:
989:
983:
975:
966:
964:
961:
956:
953:
945:
942:
939:
909:
906:
903:
885:
882:
869:
866:
863:
839:
836:
833:
825:search console
819:
816:
814:
811:
807:
794:
788:
782:
776:
770:
764:
755:
752:
739:
736:
724:the Terminator
685:
682:
665:
662:
618:
615:
610:Digital Trends
588:
587:Archival sites
585:
576:
575:Search engines
573:
568:
565:
506:
503:
469:web developers
458:Charles Stross
446:Martijn Koster
441:
438:
383:
382:
367:
363:
362:
360:
359:
356:
352:
350:
346:
345:
342:
338:
337:
334:
330:
329:
325:
317:
316:
296:
294:
289:
288:
285:
283:
281:
278:
276:
273:
272:
269:
267:
264:
261:
260:
256:
254:
252:
248:
245:
243:
240:
239:
236:
234:
231:
228:
227:
224:
222:
220:
217:
215:
212:
211:
208:
204:
203:
182:
181:
166:
145:
115:
84:
78:
76:
58:
50:
41:
38:
37:
35:
23:
22:
14:
9:
6:
4:
3:
2:
2364:
2353:
2350:
2348:
2345:
2343:
2340:
2338:
2335:
2334:
2332:
2322:
2317:
2312:
2311:
2308:
2303:
2296:
2291:
2290:
2271:
2267:
2263:
2257:
2250:
2245:
2242:
2237:
2232:
2228:
2224:
2223:
2215:
2200:
2196:
2190:
2188:
2171:
2167:
2163:
2157:
2143:on 2009-03-05
2142:
2138:
2132:
2116:
2112:
2108:
2102:
2086:
2082:
2076:
2062:on 2018-01-10
2061:
2057:
2051:
2036:
2032:
2028:
2021:
2005:
2001:
2000:
1995:
1989:
1973:
1969:
1963:
1948:
1944:
1938:
1923:
1919:
1913:
1898:
1894:
1888:
1873:
1869:
1867:9780470857472
1863:
1859:
1858:
1850:
1831:
1827:
1823:
1819:
1812:
1805:
1789:
1785:
1781:
1775:
1760:
1756:
1750:
1734:
1730:
1729:
1724:
1717:
1702:
1698:
1694:
1688:
1672:
1668:
1664:
1658:
1642:
1638:
1634:
1628:
1612:
1608:
1604:
1598:
1596:
1579:
1575:
1569:
1553:
1549:
1545:
1539:
1537:
1520:
1516:
1512:
1506:
1490:
1486:
1482:
1476:
1460:
1456:
1452:
1451:"Baiduspider"
1446:
1430:
1426:
1422:
1416:
1400:
1396:
1395:About.ask.com
1392:
1386:
1371:
1367:
1361:
1355:
1350:
1347:
1342:
1337:
1333:
1329:
1328:
1320:
1305:
1301:
1297:
1291:
1276:
1272:
1268:
1261:
1246:
1245:
1240:
1233:
1231:
1229:
1227:
1225:
1208:
1204:
1200:
1194:
1184:
1179:
1175:
1171:
1164:
1149:
1145:
1139:
1128:September 25,
1123:
1119:
1112:
1105:
1090:
1086:
1082:
1076:
1072:
1062:
1059:
1057:
1054:
1052:
1051:Web archiving
1049:
1047:
1044:
1042:
1039:
1036:
1035:Meta elements
1033:
1031:
1028:
1026:
1023:
1020:
1017:
1014:
1011:
1009:
1006:
1004:
1001:
999:
996:
993:
990:
987:
984:
980:
976:
972:
968:
967:
960:
952:
902:
900:
896:
891:
881:
875:
862:
859:
849:
832:
830:
826:
806:
803:
801:
793:
787:
781:
775:
769:
763:
760:
751:
735:
733:
729:
725:
716:
714:
710:
702:
697:
695:
691:
681:
679:
675:
671:
661:
655:
654:
648:
646:
642:
641:
636:
632:
628:
624:
623:generative AI
614:
612:
611:
606:
602:
599:. Co-founder
598:
594:
584:
582:
572:
564:
549:a.example.com
541:a.example.com
534:
529:
525:
522:
520:
502:
500:
495:
493:
489:
485:
481:
479:
474:
470:
465:
463:
459:
455:
451:
447:
437:
435:
430:
428:
424:
420:
416:
411:
409:
405:
401:
397:
393:
389:
380:
374:
368:
364:
357:
354:
353:
351:
347:
343:
339:
335:
331:
323:
318:
313:
307:
303:
292:
284:
282:
277:
275:
274:
270:
268:
265:
263:
262:
255:
244:
241:
237:
235:
232:
230:
229:
223:
221:
216:
214:
213:
205:
201:
178:
174:
169:
160:
155:
151:
143:
133:
123:
118:
109:
105:
87:
72:
65:
55:Content added
47:
44:
36:
34:
33:
20:
2347:Web scraping
2301:
2274:. Retrieved
2265:
2256:
2248:
2221:
2214:
2203:. Retrieved
2176:February 15,
2174:. Retrieved
2165:
2156:
2145:. Retrieved
2141:the original
2131:
2119:. Retrieved
2110:
2101:
2089:. Retrieved
2075:
2064:. Retrieved
2060:the original
2050:
2039:. Retrieved
2030:
2020:
2008:. Retrieved
1997:
1988:
1976:. Retrieved
1962:
1951:. Retrieved
1937:
1926:. Retrieved
1912:
1901:. Retrieved
1887:
1876:. Retrieved
1856:
1849:
1837:. Retrieved
1817:
1804:
1792:. Retrieved
1784:The Register
1783:
1774:
1763:. Retrieved
1749:
1737:. Retrieved
1726:
1716:
1705:. Retrieved
1696:
1687:
1675:. Retrieved
1657:
1645:. Retrieved
1636:
1627:
1615:. Retrieved
1606:
1582:. Retrieved
1568:
1556:. Retrieved
1547:
1523:. Retrieved
1514:
1505:
1493:. Retrieved
1484:
1475:
1463:. Retrieved
1454:
1445:
1433:. Retrieved
1429:the original
1424:
1415:
1403:. Retrieved
1394:
1385:
1374:. Retrieved
1360:
1353:
1326:
1319:
1308:. Retrieved
1299:
1290:
1279:. Retrieved
1270:
1260:
1248:. Retrieved
1242:
1211:. Retrieved
1202:
1193:
1178:the original
1173:
1163:
1152:. Retrieved
1138:
1126:. Retrieved
1117:
1114:(PostScript)
1104:
1093:. Retrieved
1084:
1081:"Historical"
1075:
979:security.txt
958:
949:
887:
873:
871:
857:
854:in the form
850:in the same
841:
821:
804:
797:
791:
785:
779:
773:
767:
761:
758:
741:
722:instructing
717:
712:
698:
687:
684:Alternatives
667:
651:
649:
638:
620:
608:
593:Archive Team
590:
578:
570:
530:
526:
523:
508:
496:
477:
466:
453:
443:
431:
412:
404:web crawlers
395:
387:
386:
306:
286:==See also==
279:==See also==
1677:18 February
1663:Jason Scott
1617:16 February
1584:16 February
1558:16 February
1495:16 February
1465:16 February
1435:16 February
1405:16 February
1056:Web crawler
1046:Spider trap
880:statement.
732:Sergey Brin
601:Jason Scott
545:example.com
537:example.com
200:Next edit β
46:Next edit β
2352:Text files
2331:Categories
2276:2022-10-17
2205:2013-08-17
2147:2009-03-23
2121:22 October
2091:9 February
2066:2018-05-25
2041:2019-10-03
2010:October 3,
1978:October 3,
1953:2013-12-29
1928:2013-12-29
1903:2013-12-29
1878:2015-08-12
1839:August 12,
1794:August 12,
1765:2015-08-10
1707:2018-12-01
1647:10 October
1376:2013-12-29
1310:2019-07-10
1281:2015-11-19
1154:2013-12-29
1120:. Geneva.
1095:2017-03-03
1068:References
899:httpd.conf
852:robots.txt
809:directory
728:Larry Page
705:humans.txt
690:user-agent
567:Compliance
511:robots.txt
484:WebCrawler
408:web robots
406:and other
388:robots.txt
310:robots.txt
1455:Baidu.com
1244:The Verge
1183:Hypermail
895:.htaccess
878:Disallow:
856:Sitemap:
829:Googlebot
703:, host a
670:web robot
653:The Verge
492:AltaVista
371:robotstxt
210:Line 198:
207:Line 198:
2342:Websites
2321:Internet
2270:Archived
2199:Archived
2170:Archived
2115:Archived
2085:Archived
2035:Archived
2004:Archived
1972:Archived
1947:Archived
1922:Archived
1897:Archived
1872:Archived
1830:Archived
1788:Archived
1759:Archived
1733:Archived
1701:Archived
1671:Archived
1641:Archived
1611:Archived
1578:Archived
1552:Archived
1525:25 April
1519:Archived
1489:Archived
1459:Archived
1399:Archived
1370:Archived
1304:Archived
1275:Archived
1250:16 March
1213:19 April
1207:Archived
1148:Archived
1122:Archived
1089:Archived
1041:Sitemaps
1030:Perma.cc
1025:Nofollow
1021:(NDIIPP)
963:See also
858:full-url
848:Sitemaps
748:Disallow
738:Examples
664:Security
597:sitemaps
505:Standard
480:standard
478:de facto
454:www-talk
434:sitemaps
400:websites
392:filename
379:RFC 9309
177:contribs
70:Wikitext
1061:noindex
992:BotSeer
971:ads.txt
926:content
901:files.
844:Sitemap
838:Sitemap
581:engines
519:website
440:History
390:is the
366:Website
349:Authors
327:folder.
2307:Portal
1999:GitHub
1864:
1015:(NDLP)
800:Google
715:page.
709:GitHub
701:Google
645:Medium
631:Google
627:OpenAI
533:origin
490:, and
473:server
423:server
333:Status
81:Inline
63:Visual
1833:(PDF)
1814:(PDF)
1739:8 May
935:/>
713:About
657:'
488:Lycos
450:Nexor
247:files
186:edits
136:HTTPS
2244:9309
2227:IETF
2178:2020
2123:2018
2093:2016
2012:2019
1980:2019
1862:ISBN
1841:2015
1796:2015
1741:2017
1679:2017
1649:2022
1619:2013
1586:2013
1560:2013
1527:2017
1497:2013
1467:2013
1437:2013
1407:2013
1349:9309
1332:IETF
1252:2024
1215:2014
1130:2013
917:name
914:meta
911:<
897:and
872:The
730:and
637:and
373:.org
173:talk
159:undo
154:edit
122:talk
108:edit
2241:RFC
2231:doi
1822:doi
1346:RFC
1336:doi
635:BBC
559:or
2333::
2268:.
2264:.
2239:.
2229:.
2225:.
2197:.
2186:^
2168:.
2164:.
2113:.
2109:.
2033:.
2029:.
2002:.
1996:.
1970:.
1870:.
1828:.
1820:.
1816:.
1786:.
1782:.
1757:.
1731:.
1725:.
1695:.
1665:.
1635:.
1609:.
1605:.
1594:^
1576:.
1550:.
1546:.
1535:^
1517:.
1513:.
1487:.
1483:.
1457:.
1453:.
1423:.
1397:.
1393:.
1344:.
1334:.
1330:.
1302:.
1298:.
1273:.
1269:.
1241:.
1223:^
1201:.
1172:.
1116:.
1087:.
1083:.
861::
734:.
563:.
494:.
486:,
429:.
376:,
175:|
134::
2309::
2279:.
2246:.
2233::
2208:.
2180:.
2150:.
2125:.
2095:.
2069:.
2044:.
2014:.
1982:.
1956:.
1931:.
1906:.
1881:.
1843:.
1824::
1798:.
1768:.
1743:.
1710:.
1681:.
1651:.
1621:.
1588:.
1562:.
1529:.
1499:.
1469:.
1439:.
1409:.
1379:.
1351:.
1338::
1313:.
1284:.
1254:.
1217:.
1181:(
1157:.
1132:.
1098:.
929:=
920:=
744:*
304:.
192:m
184:5
179:)
171:(
124:)
120:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.