Robots.txt: Difference between revisions

2316: 250:.<ref>{{Cite web |title=How Google Interprets the robots.txt Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref> 85: 322: 258:

Specification {{!}} Documentation |url=https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |access-date=2022-10-17 |website=Google Developers |language=en |archive-date=2022-10-17 |archive-url=https://web.archive.org/web/20221017101925/https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt |url-status=live }}</ref>

257:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files,{{Ref RFC|9309|section=2.5: Limits}} which Google maintains as a 500 kibibyte file size restriction for robots.txt files.<ref>{{Cite web |title=How Google Interprets the robots.txt

950:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

822:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

808:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

527:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

680:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 672:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 528:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

326:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

158: 77: 659:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

823:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

246:

517:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 121: 892:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

613:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 959:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.

524:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

165: 114: 1946: 1488: 15: 1577: 1018: 603:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

172: 1700: 633:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 2084: 471:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

1274: 1206: 1732: 647:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2198: 2136: 2034: 2269: 1942: 1551: 1480: 798:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

199: 141: 1573: 802:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 795:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 677: 1829: 2169: 1640: 521:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1692: 176: 1871: 2114: 1121: 1670: 1303: 888:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

2080: 1758: 417:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 692:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1921: 1787: 1369: 1398: 1458: 460:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1266: 1198: 153: 107: 86: 571:

A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.

1610: 45: 42: 2336: 1088: 668:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

1169: 1147: 622: 426: 2055: 789:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1865: 1722: 985: 2194: 2140: 1012: 2026: 1238: 498: 2261: 1518: 621:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1971: 1543: 982:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 497:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1420: 650:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

149: 103: 2003: 2306: 673: 418: 1896: 951:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

1810: 997: 461: 2161: 1632: 583:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1693:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 723: 2195:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 535:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2346: 1855: 824: 80: 2240: 2106: 1345: 414: 509:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2351: 2027:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 1110: 1666: 1295: 783:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

421:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 1754: 639: 625:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 532: 472: 422: 1917: 1779: 2341: 1861: 1365: 476: 456:

mailing list, the main communication channel for WWW-related activities at the time.

425:

overload. In the 2020s many websites began denying bots that collect information for

2315: 2230: 1821: 1390: 1335: 1007: 644: 604: 16: 1450: 2320: 1034: 1002: 889: 742:

This example tells all robots that they can visit all files because the wildcard

2243: 2220: 1348: 1325: 607:

announced that it would stop complying with robots.txt directives. According to

378: 1727: 1602: 609: 580: 457: 445: 1111:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 1080: 759:

The same result can be accomplished with an empty or missing robots.txt file.

2330: 1481:"Robots Exclusion Protocol: joining together to provide better documentation" 1177: 1050: 468: 1825: 1143: 786:

This example tells two specific robots not to enter one specific directory:

707:

file that displays information meant for humans to read. Some sites such as

61: 2059: 978: 592: 407: 1943:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1662: 1055: 1045: 731: 600: 403: 1723:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 898: 727: 689: 483: 2235: 2137:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1510: 1340: 1243: 1199:"How I got here in the end, part five: "things can only get better!"" 1182: 894: 828: 669: 652: 551:. In addition, each protocol and port needs its own robots.txt file; 491: 2262:"How Google Interprets the robots.txt Specification | Documentation" 1967: 1019:

National Digital Information Infrastructure and Preservation Program

482:; most complied, including those operated by search engines such as 1040: 1029: 1024: 847: 774:

This example tells all robots to stay away from one specific file:

771:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

693: 501:. A proposed standard was published in September 2022 as RFC 9309. 433: 391: 2219:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1428: 1324:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1780:"Robots.txt tells hackers the places you don't want them to look" 1060: 991: 970: 596: 518: 399: 1993: 1998: 799: 708: 700: 630: 626: 1892: 780:

All other files in the specified directory will be processed.

768:

This example tells all robots not to enter three directories:

30: 29: 940: 487: 475:

overload was a primary concern. By June 1994 it had become a

449: 358:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

321: 2294: 2226: 1331: 1267:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1857:

Innocent Code: A Security Wake-Up Call for Web Programmers

1755:"Block URLs with robots.txt: Learn about robots.txt files" 2218: 1538: 1536: 1323: 1296:"Formalizing the Robots Exclusion Protocol Specification" 750:

directive has no value, meaning no pages are disallowed.

634: 410:

which portions of the website they are allowed to visit.

762:

This example tells all robots to stay out of a website:

696:) when it detects a connection using one of the robots. 370: 2081:"To crawl or not to crawl, that is BingBot's question" 1533: 432:

The "robots.txt" file can be used in conjunction with

68: 2304: 467:

The standard, initially RobotsNotWanted.txt, allowed

1809:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1566: 1118:

First International Conference on the World Wide Web

954: 301: 2107:"Change Googlebot crawl rate - Search Console Help" 1808: 629:'s GPTBot in their robots.txt file and 85 blocked 994:– now inactive search engine for robots.txt files 904: 436:, another robot inclusion standard for websites. 2328: 2073: 1893:"List of User-Agents (Spiders, Robots, Browser)" 1853: 1714: 1597: 1595: 1473: 1413: 867: 792:Example demonstrating how comments can be used: 591:Some web archiving projects ignore robots.txt. 2189: 2187: 1818:National Institute of Standards and Technology 1503: 1383: 1264: 1170:"Important: Spiders, Robots and Web Wanderers" 974:, a standard for listing authorized ad sellers 678:National Institute of Standards and Technology 595:uses the file to discover more links, such as 1592: 834:User-agent: bingbot Allow: / Crawl-delay: 10 777:User-agent: * Disallow: /directory/file.html 718:Previously, Google had a joke file hosted at 344:1994 published, formally standardized in 2022 1443: 805:Example demonstrating multiple user-agents: 513:in the root of the web site hierarchy (e.g. 191: 2184: 864:Sitemap: http://www.example.com/sitemap.xml 413:The standard, developed in 1994, relies on 2251:sec. 2.5: Limits. 1574:"Submitting your website to Yahoo! Search" 876:does not mention the "*" character in the 616: 320: 27: 2234: 1860:. John Wiley & Sons. pp. 91–92. 1339: 812: 1258: 1108: 988:– a failed proposal to extend robots.txt 883: 817: 676:is discouraged by standards bodies. The 543:did not, the rules that would apply for 28: 1661: 1544:"Webmasters: Robots.txt Specifications" 1232: 1230: 1228: 1226: 1224: 225:===Maximum size of a robots.txt file=== 218:===Maximum size of a robots.txt file=== 116:2a00:1e88:b0b3:7900:6717:dc15:703e:a02d 2329: 2024: 1317: 1300:Official Google Webmaster Central Blog 1239:"The text file that runs the internet" 1236: 1167: 2172:from the original on November 2, 2019 1974:from the original on January 24, 2017 1918:"Access Control - Apache HTTP Server" 1720: 1521:from the original on 16 February 2017 300:For Knowledge's robots.txt file, see 1643:from the original on 10 October 2022 1401:from the original on 27 January 2013 1221: 1168:Koster, Martijn (25 February 1994). 94: 60: 302:https://en.wikipedia.org/robots.txt 198: 190: 164: 147: 140: 131: 127: 113: 101: 13: 2212: 1811:"Guide to General Server Security" 1461:from the original on 6 August 2013 1366:"Uncrawled URLs in search results" 1237:Pierce, David (14 February 2024). 515:https://www.example.com/robots.txt 427:generative artificial intelligence 295: 49: 2363: 2286: 2006:from the original on May 30, 2016 1136: 986:Automated Content Access Protocol 955:Maximum size of a robots.txt file 726:not to kill the company founders 586: 574: 293:Revision as of 01:27, 6 July 2024 150:Revision as of 01:27, 6 July 2024 104:Revision as of 07:47, 4 July 2024 2314: 1013:National Digital Library Program 941:A "noindex" HTTP response header 688:Many robots also pass a special 355:Martijn Koster (original author) 2272:from the original on 2022-10-17 2254: 2201:from the original on 2013-08-08 2154: 2129: 2117:from the original on 2018-11-18 2099: 2087:from the original on 2016-02-03 2048: 2037:from the original on 2018-11-18 2025:Newman, Lily Hay (2014-07-03). 2018: 1986: 1960: 1949:from the original on 2014-01-01 1935: 1924:from the original on 2013-12-29 1910: 1899:from the original on 2014-01-07 1885: 1874:from the original on 2016-04-01 1847: 1835:from the original on 2011-10-08 1802: 1790:from the original on 2015-08-21 1772: 1761:from the original on 2015-08-14 1747: 1735:from the original on 2017-05-16 1703:from the original on 2018-12-04 1685: 1673:from the original on 2017-02-18 1655: 1625: 1613:from the original on 2013-01-25 1580:from the original on 2013-01-21 1554:from the original on 2013-01-15 1491:from the original on 2014-08-18 1372:from the original on 2014-01-06 1358: 1306:from the original on 2019-07-10 1277:from the original on 2015-09-07 1265:Barry Schwartz (30 June 2014). 1209:from the original on 2013-11-25 1150:from the original on 2014-01-12 1124:from the original on 2013-09-27 1091:from the original on 2017-04-03 827:for webmasters, to control the 683: 499:Internet Engineering Task Force 1667:"Robots.txt is a suicide note" 1288: 1191: 1161: 1102: 1073: 746:stands for all robots and the 555:does not apply to pages under 1: 1721:Jones, Brad (24 April 2017). 1146:. Robotstxt.org. 1994-06-30. 1067: 846:directive, allowing multiple 566: 553:http://example.com/robots.txt 531:A robots.txt file covers one 444:The standard was proposed by 291: 253: 242: 2058:. 2018-01-10. Archived from 18:Browse history interactively 7: 2162:"Robots.txt Specifications" 1391:"About Ask.com: Webmasters" 962: 737: 663: 504: 10: 2368: 2337:Search engine optimization 1633:"ArchiveBot: Bad behavior" 837: 765:User-agent: * Disallow: / 711:redirect humans.txt to an 674:security through obscurity 539:had a robots.txt file but 439: 419:security through obscurity 394:used for implementing the 299: 129: 2222:Robots Exclusion Protocol 1854:Sverre H. Huseby (2004). 1327:Robots Exclusion Protocol 753:User-agent: * Disallow: 396:Robots Exclusion Protocol 365: 348: 340: 332: 319: 315:Robots Exclusion Protocol 314: 209: 206: 146: 100: 1368:. YouTube. Oct 5, 2009. 998:Distributed web crawling 944: 908: 874:Robot Exclusion Standard 842:Some crawlers support a 694:pass alternative content 557:http://example.com:8080/ 462:denial-of-service attack 452:in February 1994 on the 402:to indicate to visiting 1945:. Iis.net. 2013-11-06. 1826:10.6028/NIST.SP.800-123 756:User-agent: * Allow: / 617:Artificial intelligence 99: 1144:"The Web Robots Pages" 1109:Fielding, Roy (1994). 946:X-Robots-Tag: noindex 831:'s subsequent visits. 813:Nonstandard extensions 167:CitationsRuleTheNation 1174:www-talk mailing list 884:Meta tags and headers 818:Crawl-delay directive 643:. In 2023, blog host 398:, a standard used by 195:remove an extra space 2056:"/killer-robots.txt" 1920:. Httpd.apache.org. 1637:wiki.archiveteam.org 1187:on October 29, 2013. 905:A "noindex" meta tag 699:Some sites, such as 561:https://example.com/ 464:on Koster's server. 415:voluntary compliance 1994:"Github humans.txt" 1968:"Google humans.txt" 1895:. User-agents.org. 1431:on 13 December 2012 932:"noindex" 868:Universal "*" match 547:would not apply to 448:, when working for 311: 2249:Proposed Standard. 2111:support.google.com 1603:"Using robots.txt" 1421:"About AOL Search" 1354:Proposed Standard. 1271:Search Engine Land 1037:for search engines 923:"robots" 720:/killer-robots.txt 640:The New York Times 579:Some major search 309: 162: 111: 2266:Google Developers 2166:Google Developers 1699:. 17 April 2017. 1548:Google Developers 1185:archived message) 385: 384: 336:Proposed Standard 297:Internet protocol 290: 148: 102: 82: 39: 2359: 2319: 2318: 2310: 2298: 2297: 2295:Official website 2281: 2280: 2278: 2277: 2258: 2252: 2247: 2238: 2236:10.17487/RFC9309 2216: 2210: 2209: 2207: 2206: 2191: 2182: 2181: 2179: 2177: 2158: 2152: 2151: 2149: 2148: 2139:. Archived from 2133: 2127: 2126: 2124: 2122: 2103: 2097: 2096: 2094: 2092: 2077: 2071: 2070: 2068: 2067: 2052: 2046: 2045: 2043: 2042: 2022: 2016: 2015: 2013: 2011: 1990: 1984: 1983: 1981: 1979: 1964: 1958: 1957: 1955: 1954: 1939: 1933: 1932: 1930: 1929: 1914: 1908: 1907: 1905: 1904: 1889: 1883: 1882: 1880: 1879: 1851: 1845: 1844: 1842: 1840: 1834: 1815: 1806: 1800: 1799: 1797: 1795: 1776: 1770: 1769: 1767: 1766: 1751: 1745: 1744: 1742: 1740: 1718: 1712: 1711: 1709: 1708: 1697:blog.archive.org 1689: 1683: 1682: 1680: 1678: 1669:. Archive Team. 1659: 1653: 1652: 1650: 1648: 1639:. Archive Team. 1629: 1623: 1622: 1620: 1618: 1599: 1590: 1589: 1587: 1585: 1570: 1564: 1563: 1561: 1559: 1540: 1531: 1530: 1528: 1526: 1511:"DuckDuckGo Bot" 1507: 1501: 1500: 1498: 1496: 1477: 1471: 1470: 1468: 1466: 1447: 1441: 1440: 1438: 1436: 1427:. Archived from 1417: 1411: 1410: 1408: 1406: 1387: 1381: 1380: 1378: 1377: 1362: 1356: 1352: 1343: 1341:10.17487/RFC9309 1321: 1315: 1314: 1312: 1311: 1292: 1286: 1285: 1283: 1282: 1262: 1256: 1255: 1253: 1251: 1234: 1219: 1218: 1216: 1214: 1205:. 19 June 2006. 1195: 1189: 1188: 1186: 1176:. Archived from 1165: 1159: 1158: 1156: 1155: 1140: 1134: 1133: 1131: 1129: 1115: 1106: 1100: 1099: 1097: 1096: 1085:Greenhills.co.uk 1077: 1008:Internet Archive 981: 973: 936: 933: 930: 927: 924: 921: 918: 915: 912: 890:Robots meta tags 879: 860: 853: 845: 749: 745: 721: 706: 658: 605:Internet Archive 562: 558: 554: 550: 546: 542: 538: 516: 512: 381: 375: 372: 324: 312: 308: 196: 193: 185: 180: 161: 156: 138: 137: 135: 132:→‎External links 125: 110: 83: 74: 73: 71: 66: 64: 56: 53: 32: 31: 21: 19: 2367: 2366: 2362: 2361: 2360: 2358: 2357: 2356: 2327: 2326: 2325: 2313: 2305: 2302: 2293: 2292: 2289: 2284: 2275: 2273: 2260: 2259: 2255: 2217: 2213: 2204: 2202: 2193: 2192: 2185: 2175: 2173: 2160: 2159: 2155: 2146: 2144: 2135: 2134: 2130: 2120: 2118: 2105: 2104: 2100: 2090: 2088: 2079: 2078: 2074: 2065: 2063: 2054: 2053: 2049: 2040: 2038: 2023: 2019: 2009: 2007: 1992: 1991: 1987: 1977: 1975: 1966: 1965: 1961: 1952: 1950: 1941: 1940: 1936: 1927: 1925: 1916: 1915: 1911: 1902: 1900: 1891: 1890: 1886: 1877: 1875: 1868: 1852: 1848: 1838: 1836: 1832: 1813: 1807: 1803: 1793: 1791: 1778: 1777: 1773: 1764: 1762: 1753: 1752: 1748: 1738: 1736: 1719: 1715: 1706: 1704: 1691: 1690: 1686: 1676: 1674: 1660: 1656: 1646: 1644: 1631: 1630: 1626: 1616: 1614: 1607:Help.yandex.com 1601: 1600: 1593: 1583: 1581: 1572: 1571: 1567: 1557: 1555: 1542: 1541: 1534: 1524: 1522: 1509: 1508: 1504: 1494: 1492: 1479: 1478: 1474: 1464: 1462: 1449: 1448: 1444: 1434: 1432: 1419: 1418: 1414: 1404: 1402: 1389: 1388: 1384: 1375: 1373: 1364: 1363: 1359: 1322: 1318: 1309: 1307: 1294: 1293: 1289: 1280: 1278: 1263: 1259: 1249: 1247: 1235: 1222: 1212: 1210: 1203:Charlie's Diary 1197: 1196: 1192: 1180: 1166: 1162: 1153: 1151: 1142: 1141: 1137: 1127: 1125: 1113: 1107: 1103: 1094: 1092: 1079: 1078: 1074: 1070: 1065: 1003:Focused crawler 977: 969: 965: 957: 948: 947: 943: 938: 937: 934: 931: 928: 925: 922: 919: 916: 913: 910: 907: 886: 877: 870: 865: 855: 851: 843: 840: 835: 820: 815: 810: 796: 790: 784: 778: 772: 766: 757: 754: 747: 743: 740: 719: 704: 686: 666: 656: 619: 589: 577: 569: 560: 556: 552: 548: 544: 540: 536: 514: 510: 507: 442: 377: 369: 361: 341:First published 328: 305: 298: 287: 280: 271: 266: 259: 251: 249: 238: 233: 226: 219: 202: 197: 194: 189: 188: 187: 183: 170: 168: 163: 157: 152: 144: 142:← Previous edit 139: 130: 128: 126: 119: 117: 112: 106: 98: 97: 96: 95: 93: 92: 91: 90: 89: 88: 79: 75: 69: 67: 62: 59: 57: 54: 52:Content deleted 51: 48: 43:← Previous edit 40: 26: 25: 24: 17: 12: 11: 5: 2365: 2355: 2354: 2349: 2344: 2339: 2324: 2323: 2300: 2299: 2288: 2287:External links 2285: 2283: 2282: 2253: 2211: 2183: 2153: 2128: 2098: 2083:. 3 May 2012. 2072: 2047: 2031:Slate Magazine 2017: 1985: 1959: 1934: 1909: 1884: 1866: 1846: 1801: 1771: 1746: 1728:Digital Trends 1713: 1684: 1654: 1624: 1591: 1565: 1532: 1515:DuckDuckGo.com 1502: 1485:Blogs.bing.com 1472: 1442: 1425:Search.aol.com 1412: 1382: 1357: 1316: 1287: 1257: 1220: 1190: 1160: 1135: 1101: 1071: 1069: 1066: 1064: 1063: 1058: 1053: 1048: 1043: 1038: 1032: 1027: 1022: 1016: 1010: 1005: 1000: 995: 989: 983: 975: 966: 964: 961: 956: 953: 945: 942: 939: 909: 906: 903: 885: 882: 869: 866: 863: 839: 836: 833: 825:search console 819: 816: 814: 811: 807: 794: 788: 782: 776: 770: 764: 755: 752: 739: 736: 724:the Terminator 685: 682: 665: 662: 618: 615: 610:Digital Trends 588: 587:Archival sites 585: 576: 575:Search engines 573: 568: 565: 506: 503: 469:web developers 458:Charles Stross 446:Martijn Koster 441: 438: 383: 382: 367: 363: 362: 360: 359: 356: 352: 350: 346: 345: 342: 338: 337: 334: 330: 329: 325: 317: 316: 296: 294: 289: 288: 285: 283: 281: 278: 276: 273: 272: 269: 267: 264: 261: 260: 256: 254: 252: 248: 245: 243: 240: 239: 236: 234: 231: 228: 227: 224: 222: 220: 217: 215: 212: 211: 208: 204: 203: 182: 181: 166: 145: 115: 84: 78: 76: 58: 50: 41: 38: 37: 35: 23: 22: 14: 9: 6: 4: 3: 2: 2364: 2353: 2350: 2348: 2345: 2343: 2340: 2338: 2335: 2334: 2332: 2322: 2317: 2312: 2311: 2308: 2303: 2296: 2291: 2290: 2271: 2267: 2263: 2257: 2250: 2245: 2242: 2237: 2232: 2228: 2224: 2223: 2215: 2200: 2196: 2190: 2188: 2171: 2167: 2163: 2157: 2143:on 2009-03-05 2142: 2138: 2132: 2116: 2112: 2108: 2102: 2086: 2082: 2076: 2062:on 2018-01-10 2061: 2057: 2051: 2036: 2032: 2028: 2021: 2005: 2001: 2000: 1995: 1989: 1973: 1969: 1963: 1948: 1944: 1938: 1923: 1919: 1913: 1898: 1894: 1888: 1873: 1869: 1867:9780470857472 1863: 1859: 1858: 1850: 1831: 1827: 1823: 1819: 1812: 1805: 1789: 1785: 1781: 1775: 1760: 1756: 1750: 1734: 1730: 1729: 1724: 1717: 1702: 1698: 1694: 1688: 1672: 1668: 1664: 1658: 1642: 1638: 1634: 1628: 1612: 1608: 1604: 1598: 1596: 1579: 1575: 1569: 1553: 1549: 1545: 1539: 1537: 1520: 1516: 1512: 1506: 1490: 1486: 1482: 1476: 1460: 1456: 1452: 1451:"Baiduspider" 1446: 1430: 1426: 1422: 1416: 1400: 1396: 1395:About.ask.com 1392: 1386: 1371: 1367: 1361: 1355: 1350: 1347: 1342: 1337: 1333: 1329: 1328: 1320: 1305: 1301: 1297: 1291: 1276: 1272: 1268: 1261: 1246: 1245: 1240: 1233: 1231: 1229: 1227: 1225: 1208: 1204: 1200: 1194: 1184: 1179: 1175: 1171: 1164: 1149: 1145: 1139: 1128:September 25, 1123: 1119: 1112: 1105: 1090: 1086: 1082: 1076: 1072: 1062: 1059: 1057: 1054: 1052: 1051:Web archiving 1049: 1047: 1044: 1042: 1039: 1036: 1035:Meta elements 1033: 1031: 1028: 1026: 1023: 1020: 1017: 1014: 1011: 1009: 1006: 1004: 1001: 999: 996: 993: 990: 987: 984: 980: 976: 972: 968: 967: 960: 952: 902: 900: 896: 891: 881: 875: 862: 859: 849: 832: 830: 826: 806: 803: 801: 793: 787: 781: 775: 769: 763: 760: 751: 735: 733: 729: 725: 716: 714: 710: 702: 697: 695: 691: 681: 679: 675: 671: 661: 655: 654: 648: 646: 642: 641: 636: 632: 628: 624: 623:generative AI 614: 612: 611: 606: 602: 599:. Co-founder 598: 594: 584: 582: 572: 564: 549:a.example.com 541:a.example.com 534: 529: 525: 522: 520: 502: 500: 495: 493: 489: 485: 481: 479: 474: 470: 465: 463: 459: 455: 451: 447: 437: 435: 430: 428: 424: 420: 416: 411: 409: 405: 401: 397: 393: 389: 380: 374: 368: 364: 357: 354: 353: 351: 347: 343: 339: 335: 331: 323: 318: 313: 307: 303: 292: 284: 282: 277: 275: 274: 270: 268: 265: 263: 262: 255: 244: 241: 237: 235: 232: 230: 229: 223: 221: 216: 214: 213: 205: 201: 178: 174: 169: 160: 155: 151: 143: 133: 123: 118: 109: 105: 87: 72: 65: 55:Content added 47: 44: 36: 34: 33: 20: 2347:Web scraping 2301: 2274:. Retrieved 2265: 2256: 2248: 2221: 2214: 2203:. Retrieved 2176:February 15, 2174:. Retrieved 2165: 2156: 2145:. Retrieved 2141:the original 2131: 2119:. Retrieved 2110: 2101: 2089:. Retrieved 2075: 2064:. Retrieved 2060:the original 2050: 2039:. Retrieved 2030: 2020: 2008:. Retrieved 1997: 1988: 1976:. Retrieved 1962: 1951:. Retrieved 1937: 1926:. Retrieved 1912: 1901:. Retrieved 1887: 1876:. Retrieved 1856: 1849: 1837:. Retrieved 1817: 1804: 1792:. Retrieved 1784:The Register 1783: 1774: 1763:. Retrieved 1749: 1737:. Retrieved 1726: 1716: 1705:. Retrieved 1696: 1687: 1675:. Retrieved 1657: 1645:. Retrieved 1636: 1627: 1615:. Retrieved 1606: 1582:. Retrieved 1568: 1556:. Retrieved 1547: 1523:. Retrieved 1514: 1505: 1493:. Retrieved 1484: 1475: 1463:. Retrieved 1454: 1445: 1433:. Retrieved 1429:the original 1424: 1415: 1403:. Retrieved 1394: 1385: 1374:. Retrieved 1360: 1353: 1326: 1319: 1308:. Retrieved 1299: 1290: 1279:. Retrieved 1270: 1260: 1248:. Retrieved 1242: 1211:. Retrieved 1202: 1193: 1178:the original 1173: 1163: 1152:. Retrieved 1138: 1126:. Retrieved 1117: 1114:(PostScript) 1104: 1093:. Retrieved 1084: 1081:"Historical" 1075: 979:security.txt 958: 949: 887: 873: 871: 857: 854:in the form 850:in the same 841: 821: 804: 797: 791: 785: 779: 773: 767: 761: 758: 741: 722:instructing 717: 712: 698: 687: 684:Alternatives 667: 651: 649: 638: 620: 608: 593:Archive Team 590: 578: 570: 530: 526: 523: 508: 496: 477: 466: 453: 443: 431: 412: 404:web crawlers 395: 387: 386: 306: 286:==See also== 279:==See also== 1677:18 February 1663:Jason Scott 1617:16 February 1584:16 February 1558:16 February 1495:16 February 1465:16 February 1435:16 February 1405:16 February 1056:Web crawler 1046:Spider trap 880:statement. 732:Sergey Brin 601:Jason Scott 545:example.com 537:example.com 200:Next edit → 46:Next edit → 2352:Text files 2331:Categories 2276:2022-10-17 2205:2013-08-17 2147:2009-03-23 2121:22 October 2091:9 February 2066:2018-05-25 2041:2019-10-03 2010:October 3, 1978:October 3, 1953:2013-12-29 1928:2013-12-29 1903:2013-12-29 1878:2015-08-12 1839:August 12, 1794:August 12, 1765:2015-08-10 1707:2018-12-01 1647:10 October 1376:2013-12-29 1310:2019-07-10 1281:2015-11-19 1154:2013-12-29 1120:. Geneva. 1095:2017-03-03 1068:References 899:httpd.conf 852:robots.txt 809:directory 728:Larry Page 705:humans.txt 690:user-agent 567:Compliance 511:robots.txt 484:WebCrawler 408:web robots 406:and other 388:robots.txt 310:robots.txt 1455:Baidu.com 1244:The Verge 1183:Hypermail 895:.htaccess 878:Disallow: 856:Sitemap: 829:Googlebot 703:, host a 670:web robot 653:The Verge 492:AltaVista 371:robotstxt 210:Line 198: 207:Line 198: 2342:Websites 2321:Internet 2270:Archived 2199:Archived 2170:Archived 2115:Archived 2085:Archived 2035:Archived 2004:Archived 1972:Archived 1947:Archived 1922:Archived 1897:Archived 1872:Archived 1830:Archived 1788:Archived 1759:Archived 1733:Archived 1701:Archived 1671:Archived 1641:Archived 1611:Archived 1578:Archived 1552:Archived 1525:25 April 1519:Archived 1489:Archived 1459:Archived 1399:Archived 1370:Archived 1304:Archived 1275:Archived 1250:16 March 1213:19 April 1207:Archived 1148:Archived 1122:Archived 1089:Archived 1041:Sitemaps 1030:Perma.cc 1025:Nofollow 1021:(NDIIPP) 963:See also 858:full-url 848:Sitemaps 748:Disallow 738:Examples 664:Security 597:sitemaps 505:Standard 480:standard 478:de facto 454:www-talk 434:sitemaps 400:websites 392:filename 379:RFC 9309 177:contribs 70:Wikitext 1061:noindex 992:BotSeer 971:ads.txt 926:content 901:files. 844:Sitemap 838:Sitemap 581:engines 519:website 440:History 390:is the 366:Website 349:Authors 327:folder. 2307:Portal 1999:GitHub 1864: 1015:(NDLP) 800:Google 715:page. 709:GitHub 701:Google 645:Medium 631:Google 627:OpenAI 533:origin 490:, and 473:server 423:server 333:Status 81:Inline 63:Visual 1833:(PDF) 1814:(PDF) 1739:8 May 935:/> 713:About 657:' 488:Lycos 450:Nexor 247:files 186:edits 136:HTTPS 2244:9309 2227:IETF 2178:2020 2123:2018 2093:2016 2012:2019 1980:2019 1862:ISBN 1841:2015 1796:2015 1741:2017 1679:2017 1649:2022 1619:2013 1586:2013 1560:2013 1527:2017 1497:2013 1467:2013 1437:2013 1407:2013 1349:9309 1332:IETF 1252:2024 1215:2014 1130:2013 917:name 914:meta 911:< 897:and 872:The 730:and 637:and 373:.org 173:talk 159:undo 154:edit 122:talk 108:edit 2241:RFC 2231:doi 1822:doi 1346:RFC 1336:doi 635:BBC 559:or 2333:: 2268:. 2264:. 2239:. 2229:. 2225:. 2197:. 2186:^ 2168:. 2164:. 2113:. 2109:. 2033:. 2029:. 2002:. 1996:. 1970:. 1870:. 1828:. 1820:. 1816:. 1786:. 1782:. 1757:. 1731:. 1725:. 1695:. 1665:. 1635:. 1609:. 1605:. 1594:^ 1576:. 1550:. 1546:. 1535:^ 1517:. 1513:. 1487:. 1483:. 1457:. 1453:. 1423:. 1397:. 1393:. 1344:. 1334:. 1330:. 1302:. 1298:. 1273:. 1269:. 1241:. 1223:^ 1201:. 1172:. 1116:. 1087:. 1083:. 861:: 734:. 563:. 494:. 486:, 429:. 376:, 175:| 134:: 2309:: 2279:. 2246:. 2233:: 2208:. 2180:. 2150:. 2125:. 2095:. 2069:. 2044:. 2014:. 1982:. 1956:. 1931:. 1906:. 1881:. 1843:. 1824:: 1798:. 1768:. 1743:. 1710:. 1681:. 1651:. 1621:. 1588:. 1562:. 1529:. 1499:. 1469:. 1439:. 1409:. 1379:. 1351:. 1338:: 1313:. 1284:. 1254:. 1217:. 1181:( 1157:. 1132:. 1098:. 929:= 920:= 744:* 304:. 192:m 184:5 179:) 171:( 124:) 120:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index