robots.txt - Knowledge

833: 184: 812:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

684:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

670:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

389:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

542:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 534:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 390:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

188:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

521:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

685:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

379:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 754:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

475:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 821:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.

386:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

1824: 1366: 1455: 902: 465:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1578: 2186: 495:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 1962: 333:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

1152: 1084: 1610: 509:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2076: 2014: 1912: 2147: 1820: 1429: 1358: 660:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

1451: 664:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 657:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 539: 1707: 2047: 1518: 383:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1570: 1749: 1992: 2172: 999: 1548: 1181: 750:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

1958: 1636: 279:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 554:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1799: 1665: 1247: 126: 114: 1276: 1336: 322:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1144: 1076: 433:

A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.

1488: 130: 110: 2227: 966: 530:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

122: 1047: 1025: 484: 288: 1933: 651:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1743: 1600: 863: 88: 37: 2072: 2018: 896: 1904: 1116: 360: 2139: 1396: 483:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1849: 1421: 860:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 359:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

77: 26: 1298: 512:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

1881: 147: 535: 280: 1774: 813:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

1688: 875: 323: 2039: 1510: 445:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1571:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 585: 2073:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 397:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2237: 2177: 1733: 686: 84: 33: 143: 2118: 1984: 1223: 276: 371:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2242: 1905:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 988: 63: 1544: 1173: 645:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

283:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 1632: 501: 487:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 394: 334: 284: 1795: 1657: 2232: 1739: 1243: 338: 318:

mailing list, the main communication channel for WWW-related activities at the time.

287:

overload. In the 2020s many websites began denying bots that collect information for

21: 832: 2108: 1699: 1268: 1213: 885: 506: 466: 1328: 890: 880: 838: 751: 604:

This example tells all robots that they can visit all files because the wildcard

2121: 2098: 1226: 1203: 469:

announced that it would stop complying with robots.txt directives. According to

240: 1605: 1480: 471: 442: 319: 307: 989:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 958: 621:

The same result can be accomplished with an empty or missing robots.txt file.

2221: 1359:"Robots Exclusion Protocol: joining together to provide better documentation" 1055: 933: 330: 1703: 1021: 648:

This example tells two specific robots not to enter one specific directory:

569:

file that displays information meant for humans to read. Some sites such as

1937: 856: 454: 269: 1821:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1540: 938: 928: 593: 462: 265: 1601:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 760: 589: 551: 345: 118: 2113: 2015:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1388: 1218: 1121: 1077:"How I got here in the end, part five: "things can only get better!"" 1060: 756: 690: 531: 514: 413:. In addition, each protocol and port needs its own robots.txt file; 353: 2140:"How Google Interprets the robots.txt Specification | Documentation" 1845: 903:

National Digital Information Infrastructure and Preservation Program

344:; most complied, including those operated by search engines such as 923: 918: 908: 709: 636:

This example tells all robots to stay away from one specific file:

633:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

555: 363:. A proposed standard was published in September 2022 as RFC 9309. 295: 253: 2097:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1306: 1202:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1658:"Robots.txt tells hackers the places you don't want them to look" 913: 869: 848: 458: 380: 261: 1871: 1876: 661: 570: 562: 492: 488: 1770: 642:

All other files in the specified directory will be processed.

630:

This example tells all robots not to enter three directories:

802: 349: 337:

overload was a primary concern. By June 1994 it had become a

311: 220:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

183: 2210: 2104: 1209: 1145:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1735:

Innocent Code: A Security Wake-Up Call for Web Programmers

1633:"Block URLs with robots.txt: Learn about robots.txt files" 66:

to this revision, which may differ significantly from the

2182: 2096: 1416: 1414: 1201: 1174:"Formalizing the Robots Exclusion Protocol Specification" 612:

directive has no value, meaning no pages are disallowed.

496: 272:

which portions of the website they are allowed to visit.

624:

This example tells all robots to stay out of a website:

558:) when it detects a connection using one of the robots. 232: 2173:"Artificial Intelligence Web Crawlers Are Running Amok" 1959:"To crawl or not to crawl, that is BingBot's question" 1411: 294:

The "robots.txt" file can be used in conjunction with

329:

The standard, initially RobotsNotWanted.txt, allowed

1687:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1444: 996:

First International Conference on the World Wide Web

828: 816: 163: 68: 1985:"Change Googlebot crawl rate - Search Console Help" 1686: 491:'s GPTBot in their robots.txt file and 85 blocked 872:– Now inactive search engine for robots.txt files 766: 298:, another robot inclusion standard for websites. 2219: 1951: 1771:"List of User-Agents (Spiders, Robots, Browser)" 1731: 1592: 1475: 1473: 1351: 1291: 729: 654:Example demonstrating how comments can be used: 453:Some web archiving projects ignore robots.txt. 2067: 2065: 1696:National Institute of Standards and Technology 1381: 1261: 1142: 1048:"Important: Spiders, Robots and Web Wanderers" 852:, a standard for listing authorized ad sellers 540:National Institute of Standards and Technology 457:uses the file to discover more links, such as 1470: 696:User-agent: bingbot Allow: / Crawl-delay: 10 639:User-agent: * Disallow: /directory/file.html 580:Previously, Google had a joke file hosted at 206:1994 published, formally standardized in 2022 1321: 667:Example demonstrating multiple user-agents: 375:in the root of the web site hierarchy (e.g. 2062: 726:Sitemap: http://www.example.com/sitemap.xml 275:The standard, developed in 1994, relies on 2129:sec. 2.5: Limits. 1452:"Submitting your website to Yahoo! Search" 738:does not mention the "*" character in the 478: 182: 140: 2112: 1738:. John Wiley & Sons. pp. 91–92. 1217: 674: 1136: 986: 866:– A failed proposal to extend robots.txt 745: 679: 538:is discouraged by standards bodies. The 405:did not, the rules that would apply for 47: 1539: 1422:"Webmasters: Robots.txt Specifications" 1110: 1108: 1106: 1104: 1102: 75: 14: 2220: 1902: 1195: 1178:Official Google Webmaster Central Blog 1117:"The text file that runs the internet" 1114: 1045: 2170: 2050:from the original on November 2, 2019 1852:from the original on January 24, 2017 1796:"Access Control - Apache HTTP Server" 1598: 1399:from the original on 16 February 2017 162:For Knowledge's robots.txt file, see 76:Revision as of 06:33, 6 July 2024 by 44: 25: 1521:from the original on 10 October 2022 1279:from the original on 27 January 2013 1099: 1046:Koster, Martijn (25 February 1994). 97: 52: 17: 164:https://en.wikipedia.org/robots.txt 139: 108: 2164: 2090: 1689:"Guide to General Server Security" 1339:from the original on 6 August 2013 1244:"Uncrawled URLs in search results" 1115:Pierce, David (14 February 2024). 377:https://www.example.com/robots.txt 289:generative artificial intelligence 157: 2254: 2202: 1884:from the original on May 30, 2016 1014: 864:Automated Content Access Protocol 817:Maximum size of a robots.txt file 588:not to kill the company founders 448: 436: 62:. The present address (URL) is a 2189:from the original on 6 July 2024 897:National Digital Library Program 831: 803:A "noindex" HTTP response header 550:Many robots also pass a special 217:Martijn Koster (original author) 2150:from the original on 2022-10-17 2132: 2079:from the original on 2013-08-08 2032: 2007: 1995:from the original on 2018-11-18 1977: 1965:from the original on 2016-02-03 1926: 1915:from the original on 2018-11-18 1903:Newman, Lily Hay (2014-07-03). 1896: 1864: 1838: 1827:from the original on 2014-01-01 1813: 1802:from the original on 2013-12-29 1788: 1777:from the original on 2014-01-07 1763: 1752:from the original on 2016-04-01 1725: 1713:from the original on 2011-10-08 1680: 1668:from the original on 2015-08-21 1650: 1639:from the original on 2015-08-14 1625: 1613:from the original on 2017-05-16 1581:from the original on 2018-12-04 1563: 1551:from the original on 2017-02-18 1533: 1503: 1491:from the original on 2013-01-25 1458:from the original on 2013-01-21 1432:from the original on 2013-01-15 1369:from the original on 2014-08-18 1250:from the original on 2014-01-06 1236: 1184:from the original on 2019-07-10 1155:from the original on 2015-09-07 1143:Barry Schwartz (30 June 2014). 1087:from the original on 2013-11-25 1028:from the original on 2014-01-12 1002:from the original on 2013-09-27 969:from the original on 2017-04-03 689:for webmasters, to control the 545: 361:Internet Engineering Task Force 1545:"Robots.txt is a suicide note" 1166: 1069: 1039: 980: 951: 608:stands for all robots and the 417:does not apply to pages under 13: 1: 1599:Jones, Brad (24 April 2017). 1024:. Robotstxt.org. 1994-06-30. 945: 708:directive, allowing multiple 428: 415:http://example.com/robots.txt 393:A robots.txt file covers one 306:The standard was proposed by 154:, was based on this revision. 2171:Allyn, Bobby (5 July 2024). 1936:. 2018-01-10. Archived from 7: 2040:"Robots.txt Specifications" 1269:"About Ask.com: Webmasters" 824: 599: 525: 366: 24:of this page, as edited by 10: 2259: 2228:Search engine optimization 1511:"ArchiveBot: Bad behavior" 699: 627:User-agent: * Disallow: / 573:redirect humans.txt to an 536:security through obscurity 401:had a robots.txt file but 301: 281:security through obscurity 256:used for implementing the 161: 95: 50: 2100:Robots Exclusion Protocol 1732:Sverre H. Huseby (2004). 1205:Robots Exclusion Protocol 615:User-agent: * Disallow: 258:Robots Exclusion Protocol 227: 210: 202: 194: 181: 177:Robots Exclusion Protocol 176: 1246:. YouTube. Oct 5, 2009. 876:Distributed web crawling 806: 770: 736:Robot Exclusion Standard 704:Some crawlers support a 556:pass alternative content 419:http://example.com:8080/ 324:denial-of-service attack 314:in February 1994 on the 264:to indicate to visiting 1823:. Iis.net. 2013-11-06. 1704:10.6028/NIST.SP.800-123 618:User-agent: * Allow: / 479:Artificial intelligence 1022:"The Web Robots Pages" 987:Fielding, Roy (1994). 808:X-Robots-Tag: noindex 693:'s subsequent visits. 675:Nonstandard extensions 2178:All Things Considered 1052:www-talk mailing list 746:Meta tags and headers 680:Crawl-delay directive 505:. In 2023, blog host 260:, a standard used by 1934:"/killer-robots.txt" 1798:. Httpd.apache.org. 1515:wiki.archiveteam.org 1065:on October 29, 2013. 767:A "noindex" meta tag 561:Some sites, such as 423:https://example.com/ 326:on Koster's server. 277:voluntary compliance 1872:"Github humans.txt" 1846:"Google humans.txt" 1773:. User-agents.org. 1309:on 13 December 2012 794:"noindex" 730:Universal "*" match 409:would not apply to 310:, when working for 173: 115:← Previous revision 2127:Proposed Standard. 1989:support.google.com 1481:"Using robots.txt" 1299:"About AOL Search" 1232:Proposed Standard. 1149:Search Engine Land 893:for search engines 785:"robots" 582:/killer-robots.txt 502:The New York Times 441:Some major search 171: 45:06:33, 6 July 2024 2144:Google Developers 2044:Google Developers 1577:. 17 April 2017. 1426:Google Developers 1063:archived message) 247: 246: 198:Proposed Standard 159:Internet protocol 98:→‎Further reading 53:→‎Further reading 2250: 2214: 2213: 2211:Official website 2198: 2196: 2194: 2159: 2158: 2156: 2155: 2136: 2130: 2125: 2116: 2114:10.17487/RFC9309 2094: 2088: 2087: 2085: 2084: 2069: 2060: 2059: 2057: 2055: 2036: 2030: 2029: 2027: 2026: 2017:. Archived from 2011: 2005: 2004: 2002: 2000: 1981: 1975: 1974: 1972: 1970: 1955: 1949: 1948: 1946: 1945: 1930: 1924: 1923: 1921: 1920: 1900: 1894: 1893: 1891: 1889: 1868: 1862: 1861: 1859: 1857: 1842: 1836: 1835: 1833: 1832: 1817: 1811: 1810: 1808: 1807: 1792: 1786: 1785: 1783: 1782: 1767: 1761: 1760: 1758: 1757: 1729: 1723: 1722: 1720: 1718: 1712: 1693: 1684: 1678: 1677: 1675: 1673: 1654: 1648: 1647: 1645: 1644: 1629: 1623: 1622: 1620: 1618: 1596: 1590: 1589: 1587: 1586: 1575:blog.archive.org 1567: 1561: 1560: 1558: 1556: 1547:. Archive Team. 1537: 1531: 1530: 1528: 1526: 1517:. Archive Team. 1507: 1501: 1500: 1498: 1496: 1477: 1468: 1467: 1465: 1463: 1448: 1442: 1441: 1439: 1437: 1418: 1409: 1408: 1406: 1404: 1389:"DuckDuckGo Bot" 1385: 1379: 1378: 1376: 1374: 1355: 1349: 1348: 1346: 1344: 1325: 1319: 1318: 1316: 1314: 1305:. Archived from 1295: 1289: 1288: 1286: 1284: 1265: 1259: 1258: 1256: 1255: 1240: 1234: 1230: 1221: 1219:10.17487/RFC9309 1199: 1193: 1192: 1190: 1189: 1170: 1164: 1163: 1161: 1160: 1140: 1134: 1133: 1131: 1129: 1112: 1097: 1096: 1094: 1092: 1083:. 19 June 2006. 1073: 1067: 1066: 1064: 1054:. Archived from 1043: 1037: 1036: 1034: 1033: 1018: 1012: 1011: 1009: 1007: 993: 984: 978: 977: 975: 974: 963:Greenhills.co.uk 955: 886:Internet Archive 859: 851: 841: 836: 835: 798: 795: 792: 789: 786: 783: 780: 777: 774: 752:Robots meta tags 741: 722: 715: 707: 611: 607: 583: 568: 520: 467:Internet Archive 424: 420: 416: 412: 408: 404: 400: 378: 374: 243: 237: 234: 186: 174: 170: 144:accepted version 127:Newer revision → 105: 103: 101: 92: 71: 69:current revision 61: 60: 58: 56: 46: 42: 41: 2258: 2257: 2253: 2252: 2251: 2249: 2248: 2247: 2218: 2217: 2209: 2208: 2205: 2192: 2190: 2167: 2165:Further reading 2162: 2153: 2151: 2138: 2137: 2133: 2095: 2091: 2082: 2080: 2071: 2070: 2063: 2053: 2051: 2038: 2037: 2033: 2024: 2022: 2013: 2012: 2008: 1998: 1996: 1983: 1982: 1978: 1968: 1966: 1957: 1956: 1952: 1943: 1941: 1932: 1931: 1927: 1918: 1916: 1901: 1897: 1887: 1885: 1870: 1869: 1865: 1855: 1853: 1844: 1843: 1839: 1830: 1828: 1819: 1818: 1814: 1805: 1803: 1794: 1793: 1789: 1780: 1778: 1769: 1768: 1764: 1755: 1753: 1746: 1730: 1726: 1716: 1714: 1710: 1691: 1685: 1681: 1671: 1669: 1656: 1655: 1651: 1642: 1640: 1631: 1630: 1626: 1616: 1614: 1597: 1593: 1584: 1582: 1569: 1568: 1564: 1554: 1552: 1538: 1534: 1524: 1522: 1509: 1508: 1504: 1494: 1492: 1485:Help.yandex.com 1479: 1478: 1471: 1461: 1459: 1450: 1449: 1445: 1435: 1433: 1420: 1419: 1412: 1402: 1400: 1387: 1386: 1382: 1372: 1370: 1357: 1356: 1352: 1342: 1340: 1327: 1326: 1322: 1312: 1310: 1297: 1296: 1292: 1282: 1280: 1267: 1266: 1262: 1253: 1251: 1242: 1241: 1237: 1200: 1196: 1187: 1185: 1172: 1171: 1167: 1158: 1156: 1141: 1137: 1127: 1125: 1113: 1100: 1090: 1088: 1081:Charlie's Diary 1075: 1074: 1070: 1058: 1044: 1040: 1031: 1029: 1020: 1019: 1015: 1005: 1003: 991: 985: 981: 972: 970: 957: 956: 952: 948: 943: 881:Focused crawler 855: 847: 839:Internet portal 837: 830: 827: 819: 810: 809: 805: 800: 799: 796: 793: 790: 787: 784: 781: 778: 775: 772: 769: 748: 739: 732: 727: 717: 713: 705: 702: 697: 682: 677: 672: 658: 652: 646: 640: 634: 628: 619: 616: 609: 605: 602: 581: 566: 548: 528: 518: 481: 451: 439: 431: 422: 418: 414: 410: 406: 402: 398: 376: 372: 369: 304: 239: 231: 223: 203:First published 190: 167: 160: 156: 155: 138: 137: 136: 135: 134: 119:Latest revision 107: 106: 96: 93: 82: 80: 67: 51: 48: 31: 29: 12: 11: 5: 2256: 2246: 2245: 2240: 2235: 2230: 2216: 2215: 2204: 2203:External links 2201: 2200: 2199: 2166: 2163: 2161: 2160: 2131: 2089: 2061: 2031: 2006: 1976: 1961:. 3 May 2012. 1950: 1925: 1909:Slate Magazine 1895: 1863: 1837: 1812: 1787: 1762: 1744: 1724: 1679: 1649: 1624: 1606:Digital Trends 1591: 1562: 1532: 1502: 1469: 1443: 1410: 1393:DuckDuckGo.com 1380: 1363:Blogs.bing.com 1350: 1320: 1303:Search.aol.com 1290: 1260: 1235: 1194: 1165: 1135: 1098: 1068: 1038: 1013: 979: 949: 947: 944: 942: 941: 936: 931: 926: 921: 916: 911: 906: 900: 894: 888: 883: 878: 873: 867: 861: 853: 844: 843: 842: 826: 823: 818: 815: 807: 804: 801: 771: 768: 765: 747: 744: 731: 728: 725: 701: 698: 695: 687:search console 681: 678: 676: 673: 669: 656: 650: 644: 638: 632: 626: 617: 614: 601: 598: 586:the Terminator 547: 544: 527: 524: 480: 477: 472:Digital Trends 450: 449:Archival sites 447: 438: 437:Search engines 435: 430: 427: 368: 365: 331:web developers 320:Charles Stross 308:Martijn Koster 303: 300: 245: 244: 229: 225: 224: 222: 221: 218: 214: 212: 208: 207: 204: 200: 199: 196: 192: 191: 187: 179: 178: 158: 146:of this page, 141: 102:Updated a URL. 78: 64:permanent link 57:Updated a URL. 27: 16: 15: 9: 6: 4: 3: 2: 2255: 2244: 2241: 2239: 2236: 2234: 2231: 2229: 2226: 2225: 2223: 2212: 2207: 2206: 2188: 2184: 2180: 2179: 2174: 2169: 2168: 2149: 2145: 2141: 2135: 2128: 2123: 2120: 2115: 2110: 2106: 2102: 2101: 2093: 2078: 2074: 2068: 2066: 2049: 2045: 2041: 2035: 2021:on 2009-03-05 2020: 2016: 2010: 1994: 1990: 1986: 1980: 1964: 1960: 1954: 1940:on 2018-01-10 1939: 1935: 1929: 1914: 1910: 1906: 1899: 1883: 1879: 1878: 1873: 1867: 1851: 1847: 1841: 1826: 1822: 1816: 1801: 1797: 1791: 1776: 1772: 1766: 1751: 1747: 1745:9780470857472 1741: 1737: 1736: 1728: 1709: 1705: 1701: 1697: 1690: 1683: 1667: 1663: 1659: 1653: 1638: 1634: 1628: 1612: 1608: 1607: 1602: 1595: 1580: 1576: 1572: 1566: 1550: 1546: 1542: 1536: 1520: 1516: 1512: 1506: 1490: 1486: 1482: 1476: 1474: 1457: 1453: 1447: 1431: 1427: 1423: 1417: 1415: 1398: 1394: 1390: 1384: 1368: 1364: 1360: 1354: 1338: 1334: 1330: 1329:"Baiduspider" 1324: 1308: 1304: 1300: 1294: 1278: 1274: 1273:About.ask.com 1270: 1264: 1249: 1245: 1239: 1233: 1228: 1225: 1220: 1215: 1211: 1207: 1206: 1198: 1183: 1179: 1175: 1169: 1154: 1150: 1146: 1139: 1124: 1123: 1118: 1111: 1109: 1107: 1105: 1103: 1086: 1082: 1078: 1072: 1062: 1057: 1053: 1049: 1042: 1027: 1023: 1017: 1006:September 25, 1001: 997: 990: 983: 968: 964: 960: 954: 950: 940: 937: 935: 934:Web archiving 932: 930: 927: 925: 922: 920: 917: 915: 912: 910: 907: 904: 901: 898: 895: 892: 891:Meta elements 889: 887: 884: 882: 879: 877: 874: 871: 868: 865: 862: 858: 854: 850: 846: 845: 840: 834: 829: 822: 814: 764: 762: 758: 753: 743: 737: 724: 721: 711: 694: 692: 688: 668: 665: 663: 655: 649: 643: 637: 631: 625: 622: 613: 597: 595: 591: 587: 578: 576: 572: 564: 559: 557: 553: 543: 541: 537: 533: 523: 517: 516: 510: 508: 504: 503: 498: 494: 490: 486: 485:generative AI 476: 474: 473: 468: 464: 461:. Co-founder 460: 456: 446: 444: 434: 426: 411:a.example.com 403:a.example.com 396: 391: 387: 384: 382: 364: 362: 357: 355: 351: 347: 343: 341: 336: 332: 327: 325: 321: 317: 313: 309: 299: 297: 292: 290: 286: 282: 278: 273: 271: 267: 263: 259: 255: 251: 242: 236: 230: 226: 219: 216: 215: 213: 209: 205: 201: 197: 193: 185: 180: 175: 169: 165: 153: 149: 145: 132: 128: 124: 120: 116: 112: 99: 90: 86: 81: 74: 73: 70: 65: 54: 39: 35: 30: 23: 2238:Web scraping 2191:. Retrieved 2176: 2152:. Retrieved 2143: 2134: 2126: 2099: 2092: 2081:. Retrieved 2054:February 15, 2052:. Retrieved 2043: 2034: 2023:. Retrieved 2019:the original 2009: 1997:. Retrieved 1988: 1979: 1967:. Retrieved 1953: 1942:. Retrieved 1938:the original 1928: 1917:. Retrieved 1908: 1898: 1886:. Retrieved 1875: 1866: 1854:. Retrieved 1840: 1829:. Retrieved 1815: 1804:. Retrieved 1790: 1779:. Retrieved 1765: 1754:. Retrieved 1734: 1727: 1715:. Retrieved 1695: 1682: 1670:. Retrieved 1662:The Register 1661: 1652: 1641:. Retrieved 1627: 1615:. Retrieved 1604: 1594: 1583:. Retrieved 1574: 1565: 1553:. Retrieved 1535: 1523:. Retrieved 1514: 1505: 1493:. Retrieved 1484: 1460:. Retrieved 1446: 1434:. Retrieved 1425: 1401:. Retrieved 1392: 1383: 1371:. Retrieved 1362: 1353: 1341:. Retrieved 1332: 1323: 1311:. Retrieved 1307:the original 1302: 1293: 1281:. Retrieved 1272: 1263: 1252:. Retrieved 1238: 1231: 1204: 1197: 1186:. Retrieved 1177: 1168: 1157:. Retrieved 1148: 1138: 1126:. Retrieved 1120: 1089:. Retrieved 1080: 1071: 1056:the original 1051: 1041: 1030:. Retrieved 1016: 1004:. Retrieved 995: 992:(PostScript) 982: 971:. Retrieved 962: 959:"Historical" 953: 857:security.txt 820: 811: 749: 735: 733: 719: 716:in the form 712:in the same 703: 683: 666: 659: 653: 647: 641: 635: 629: 623: 620: 603: 584:instructing 579: 574: 560: 549: 546:Alternatives 529: 513: 511: 500: 482: 470: 455:Archive Team 452: 440: 432: 392: 388: 385: 370: 358: 339: 328: 315: 305: 293: 274: 266:web crawlers 257: 249: 248: 168: 151: 22:old revision 19: 18: 1555:18 February 1541:Jason Scott 1495:16 February 1462:16 February 1436:16 February 1373:16 February 1343:16 February 1313:16 February 1283:16 February 939:Web crawler 929:Spider trap 742:statement. 594:Sergey Brin 463:Jason Scott 407:example.com 399:example.com 152:6 July 2024 79:DocWatson42 28:DocWatson42 20:This is an 2243:Text files 2222:Categories 2154:2022-10-17 2083:2013-08-17 2025:2009-03-23 1999:22 October 1969:9 February 1944:2018-05-25 1919:2019-10-03 1888:October 3, 1856:October 3, 1831:2013-12-29 1806:2013-12-29 1781:2013-12-29 1756:2015-08-12 1717:August 12, 1672:August 12, 1643:2015-08-10 1585:2018-12-01 1525:10 October 1254:2013-12-29 1188:2019-07-10 1159:2015-11-19 1032:2013-12-29 998:. Geneva. 973:2017-03-03 946:References 761:httpd.conf 714:robots.txt 671:directory 590:Larry Page 567:humans.txt 552:user-agent 429:Compliance 373:robots.txt 346:WebCrawler 270:web robots 268:and other 250:robots.txt 172:robots.txt 1333:Baidu.com 1122:The Verge 1061:Hypermail 757:.htaccess 740:Disallow: 718:Sitemap: 691:Googlebot 565:, host a 532:web robot 515:The Verge 354:AltaVista 233:robotstxt 2233:Websites 2187:Archived 2148:Archived 2077:Archived 2048:Archived 1993:Archived 1963:Archived 1913:Archived 1882:Archived 1850:Archived 1825:Archived 1800:Archived 1775:Archived 1750:Archived 1708:Archived 1666:Archived 1637:Archived 1611:Archived 1579:Archived 1549:Archived 1519:Archived 1489:Archived 1456:Archived 1430:Archived 1403:25 April 1397:Archived 1367:Archived 1337:Archived 1277:Archived 1248:Archived 1182:Archived 1153:Archived 1128:16 March 1091:19 April 1085:Archived 1026:Archived 1000:Archived 967:Archived 924:Sitemaps 919:Perma.cc 909:nofollow 905:(NDIIPP) 825:See also 720:full-url 710:Sitemaps 610:Disallow 600:Examples 526:Security 459:sitemaps 367:Standard 342:standard 340:de facto 316:www-talk 296:sitemaps 262:websites 254:filename 241:RFC 9309 148:accepted 89:contribs 38:contribs 914:noindex 870:BotSeer 849:ads.txt 788:content 763:files. 706:Sitemap 700:Sitemap 443:engines 381:website 302:History 252:is the 228:Website 211:Authors 189:folder. 2193:6 July 1877:GitHub 1742: 899:(NDLP) 662:Google 577:page. 571:GitHub 563:Google 507:Medium 493:Google 489:OpenAI 395:origin 352:, and 335:server 285:server 195:Status 1711:(PDF) 1692:(PDF) 1617:8 May 797:/> 575:About 519:' 350:Lycos 312:Nexor 2195:2024 2122:9309 2105:IETF 2056:2020 2001:2018 1971:2016 1890:2019 1858:2019 1740:ISBN 1719:2015 1674:2015 1619:2017 1557:2017 1527:2022 1497:2013 1464:2013 1438:2013 1405:2017 1375:2013 1345:2013 1315:2013 1285:2013 1227:9309 1210:IETF 1130:2024 1093:2014 1008:2013 779:name 776:meta 773:< 759:and 734:The 592:and 499:and 235:.org 131:diff 125:) | 123:diff 111:diff 85:talk 34:talk 2183:NPR 2119:RFC 2109:doi 1700:doi 1224:RFC 1214:doi 497:BBC 421:or 150:on 142:An 43:at 2224:: 2185:. 2181:. 2175:. 2146:. 2142:. 2117:. 2107:. 2103:. 2075:. 2064:^ 2046:. 2042:. 1991:. 1987:. 1911:. 1907:. 1880:. 1874:. 1848:. 1748:. 1706:. 1698:. 1694:. 1664:. 1660:. 1635:. 1609:. 1603:. 1573:. 1543:. 1513:. 1487:. 1483:. 1472:^ 1454:. 1428:. 1424:. 1413:^ 1395:. 1391:. 1365:. 1361:. 1335:. 1331:. 1301:. 1275:. 1271:. 1222:. 1212:. 1208:. 1180:. 1176:. 1151:. 1147:. 1119:. 1101:^ 1079:. 1050:. 994:. 965:. 961:. 723:: 596:. 425:. 356:. 348:, 291:. 238:, 117:| 113:) 100:: 87:| 55:: 36:| 2197:. 2157:. 2124:. 2111:: 2086:. 2058:. 2028:. 2003:. 1973:. 1947:. 1922:. 1892:. 1860:. 1834:. 1809:. 1784:. 1759:. 1721:. 1702:: 1676:. 1646:. 1621:. 1588:. 1559:. 1529:. 1499:. 1466:. 1440:. 1407:. 1377:. 1347:. 1317:. 1287:. 1257:. 1229:. 1216:: 1191:. 1162:. 1132:. 1095:. 1059:( 1035:. 1010:. 976:. 791:= 782:= 606:* 166:. 133:) 129:( 121:( 109:( 104:) 94:( 91:) 83:( 72:. 59:) 49:( 40:) 32:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index