robots.txt - Knowledge

2175: 180: 808:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

680:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

666:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

385:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

538:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 530:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 386:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

184:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

517:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

681:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

375:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 750:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

471:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 817:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .

382:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

1805: 1347: 1436: 877: 461:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1559: 491:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 1943: 329:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

75: 26: 1133: 1065: 1591: 505:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2057: 1995: 1893: 2128: 1801: 1410: 1339: 656:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

1432: 660:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 653:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 535: 1688: 2028: 1499: 379:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1551: 1730: 1973: 980: 1529: 1162: 746:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

1939: 1617: 275:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 550:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1780: 1646: 1228: 122: 110: 1257: 1317: 318:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1125: 1057: 429:

A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.

1469: 126: 106: 2195: 947: 526:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

118: 1028: 1006: 480: 284: 1914: 647:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1724: 1581: 844: 2053: 1999: 871: 1885: 1097: 356: 2120: 1377: 479:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1830: 1402: 841:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 86: 37: 355:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1279: 508:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

1862: 143: 2165: 531: 276: 1755: 809:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

1669: 856: 319: 2020: 1491: 441:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1552:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 581: 2054:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 393:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2205: 1714: 682: 139: 2099: 1965: 1204: 272: 367:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2210: 1886:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 969: 61: 1525: 1154: 641:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

279:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 1613: 497: 483:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 390: 330: 280: 82: 33: 1776: 1638: 2200: 1720: 1224: 334: 314:

mailing list, the main communication channel for WWW-related activities at the time.

283:

overload. In the 2020s many websites began denying bots that collect information for

21: 2174: 2089: 1680: 1249: 1194: 866: 502: 462: 1309: 2179: 893: 861: 747: 600:

This example tells all robots that they can visit all files because the wildcard

2102: 2079: 1207: 1184: 465:

announced that it would stop complying with robots.txt directives. According to

236: 1586: 1461: 467: 438: 315: 303: 970:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 939: 617:

The same result can be accomplished with an empty or missing robots.txt file.

2189: 1340:"Robots Exclusion Protocol: joining together to provide better documentation" 1036: 909: 326: 1684: 1002: 644:

This example tells two specific robots not to enter one specific directory:

565:

file that displays information meant for humans to read. Some sites such as

1918: 837: 450: 265: 1802:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1521: 914: 904: 589: 458: 261: 1582:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 756: 585: 547: 341: 114: 2094: 1996:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1369: 1199: 1102: 1058:"How I got here in the end, part five: "things can only get better!"" 1041: 752: 686: 527: 510: 409:. In addition, each protocol and port needs its own robots.txt file; 349: 2121:"How Google Interprets the robots.txt Specification | Documentation" 1826: 878:

National Digital Information Infrastructure and Preservation Program

340:; most complied, including those operated by search engines such as 899: 888: 883: 705: 632:

This example tells all robots to stay away from one specific file:

629:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

551: 359:. A proposed standard was published in September 2022 as RFC 9309. 291: 249: 2078:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1287: 1183:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1639:"Robots.txt tells hackers the places you don't want them to look" 919: 850: 829: 454: 376: 257: 1852: 1857: 657: 566: 558: 488: 484: 1751: 638:

All other files in the specified directory will be processed.

626:

This example tells all robots not to enter three directories:

798: 345: 333:

overload was a primary concern. By June 1994 it had become a

307: 216:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

179: 2153: 2085: 1190: 1126:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1716:

Innocent Code: A Security Wake-Up Call for Web Programmers

1614:"Block URLs with robots.txt: Learn about robots.txt files" 64:

to this revision, which may differ significantly from the

2077: 1397: 1395: 1182: 1155:"Formalizing the Robots Exclusion Protocol Specification" 608:

directive has no value, meaning no pages are disallowed.

492: 268:

which portions of the website they are allowed to visit.

620:

This example tells all robots to stay out of a website:

554:) when it detects a connection using one of the robots. 228: 1940:"To crawl or not to crawl, that is BingBot's question" 1392: 290:

The "robots.txt" file can be used in conjunction with

95: 52: 2163: 325:

The standard, initially RobotsNotWanted.txt, allowed

1668:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1425: 977:

First International Conference on the World Wide Web

812: 159: 66: 1966:"Change Googlebot crawl rate - Search Console Help" 1667: 487:'s GPTBot in their robots.txt file and 85 blocked 853:– now inactive search engine for robots.txt files 762: 294:, another robot inclusion standard for websites. 2187: 1932: 1752:"List of User-Agents (Spiders, Robots, Browser)" 1712: 1573: 1456: 1454: 1332: 1272: 725: 650:Example demonstrating how comments can be used: 449:Some web archiving projects ignore robots.txt. 2048: 2046: 1677:National Institute of Standards and Technology 1362: 1242: 1123: 1029:"Important: Spiders, Robots and Web Wanderers" 833:, a standard for listing authorized ad sellers 536:National Institute of Standards and Technology 453:uses the file to discover more links, such as 1451: 692:User-agent: bingbot Allow: / Crawl-delay: 10 635:User-agent: * Disallow: /directory/file.html 576:Previously, Google had a joke file hosted at 202:1994 published, formally standardized in 2022 1302: 663:Example demonstrating multiple user-agents: 371:in the root of the web site hierarchy (e.g. 2043: 722:Sitemap: http://www.example.com/sitemap.xml 271:The standard, developed in 1994, relies on 2110:sec. 2.5: Limits. 1433:"Submitting your website to Yahoo! Search" 734:does not mention the "*" character in the 474: 178: 136: 2093: 1719:. John Wiley & Sons. pp. 91–92. 1198: 670: 1117: 967: 847:– a failed proposal to extend robots.txt 741: 675: 534:is discouraged by standards bodies. The 401:did not, the rules that would apply for 47: 1520: 1403:"Webmasters: Robots.txt Specifications" 1091: 1089: 1087: 1085: 1083: 73: 14: 2188: 1883: 1176: 1159:Official Google Webmaster Central Blog 1098:"The text file that runs the internet" 1095: 1026: 2031:from the original on November 2, 2019 1833:from the original on January 24, 2017 1777:"Access Control - Apache HTTP Server" 1579: 1380:from the original on 16 February 2017 158:For Knowledge's robots.txt file, see 74:Revision as of 07:07, 4 July 2024 by 44: 25: 1502:from the original on 10 October 2022 1260:from the original on 27 January 2013 1080: 1027:Koster, Martijn (25 February 1994). 17: 160:https://en.wikipedia.org/robots.txt 135: 104: 96:→‎Maximum size of a robots.txt file 53:→‎Maximum size of a robots.txt file 2071: 1670:"Guide to General Server Security" 1320:from the original on 6 August 2013 1225:"Uncrawled URLs in search results" 1096:Pierce, David (14 February 2024). 373:https://www.example.com/robots.txt 285:generative artificial intelligence 153: 2222: 2145: 1865:from the original on May 30, 2016 995: 845:Automated Content Access Protocol 813:Maximum size of a robots.txt file 584:not to kill the company founders 444: 432: 60:. The present address (URL) is a 2173: 872:National Digital Library Program 799:A "noindex" HTTP response header 546:Many robots also pass a special 213:Martijn Koster (original author) 2131:from the original on 2022-10-17 2113: 2060:from the original on 2013-08-08 2013: 1988: 1976:from the original on 2018-11-18 1958: 1946:from the original on 2016-02-03 1907: 1896:from the original on 2018-11-18 1884:Newman, Lily Hay (2014-07-03). 1877: 1845: 1819: 1808:from the original on 2014-01-01 1794: 1783:from the original on 2013-12-29 1769: 1758:from the original on 2014-01-07 1744: 1733:from the original on 2016-04-01 1706: 1694:from the original on 2011-10-08 1661: 1649:from the original on 2015-08-21 1631: 1620:from the original on 2015-08-14 1606: 1594:from the original on 2017-05-16 1562:from the original on 2018-12-04 1544: 1532:from the original on 2017-02-18 1514: 1484: 1472:from the original on 2013-01-25 1439:from the original on 2013-01-21 1413:from the original on 2013-01-15 1350:from the original on 2014-08-18 1231:from the original on 2014-01-06 1217: 1165:from the original on 2019-07-10 1136:from the original on 2015-09-07 1124:Barry Schwartz (30 June 2014). 1068:from the original on 2013-11-25 1009:from the original on 2014-01-12 983:from the original on 2013-09-27 950:from the original on 2017-04-03 685:for webmasters, to control the 541: 357:Internet Engineering Task Force 1526:"Robots.txt is a suicide note" 1147: 1050: 1020: 961: 932: 604:stands for all robots and the 413:does not apply to pages under 13: 1: 1580:Jones, Brad (24 April 2017). 1005:. Robotstxt.org. 1994-06-30. 926: 704:directive, allowing multiple 424: 411:http://example.com/robots.txt 389:A robots.txt file covers one 302:The standard was proposed by 150:, was based on this revision. 1917:. 2018-01-10. Archived from 7: 2021:"Robots.txt Specifications" 1250:"About Ask.com: Webmasters" 821: 595: 521: 362: 24:of this page, as edited by 10: 2227: 2196:Search engine optimization 1492:"ArchiveBot: Bad behavior" 695: 623:User-agent: * Disallow: / 569:redirect humans.txt to an 532:security through obscurity 397:had a robots.txt file but 297: 277:security through obscurity 252:used for implementing the 157: 93: 50: 2081:Robots Exclusion Protocol 1713:Sverre H. Huseby (2004). 1186:Robots Exclusion Protocol 611:User-agent: * Disallow: 254:Robots Exclusion Protocol 223: 206: 198: 190: 177: 173:Robots Exclusion Protocol 172: 1227:. YouTube. Oct 5, 2009. 857:Distributed web crawling 802: 766: 732:Robot Exclusion Standard 700:Some crawlers support a 552:pass alternative content 415:http://example.com:8080/ 320:denial-of-service attack 310:in February 1994 on the 260:to indicate to visiting 1804:. Iis.net. 2013-11-06. 1685:10.6028/NIST.SP.800-123 614:User-agent: * Allow: / 475:Artificial intelligence 1003:"The Web Robots Pages" 968:Fielding, Roy (1994). 804:X-Robots-Tag: noindex 689:'s subsequent visits. 671:Nonstandard extensions 1033:www-talk mailing list 742:Meta tags and headers 676:Crawl-delay directive 501:. In 2023, blog host 256:, a standard used by 1915:"/killer-robots.txt" 1779:. Httpd.apache.org. 1496:wiki.archiveteam.org 1046:on October 29, 2013. 763:A "noindex" meta tag 557:Some sites, such as 419:https://example.com/ 322:on Koster's server. 273:voluntary compliance 1853:"Github humans.txt" 1827:"Google humans.txt" 1754:. User-agents.org. 1290:on 13 December 2012 790:"noindex" 726:Universal "*" match 405:would not apply to 306:, when working for 169: 111:← Previous revision 2108:Proposed Standard. 1970:support.google.com 1462:"Using robots.txt" 1280:"About AOL Search" 1213:Proposed Standard. 1130:Search Engine Land 896:for search engines 781:"robots" 578:/killer-robots.txt 498:The New York Times 437:Some major search 167: 45:07:07, 4 July 2024 2125:Google Developers 2025:Google Developers 1558:. 17 April 2017. 1407:Google Developers 1044:archived message) 243: 242: 194:Proposed Standard 155:Internet protocol 2218: 2178: 2177: 2169: 2157: 2156: 2154:Official website 2140: 2139: 2137: 2136: 2117: 2111: 2106: 2097: 2095:10.17487/RFC9309 2075: 2069: 2068: 2066: 2065: 2050: 2041: 2040: 2038: 2036: 2017: 2011: 2010: 2008: 2007: 1998:. Archived from 1992: 1986: 1985: 1983: 1981: 1962: 1956: 1955: 1953: 1951: 1936: 1930: 1929: 1927: 1926: 1911: 1905: 1904: 1902: 1901: 1881: 1875: 1874: 1872: 1870: 1849: 1843: 1842: 1840: 1838: 1823: 1817: 1816: 1814: 1813: 1798: 1792: 1791: 1789: 1788: 1773: 1767: 1766: 1764: 1763: 1748: 1742: 1741: 1739: 1738: 1710: 1704: 1703: 1701: 1699: 1693: 1674: 1665: 1659: 1658: 1656: 1654: 1635: 1629: 1628: 1626: 1625: 1610: 1604: 1603: 1601: 1599: 1577: 1571: 1570: 1568: 1567: 1556:blog.archive.org 1548: 1542: 1541: 1539: 1537: 1528:. Archive Team. 1518: 1512: 1511: 1509: 1507: 1498:. Archive Team. 1488: 1482: 1481: 1479: 1477: 1458: 1449: 1448: 1446: 1444: 1429: 1423: 1422: 1420: 1418: 1399: 1390: 1389: 1387: 1385: 1370:"DuckDuckGo Bot" 1366: 1360: 1359: 1357: 1355: 1336: 1330: 1329: 1327: 1325: 1306: 1300: 1299: 1297: 1295: 1286:. Archived from 1276: 1270: 1269: 1267: 1265: 1246: 1240: 1239: 1237: 1236: 1221: 1215: 1211: 1202: 1200:10.17487/RFC9309 1180: 1174: 1173: 1171: 1170: 1151: 1145: 1144: 1142: 1141: 1121: 1115: 1114: 1112: 1110: 1093: 1078: 1077: 1075: 1073: 1064:. 19 June 2006. 1054: 1048: 1047: 1045: 1035:. Archived from 1024: 1018: 1017: 1015: 1014: 999: 993: 992: 990: 988: 974: 965: 959: 958: 956: 955: 944:Greenhills.co.uk 936: 867:Internet Archive 840: 832: 794: 791: 788: 785: 782: 779: 776: 773: 770: 748:Robots meta tags 737: 718: 711: 703: 607: 603: 579: 564: 516: 463:Internet Archive 420: 416: 412: 408: 404: 400: 396: 374: 370: 239: 233: 230: 182: 170: 166: 140:accepted version 123:Newer revision → 101: 99: 98: 90: 69: 67:current revision 59: 58: 56: 55: 46: 42: 41: 2226: 2225: 2221: 2220: 2219: 2217: 2216: 2215: 2186: 2185: 2184: 2172: 2164: 2161: 2152: 2151: 2148: 2143: 2134: 2132: 2119: 2118: 2114: 2076: 2072: 2063: 2061: 2052: 2051: 2044: 2034: 2032: 2019: 2018: 2014: 2005: 2003: 1994: 1993: 1989: 1979: 1977: 1964: 1963: 1959: 1949: 1947: 1938: 1937: 1933: 1924: 1922: 1913: 1912: 1908: 1899: 1897: 1882: 1878: 1868: 1866: 1851: 1850: 1846: 1836: 1834: 1825: 1824: 1820: 1811: 1809: 1800: 1799: 1795: 1786: 1784: 1775: 1774: 1770: 1761: 1759: 1750: 1749: 1745: 1736: 1734: 1727: 1711: 1707: 1697: 1695: 1691: 1672: 1666: 1662: 1652: 1650: 1637: 1636: 1632: 1623: 1621: 1612: 1611: 1607: 1597: 1595: 1578: 1574: 1565: 1563: 1550: 1549: 1545: 1535: 1533: 1519: 1515: 1505: 1503: 1490: 1489: 1485: 1475: 1473: 1466:Help.yandex.com 1460: 1459: 1452: 1442: 1440: 1431: 1430: 1426: 1416: 1414: 1401: 1400: 1393: 1383: 1381: 1368: 1367: 1363: 1353: 1351: 1338: 1337: 1333: 1323: 1321: 1308: 1307: 1303: 1293: 1291: 1278: 1277: 1273: 1263: 1261: 1248: 1247: 1243: 1234: 1232: 1223: 1222: 1218: 1181: 1177: 1168: 1166: 1153: 1152: 1148: 1139: 1137: 1122: 1118: 1108: 1106: 1094: 1081: 1071: 1069: 1062:Charlie's Diary 1056: 1055: 1051: 1039: 1025: 1021: 1012: 1010: 1001: 1000: 996: 986: 984: 972: 966: 962: 953: 951: 938: 937: 933: 929: 924: 862:Focused crawler 836: 828: 824: 815: 806: 805: 801: 796: 795: 792: 789: 786: 783: 780: 777: 774: 771: 768: 765: 744: 735: 728: 723: 713: 709: 701: 698: 693: 678: 673: 668: 654: 648: 642: 636: 630: 624: 615: 612: 605: 601: 598: 577: 562: 544: 524: 514: 477: 447: 435: 427: 418: 414: 410: 406: 402: 398: 394: 372: 368: 365: 300: 235: 227: 219: 199:First published 186: 163: 156: 152: 151: 134: 133: 132: 131: 130: 115:Latest revision 103: 102: 94: 91: 80: 78: 65: 51: 48: 31: 29: 12: 11: 5: 2224: 2214: 2213: 2208: 2203: 2198: 2183: 2182: 2159: 2158: 2147: 2146:External links 2144: 2142: 2141: 2112: 2070: 2042: 2012: 1987: 1957: 1942:. 3 May 2012. 1931: 1906: 1890:Slate Magazine 1876: 1844: 1818: 1793: 1768: 1743: 1725: 1705: 1660: 1630: 1605: 1587:Digital Trends 1572: 1543: 1513: 1483: 1450: 1424: 1391: 1374:DuckDuckGo.com 1361: 1344:Blogs.bing.com 1331: 1301: 1284:Search.aol.com 1271: 1241: 1216: 1175: 1146: 1116: 1079: 1049: 1019: 994: 960: 930: 928: 925: 923: 922: 917: 912: 907: 902: 897: 891: 886: 881: 875: 869: 864: 859: 854: 848: 842: 834: 825: 823: 820: 814: 811: 803: 800: 797: 767: 764: 761: 743: 740: 727: 724: 721: 697: 694: 691: 683:search console 677: 674: 672: 669: 665: 652: 646: 640: 634: 628: 622: 613: 610: 597: 594: 582:the Terminator 543: 540: 523: 520: 476: 473: 468:Digital Trends 446: 445:Archival sites 443: 434: 433:Search engines 431: 426: 423: 364: 361: 327:web developers 316:Charles Stross 304:Martijn Koster 299: 296: 241: 240: 225: 221: 220: 218: 217: 214: 210: 208: 204: 203: 200: 196: 195: 192: 188: 187: 183: 175: 174: 154: 142:of this page, 137: 76: 62:permanent link 27: 16: 15: 9: 6: 4: 3: 2: 2223: 2212: 2209: 2207: 2204: 2202: 2199: 2197: 2194: 2193: 2191: 2181: 2176: 2171: 2170: 2167: 2162: 2155: 2150: 2149: 2130: 2126: 2122: 2116: 2109: 2104: 2101: 2096: 2091: 2087: 2083: 2082: 2074: 2059: 2055: 2049: 2047: 2030: 2026: 2022: 2016: 2002:on 2009-03-05 2001: 1997: 1991: 1975: 1971: 1967: 1961: 1945: 1941: 1935: 1921:on 2018-01-10 1920: 1916: 1910: 1895: 1891: 1887: 1880: 1864: 1860: 1859: 1854: 1848: 1832: 1828: 1822: 1807: 1803: 1797: 1782: 1778: 1772: 1757: 1753: 1747: 1732: 1728: 1726:9780470857472 1722: 1718: 1717: 1709: 1690: 1686: 1682: 1678: 1671: 1664: 1648: 1644: 1640: 1634: 1619: 1615: 1609: 1593: 1589: 1588: 1583: 1576: 1561: 1557: 1553: 1547: 1531: 1527: 1523: 1517: 1501: 1497: 1493: 1487: 1471: 1467: 1463: 1457: 1455: 1438: 1434: 1428: 1412: 1408: 1404: 1398: 1396: 1379: 1375: 1371: 1365: 1349: 1345: 1341: 1335: 1319: 1315: 1311: 1310:"Baiduspider" 1305: 1289: 1285: 1281: 1275: 1259: 1255: 1254:About.ask.com 1251: 1245: 1230: 1226: 1220: 1214: 1209: 1206: 1201: 1196: 1192: 1188: 1187: 1179: 1164: 1160: 1156: 1150: 1135: 1131: 1127: 1120: 1105: 1104: 1099: 1092: 1090: 1088: 1086: 1084: 1067: 1063: 1059: 1053: 1043: 1038: 1034: 1030: 1023: 1008: 1004: 998: 987:September 25, 982: 978: 971: 964: 949: 945: 941: 935: 931: 921: 918: 916: 913: 911: 910:Web archiving 908: 906: 903: 901: 898: 895: 894:Meta elements 892: 890: 887: 885: 882: 879: 876: 873: 870: 868: 865: 863: 860: 858: 855: 852: 849: 846: 843: 839: 835: 831: 827: 826: 819: 810: 760: 758: 754: 749: 739: 733: 720: 717: 707: 690: 688: 684: 664: 661: 659: 651: 645: 639: 633: 627: 621: 618: 609: 593: 591: 587: 583: 574: 572: 568: 560: 555: 553: 549: 539: 537: 533: 529: 519: 513: 512: 506: 504: 500: 499: 494: 490: 486: 482: 481:generative AI 472: 470: 469: 464: 460: 457:. Co-founder 456: 452: 442: 440: 430: 422: 407:a.example.com 399:a.example.com 392: 387: 383: 380: 378: 360: 358: 353: 351: 347: 343: 339: 337: 332: 328: 323: 321: 317: 313: 309: 305: 295: 293: 288: 286: 282: 278: 274: 269: 267: 263: 259: 255: 251: 247: 238: 232: 226: 222: 215: 212: 211: 209: 205: 201: 197: 193: 189: 181: 176: 171: 165: 161: 149: 145: 141: 128: 124: 120: 116: 112: 108: 97: 88: 84: 79: 72: 71: 68: 63: 54: 39: 35: 30: 23: 2206:Web scraping 2160: 2133:. Retrieved 2124: 2115: 2107: 2080: 2073: 2062:. Retrieved 2035:February 15, 2033:. Retrieved 2024: 2015: 2004:. Retrieved 2000:the original 1990: 1978:. Retrieved 1969: 1960: 1948:. Retrieved 1934: 1923:. Retrieved 1919:the original 1909: 1898:. Retrieved 1889: 1879: 1867:. Retrieved 1856: 1847: 1835:. Retrieved 1821: 1810:. Retrieved 1796: 1785:. Retrieved 1771: 1760:. Retrieved 1746: 1735:. Retrieved 1715: 1708: 1696:. Retrieved 1676: 1663: 1651:. Retrieved 1643:The Register 1642: 1633: 1622:. Retrieved 1608: 1596:. Retrieved 1585: 1575: 1564:. Retrieved 1555: 1546: 1534:. Retrieved 1516: 1504:. Retrieved 1495: 1486: 1474:. Retrieved 1465: 1441:. Retrieved 1427: 1415:. Retrieved 1406: 1382:. Retrieved 1373: 1364: 1352:. Retrieved 1343: 1334: 1322:. Retrieved 1313: 1304: 1292:. Retrieved 1288:the original 1283: 1274: 1262:. Retrieved 1253: 1244: 1233:. Retrieved 1219: 1212: 1185: 1178: 1167:. Retrieved 1158: 1149: 1138:. Retrieved 1129: 1119: 1107:. Retrieved 1101: 1070:. Retrieved 1061: 1052: 1037:the original 1032: 1022: 1011:. Retrieved 997: 985:. Retrieved 976: 973:(PostScript) 963: 952:. Retrieved 943: 940:"Historical" 934: 838:security.txt 816: 807: 745: 731: 729: 715: 712:in the form 708:in the same 699: 679: 662: 655: 649: 643: 637: 631: 625: 619: 616: 599: 580:instructing 575: 570: 556: 545: 542:Alternatives 525: 509: 507: 496: 478: 466: 451:Archive Team 448: 436: 428: 388: 384: 381: 366: 354: 335: 324: 311: 301: 289: 270: 262:web crawlers 253: 245: 244: 164: 147: 22:old revision 19: 18: 1536:18 February 1522:Jason Scott 1476:16 February 1443:16 February 1417:16 February 1354:16 February 1324:16 February 1294:16 February 1264:16 February 915:Web crawler 905:Spider trap 738:statement. 590:Sergey Brin 459:Jason Scott 403:example.com 395:example.com 148:4 July 2024 20:This is an 2211:Text files 2190:Categories 2135:2022-10-17 2064:2013-08-17 2006:2009-03-23 1980:22 October 1950:9 February 1925:2018-05-25 1900:2019-10-03 1869:October 3, 1837:October 3, 1812:2013-12-29 1787:2013-12-29 1762:2013-12-29 1737:2015-08-12 1698:August 12, 1653:August 12, 1624:2015-08-10 1566:2018-12-01 1506:10 October 1235:2013-12-29 1169:2019-07-10 1140:2015-11-19 1013:2013-12-29 979:. Geneva. 954:2017-03-03 927:References 757:httpd.conf 710:robots.txt 667:directory 586:Larry Page 563:humans.txt 548:user-agent 425:Compliance 369:robots.txt 342:WebCrawler 266:web robots 264:and other 246:robots.txt 168:robots.txt 1314:Baidu.com 1103:The Verge 1042:Hypermail 753:.htaccess 736:Disallow: 714:Sitemap: 687:Googlebot 561:, host a 528:web robot 511:The Verge 350:AltaVista 229:robotstxt 2201:Websites 2180:Internet 2129:Archived 2058:Archived 2029:Archived 1974:Archived 1944:Archived 1894:Archived 1863:Archived 1831:Archived 1806:Archived 1781:Archived 1756:Archived 1731:Archived 1689:Archived 1647:Archived 1618:Archived 1592:Archived 1560:Archived 1530:Archived 1500:Archived 1470:Archived 1437:Archived 1411:Archived 1384:25 April 1378:Archived 1348:Archived 1318:Archived 1258:Archived 1229:Archived 1163:Archived 1134:Archived 1109:16 March 1072:19 April 1066:Archived 1007:Archived 981:Archived 948:Archived 900:Sitemaps 889:Perma.cc 884:Nofollow 880:(NDIIPP) 822:See also 716:full-url 706:Sitemaps 606:Disallow 596:Examples 522:Security 455:sitemaps 363:Standard 338:standard 336:de facto 312:www-talk 292:sitemaps 258:websites 250:filename 237:RFC 9309 144:accepted 87:contribs 77:Pcrooker 38:contribs 28:Pcrooker 920:noindex 851:BotSeer 830:ads.txt 784:content 759:files. 702:Sitemap 696:Sitemap 439:engines 377:website 298:History 248:is the 224:Website 207:Authors 185:folder. 2166:Portal 1858:GitHub 1723: 874:(NDLP) 658:Google 573:page. 567:GitHub 559:Google 503:Medium 489:Google 485:OpenAI 391:origin 348:, and 331:server 281:server 191:Status 1692:(PDF) 1673:(PDF) 1598:8 May 793:/> 571:About 515:' 346:Lycos 308:Nexor 2103:9309 2086:IETF 2037:2020 1982:2018 1952:2016 1871:2019 1839:2019 1721:ISBN 1700:2015 1655:2015 1600:2017 1538:2017 1508:2022 1478:2013 1445:2013 1419:2013 1386:2017 1356:2013 1326:2013 1296:2013 1266:2013 1208:9309 1191:IETF 1111:2024 1074:2014 989:2013 775:name 772:meta 769:< 755:and 730:The 588:and 495:and 231:.org 127:diff 121:) | 119:diff 107:diff 83:talk 34:talk 2100:RFC 2090:doi 1681:doi 1205:RFC 1195:doi 493:BBC 417:or 146:on 138:An 43:at 2192:: 2127:. 2123:. 2098:. 2088:. 2084:. 2056:. 2045:^ 2027:. 2023:. 1972:. 1968:. 1892:. 1888:. 1861:. 1855:. 1829:. 1729:. 1687:. 1679:. 1675:. 1645:. 1641:. 1616:. 1590:. 1584:. 1554:. 1524:. 1494:. 1468:. 1464:. 1453:^ 1435:. 1409:. 1405:. 1394:^ 1376:. 1372:. 1346:. 1342:. 1316:. 1312:. 1282:. 1256:. 1252:. 1203:. 1193:. 1189:. 1161:. 1157:. 1132:. 1128:. 1100:. 1082:^ 1060:. 1031:. 975:. 946:. 942:. 719:: 592:. 421:. 352:. 344:, 287:. 234:, 113:| 109:) 85:| 36:| 2168:: 2138:. 2105:. 2092:: 2067:. 2039:. 2009:. 1984:. 1954:. 1928:. 1903:. 1873:. 1841:. 1815:. 1790:. 1765:. 1740:. 1702:. 1683:: 1657:. 1627:. 1602:. 1569:. 1540:. 1510:. 1480:. 1447:. 1421:. 1388:. 1358:. 1328:. 1298:. 1268:. 1238:. 1210:. 1197:: 1172:. 1143:. 1113:. 1076:. 1040:( 1016:. 991:. 957:. 787:= 778:= 602:* 162:. 129:) 125:( 117:( 105:( 100:) 92:( 89:) 81:( 70:. 57:) 49:( 40:) 32:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index