robots.txt - Knowledge

743: 76: 722:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

594:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

580:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

281:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

452:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 444:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 282:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

80:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

413:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

595:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

271:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 664:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

367:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 731:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.

278:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

1764: 1281: 1370: 817: 357:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1493: 2126: 387:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 1902: 225:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

1067: 999: 1525: 401:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2016: 1954: 1852: 2087: 1760: 1344: 1273: 570:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

1366: 1548: 574:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 567:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 449: 1647: 1987: 1433: 275:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1485: 1689: 1932: 2112: 914: 1463: 1096: 660:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

1898: 1576: 171:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 464:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1739: 1605: 1162: 1191: 1251: 214:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1059: 991: 1403: 2167: 881: 440:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

962: 940: 376: 180: 1873: 561:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1683: 1515: 778: 2012: 1958: 811: 1844: 1031: 252: 2079: 1311: 375:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1789: 1336: 770:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 251:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1213: 428:

circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular

404:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

1821: 34: 445: 172: 1714: 723:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

30: 1628: 790: 773: 215: 1979: 1425: 337:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1486:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 495: 2013:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 289:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2177: 2117: 1673: 596: 2058: 1924: 1138: 168: 263:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2182: 1845:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 1549:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)" 903: 1459: 1088: 555:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

175:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 1572: 393: 379:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 286: 226: 176: 1735: 1597: 2172: 1679: 1158: 230: 210:

mailing list, the main communication channel for WWW-related activities at the time.

179:

overload. In the 2020s many websites began denying bots that collect information for

742: 2048: 1639: 1183: 1128: 800: 398: 358: 1243: 805: 795: 748: 661: 514:

This example tells all robots that they can visit all files because the wildcard

2061: 2038: 1141: 1118: 361:

announced that it would stop complying with robots.txt directives. According to

132: 1520: 1395: 363: 334: 211: 199: 904:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 873: 531:

The same result can be accomplished with an empty or missing robots.txt file.

2161: 1274:"Robots Exclusion Protocol: joining together to provide better documentation" 970: 848: 425: 222: 1643: 936: 558:

This example tells two specific robots not to enter one specific directory:

479:

file that displays information meant for humans to read. Some sites such as

1877: 766: 346: 161: 1761:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1455: 853: 843: 503: 354: 157: 1516:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 670: 499: 461: 237: 2053: 1955:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1303: 1133: 1036: 992:"How I got here in the end, part five: "things can only get better!"" 975: 666: 600: 441: 429: 421: 416: 406: 305:. In addition, each protocol and port needs its own robots.txt file; 245: 2080:"How Google Interprets the robots.txt Specification | Documentation" 1785: 818:

National Digital Information Infrastructure and Preservation Program

236:; most complied, including those operated by search engines such as 838: 833: 823: 619: 546:

This example tells all robots to stay away from one specific file:

543:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

465: 325:

The robots.txt protocol is widely complied with by bot operators.

255:. A proposed standard was published in September 2022 as RFC 9309. 187: 145: 2037:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1221: 1117:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1598:"Robots.txt tells hackers the places you don't want them to look" 828: 784: 758: 350: 272: 153: 1811: 1816: 571: 480: 472: 384: 380: 1710: 552:

All other files in the specified directory will be processed.

540:

This example tells all robots not to enter three directories:

712: 241: 229:

overload was a primary concern. By June 1994 it had become a

203: 112:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

75: 2150: 2044: 1124: 1060:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1675:

Innocent Code: A Security Wake-Up Call for Web Programmers

1573:"Block URLs with robots.txt: Learn about robots.txt files" 2122: 2036: 1331: 1329: 1116: 1089:"Formalizing the Robots Exclusion Protocol Specification" 522:

directive has no value, meaning no pages are disallowed.

388: 164:

which portions of the website they are allowed to visit.

534:

This example tells all robots to stay out of a website:

468:) when it detects a connection using one of the robots. 124: 2113:"Artificial Intelligence Web Crawlers Are Running Amok" 1899:"To crawl or not to crawl, that is BingBot's question" 1326: 186:

The "robots.txt" file can be used in conjunction with

221:

The standard, initially RobotsNotWanted.txt, allowed

1627:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1359: 911:

First International Conference on the World Wide Web

738: 726: 55: 1925:"Change Googlebot crawl rate - Search Console Help" 1626: 383:'s GPTBot in their robots.txt file and 85 blocked 787:– Now inactive search engine for robots.txt files 676: 190:, another robot inclusion standard for websites. 2159: 1891: 1711:"List of User-Agents (Spiders, Robots, Browser)" 1671: 1507: 1390: 1388: 1266: 1206: 639: 564:Example demonstrating how comments can be used: 345:Some web archiving projects ignore robots.txt. 2007: 2005: 1636:National Institute of Standards and Technology 1296: 1176: 1057: 963:"Important: Spiders, Robots and Web Wanderers" 762:, a standard for listing authorized ad sellers 450:National Institute of Standards and Technology 349:uses the file to discover more links, such as 1385: 606:User-agent: bingbot Allow: / Crawl-delay: 10 549:User-agent: * Disallow: /directory/file.html 490:Previously, Google had a joke file hosted at 98:1994 published, formally standardized in 2022 1236: 577:Example demonstrating multiple user-agents: 267:in the root of the web site hierarchy (e.g. 2002: 636:Sitemap: http://www.example.com/sitemap.xml 167:The standard, developed in 1994, relies on 15: 2069:sec. 2.5: Limits. 1367:"Submitting your website to Yahoo! Search" 648:does not mention the "*" character in the 370: 74: 16: 2052: 1678:. John Wiley & Sons. pp. 91–92. 1132: 584: 1051: 901: 781:– A failed proposal to extend robots.txt 655: 589: 448:is discouraged by standards bodies. The 297:did not, the rules that would apply for 24:This is an accepted version of this page 1546: 1454: 1337:"Webmasters: Robots.txt Specifications" 1025: 1023: 1021: 1019: 1017: 14: 2160: 1842: 1110: 1093:Official Google Webmaster Central Blog 1032:"The text file that runs the internet" 1029: 960: 2110: 1990:from the original on November 2, 2019 1792:from the original on January 24, 2017 1736:"Access Control - Apache HTTP Server" 1513: 1314:from the original on 16 February 2017 54:For Knowledge's robots.txt file, see 1436:from the original on 10 October 2022 1194:from the original on 27 January 2013 1014: 961:Koster, Martijn (25 February 1994). 56:https://en.wikipedia.org/robots.txt 48: 2104: 2030: 1629:"Guide to General Server Security" 1254:from the original on 6 August 2013 1159:"Uncrawled URLs in search results" 1030:Pierce, David (14 February 2024). 269:https://www.example.com/robots.txt 181:generative artificial intelligence 49: 2194: 2142: 1824:from the original on May 30, 2016 929: 779:Automated Content Access Protocol 727:Maximum size of a robots.txt file 498:not to kill the company founders 340: 328: 2129:from the original on 6 July 2024 812:National Digital Library Program 741: 713:A "noindex" HTTP response header 460:Many robots also pass a special 109:Martijn Koster (original author) 2090:from the original on 2022-10-17 2072: 2019:from the original on 2013-08-08 1972: 1947: 1935:from the original on 2018-11-18 1917: 1905:from the original on 2016-02-03 1866: 1855:from the original on 2018-11-18 1843:Newman, Lily Hay (2014-07-03). 1836: 1804: 1778: 1767:from the original on 2014-01-01 1753: 1742:from the original on 2013-12-29 1728: 1717:from the original on 2014-01-07 1703: 1692:from the original on 2016-04-01 1665: 1653:from the original on 2011-10-08 1620: 1608:from the original on 2015-08-21 1590: 1579:from the original on 2015-08-14 1565: 1540: 1528:from the original on 2017-05-16 1496:from the original on 2018-12-04 1478: 1466:from the original on 2017-02-18 1448: 1418: 1406:from the original on 2013-01-25 1373:from the original on 2013-01-21 1347:from the original on 2013-01-15 1284:from the original on 2014-08-18 1165:from the original on 2014-01-06 1151: 1099:from the original on 2019-07-10 1070:from the original on 2015-09-07 1058:Barry Schwartz (30 June 2014). 1002:from the original on 2013-11-25 943:from the original on 2014-01-12 917:from the original on 2013-09-27 884:from the original on 2017-04-03 599:for webmasters, to control the 455: 253:Internet Engineering Task Force 1460:"Robots.txt is a suicide note" 1081: 984: 954: 895: 866: 518:stands for all robots and the 309:does not apply to pages under 13: 1: 1547:Koebler, Jason (2024-07-29). 1514:Jones, Brad (24 April 2017). 939:. Robotstxt.org. 1994-06-30. 860: 618:directive, allowing multiple 420:reported that companies like 320: 307:http://example.com/robots.txt 285:A robots.txt file covers one 198:The standard was proposed by 2111:Allyn, Bobby (5 July 2024). 1876:. 2018-01-10. Archived from 7: 1980:"Robots.txt Specifications" 1184:"About Ask.com: Webmasters" 734: 509: 435: 258: 10: 2199: 2168:Search engine optimization 1426:"ArchiveBot: Bad behavior" 609: 537:User-agent: * Disallow: / 483:redirect humans.txt to an 446:security through obscurity 293:had a robots.txt file but 193: 173:security through obscurity 148:used for implementing the 53: 2040:Robots Exclusion Protocol 1672:Sverre H. Huseby (2004). 1120:Robots Exclusion Protocol 525:User-agent: * Disallow: 150:Robots Exclusion Protocol 119: 102: 94: 86: 73: 69:Robots Exclusion Protocol 68: 1161:. YouTube. Oct 5, 2009. 791:Distributed web crawling 716: 680: 646:Robot Exclusion Standard 614:Some crawlers support a 466:pass alternative content 311:http://example.com:8080/ 216:denial-of-service attack 206:in February 1994 on the 156:to indicate to visiting 31:latest accepted revision 1763:. Iis.net. 2013-11-06. 1644:10.6028/NIST.SP.800-123 528:User-agent: * Allow: / 371:Artificial intelligence 937:"The Web Robots Pages" 902:Fielding, Roy (1994). 718:X-Robots-Tag: noindex 603:'s subsequent visits. 585:Nonstandard extensions 2118:All Things Considered 967:www-talk mailing list 774:eBay v. Bidder's Edge 656:Meta tags and headers 590:Crawl-delay directive 397:. In 2023, blog host 152:, a standard used by 1874:"/killer-robots.txt" 1738:. Httpd.apache.org. 1430:wiki.archiveteam.org 980:on October 29, 2013. 677:A "noindex" meta tag 471:Some sites, such as 315:https://example.com/ 218:on Koster's server. 169:voluntary compliance 1812:"Github humans.txt" 1786:"Google humans.txt" 1713:. User-agents.org. 1224:on 13 December 2012 704:"noindex" 640:Universal "*" match 301:would not apply to 202:, when working for 65: 21:Page version status 2067:Proposed Standard. 1929:support.google.com 1396:"Using robots.txt" 1214:"About AOL Search" 1147:Proposed Standard. 1064:Search Engine Land 808:for search engines 695:"robots" 492:/killer-robots.txt 394:The New York Times 333:Some major search 63: 27: 2084:Google Developers 1984:Google Developers 1492:. 17 April 2017. 1341:Google Developers 978:archived message) 139: 138: 90:Proposed Standard 51:Internet protocol 18: 2190: 2154: 2153: 2151:Official website 2138: 2136: 2134: 2099: 2098: 2096: 2095: 2076: 2070: 2065: 2056: 2054:10.17487/RFC9309 2034: 2028: 2027: 2025: 2024: 2009: 2000: 1999: 1997: 1995: 1976: 1970: 1969: 1967: 1966: 1957:. Archived from 1951: 1945: 1944: 1942: 1940: 1921: 1915: 1914: 1912: 1910: 1895: 1889: 1888: 1886: 1885: 1870: 1864: 1863: 1861: 1860: 1840: 1834: 1833: 1831: 1829: 1808: 1802: 1801: 1799: 1797: 1782: 1776: 1775: 1773: 1772: 1757: 1751: 1750: 1748: 1747: 1732: 1726: 1725: 1723: 1722: 1707: 1701: 1700: 1698: 1697: 1669: 1663: 1662: 1660: 1658: 1652: 1633: 1624: 1618: 1617: 1615: 1613: 1594: 1588: 1587: 1585: 1584: 1569: 1563: 1562: 1560: 1559: 1544: 1538: 1537: 1535: 1533: 1511: 1505: 1504: 1502: 1501: 1490:blog.archive.org 1482: 1476: 1475: 1473: 1471: 1462:. Archive Team. 1452: 1446: 1445: 1443: 1441: 1432:. Archive Team. 1422: 1416: 1415: 1413: 1411: 1392: 1383: 1382: 1380: 1378: 1363: 1357: 1356: 1354: 1352: 1333: 1324: 1323: 1321: 1319: 1304:"DuckDuckGo Bot" 1300: 1294: 1293: 1291: 1289: 1270: 1264: 1263: 1261: 1259: 1240: 1234: 1233: 1231: 1229: 1220:. Archived from 1210: 1204: 1203: 1201: 1199: 1180: 1174: 1173: 1171: 1170: 1155: 1149: 1145: 1136: 1134:10.17487/RFC9309 1114: 1108: 1107: 1105: 1104: 1085: 1079: 1078: 1076: 1075: 1055: 1049: 1048: 1046: 1044: 1027: 1012: 1011: 1009: 1007: 998:. 19 June 2006. 988: 982: 981: 979: 969:. Archived from 958: 952: 951: 949: 948: 933: 927: 926: 924: 922: 908: 899: 893: 892: 890: 889: 878:Greenhills.co.uk 870: 801:Internet Archive 769: 761: 751: 746: 745: 708: 705: 702: 699: 696: 693: 690: 687: 684: 662:Robots meta tags 651: 632: 625: 617: 521: 517: 493: 478: 412: 359:Internet Archive 316: 312: 308: 304: 300: 296: 292: 270: 266: 135: 129: 126: 78: 66: 62: 2198: 2197: 2193: 2192: 2191: 2189: 2188: 2187: 2158: 2157: 2149: 2148: 2145: 2132: 2130: 2107: 2105:Further reading 2102: 2093: 2091: 2078: 2077: 2073: 2035: 2031: 2022: 2020: 2011: 2010: 2003: 1993: 1991: 1978: 1977: 1973: 1964: 1962: 1953: 1952: 1948: 1938: 1936: 1923: 1922: 1918: 1908: 1906: 1897: 1896: 1892: 1883: 1881: 1872: 1871: 1867: 1858: 1856: 1841: 1837: 1827: 1825: 1810: 1809: 1805: 1795: 1793: 1784: 1783: 1779: 1770: 1768: 1759: 1758: 1754: 1745: 1743: 1734: 1733: 1729: 1720: 1718: 1709: 1708: 1704: 1695: 1693: 1686: 1670: 1666: 1656: 1654: 1650: 1631: 1625: 1621: 1611: 1609: 1596: 1595: 1591: 1582: 1580: 1571: 1570: 1566: 1557: 1555: 1545: 1541: 1531: 1529: 1512: 1508: 1499: 1497: 1484: 1483: 1479: 1469: 1467: 1453: 1449: 1439: 1437: 1424: 1423: 1419: 1409: 1407: 1400:Help.yandex.com 1394: 1393: 1386: 1376: 1374: 1365: 1364: 1360: 1350: 1348: 1335: 1334: 1327: 1317: 1315: 1302: 1301: 1297: 1287: 1285: 1272: 1271: 1267: 1257: 1255: 1242: 1241: 1237: 1227: 1225: 1212: 1211: 1207: 1197: 1195: 1182: 1181: 1177: 1168: 1166: 1157: 1156: 1152: 1115: 1111: 1102: 1100: 1087: 1086: 1082: 1073: 1071: 1056: 1052: 1042: 1040: 1028: 1015: 1005: 1003: 996:Charlie's Diary 990: 989: 985: 973: 959: 955: 946: 944: 935: 934: 930: 920: 918: 906: 900: 896: 887: 885: 872: 871: 867: 863: 858: 796:Focused crawler 765: 757: 749:Internet portal 747: 740: 737: 729: 720: 719: 715: 710: 709: 706: 703: 700: 697: 694: 691: 688: 685: 682: 679: 658: 649: 642: 637: 627: 623: 615: 612: 607: 592: 587: 582: 568: 562: 556: 550: 544: 538: 529: 526: 519: 515: 512: 491: 476: 458: 438: 410: 373: 343: 331: 323: 314: 310: 306: 302: 298: 294: 290: 268: 264: 261: 196: 131: 123: 115: 95:First published 82: 59: 52: 47: 46: 45: 44: 43: 42: 26: 12: 11: 5: 2196: 2186: 2185: 2180: 2175: 2170: 2156: 2155: 2144: 2143:External links 2141: 2140: 2139: 2106: 2103: 2101: 2100: 2071: 2029: 2001: 1971: 1946: 1916: 1901:. 3 May 2012. 1890: 1865: 1849:Slate Magazine 1835: 1803: 1777: 1752: 1727: 1702: 1684: 1664: 1619: 1589: 1564: 1539: 1521:Digital Trends 1506: 1477: 1447: 1417: 1384: 1358: 1325: 1308:DuckDuckGo.com 1295: 1278:Blogs.bing.com 1265: 1235: 1218:Search.aol.com 1205: 1175: 1150: 1109: 1080: 1050: 1013: 983: 953: 928: 894: 864: 862: 859: 857: 856: 851: 846: 841: 836: 831: 826: 821: 815: 809: 803: 798: 793: 788: 782: 776: 771: 763: 754: 753: 752: 736: 733: 728: 725: 717: 714: 711: 681: 678: 675: 657: 654: 641: 638: 635: 611: 608: 605: 597:search console 591: 588: 586: 583: 579: 566: 560: 554: 548: 542: 536: 527: 524: 511: 508: 496:the Terminator 457: 454: 437: 434: 372: 369: 364:Digital Trends 342: 341:Archival sites 339: 330: 329:Search engines 327: 322: 319: 260: 257: 223:web developers 212:Charles Stross 200:Martijn Koster 195: 192: 137: 136: 121: 117: 116: 114: 113: 110: 106: 104: 100: 99: 96: 92: 91: 88: 84: 83: 79: 71: 70: 50: 28: 22: 19: 17: 9: 6: 4: 3: 2: 2195: 2184: 2181: 2179: 2176: 2174: 2171: 2169: 2166: 2165: 2163: 2152: 2147: 2146: 2128: 2124: 2120: 2119: 2114: 2109: 2108: 2089: 2085: 2081: 2075: 2068: 2063: 2060: 2055: 2050: 2046: 2042: 2041: 2033: 2018: 2014: 2008: 2006: 1989: 1985: 1981: 1975: 1961:on 2009-03-05 1960: 1956: 1950: 1934: 1930: 1926: 1920: 1904: 1900: 1894: 1880:on 2018-01-10 1879: 1875: 1869: 1854: 1850: 1846: 1839: 1823: 1819: 1818: 1813: 1807: 1791: 1787: 1781: 1766: 1762: 1756: 1741: 1737: 1731: 1716: 1712: 1706: 1691: 1687: 1685:9780470857472 1681: 1677: 1676: 1668: 1649: 1645: 1641: 1637: 1630: 1623: 1607: 1603: 1599: 1593: 1578: 1574: 1568: 1554: 1550: 1543: 1527: 1523: 1522: 1517: 1510: 1495: 1491: 1487: 1481: 1465: 1461: 1457: 1451: 1435: 1431: 1427: 1421: 1405: 1401: 1397: 1391: 1389: 1372: 1368: 1362: 1346: 1342: 1338: 1332: 1330: 1313: 1309: 1305: 1299: 1283: 1279: 1275: 1269: 1253: 1249: 1245: 1244:"Baiduspider" 1239: 1223: 1219: 1215: 1209: 1193: 1189: 1188:About.ask.com 1185: 1179: 1164: 1160: 1154: 1148: 1143: 1140: 1135: 1130: 1126: 1122: 1121: 1113: 1098: 1094: 1090: 1084: 1069: 1065: 1061: 1054: 1039: 1038: 1033: 1026: 1024: 1022: 1020: 1018: 1001: 997: 993: 987: 977: 972: 968: 964: 957: 942: 938: 932: 921:September 25, 916: 912: 905: 898: 883: 879: 875: 869: 865: 855: 852: 850: 849:Web archiving 847: 845: 842: 840: 837: 835: 832: 830: 827: 825: 822: 819: 816: 813: 810: 807: 806:Meta elements 804: 802: 799: 797: 794: 792: 789: 786: 783: 780: 777: 775: 772: 768: 764: 760: 756: 755: 750: 744: 739: 732: 724: 674: 672: 668: 663: 653: 647: 634: 631: 621: 604: 602: 598: 578: 575: 573: 565: 559: 553: 547: 541: 535: 532: 523: 507: 505: 501: 497: 488: 486: 482: 474: 469: 467: 463: 453: 451: 447: 443: 433: 431: 427: 426:Perplexity.ai 423: 419: 418: 409: 408: 402: 400: 396: 395: 390: 386: 382: 378: 377:generative AI 368: 366: 365: 360: 356: 353:. Co-founder 352: 348: 338: 336: 326: 318: 303:a.example.com 295:a.example.com 288: 283: 279: 276: 274: 256: 254: 249: 247: 243: 239: 235: 233: 228: 224: 219: 217: 213: 209: 205: 201: 191: 189: 184: 182: 178: 174: 170: 165: 163: 159: 155: 151: 147: 143: 134: 128: 122: 118: 111: 108: 107: 105: 101: 97: 93: 89: 85: 77: 72: 67: 61: 57: 40: 39:9 August 2024 36: 32: 25: 20: 2178:Web scraping 2131:. Retrieved 2116: 2092:. Retrieved 2083: 2074: 2066: 2039: 2032: 2021:. Retrieved 1994:February 15, 1992:. Retrieved 1983: 1974: 1963:. Retrieved 1959:the original 1949: 1937:. Retrieved 1928: 1919: 1907:. Retrieved 1893: 1882:. Retrieved 1878:the original 1868: 1857:. Retrieved 1848: 1838: 1826:. Retrieved 1815: 1806: 1794:. Retrieved 1780: 1769:. Retrieved 1755: 1744:. Retrieved 1730: 1719:. Retrieved 1705: 1694:. Retrieved 1674: 1667: 1655:. Retrieved 1635: 1622: 1610:. Retrieved 1602:The Register 1601: 1592: 1581:. Retrieved 1567: 1556:. Retrieved 1552: 1542: 1530:. Retrieved 1519: 1509: 1498:. Retrieved 1489: 1480: 1468:. Retrieved 1450: 1438:. Retrieved 1429: 1420: 1408:. Retrieved 1399: 1375:. Retrieved 1361: 1349:. Retrieved 1340: 1316:. Retrieved 1307: 1298: 1286:. Retrieved 1277: 1268: 1256:. Retrieved 1247: 1238: 1226:. Retrieved 1222:the original 1217: 1208: 1196:. Retrieved 1187: 1178: 1167:. Retrieved 1153: 1146: 1119: 1112: 1101:. Retrieved 1092: 1083: 1072:. Retrieved 1063: 1053: 1041:. Retrieved 1035: 1004:. Retrieved 995: 986: 971:the original 966: 956: 945:. Retrieved 931: 919:. Retrieved 910: 907:(PostScript) 897: 886:. Retrieved 877: 874:"Historical" 868: 767:security.txt 730: 721: 659: 645: 643: 629: 626:in the form 622:in the same 613: 593: 576: 569: 563: 557: 551: 545: 539: 533: 530: 513: 494:instructing 489: 484: 470: 459: 456:Alternatives 439: 415: 405: 403: 392: 374: 362: 347:Archive Team 344: 332: 324: 284: 280: 277: 262: 250: 231: 220: 207: 197: 185: 166: 158:web crawlers 149: 141: 140: 60: 38: 29:This is the 23: 1470:18 February 1456:Jason Scott 1410:16 February 1377:16 February 1351:16 February 1288:16 February 1258:16 February 1228:16 February 1198:16 February 854:Web crawler 844:Spider trap 652:statement. 504:Sergey Brin 355:Jason Scott 299:example.com 291:example.com 2183:Text files 2162:Categories 2094:2022-10-17 2023:2013-08-17 1965:2009-03-23 1939:22 October 1909:9 February 1884:2018-05-25 1859:2019-10-03 1828:October 3, 1796:October 3, 1771:2013-12-29 1746:2013-12-29 1721:2013-12-29 1696:2015-08-12 1657:August 12, 1612:August 12, 1583:2015-08-10 1558:2024-07-29 1500:2018-12-01 1440:10 October 1169:2013-12-29 1103:2019-07-10 1074:2015-11-19 947:2013-12-29 913:. Geneva. 888:2017-03-03 861:References 671:httpd.conf 624:robots.txt 581:directory 500:Larry Page 477:humans.txt 462:user-agent 430:blocklists 321:Compliance 265:robots.txt 238:WebCrawler 162:web robots 160:and other 142:robots.txt 64:robots.txt 1553:404 Media 1248:Baidu.com 1037:The Verge 976:Hypermail 667:.htaccess 650:Disallow: 628:Sitemap: 601:Googlebot 475:, host a 442:web robot 422:Anthropic 417:404 Media 407:The Verge 246:AltaVista 125:robotstxt 2173:Websites 2127:Archived 2088:Archived 2017:Archived 1988:Archived 1933:Archived 1903:Archived 1853:Archived 1822:Archived 1790:Archived 1765:Archived 1740:Archived 1715:Archived 1690:Archived 1648:Archived 1606:Archived 1577:Archived 1526:Archived 1494:Archived 1464:Archived 1434:Archived 1404:Archived 1371:Archived 1345:Archived 1318:25 April 1312:Archived 1282:Archived 1252:Archived 1192:Archived 1163:Archived 1097:Archived 1068:Archived 1043:16 March 1006:19 April 1000:Archived 941:Archived 915:Archived 882:Archived 839:Sitemaps 834:Perma.cc 824:nofollow 820:(NDIIPP) 735:See also 630:full-url 620:Sitemaps 520:Disallow 510:Examples 436:Security 351:sitemaps 259:Standard 234:standard 232:de facto 208:www-talk 188:sitemaps 154:websites 146:filename 133:RFC 9309 35:reviewed 829:noindex 785:BotSeer 759:ads.txt 698:content 673:files. 616:Sitemap 610:Sitemap 335:engines 273:website 194:History 144:is the 120:Website 103:Authors 81:folder. 2133:6 July 1817:GitHub 1682: 814:(NDLP) 572:Google 487:page. 481:GitHub 473:Google 399:Medium 385:Google 381:OpenAI 287:origin 244:, and 227:server 177:server 87:Status 1651:(PDF) 1632:(PDF) 1532:8 May 707:/> 485:About 411:' 242:Lycos 204:Nexor 2135:2024 2062:9309 2045:IETF 1996:2020 1941:2018 1911:2016 1830:2019 1798:2019 1680:ISBN 1659:2015 1614:2015 1534:2017 1472:2017 1442:2022 1412:2013 1379:2013 1353:2013 1320:2017 1290:2013 1260:2013 1230:2013 1200:2013 1142:9309 1125:IETF 1045:2024 1008:2014 923:2013 689:name 686:meta 683:< 669:and 644:The 502:and 424:and 391:and 127:.org 2123:NPR 2059:RFC 2049:doi 1640:doi 1139:RFC 1129:doi 389:BBC 313:or 37:on 2164:: 2125:. 2121:. 2115:. 2086:. 2082:. 2057:. 2047:. 2043:. 2015:. 2004:^ 1986:. 1982:. 1931:. 1927:. 1851:. 1847:. 1820:. 1814:. 1788:. 1688:. 1646:. 1638:. 1634:. 1604:. 1600:. 1575:. 1551:. 1524:. 1518:. 1488:. 1458:. 1428:. 1402:. 1398:. 1387:^ 1369:. 1343:. 1339:. 1328:^ 1310:. 1306:. 1280:. 1276:. 1250:. 1246:. 1216:. 1190:. 1186:. 1137:. 1127:. 1123:. 1095:. 1091:. 1066:. 1062:. 1034:. 1016:^ 994:. 965:. 909:. 880:. 876:. 633:: 506:. 432:. 317:. 248:. 240:, 183:. 130:, 33:, 2137:. 2097:. 2064:. 2051:: 2026:. 1998:. 1968:. 1943:. 1913:. 1887:. 1862:. 1832:. 1800:. 1774:. 1749:. 1724:. 1699:. 1661:. 1642:: 1616:. 1586:. 1561:. 1536:. 1503:. 1474:. 1444:. 1414:. 1381:. 1355:. 1322:. 1292:. 1262:. 1232:. 1202:. 1172:. 1144:. 1131:: 1106:. 1077:. 1047:. 1010:. 974:( 950:. 925:. 891:. 701:= 692:= 516:* 58:. 41:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index