Robots.txt: Difference between revisions

2322: 85: 327: 955:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

827:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

813:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

532:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

685:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 677:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 533:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

331:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

171: 77: 664:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

828:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

522:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 185: 897:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

618:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 964:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .

529:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

178: 1952: 1494: 15: 1583: 1024: 608:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1706: 638:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 2090: 476:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

114: 1280: 1212: 1738: 652:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2204: 2142: 2040: 2275: 1948: 1557: 1486: 803:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

205: 154: 1579: 807:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 800:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 682: 1835: 2175: 1646: 526:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1698: 1877: 2120: 1127: 1676: 1309: 893:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

2086: 1764: 422:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 697:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1927: 1793: 1375: 1404: 1464: 465:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1272: 1204: 166: 107: 86: 576:

A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.

1616: 45: 42: 2342: 1094: 673:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

1175: 1153: 627: 431: 2061: 794:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1871: 1728: 991: 2200: 2146: 1018: 2032: 1244: 503: 2267: 1524: 626:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1977: 1549: 988:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 125: 502:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1426: 655:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

162: 103: 2009: 2312: 678: 423: 1902: 956:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

1816: 1003: 466: 2167: 1638: 588:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1699:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 728: 2201:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 540:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2352: 1861: 829: 80: 2246: 2112: 1351: 419: 514:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2357: 2033:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 1116: 1672: 1301: 788:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

426:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 1760: 644: 630:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 537: 477: 427: 121: 1923: 1785: 2347: 1867: 1371: 481: 461:

mailing list, the main communication channel for WWW-related activities at the time.

430:

overload. In the 2020s many websites began denying bots that collect information for

2321: 2236: 1827: 1396: 1341: 1013: 649: 609: 16: 1456: 2326: 1040: 1008: 894: 747:

This example tells all robots that they can visit all files because the wildcard

2249: 2226: 1354: 1331: 612:

announced that it would stop complying with robots.txt directives. According to

383: 1733: 1608: 614: 585: 462: 450: 1117:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 1086: 764:

The same result can be accomplished with an empty or missing robots.txt file.

2336: 1487:"Robots Exclusion Protocol: joining together to provide better documentation" 1183: 1056: 473: 1831: 1149: 791:

This example tells two specific robots not to enter one specific directory:

712:

file that displays information meant for humans to read. Some sites such as

61: 2065: 984: 597: 412: 1949:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1668: 1061: 1051: 736: 605: 408: 1729:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 903: 732: 694: 488: 2241: 2143:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1516: 1346: 1249: 1205:"How I got here in the end, part five: "things can only get better!"" 1188: 899: 833: 674: 657: 556:. In addition, each protocol and port needs its own robots.txt file; 496: 2268:"How Google Interprets the robots.txt Specification | Documentation" 1973: 1025:

National Digital Information Infrastructure and Preservation Program

487:; most complied, including those operated by search engines such as 1046: 1035: 1030: 852: 779:

This example tells all robots to stay away from one specific file:

776:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

698: 506:. A proposed standard was published in September 2022 as RFC 9309. 438: 396: 2225:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1434: 1330:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1786:"Robots.txt tells hackers the places you don't want them to look" 1066: 997: 976: 601: 523: 404: 1999: 2004: 804: 713: 705: 635: 631: 1898: 785:

All other files in the specified directory will be processed.

773:

This example tells all robots not to enter three directories:

30: 29: 945: 492: 480:

overload was a primary concern. By June 1994 it had become a

454: 363:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

326: 2300: 2232: 1337: 1273:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1863:

Innocent Code: A Security Wake-Up Call for Web Programmers

1761:"Block URLs with robots.txt: Learn about robots.txt files" 2224: 1544: 1542: 1329: 1302:"Formalizing the Robots Exclusion Protocol Specification" 755:

directive has no value, meaning no pages are disallowed.

639: 415:

which portions of the website they are allowed to visit.

767:

This example tells all robots to stay out of a website:

701:) when it detects a connection using one of the robots. 375: 2087:"To crawl or not to crawl, that is BingBot's question" 1539: 437:

The "robots.txt" file can be used in conjunction with

146: 68: 2310: 472:

The standard, initially RobotsNotWanted.txt, allowed

1815:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1572: 1124:

First International Conference on the World Wide Web

959: 306: 2113:"Change Googlebot crawl rate - Search Console Help" 1814: 634:'s GPTBot in their robots.txt file and 85 blocked 1000:– now inactive search engine for robots.txt files 909: 441:, another robot inclusion standard for websites. 2334: 2079: 1899:"List of User-Agents (Spiders, Robots, Browser)" 1859: 1720: 1603: 1601: 1479: 1419: 872: 797:Example demonstrating how comments can be used: 596:Some web archiving projects ignore robots.txt. 2195: 2193: 1824:National Institute of Standards and Technology 1509: 1389: 1270: 1176:"Important: Spiders, Robots and Web Wanderers" 980:, a standard for listing authorized ad sellers 683:National Institute of Standards and Technology 600:uses the file to discover more links, such as 1598: 839:User-agent: bingbot Allow: / Crawl-delay: 10 782:User-agent: * Disallow: /directory/file.html 723:Previously, Google had a joke file hosted at 349:1994 published, formally standardized in 2022 1449: 810:Example demonstrating multiple user-agents: 518:in the root of the web site hierarchy (e.g. 140: 2190: 869:Sitemap: http://www.example.com/sitemap.xml 418:The standard, developed in 1994, relies on 2257:sec. 2.5: Limits. 1580:"Submitting your website to Yahoo! Search" 881:does not mention the "*" character in the 621: 325: 27: 2240: 1866:. John Wiley & Sons. pp. 91–92. 1345: 817: 1264: 1114: 994:– a failed proposal to extend robots.txt 888: 822: 681:is discouraged by standards bodies. The 548:did not, the rules that would apply for 28: 1667: 1550:"Webmasters: Robots.txt Specifications" 1238: 1236: 1234: 1232: 1230: 180:2a00:1e88:b0b3:7900:6717:dc15:703e:a02d 2335: 2030: 1323: 1306:Official Google Webmaster Central Blog 1245:"The text file that runs the internet" 1242: 1173: 2178:from the original on November 2, 2019 1980:from the original on January 24, 2017 1924:"Access Control - Apache HTTP Server" 1726: 1527:from the original on 16 February 2017 305:For Knowledge's robots.txt file, see 1649:from the original on 10 October 2022 1407:from the original on 27 January 2013 1227: 1174:Koster, Martijn (25 February 1994). 94: 60: 307:https://en.wikipedia.org/robots.txt 204: 195: 191: 177: 160: 153: 147:→‎Maximum size of a robots.txt file 139: 113: 101: 13: 2218: 1817:"Guide to General Server Security" 1467:from the original on 6 August 2013 1372:"Uncrawled URLs in search results" 1243:Pierce, David (14 February 2024). 520:https://www.example.com/robots.txt 432:generative artificial intelligence 300: 49: 2369: 2292: 2012:from the original on May 30, 2016 1142: 992:Automated Content Access Protocol 960:Maximum size of a robots.txt file 731:not to kill the company founders 591: 579: 298:Revision as of 07:47, 4 July 2024 163:Revision as of 07:47, 4 July 2024 104:Revision as of 07:07, 4 July 2024 2320: 1019:National Digital Library Program 946:A "noindex" HTTP response header 693:Many robots also pass a special 360:Martijn Koster (original author) 2278:from the original on 2022-10-17 2260: 2207:from the original on 2013-08-08 2160: 2135: 2123:from the original on 2018-11-18 2105: 2093:from the original on 2016-02-03 2054: 2043:from the original on 2018-11-18 2031:Newman, Lily Hay (2014-07-03). 2024: 1992: 1966: 1955:from the original on 2014-01-01 1941: 1930:from the original on 2013-12-29 1916: 1905:from the original on 2014-01-07 1891: 1880:from the original on 2016-04-01 1853: 1841:from the original on 2011-10-08 1808: 1796:from the original on 2015-08-21 1778: 1767:from the original on 2015-08-14 1753: 1741:from the original on 2017-05-16 1709:from the original on 2018-12-04 1691: 1679:from the original on 2017-02-18 1661: 1631: 1619:from the original on 2013-01-25 1586:from the original on 2013-01-21 1560:from the original on 2013-01-15 1497:from the original on 2014-08-18 1378:from the original on 2014-01-06 1364: 1312:from the original on 2019-07-10 1283:from the original on 2015-09-07 1271:Barry Schwartz (30 June 2014). 1215:from the original on 2013-11-25 1156:from the original on 2014-01-12 1130:from the original on 2013-09-27 1097:from the original on 2017-04-03 832:for webmasters, to control the 688: 504:Internet Engineering Task Force 1673:"Robots.txt is a suicide note" 1294: 1197: 1167: 1108: 1079: 751:stands for all robots and the 560:does not apply to pages under 1: 1727:Jones, Brad (24 April 2017). 1152:. Robotstxt.org. 1994-06-30. 1073: 851:directive, allowing multiple 571: 558:http://example.com/robots.txt 536:A robots.txt file covers one 449:The standard was proposed by 296: 259: 248: 2064:. 2018-01-10. Archived from 18:Browse history interactively 7: 2168:"Robots.txt Specifications" 1397:"About Ask.com: Webmasters" 968: 742: 668: 509: 10: 2374: 2343:Search engine optimization 1639:"ArchiveBot: Bad behavior" 842: 770:User-agent: * Disallow: / 716:redirect humans.txt to an 679:security through obscurity 544:had a robots.txt file but 444: 424:security through obscurity 399:used for implementing the 304: 193: 144: 2228:Robots Exclusion Protocol 1860:Sverre H. Huseby (2004). 1333:Robots Exclusion Protocol 758:User-agent: * Disallow: 401:Robots Exclusion Protocol 370: 353: 345: 337: 324: 320:Robots Exclusion Protocol 319: 215: 212: 159: 100: 1374:. YouTube. Oct 5, 2009. 1004:Distributed web crawling 949: 913: 879:Robot Exclusion Standard 847:Some crawlers support a 699:pass alternative content 562:http://example.com:8080/ 467:denial-of-service attack 457:in February 1994 on the 407:to indicate to visiting 1951:. Iis.net. 2013-11-06. 1832:10.6028/NIST.SP.800-123 761:User-agent: * Allow: / 622:Artificial intelligence 99: 1150:"The Web Robots Pages" 1115:Fielding, Roy (1994). 951:X-Robots-Tag: noindex 836:'s subsequent visits. 818:Nonstandard extensions 267:://www.robotstxt.org}} 256:://www.robotstxt.org}} 1180:www-talk mailing list 889:Meta tags and headers 823:Crawl-delay directive 648:. In 2023, blog host 403:, a standard used by 263:* {{Official website| 252:* {{Official website| 2062:"/killer-robots.txt" 1926:. Httpd.apache.org. 1643:wiki.archiveteam.org 1193:on October 29, 2013. 910:A "noindex" meta tag 704:Some sites, such as 566:https://example.com/ 469:on Koster's server. 420:voluntary compliance 2000:"Github humans.txt" 1974:"Google humans.txt" 1901:. User-agents.org. 1437:on 13 December 2012 937:"noindex" 873:Universal "*" match 552:would not apply to 453:, when working for 316: 2255:Proposed Standard. 2117:support.google.com 1609:"Using robots.txt" 1427:"About AOL Search" 1360:Proposed Standard. 1277:Search Engine Land 1043:for search engines 928:"robots" 725:/killer-robots.txt 645:The New York Times 584:Some major search 314: 175: 111: 2272:Google Developers 2172:Google Developers 1705:. 17 April 2017. 1554:Google Developers 1191:archived message) 390: 389: 341:Proposed Standard 302:Internet protocol 295: 161: 102: 82: 39: 2365: 2325: 2324: 2316: 2304: 2303: 2301:Official website 2287: 2286: 2284: 2283: 2264: 2258: 2253: 2244: 2242:10.17487/RFC9309 2222: 2216: 2215: 2213: 2212: 2197: 2188: 2187: 2185: 2183: 2164: 2158: 2157: 2155: 2154: 2145:. Archived from 2139: 2133: 2132: 2130: 2128: 2109: 2103: 2102: 2100: 2098: 2083: 2077: 2076: 2074: 2073: 2058: 2052: 2051: 2049: 2048: 2028: 2022: 2021: 2019: 2017: 1996: 1990: 1989: 1987: 1985: 1970: 1964: 1963: 1961: 1960: 1945: 1939: 1938: 1936: 1935: 1920: 1914: 1913: 1911: 1910: 1895: 1889: 1888: 1886: 1885: 1857: 1851: 1850: 1848: 1846: 1840: 1821: 1812: 1806: 1805: 1803: 1801: 1782: 1776: 1775: 1773: 1772: 1757: 1751: 1750: 1748: 1746: 1724: 1718: 1717: 1715: 1714: 1703:blog.archive.org 1695: 1689: 1688: 1686: 1684: 1675:. Archive Team. 1665: 1659: 1658: 1656: 1654: 1645:. Archive Team. 1635: 1629: 1628: 1626: 1624: 1605: 1596: 1595: 1593: 1591: 1576: 1570: 1569: 1567: 1565: 1546: 1537: 1536: 1534: 1532: 1517:"DuckDuckGo Bot" 1513: 1507: 1506: 1504: 1502: 1483: 1477: 1476: 1474: 1472: 1453: 1447: 1446: 1444: 1442: 1433:. Archived from 1423: 1417: 1416: 1414: 1412: 1393: 1387: 1386: 1384: 1383: 1368: 1362: 1358: 1349: 1347:10.17487/RFC9309 1327: 1321: 1320: 1318: 1317: 1298: 1292: 1291: 1289: 1288: 1268: 1262: 1261: 1259: 1257: 1240: 1225: 1224: 1222: 1220: 1211:. 19 June 2006. 1201: 1195: 1194: 1192: 1182:. Archived from 1171: 1165: 1164: 1162: 1161: 1146: 1140: 1139: 1137: 1135: 1121: 1112: 1106: 1105: 1103: 1102: 1091:Greenhills.co.uk 1083: 1014:Internet Archive 987: 979: 941: 938: 935: 932: 929: 926: 923: 920: 917: 895:Robots meta tags 884: 865: 858: 850: 754: 750: 726: 711: 663: 610:Internet Archive 567: 563: 559: 555: 551: 547: 543: 521: 517: 386: 380: 377: 329: 317: 313: 202: 201: 199: 196:→‎External links 189: 174: 169: 151: 150: 149: 142: 134: 129: 110: 83: 74: 73: 71: 66: 64: 56: 53: 32: 31: 21: 19: 2373: 2372: 2368: 2367: 2366: 2364: 2363: 2362: 2333: 2332: 2331: 2319: 2311: 2308: 2299: 2298: 2295: 2290: 2281: 2279: 2266: 2265: 2261: 2223: 2219: 2210: 2208: 2199: 2198: 2191: 2181: 2179: 2166: 2165: 2161: 2152: 2150: 2141: 2140: 2136: 2126: 2124: 2111: 2110: 2106: 2096: 2094: 2085: 2084: 2080: 2071: 2069: 2060: 2059: 2055: 2046: 2044: 2029: 2025: 2015: 2013: 1998: 1997: 1993: 1983: 1981: 1972: 1971: 1967: 1958: 1956: 1947: 1946: 1942: 1933: 1931: 1922: 1921: 1917: 1908: 1906: 1897: 1896: 1892: 1883: 1881: 1874: 1858: 1854: 1844: 1842: 1838: 1819: 1813: 1809: 1799: 1797: 1784: 1783: 1779: 1770: 1768: 1759: 1758: 1754: 1744: 1742: 1725: 1721: 1712: 1710: 1697: 1696: 1692: 1682: 1680: 1666: 1662: 1652: 1650: 1637: 1636: 1632: 1622: 1620: 1613:Help.yandex.com 1607: 1606: 1599: 1589: 1587: 1578: 1577: 1573: 1563: 1561: 1548: 1547: 1540: 1530: 1528: 1515: 1514: 1510: 1500: 1498: 1485: 1484: 1480: 1470: 1468: 1455: 1454: 1450: 1440: 1438: 1425: 1424: 1420: 1410: 1408: 1395: 1394: 1390: 1381: 1379: 1370: 1369: 1365: 1328: 1324: 1315: 1313: 1300: 1299: 1295: 1286: 1284: 1269: 1265: 1255: 1253: 1241: 1228: 1218: 1216: 1209:Charlie's Diary 1203: 1202: 1198: 1186: 1172: 1168: 1159: 1157: 1148: 1147: 1143: 1133: 1131: 1119: 1113: 1109: 1100: 1098: 1085: 1084: 1080: 1076: 1071: 1009:Focused crawler 983: 975: 971: 962: 953: 952: 948: 943: 942: 939: 936: 933: 930: 927: 924: 921: 918: 915: 912: 891: 882: 875: 870: 860: 856: 848: 845: 840: 825: 820: 815: 801: 795: 789: 783: 777: 771: 762: 759: 752: 748: 745: 724: 709: 691: 671: 661: 624: 594: 582: 574: 565: 561: 557: 553: 549: 545: 541: 519: 515: 512: 447: 382: 374: 366: 346:First published 333: 310: 303: 292: 287: 280: 275: 268: 266: 257: 255: 244: 239: 232: 225: 208: 203: 194: 192: 190: 183: 181: 176: 170: 165: 157: 155:← Previous edit 152: 145: 143: 138: 137: 136: 132: 119: 117: 112: 106: 98: 97: 96: 95: 93: 92: 91: 90: 89: 88: 79: 75: 69: 67: 62: 59: 57: 54: 52:Content deleted 51: 48: 43:← Previous edit 40: 26: 25: 24: 17: 12: 11: 5: 2371: 2361: 2360: 2355: 2350: 2345: 2330: 2329: 2306: 2305: 2294: 2293:External links 2291: 2289: 2288: 2259: 2217: 2189: 2159: 2134: 2104: 2089:. 3 May 2012. 2078: 2053: 2037:Slate Magazine 2023: 1991: 1965: 1940: 1915: 1890: 1872: 1852: 1807: 1777: 1752: 1734:Digital Trends 1719: 1690: 1660: 1630: 1597: 1571: 1538: 1521:DuckDuckGo.com 1508: 1491:Blogs.bing.com 1478: 1448: 1431:Search.aol.com 1418: 1388: 1363: 1322: 1293: 1263: 1226: 1196: 1166: 1141: 1107: 1077: 1075: 1072: 1070: 1069: 1064: 1059: 1054: 1049: 1044: 1038: 1033: 1028: 1022: 1016: 1011: 1006: 1001: 995: 989: 981: 972: 970: 967: 961: 958: 950: 947: 944: 914: 911: 908: 890: 887: 874: 871: 868: 844: 841: 838: 830:search console 824: 821: 819: 816: 812: 799: 793: 787: 781: 775: 769: 760: 757: 744: 741: 729:the Terminator 690: 687: 670: 667: 623: 620: 615:Digital Trends 593: 592:Archival sites 590: 581: 580:Search engines 578: 573: 570: 511: 508: 474:web developers 463:Charles Stross 451:Martijn Koster 446: 443: 388: 387: 372: 368: 367: 365: 364: 361: 357: 355: 351: 350: 347: 343: 342: 339: 335: 334: 330: 322: 321: 301: 299: 294: 293: 290: 288: 285: 282: 281: 278: 276: 273: 270: 269: 264: 262: 260: 258: 253: 251: 249: 246: 245: 242: 240: 237: 234: 233: 230: 228: 226: 223: 221: 218: 217: 214: 210: 209: 179: 158: 131: 130: 115: 84: 78: 76: 58: 50: 41: 38: 37: 35: 23: 22: 14: 9: 6: 4: 3: 2: 2370: 2359: 2356: 2354: 2351: 2349: 2346: 2344: 2341: 2340: 2338: 2328: 2323: 2318: 2317: 2314: 2309: 2302: 2297: 2296: 2277: 2273: 2269: 2263: 2256: 2251: 2248: 2243: 2238: 2234: 2230: 2229: 2221: 2206: 2202: 2196: 2194: 2177: 2173: 2169: 2163: 2149:on 2009-03-05 2148: 2144: 2138: 2122: 2118: 2114: 2108: 2092: 2088: 2082: 2068:on 2018-01-10 2067: 2063: 2057: 2042: 2038: 2034: 2027: 2011: 2007: 2006: 2001: 1995: 1979: 1975: 1969: 1954: 1950: 1944: 1929: 1925: 1919: 1904: 1900: 1894: 1879: 1875: 1873:9780470857472 1869: 1865: 1864: 1856: 1837: 1833: 1829: 1825: 1818: 1811: 1795: 1791: 1787: 1781: 1766: 1762: 1756: 1740: 1736: 1735: 1730: 1723: 1708: 1704: 1700: 1694: 1678: 1674: 1670: 1664: 1648: 1644: 1640: 1634: 1618: 1614: 1610: 1604: 1602: 1585: 1581: 1575: 1559: 1555: 1551: 1545: 1543: 1526: 1522: 1518: 1512: 1496: 1492: 1488: 1482: 1466: 1462: 1458: 1457:"Baiduspider" 1452: 1436: 1432: 1428: 1422: 1406: 1402: 1401:About.ask.com 1398: 1392: 1377: 1373: 1367: 1361: 1356: 1353: 1348: 1343: 1339: 1335: 1334: 1326: 1311: 1307: 1303: 1297: 1282: 1278: 1274: 1267: 1252: 1251: 1246: 1239: 1237: 1235: 1233: 1231: 1214: 1210: 1206: 1200: 1190: 1185: 1181: 1177: 1170: 1155: 1151: 1145: 1134:September 25, 1129: 1125: 1118: 1111: 1096: 1092: 1088: 1082: 1078: 1068: 1065: 1063: 1060: 1058: 1057:Web archiving 1055: 1053: 1050: 1048: 1045: 1042: 1041:Meta elements 1039: 1037: 1034: 1032: 1029: 1026: 1023: 1020: 1017: 1015: 1012: 1010: 1007: 1005: 1002: 999: 996: 993: 990: 986: 982: 978: 974: 973: 966: 957: 907: 905: 901: 896: 886: 880: 867: 864: 854: 837: 835: 831: 811: 808: 806: 798: 792: 786: 780: 774: 768: 765: 756: 740: 738: 734: 730: 721: 719: 715: 707: 702: 700: 696: 686: 684: 680: 676: 666: 660: 659: 653: 651: 647: 646: 641: 637: 633: 629: 628:generative AI 619: 617: 616: 611: 607: 604:. Co-founder 603: 599: 589: 587: 577: 569: 554:a.example.com 546:a.example.com 539: 534: 530: 527: 525: 507: 505: 500: 498: 494: 490: 486: 484: 479: 475: 470: 468: 464: 460: 456: 452: 442: 440: 435: 433: 429: 425: 421: 416: 414: 410: 406: 402: 398: 394: 385: 379: 373: 369: 362: 359: 358: 356: 352: 348: 344: 340: 336: 328: 323: 318: 312: 308: 297: 291: 289: 286: 284: 283: 279: 277: 274: 272: 271: 261: 250: 247: 243: 241: 238: 236: 235: 229: 227: 222: 220: 219: 211: 207: 197: 187: 182: 173: 168: 164: 156: 148: 127: 123: 118: 109: 105: 87: 72: 65: 55:Content added 47: 44: 36: 34: 33: 20: 2353:Web scraping 2307: 2280:. Retrieved 2271: 2262: 2254: 2227: 2220: 2209:. Retrieved 2182:February 15, 2180:. Retrieved 2171: 2162: 2151:. Retrieved 2147:the original 2137: 2125:. Retrieved 2116: 2107: 2095:. Retrieved 2081: 2070:. Retrieved 2066:the original 2056: 2045:. Retrieved 2036: 2026: 2014:. Retrieved 2003: 1994: 1982:. Retrieved 1968: 1957:. Retrieved 1943: 1932:. Retrieved 1918: 1907:. Retrieved 1893: 1882:. Retrieved 1862: 1855: 1843:. Retrieved 1823: 1810: 1798:. Retrieved 1790:The Register 1789: 1780: 1769:. Retrieved 1755: 1743:. Retrieved 1732: 1722: 1711:. Retrieved 1702: 1693: 1681:. Retrieved 1663: 1651:. Retrieved 1642: 1633: 1621:. Retrieved 1612: 1588:. Retrieved 1574: 1562:. Retrieved 1553: 1529:. Retrieved 1520: 1511: 1499:. Retrieved 1490: 1481: 1469:. Retrieved 1460: 1451: 1439:. Retrieved 1435:the original 1430: 1421: 1409:. Retrieved 1400: 1391: 1380:. Retrieved 1366: 1359: 1332: 1325: 1314:. Retrieved 1305: 1296: 1285:. Retrieved 1276: 1266: 1254:. Retrieved 1248: 1217:. Retrieved 1208: 1199: 1184:the original 1179: 1169: 1158:. Retrieved 1144: 1132:. Retrieved 1123: 1120:(PostScript) 1110: 1099:. Retrieved 1090: 1087:"Historical" 1081: 985:security.txt 963: 954: 892: 878: 876: 862: 859:in the form 855:in the same 846: 826: 809: 802: 796: 790: 784: 778: 772: 766: 763: 746: 727:instructing 722: 717: 703: 692: 689:Alternatives 672: 656: 654: 643: 625: 613: 598:Archive Team 595: 583: 575: 535: 531: 528: 513: 501: 482: 471: 458: 448: 436: 417: 409:web crawlers 400: 392: 391: 311: 1683:18 February 1669:Jason Scott 1623:16 February 1590:16 February 1564:16 February 1501:16 February 1471:16 February 1441:16 February 1411:16 February 1062:Web crawler 1052:Spider trap 885:statement. 737:Sergey Brin 606:Jason Scott 550:example.com 542:example.com 206:Next edit → 46:Next edit → 2358:Text files 2337:Categories 2282:2022-10-17 2211:2013-08-17 2153:2009-03-23 2127:22 October 2097:9 February 2072:2018-05-25 2047:2019-10-03 2016:October 3, 1984:October 3, 1959:2013-12-29 1934:2013-12-29 1909:2013-12-29 1884:2015-08-12 1845:August 12, 1800:August 12, 1771:2015-08-10 1713:2018-12-01 1653:10 October 1382:2013-12-29 1316:2019-07-10 1287:2015-11-19 1160:2013-12-29 1126:. Geneva. 1101:2017-03-03 1074:References 904:httpd.conf 857:robots.txt 814:directory 733:Larry Page 710:humans.txt 695:user-agent 572:Compliance 516:robots.txt 489:WebCrawler 413:web robots 411:and other 393:robots.txt 315:robots.txt 1461:Baidu.com 1250:The Verge 1189:Hypermail 900:.htaccess 883:Disallow: 861:Sitemap: 834:Googlebot 708:, host a 675:web robot 658:The Verge 497:AltaVista 376:robotstxt 216:Line 234: 213:Line 234: 2348:Websites 2327:Internet 2276:Archived 2205:Archived 2176:Archived 2121:Archived 2091:Archived 2041:Archived 2010:Archived 1978:Archived 1953:Archived 1928:Archived 1903:Archived 1878:Archived 1836:Archived 1794:Archived 1765:Archived 1739:Archived 1707:Archived 1677:Archived 1647:Archived 1617:Archived 1584:Archived 1558:Archived 1531:25 April 1525:Archived 1495:Archived 1465:Archived 1405:Archived 1376:Archived 1310:Archived 1281:Archived 1256:16 March 1219:19 April 1213:Archived 1154:Archived 1128:Archived 1095:Archived 1047:Sitemaps 1036:Perma.cc 1031:Nofollow 1027:(NDIIPP) 969:See also 863:full-url 853:Sitemaps 753:Disallow 743:Examples 669:Security 602:sitemaps 510:Standard 485:standard 483:de facto 459:www-talk 439:sitemaps 405:websites 397:filename 384:RFC 9309 126:contribs 116:Pcrooker 70:Wikitext 1067:noindex 998:BotSeer 977:ads.txt 931:content 906:files. 849:Sitemap 843:Sitemap 586:engines 524:website 445:History 395:is the 371:Website 354:Authors 332:folder. 2313:Portal 2005:GitHub 1870: 1021:(NDLP) 805:Google 720:page. 714:GitHub 706:Google 650:Medium 636:Google 632:OpenAI 538:origin 495:, and 478:server 428:server 338:Status 231:--> 224:--> 81:Inline 63:Visual 1839:(PDF) 1820:(PDF) 1745:8 May 940:/> 718:About 662:' 493:Lycos 455:Nexor 265:https 200:HTTPS 135:edits 2250:9309 2233:IETF 2184:2020 2129:2018 2099:2016 2018:2019 1986:2019 1868:ISBN 1847:2015 1802:2015 1747:2017 1685:2017 1655:2022 1625:2013 1592:2013 1566:2013 1533:2017 1503:2013 1473:2013 1443:2013 1413:2013 1355:9309 1338:IETF 1258:2024 1221:2014 1136:2013 922:name 919:meta 916:< 902:and 877:The 735:and 642:and 378:.org 254:http 186:talk 172:undo 167:edit 122:talk 108:edit 2247:RFC 2237:doi 1828:doi 1352:RFC 1342:doi 640:BBC 564:or 133:126 2339:: 2274:. 2270:. 2245:. 2235:. 2231:. 2203:. 2192:^ 2174:. 2170:. 2119:. 2115:. 2039:. 2035:. 2008:. 2002:. 1976:. 1876:. 1834:. 1826:. 1822:. 1792:. 1788:. 1763:. 1737:. 1731:. 1701:. 1671:. 1641:. 1615:. 1611:. 1600:^ 1582:. 1556:. 1552:. 1541:^ 1523:. 1519:. 1493:. 1489:. 1463:. 1459:. 1429:. 1403:. 1399:. 1350:. 1340:. 1336:. 1308:. 1304:. 1279:. 1275:. 1247:. 1229:^ 1207:. 1178:. 1122:. 1093:. 1089:. 866:: 739:. 568:. 499:. 491:, 434:. 381:, 198:: 124:| 2315:: 2285:. 2252:. 2239:: 2214:. 2186:. 2156:. 2131:. 2101:. 2075:. 2050:. 2020:. 1988:. 1962:. 1937:. 1912:. 1887:. 1849:. 1830:: 1804:. 1774:. 1749:. 1716:. 1687:. 1657:. 1627:. 1594:. 1568:. 1535:. 1505:. 1475:. 1445:. 1415:. 1385:. 1357:. 1344:: 1319:. 1290:. 1260:. 1223:. 1187:( 1163:. 1138:. 1104:. 934:= 925:= 749:* 309:. 188:) 184:( 141:m 128:) 120:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index