Robots.txt: Difference between revisions

2299: 85: 276:|url=http://ysearchblog.com/2007/04/11/webmasters-can-now-auto-discover-with-sitemaps/ |title=Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps |access-date=2009-03-23 |archive-url=https://web.archive.org/web/20090305061841/http://ysearchblog.com/2007/04/11/webmasters-can-now-auto-discover-with-sitemaps/ |archive-date=2009-03-05 |url-status=dead }}</ref> 268:|url=http://ysearchblog.com/2007/04/11/webmasters-can-now-auto-discover-with-sitemaps/ |title=Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps |access-date=2009-03-23 |archive-url=https://web.archive.org/web/20090305061841/http://ysearchblog.com/2007/04/11/webmasters-can-now-auto-discover-with-sitemaps/ |archive-date=2009-03-05 |url-status=dead }}</ref> 928:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

800:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

637:

did not crawl sites with robots.txt, but in April 2017, it announced that it would no longer honour directives in the robots.txt files. "Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes". This was in

786:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

568:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

630:. The group views it as an obsolete standard that hinders web archival efforts. According to project leader Jason Scott, "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." 658:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 650:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 321:

Some crawlers (]) support a <code>Host</code> directive, allowing websites with multiple mirrors to specify their preferred domain:<ref>{{cite web |url=http://help.yandex.com/webmaster/?id=1113851 |title=Yandex - Using robots.txt |access-date=2013-05-13

569:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

193: 77: 227:

The nonstandard extension of the 'Host' directive/keyword is no longer used by any well-known search engine, including the previously mentioned Yandex. As it was also not used a lot in practice, it no longer makes sense to mention it

801:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

275:

Some crawlers support a <code>Sitemap</code> directive, allowing multiple ] in the same <samp>robots.txt</samp> in the form <code>Sitemap: ''full-url''</code>:<ref>{{cite web

267:

Some crawlers support a <code>Sitemap</code> directive, allowing multiple ] in the same <samp>robots.txt</samp> in the form <code>Sitemap: ''full-url''</code>:<ref>{{cite web

870:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

565:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

1926: 1541: 938:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (KiB) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .

15: 1478: 997: 611:

following this standard include Ask, AOL, Baidu, DuckDuckGo, Google, Yahoo!, and Yandex. Bing is still not fully compatible with the standard as it cannot inherit settings from the

1660: 477:

and robots that scan for security vulnerabilities may very well start with the portions of the website they have been asked (by the Robots Exclusion Protocol) to stay out of.

2064: 200: 1216: 1185: 1695: 2178: 2116: 2014: 322:|archive-url=https://web.archive.org/web/20130509230548/http://help.yandex.com/webmaster/?id=1113851 |archive-date=2013-05-09 |url-status=live }}</ref> 2237: 1922: 1571: 1533: 1452: 207: 776:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

248: 176: 1474: 780:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 773:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 655: 1788: 2149: 1600: 562:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1652: 1851: 2094: 1100: 1630: 1245: 866:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

2060: 1721: 86: 670:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1901: 1750: 519:

that present and future web crawlers were expected to follow; most complied, including those operated by search engines such as

2208: 1300: 1329: 1389: 508:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly-behaved web crawler that inadvertently caused a

1208: 1177: 188: 107: 132: 1511: 45: 42: 2319: 1275: 1067: 646:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

1148: 1126: 2035: 767:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1845: 1685: 964: 2174: 2120: 558:

to follow the instructions try to fetch this file and read the instructions before fetching any other file from the

991: 2006: 2276: 535: 125: 2229: 1419: 1951: 1563: 1444: 961:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 534:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1351: 147: 184: 103: 1983: 211: 114: 2289: 651: 1876: 929:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

1820: 1773: 976: 554:). This text file contains the instructions in a specific format (see examples below). Robots that 509: 466: 2141: 1592: 1653:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 701: 626:

explicitly ignores robots.txt directives, using it instead for discovering more links, such as

241: 2175:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 576:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 1835: 1807: 802: 80: 2086: 462: 638:

response to entire domains being tagged with robots.txt when the content became obsolete.

546:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2007:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 1089: 121: 1626: 1237: 761:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

1717: 612: 573: 1897: 1742: 2200: 1841: 1296: 516: 504:

mailing list, the main communication channel for WWW-related activities at the time.

164: 151: 2298: 1780: 1321: 986: 634: 16: 1381: 2303: 1013: 981: 867: 720:

This example tells all robots that they can visit all files because the wildcard

425:

Standard used to advise web crawlers and scrapers not to index a web page or site

2270:

Koster, Martijn; Illyes, Gary; Zeller, Henner; Sassman, Lizzi (September 2022).

1690: 1503: 608: 505: 493: 169: 1267: 1090:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 1059: 737:

The same result can be accomplished with an empty or missing robots.txt file.

2313: 2271: 1534:"Robots Exclusion Protocol: joining together to provide better documentation" 1156: 1029: 1784: 1122: 764:

This example tells two specific robots not to enter one specific directory:

685:

file that displays information meant for humans to read. Some sites such as

61: 2039: 957: 623: 455: 233: 159: 1923:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1622: 1034: 1024: 709: 451: 1686:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 876: 705: 667: 520: 2117:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1411: 1178:"How I got here in the end, part five: "things can only get better!"" 1161: 872: 806: 647: 592:. In addition, each protocol and port needs its own robots.txt file; 528: 2230:"How Google Interprets the robots.txt Specification | Documentation" 1947: 998:

National Digital Information Infrastructure and Preservation Program

1019: 1008: 1003: 825: 752:

This example tells all robots to stay away from one specific file:

749:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

671: 538:. A proposed standard was published in September 2022 as RFC 9309. 481: 470: 439: 1359: 292:<pre>Sitemap: http://www.example.com/sitemap.xml</pre> 285:<pre>Sitemap: http://www.example.com/sitemap.xml</pre> 1743:"Robots.txt tells hackers the places you don't want them to look" 1039: 970: 949: 627: 559: 474: 447: 1973: 1978: 777: 686: 678: 1872: 1266:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (2022-09-14).

758:

All other files in the specified directory will be processed.

746:

This example tells all robots not to enter three directories:

30: 29: 524: 497: 2262: 1209:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 1837:

Innocent Code: A Security Wake-Up Call for Web Programmers

1718:"Block URLs with robots.txt: Learn about robots.txt files" 2269: 1439: 1437: 1265: 1238:"Formalizing the Robots Exclusion Protocol Specification" 728:

directive has no value, meaning no pages are disallowed.

458:

which portions of the website they are allowed to visit.

1564:"How to Create a Robots.txt File - Bing Webmaster Tools" 740:

This example tells all robots to stay out of a website:

674:) when it detects a connection using one of the robots. 2061:"To crawl or not to crawl, that is BingBot's question" 1434: 480:

The "robots.txt" file can be used in conjunction with

68: 2287: 1772:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1467: 1097:

First International Conference on the World Wide Web

429: 2087:"Change Googlebot crawl rate - Search Console Help" 1771: 1679: 1677: 465:. Not all robots comply with the standard; indeed, 1779:. National Institute of Standards and Technology. 973:– now inactive search engine for robots.txt files 484:, another robot inclusion standard for websites. 2311: 2053: 1873:"List of User-Agents (Spiders, Robots, Browser)" 1833: 1674: 1526: 1498: 1496: 1344: 845: 770:Example demonstrating how comments can be used: 2169: 2167: 1404: 1314: 1206: 1149:"Important: Spiders, Robots and Web Wanderers" 953:, a standard for listing authorized ad sellers 656:National Institute of Standards and Technology 1493: 812:User-agent: bingbot Allow: / Crawl-delay: 10 755:User-agent: * Disallow: /directory/file.html 696:Previously, Google had a joke file hosted at 1374: 783:Example demonstrating multiple user-agents: 550:in the root of the web site hierarchy (e.g. 2164: 842:Sitemap: http://www.example.com/sitemap.xml 1475:"Submitting your website to Yahoo! Search" 854:does not mention the "*" character in the 27: 1840:. John Wiley & Sons. pp. 91–92. 790: 1200: 1087: 967:– a failed proposal to extend robots.txt 861: 795: 654:is discouraged by standards bodies. The 584:did not, the rules that would apply for 28: 1621: 1445:"Webmasters: Robots.txt Specifications" 2312: 2004: 1242:Official Google Webmaster Central Blog 1146: 384:This is not supported by all crawlers. 2201:"RFC 9309: Robots Exclusion Protocol" 2152:from the original on November 2, 2019 1954:from the original on January 24, 2017 1898:"Access Control - Apache HTTP Server" 1683: 1422:from the original on 16 February 2017 428:For Knowledge's robots.txt file, see 421:Revision as of 20:28, 21 January 2024 185:Revision as of 20:28, 21 January 2024 104:Revision as of 19:01, 15 January 2024 1603:from the original on 10 October 2022 1332:from the original on 27 January 2013 1147:Koster, Martijn (25 February 1994). 94: 60: 430:https://en.wikipedia.org/robots.txt 247: 231: 225: 199: 182: 175: 157: 144: 113: 101: 13: 2198: 1774:"Guide to General Server Security" 1392:from the original on 6 August 2013 1297:"Uncrawled URLs in search results" 552:https://www.example.com/robots.txt 423: 49: 2331: 2254: 1986:from the original on May 30, 2016 1115: 965:Automated Content Access Protocol 934:Maximum size of a robots.txt file 704:not to kill the company founders 2297: 992:National Digital Library Program 919:A "noindex" HTTP response header 666:Many robots also pass a special 146:Restored revision 1194609200 by 2277:Internet Engineering Task Force 2240:from the original on 2022-10-17 2222: 2211:from the original on 2022-10-05 2192: 2181:from the original on 2013-08-08 2134: 2109: 2097:from the original on 2018-11-18 2079: 2067:from the original on 2016-02-03 2028: 2017:from the original on 2018-11-18 2005:Newman, Lily Hay (2014-07-03). 1998: 1966: 1940: 1929:from the original on 2014-01-01 1915: 1904:from the original on 2013-12-29 1890: 1879:from the original on 2014-01-07 1865: 1854:from the original on 2016-04-01 1827: 1794:from the original on 2011-10-08 1765: 1753:from the original on 2015-08-21 1735: 1724:from the original on 2015-08-14 1710: 1698:from the original on 2017-05-16 1663:from the original on 2018-12-04 1645: 1633:from the original on 2017-02-18 1615: 1585: 1574:from the original on 2019-02-07 1556: 1544:from the original on 2014-08-18 1514:from the original on 2013-01-25 1481:from the original on 2013-01-21 1455:from the original on 2013-01-15 1303:from the original on 2014-01-06 1278:from the original on 2022-09-22 1248:from the original on 2019-07-10 1219:from the original on 2015-09-07 1207:Barry Schwartz (30 June 2014). 1188:from the original on 2013-11-25 1129:from the original on 2014-01-12 1103:from the original on 2013-09-27 1070:from the original on 2017-04-03 805:for webmasters, to control the 661: 536:Internet Engineering Task Force 1627:"Robots.txt is a suicide note" 1289: 1259: 1230: 1170: 1140: 1081: 1052: 724:stands for all robots and the 596:does not apply to pages under 1: 1684:Jones, Brad (24 April 2017). 1125:. Robotstxt.org. 1994-06-30. 1046: 824:directive, allowing multiple 594:http://example.com/robots.txt 572:A robots.txt file covers one 492:The standard was proposed by 419: 380: 371: 360: 349: 338: 329: 317: 306: 297: 2038:. 2018-01-10. Archived from 18:Browse history interactively 7: 2272:"Robots Exclusion Protocol" 2142:"Robots.txt Specifications" 1322:"About Ask.com: Webmasters" 1268:"Robots Exclusion Protocol" 941: 715: 641: 541: 10: 2336: 2320:Search engine optimization 1593:"ArchiveBot: Bad behavior" 815: 743:User-agent: * Disallow: / 689:redirect humans.txt to an 652:security through obscurity 580:had a robots.txt file but 487: 442:used for implementing the 427: 1834:Sverre H. Huseby (2004). 731:User-agent: * Disallow: 444:Robots Exclusion Protocol 414:===Universal "*" match=== 407:===Universal "*" match=== 387: 376: 367: 356: 353:Host: hosting.example.com 345: 334: 325: 313: 302: 258: 255: 181: 100: 1299:. YouTube. Oct 5, 2009. 977:Distributed web crawling 922: 886: 852:Robot Exclusion Standard 820:Some crawlers support a 672:pass alternative content 598:http://example.com:8080/ 510:denial-of-service attack 500:in February 1994 on the 450:to indicate to visiting 133:Extended confirmed users 1925:. Iis.net. 2013-11-06. 1785:10.6028/NIST.SP.800-123 734:User-agent: * Allow: / 622:The volunteering group 99: 1815:Cite journal requires 1123:"The Web Robots Pages" 1088:Fielding, Roy (1994). 924:X-Robots-Tag: noindex 809:'s subsequent visits. 791:Nonstandard extensions 1153:www-talk mailing list 862:Meta tags and headers 796:Crawl-delay directive 446:, a standard used by 242:Visual edit: Switched 2036:"/killer-robots.txt" 1900:. Httpd.apache.org. 1597:wiki.archiveteam.org 1166:on October 29, 2013. 883:A "noindex" meta tag 677:Some sites, such as 633:For some years, the 602:https://example.com/ 515:It quickly became a 512:on Koster's server. 463:voluntary compliance 1974:"Github humans.txt" 1948:"Google humans.txt" 1875:. User-agents.org. 1362:on 13 December 2012 910:"noindex" 846:Universal "*" match 588:would not apply to 496:, when working for 154:): Not constructive 2205:www.rfc-editor.org 2091:support.google.com 1504:"Using robots.txt" 1352:"About AOL Search" 1213:Search Engine Land 1016:for search engines 901:"robots" 698:/killer-robots.txt 613:wildcard character 607:Some major search 197: 111: 2234:Google Developers 2199:Koster, Martijn. 2146:Google Developers 1659:. 17 April 2017. 1449:Google Developers 1164:archived message) 517:de facto standard 418: 183: 102: 82: 39: 2327: 2302: 2301: 2293: 2281: 2266: 2265: 2263:Official website 2249: 2248: 2246: 2245: 2226: 2220: 2219: 2217: 2216: 2196: 2190: 2189: 2187: 2186: 2171: 2162: 2161: 2159: 2157: 2138: 2132: 2131: 2129: 2128: 2119:. Archived from 2113: 2107: 2106: 2104: 2102: 2083: 2077: 2076: 2074: 2072: 2057: 2051: 2050: 2048: 2047: 2032: 2026: 2025: 2023: 2022: 2002: 1996: 1995: 1993: 1991: 1970: 1964: 1963: 1961: 1959: 1944: 1938: 1937: 1935: 1934: 1919: 1913: 1912: 1910: 1909: 1894: 1888: 1887: 1885: 1884: 1869: 1863: 1862: 1860: 1859: 1831: 1825: 1824: 1818: 1813: 1811: 1803: 1801: 1799: 1793: 1778: 1769: 1763: 1762: 1760: 1758: 1739: 1733: 1732: 1730: 1729: 1714: 1708: 1707: 1705: 1703: 1681: 1672: 1671: 1669: 1668: 1657:blog.archive.org 1649: 1643: 1642: 1640: 1638: 1629:. Archive Team. 1619: 1613: 1612: 1610: 1608: 1599:. Archive Team. 1589: 1583: 1582: 1580: 1579: 1560: 1554: 1553: 1551: 1549: 1530: 1524: 1523: 1521: 1519: 1500: 1491: 1490: 1488: 1486: 1471: 1465: 1464: 1462: 1460: 1441: 1432: 1431: 1429: 1427: 1412:"DuckDuckGo Bot" 1408: 1402: 1401: 1399: 1397: 1378: 1372: 1371: 1369: 1367: 1358:. Archived from 1348: 1342: 1341: 1339: 1337: 1318: 1312: 1311: 1309: 1308: 1293: 1287: 1286: 1284: 1283: 1263: 1257: 1256: 1254: 1253: 1234: 1228: 1227: 1225: 1224: 1204: 1198: 1197: 1195: 1193: 1184:. 19 June 2006. 1174: 1168: 1167: 1165: 1155:. Archived from 1144: 1138: 1137: 1135: 1134: 1119: 1113: 1112: 1110: 1108: 1094: 1085: 1079: 1078: 1076: 1075: 1064:Greenhills.co.uk 1056: 987:Internet Archive 960: 952: 914: 911: 908: 905: 902: 899: 896: 893: 890: 868:Robots meta tags 857: 838: 831: 823: 727: 723: 699: 684: 635:Internet Archive 618: 603: 599: 595: 591: 587: 583: 579: 553: 549: 467:email harvesters 245: 244: 239: 238:section blanking 229: 220: 215: 196: 191: 173: 172: 167: 155: 139: 129: 110: 83: 74: 73: 71: 66: 64: 56: 53: 32: 31: 21: 19: 2335: 2334: 2330: 2329: 2328: 2326: 2325: 2324: 2310: 2309: 2308: 2296: 2288: 2285: 2261: 2260: 2257: 2252: 2243: 2241: 2228: 2227: 2223: 2214: 2212: 2197: 2193: 2184: 2182: 2173: 2172: 2165: 2155: 2153: 2140: 2139: 2135: 2126: 2124: 2115: 2114: 2110: 2100: 2098: 2085: 2084: 2080: 2070: 2068: 2059: 2058: 2054: 2045: 2043: 2034: 2033: 2029: 2020: 2018: 2003: 1999: 1989: 1987: 1972: 1971: 1967: 1957: 1955: 1946: 1945: 1941: 1932: 1930: 1921: 1920: 1916: 1907: 1905: 1896: 1895: 1891: 1882: 1880: 1871: 1870: 1866: 1857: 1855: 1848: 1832: 1828: 1816: 1814: 1805: 1804: 1797: 1795: 1791: 1776: 1770: 1766: 1756: 1754: 1741: 1740: 1736: 1727: 1725: 1716: 1715: 1711: 1701: 1699: 1682: 1675: 1666: 1664: 1651: 1650: 1646: 1636: 1634: 1620: 1616: 1606: 1604: 1591: 1590: 1586: 1577: 1575: 1562: 1561: 1557: 1547: 1545: 1532: 1531: 1527: 1517: 1515: 1508:Help.yandex.com 1502: 1501: 1494: 1484: 1482: 1473: 1472: 1468: 1458: 1456: 1443: 1442: 1435: 1425: 1423: 1410: 1409: 1405: 1395: 1393: 1380: 1379: 1375: 1365: 1363: 1350: 1349: 1345: 1335: 1333: 1320: 1319: 1315: 1306: 1304: 1295: 1294: 1290: 1281: 1279: 1264: 1260: 1251: 1249: 1236: 1235: 1231: 1222: 1220: 1205: 1201: 1191: 1189: 1182:Charlie's Diary 1176: 1175: 1171: 1159: 1145: 1141: 1132: 1130: 1121: 1120: 1116: 1106: 1104: 1092: 1086: 1082: 1073: 1071: 1058: 1057: 1053: 1049: 1044: 982:Focused crawler 956: 948: 944: 932: 926: 925: 916: 915: 912: 909: 906: 903: 900: 897: 894: 891: 888: 864: 855: 848: 843: 833: 829: 821: 818: 813: 798: 793: 788: 774: 768: 762: 756: 750: 744: 735: 732: 725: 721: 718: 697: 682: 664: 644: 616: 601: 597: 593: 589: 585: 581: 577: 551: 547: 544: 490: 461:This relies on 433: 426: 415: 408: 399: 394: 385: 374: 365: 354: 343: 332: 323: 311: 300: 293: 286: 277: 269: 251: 246: 240: 237: 232: 230: 226: 224: 223: 222: 218: 205: 203: 198: 192: 187: 179: 177:← Previous edit 174: 168: 163: 158: 156: 145: 143: 142: 141: 137: 135: 119: 117: 112: 106: 98: 97: 96: 95: 93: 92: 91: 90: 89: 88: 79: 75: 69: 67: 62: 59: 57: 54: 52:Content deleted 51: 48: 43:← Previous edit 40: 26: 25: 24: 17: 12: 11: 5: 2333: 2323: 2322: 2307: 2306: 2283: 2282: 2267: 2256: 2255:External links 2253: 2251: 2250: 2221: 2191: 2163: 2133: 2108: 2078: 2063:. 3 May 2012. 2052: 2027: 2011:Slate Magazine 1997: 1965: 1939: 1914: 1889: 1864: 1846: 1826: 1817:|journal= 1764: 1734: 1709: 1691:Digital Trends 1673: 1644: 1614: 1584: 1555: 1538:Blogs.bing.com 1525: 1492: 1466: 1433: 1416:DuckDuckGo.com 1403: 1373: 1356:Search.aol.com 1343: 1313: 1288: 1272:IETF Documents 1258: 1229: 1199: 1169: 1139: 1114: 1080: 1050: 1048: 1045: 1043: 1042: 1037: 1032: 1027: 1022: 1017: 1011: 1006: 1001: 995: 989: 984: 979: 974: 968: 962: 954: 945: 943: 940: 923: 921: 920: 887: 885: 884: 863: 860: 847: 844: 841: 817: 814: 811: 803:search console 797: 794: 792: 789: 785: 772: 766: 760: 754: 748: 742: 733: 730: 717: 714: 702:the Terminator 663: 660: 643: 640: 543: 540: 506:Charles Stross 494:Martijn Koster 489: 486: 424: 422: 417: 416: 413: 411: 409: 406: 404: 401: 400: 397: 395: 392: 389: 388: 386: 383: 381: 378: 377: 375: 372: 369: 368: 366: 363: 361: 358: 357: 355: 352: 350: 347: 346: 344: 341: 339: 336: 335: 333: 330: 327: 326: 324: 320: 318: 315: 314: 312: 309: 307: 304: 303: 301: 298: 295: 294: 291: 289: 287: 284: 282: 279: 278: 274: 272: 270: 266: 264: 261: 260: 257: 253: 252: 217: 216: 201: 180: 136: 131: 130: 115: 84: 78: 76: 58: 50: 41: 38: 37: 35: 23: 22: 14: 9: 6: 4: 3: 2: 2332: 2321: 2318: 2317: 2315: 2305: 2300: 2295: 2294: 2291: 2286: 2279: 2278: 2273: 2268: 2264: 2259: 2258: 2239: 2235: 2231: 2225: 2210: 2206: 2202: 2195: 2180: 2176: 2170: 2168: 2151: 2147: 2143: 2137: 2123:on 2009-03-05 2122: 2118: 2112: 2096: 2092: 2088: 2082: 2066: 2062: 2056: 2042:on 2018-01-10 2041: 2037: 2031: 2016: 2012: 2008: 2001: 1985: 1981: 1980: 1975: 1969: 1953: 1949: 1943: 1928: 1924: 1918: 1903: 1899: 1893: 1878: 1874: 1868: 1853: 1849: 1847:9780470857472 1843: 1839: 1838: 1830: 1822: 1809: 1790: 1786: 1782: 1775: 1768: 1752: 1748: 1744: 1738: 1723: 1719: 1713: 1697: 1693: 1692: 1687: 1680: 1678: 1662: 1658: 1654: 1648: 1632: 1628: 1624: 1618: 1602: 1598: 1594: 1588: 1573: 1569: 1565: 1559: 1543: 1539: 1535: 1529: 1513: 1509: 1505: 1499: 1497: 1480: 1476: 1470: 1454: 1450: 1446: 1440: 1438: 1421: 1417: 1413: 1407: 1391: 1387: 1383: 1382:"Baiduspider" 1377: 1361: 1357: 1353: 1347: 1331: 1327: 1326:About.ask.com 1323: 1317: 1302: 1298: 1292: 1277: 1273: 1269: 1262: 1247: 1243: 1239: 1233: 1218: 1214: 1210: 1203: 1187: 1183: 1179: 1173: 1163: 1158: 1154: 1150: 1143: 1128: 1124: 1118: 1107:September 25, 1102: 1098: 1091: 1084: 1069: 1065: 1061: 1055: 1051: 1041: 1038: 1036: 1033: 1031: 1030:Web archiving 1028: 1026: 1023: 1021: 1018: 1015: 1014:Meta elements 1012: 1010: 1007: 1005: 1002: 999: 996: 993: 990: 988: 985: 983: 980: 978: 975: 972: 969: 966: 963: 959: 955: 951: 947: 946: 939: 936: 935: 930: 918: 917: 882: 881: 880: 878: 874: 869: 859: 853: 840: 837: 827: 810: 808: 804: 784: 781: 779: 771: 765: 759: 753: 747: 741: 738: 729: 713: 711: 707: 703: 694: 692: 688: 680: 675: 673: 669: 659: 657: 653: 649: 639: 636: 631: 629: 625: 620: 614: 610: 605: 590:a.example.com 582:a.example.com 575: 570: 566: 563: 561: 557: 539: 537: 532: 530: 526: 522: 518: 513: 511: 507: 503: 499: 495: 485: 483: 478: 476: 472: 468: 464: 459: 457: 453: 449: 445: 441: 437: 431: 420: 412: 410: 405: 403: 402: 398: 396: 393: 391: 390: 382: 379: 373: 370: 362: 359: 351: 348: 340: 337: 331: 328: 319: 316: 308: 305: 299: 296: 290: 288: 283: 281: 280: 273: 271: 265: 263: 262: 254: 250: 243: 235: 213: 209: 204: 195: 190: 186: 178: 171: 166: 161: 153: 149: 134: 127: 123: 118: 109: 105: 87: 72: 65: 55:Content added 47: 44: 36: 34: 33: 20: 2284: 2275: 2242:. Retrieved 2233: 2224: 2213:. Retrieved 2204: 2194: 2183:. Retrieved 2156:February 15, 2154:. Retrieved 2145: 2136: 2125:. Retrieved 2121:the original 2111: 2099:. Retrieved 2090: 2081: 2069:. Retrieved 2055: 2044:. Retrieved 2040:the original 2030: 2019:. Retrieved 2010: 2000: 1988:. Retrieved 1977: 1968: 1956:. Retrieved 1942: 1931:. Retrieved 1917: 1906:. Retrieved 1892: 1881:. Retrieved 1867: 1856:. Retrieved 1836: 1829: 1808:cite journal 1796:. Retrieved 1767: 1755:. Retrieved 1747:The Register 1746: 1737: 1726:. Retrieved 1712: 1700:. Retrieved 1689: 1665:. Retrieved 1656: 1647: 1635:. Retrieved 1617: 1605:. Retrieved 1596: 1587: 1576:. Retrieved 1568:www.bing.com 1567: 1558: 1546:. Retrieved 1537: 1528: 1516:. Retrieved 1507: 1483:. Retrieved 1469: 1457:. Retrieved 1448: 1424:. Retrieved 1415: 1406: 1394:. Retrieved 1385: 1376: 1364:. Retrieved 1360:the original 1355: 1346: 1334:. Retrieved 1325: 1316: 1305:. Retrieved 1291: 1280:. Retrieved 1271: 1261: 1250:. Retrieved 1241: 1232: 1221:. Retrieved 1212: 1202: 1190:. Retrieved 1181: 1172: 1157:the original 1152: 1142: 1131:. Retrieved 1117: 1105:. Retrieved 1096: 1093:(PostScript) 1083: 1072:. Retrieved 1063: 1060:"Historical" 1054: 958:security.txt 937: 933: 931: 927: 865: 851: 849: 835: 832:in the form 828:in the same 819: 799: 782: 775: 769: 763: 757: 751: 745: 739: 736: 719: 700:instructing 695: 690: 676: 665: 662:Alternatives 645: 632: 624:Archive Team 621: 606: 571: 567: 564: 555: 545: 533: 514: 501: 491: 479: 460: 452:web crawlers 443: 435: 434: 364:</pre> 1637:18 February 1623:Jason Scott 1548:16 February 1518:16 February 1485:16 February 1459:16 February 1396:16 February 1366:16 February 1336:16 February 1035:Web crawler 1025:Spider trap 858:statement. 710:Sergey Brin 586:example.com 578:example.com 342:<pre> 249:Next edit → 46:Next edit → 2244:2022-10-17 2215:2022-12-08 2185:2013-08-17 2127:2009-03-23 2101:22 October 2071:9 February 2046:2018-05-25 2021:2019-10-03 1990:October 3, 1958:October 3, 1933:2013-12-29 1908:2013-12-29 1883:2013-12-29 1858:2015-08-12 1798:August 12, 1757:August 12, 1728:2015-08-10 1667:2018-12-01 1607:10 October 1578:2019-02-06 1307:2013-12-29 1282:2022-09-22 1252:2019-07-10 1223:2015-11-19 1133:2013-12-29 1099:. Geneva. 1074:2017-03-03 1047:References 877:httpd.conf 830:robots.txt 787:directory 706:Larry Page 683:humans.txt 668:user-agent 548:robots.txt 521:WebCrawler 456:web robots 454:and other 436:robots.txt 310:===Host=== 1386:Baidu.com 1162:Hypermail 873:.htaccess 856:Disallow: 834:Sitemap: 807:Googlebot 681:, host a 648:web robot 529:AltaVista 259:Line 133: 256:Line 133: 116:777burger 2314:Category 2304:Internet 2238:Archived 2209:Archived 2179:Archived 2150:Archived 2095:Archived 2065:Archived 2015:Archived 1984:Archived 1952:Archived 1927:Archived 1902:Archived 1877:Archived 1852:Archived 1789:Archived 1751:Archived 1722:Archived 1696:Archived 1661:Archived 1631:Archived 1601:Archived 1572:Archived 1542:Archived 1512:Archived 1479:Archived 1453:Archived 1426:25 April 1420:Archived 1390:Archived 1330:Archived 1301:Archived 1276:Archived 1246:Archived 1217:Archived 1192:19 April 1186:Archived 1127:Archived 1101:Archived 1068:Archived 1020:Sitemaps 1009:Perma.cc 1004:Nofollow 1000:(NDIIPP) 942:See also 836:full-url 826:Sitemaps 726:Disallow 716:Examples 642:Security 628:sitemaps 542:Standard 502:www-talk 482:sitemaps 471:spambots 448:websites 440:filename 212:contribs 126:contribs 70:Wikitext 1040:noindex 971:BotSeer 950:ads.txt 904:content 879:files. 822:Sitemap 816:Sitemap 609:engines 560:website 488:History 475:malware 438:is the 165:Twinkle 148:Tollens 2290:Portal 1979:GitHub 1844: 994:(NDLP) 778:Google 693:page. 687:GitHub 679:Google 574:origin 556:choose 527:, and 202:Jortvl 81:Inline 63:Visual 1792:(PDF) 1777:(PDF) 1702:8 May 913:/> 691:About 525:Lycos 498:Nexor 228:here. 140:edits 138:9,049 2158:2020 2103:2018 2073:2016 1992:2019 1960:2019 1842:ISBN 1821:help 1800:2015 1759:2015 1704:2017 1639:2017 1609:2022 1550:2013 1520:2013 1487:2013 1461:2013 1428:2017 1398:2013 1368:2013 1338:2013 1194:2014 1109:2013 895:name 892:meta 889:< 875:and 850:The 708:and 234:Tags 221:edit 208:talk 194:undo 189:edit 170:Undo 160:Tags 152:talk 122:talk 108:edit 1781:doi 619:). 600:or 2316:: 2274:. 2236:. 2232:. 2207:. 2203:. 2177:. 2166:^ 2148:. 2144:. 2093:. 2089:. 2013:. 2009:. 1982:. 1976:. 1950:. 1850:. 1812:: 1810:}} 1806:{{ 1787:. 1749:. 1745:. 1720:. 1694:. 1688:. 1676:^ 1655:. 1625:. 1595:. 1570:. 1566:. 1540:. 1536:. 1510:. 1506:. 1495:^ 1477:. 1451:. 1447:. 1436:^ 1418:. 1414:. 1388:. 1384:. 1354:. 1328:. 1324:. 1274:. 1270:. 1244:. 1240:. 1215:. 1211:. 1180:. 1151:. 1095:. 1066:. 1062:. 839:: 712:. 604:. 531:. 523:, 473:, 469:, 236:: 210:| 162:: 124:| 2292:: 2280:. 2247:. 2218:. 2188:. 2160:. 2130:. 2105:. 2075:. 2049:. 2024:. 1994:. 1962:. 1936:. 1911:. 1886:. 1861:. 1823:) 1819:( 1802:. 1783:: 1761:. 1731:. 1706:. 1670:. 1641:. 1611:. 1581:. 1552:. 1522:. 1489:. 1463:. 1430:. 1400:. 1370:. 1340:. 1310:. 1285:. 1255:. 1226:. 1196:. 1160:( 1136:. 1111:. 1077:. 907:= 898:= 722:* 617:* 615:( 432:. 219:1 214:) 206:( 150:( 128:) 120:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index