Robots.txt: Difference between revisions

305:|url=https://www.wired.com/2000/07/ebay-fights-spiders-on-the-web/ |access-date=2024-08-02 |work=] |language=en-US |issn=1059-1028}}</ref> where eBay attempted to block a bot, and the company operating the crawler was ordered to stop crawling eBay's servers using any automatic means, by ] the basis of ].<ref name="case">{{cite court|litigants=eBay v. Bidder's Edge|vol=100|reporter=F. Supp. 2d|opinion=1058|pinpoint=|court=]|date=2000|quote=|url=http://www.cand.uscourts.gov/cand/tentrule.nsf/3979517dd11390ce8825690a007c1b9e/d0fc1406324de0cd882568e90081ebf4/$ FILE/Ebay.pdf|archive-url=https://web.archive.org/web/20000817173849/http://www.cand.uscourts.gov/cand/tentrule.nsf/3979517dd11390ce8825690a007c1b9e/d0fc1406324de0cd882568e90081ebf4/$ FILE/Ebay.pdf|url-status=dead|accessdate=2000-08-17}}</ref><ref>{{Cite web |last=Hoffmann |first=Jay |date=2020-09-15 |title=Chapter 4: Search |url=https://thehistoryoftheweb.com/book/search/ |access-date=2024-08-02 |website=The History of the Web |language=en-US}}</ref><ref name=":1" />--> 1218: 402:> '']'' reported that companies like ] and ] circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular ]s.<ref>{{Cite web |last=Koebler |first=Jason |date=2024-07-29 |title=Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones) |url=https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/ |access-date=2024-07-29 |website=404 Media}}</ref 551: 82: 370:

GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the ] and '']''. In 2023, blog host ] announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".<ref name="Verge"/>

362:

399:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but '']''{{'}}s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and

391:

369:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for ]. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked ]'s GPTBot in their robots.txt file and 85 blocked ]'s Google-Extended. Many robots.txt files named

361:

1197:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

1069:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

1055:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

756:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

927:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 919:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 757:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

555:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

176: 74: 888:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

304:

The robots.txt protocol is widely complied with by bot operators.<ref name="Verge"/> <!--It entered the court as part of '']'',<ref name=":1">{{Cite news |last= |first= |date=2000-07-31 |title=EBay Fights Spiders on the Web

1070:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

746:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 1139:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

842:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 1206:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.

753:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

2239: 1756: 15: 1845: 1292: 832:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1968: 2601: 862:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 2377: 700:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

1542: 1474: 2000: 876:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2491: 2429: 2327: 2562: 2235: 1819: 1748: 1045:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

159: 1841: 314:

A robots.txt has no enforcement mechanism in law or in technical protocol, despite widespread compliance by bot operators.<ref name="Verge"/>

2023: 1049:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 1042:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 924: 2122: 2462: 1908: 750:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1960: 472:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities 465:* <code>]</code>, a file to describe the process for security researchers to follow in order to report security vulnerabilities 2164: 2407: 2587: 1389: 1938: 1571: 1135:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

2373: 2051: 646:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 939:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

2214: 2080: 1637: 1666: 1726: 689:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1534: 1466: 104: 201: 194: 129: 83: 1878: 42: 2642: 1356: 915:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

133: 1437: 1415: 851: 655: 2348: 1036:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

2158: 1990: 1253: 122: 2487: 2433: 183: 1286: 2319: 1506: 727: 220: 2554: 1786: 850:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

2264: 1811: 1245:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 726:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

111: 1688: 903:

circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular

879:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

167: 100: 2296: 920: 647: 171: 2189: 1198:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

2103: 1265: 1248: 690: 190: 2454: 1900: 812:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

1961:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 970: 2488:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 764:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 2652: 2592: 2148: 1071: 118: 77: 2533: 2399: 1613: 643: 738:

When a site owner wishes to give instructions to web robots they place a text file called

8: 2657: 2320:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 2024:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)" 1378: 224: 1934: 1563: 1030:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

650:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 2047: 868: 854:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 761: 701: 651: 205: 2210: 2072: 2647: 2154: 1633: 705: 685:

mailing list, the main communication channel for WWW-related activities at the time.

654:

overload. In the 2020s many websites began denying bots that collect information for

228: 1217: 2523: 2114: 1658: 1603: 1275: 873: 833: 16: 1718: 1280: 1270: 1223: 1136: 989:

This example tells all robots that they can visit all files because the wildcard

392:

it may be impossible to block only one of these options.<ref name="Verge"/>

2536: 2513: 1616: 1593: 836:

announced that it would stop complying with robots.txt directives. According to

607: 1995: 1870: 838: 809: 686: 674: 248: 1379:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 1348: 1006:

The same result can be accomplished with an empty or missing robots.txt file.

2636: 1749:"Robots Exclusion Protocol: joining together to provide better documentation" 1445: 1323: 900: 697: 400:

it may be impossible to block only one of these options.<ref name="Verge"/

2118: 1411: 1033:

This example tells two specific robots not to enter one specific directory:

954:

file that displays information meant for humans to read. Some sites such as

58: 2352: 1241: 821: 636: 243: 238: 231: 456:* <code>]</code>, a standard for listing authorized ad sellers 449:* <code>]</code>, a standard for listing authorized ad sellers 2236:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 1930: 1328: 1318: 978: 829: 632: 1991:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 1145: 974: 936: 712: 2528: 2430:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1778: 1608: 1511: 1467:"How I got here in the end, part five: "things can only get better!"" 1450: 1141: 1075: 916: 904: 896: 891: 881: 780:. In addition, each protocol and port needs its own robots.txt file; 720: 2555:"How Google Interprets the robots.txt Specification | Documentation" 2260: 1293:

National Digital Information Infrastructure and Preservation Program

711:; most complied, including those operated by search engines such as 1313: 1308: 1298: 1094: 1021:

This example tells all robots to stay away from one specific file:

1018:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

940: 800:

The robots.txt protocol is widely complied with by bot operators.

730:. A proposed standard was published in September 2022 as RFC 9309. 662: 620: 2512:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1696: 1592:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

2073:"Robots.txt tells hackers the places you don't want them to look" 1303: 1259: 1233: 825: 747: 628: 2286: 2291: 1046: 955: 947: 859: 855: 2185: 1027:

All other files in the specified directory will be processed.

1015:

This example tells all robots not to enter three directories:

30: 29: 1187: 716: 704:

overload was a primary concern. By June 1994 it had become a

678: 587:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

550: 2625: 2519: 1599: 1535:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 2150:

Innocent Code: A Security Wake-Up Call for Web Programmers

2048:"Block URLs with robots.txt: Learn about robots.txt files" 2597: 2511: 1806: 1804: 1591: 1564:"Formalizing the Robots Exclusion Protocol Specification" 997:

directive has no value, meaning no pages are disallowed.

863: 639:

which portions of the website they are allowed to visit.

1009:

This example tells all robots to stay out of a website:

943:) when it detects a connection using one of the robots. 599: 2588:"Artificial Intelligence Web Crawlers Are Running Amok" 2374:"To crawl or not to crawl, that is BingBot's question" 1801: 661:

The "robots.txt" file can be used in conjunction with

65: 696:

The standard, initially RobotsNotWanted.txt, allowed

515:* ] – Now inactive search engine for robots.txt files 508:* ] – Now inactive search engine for robots.txt files 2102:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

1834: 1386:

First International Conference on the World Wide Web

1213: 1201: 530: 2400:"Change Googlebot crawl rate - Search Console Help" 2101: 858:'s GPTBot in their robots.txt file and 85 blocked 1262:– Now inactive search engine for robots.txt files 1151: 665:, another robot inclusion standard for websites. 2634: 2366: 2186:"List of User-Agents (Spiders, Robots, Browser)" 2146: 1982: 1865: 1863: 1741: 1681: 1114: 1039:Example demonstrating how comments can be used: 258:(11 intermediate revisions by 7 users not shown) 820:Some web archiving projects ignore robots.txt. 2482: 2480: 2111:National Institute of Standards and Technology 1771: 1651: 1532: 1438:"Important: Spiders, Robots and Web Wanderers" 1237:, a standard for listing authorized ad sellers 925:National Institute of Standards and Technology 824:uses the file to discover more links, such as 1860: 1081:User-agent: bingbot Allow: / Crawl-delay: 10 1024:User-agent: * Disallow: /directory/file.html 965:Previously, Google had a joke file hosted at 573:1994 published, formally standardized in 2022 1711: 1052:Example demonstrating multiple user-agents: 742:in the root of the web site hierarchy (e.g. 499:* ] – A failed proposal to extend robots.txt 492:* ] – A failed proposal to extend robots.txt 2477: 1111:Sitemap: http://www.example.com/sitemap.xml 642:The standard, developed in 1994, relies on 227:) to rev. 1239430727 by ClueBot NG: Likely 2544:sec. 2.5: Limits. 1842:"Submitting your website to Yahoo! Search" 1123:does not mention the "*" character in the 845: 549: 522:Latest revision as of 07:24, 9 August 2024 168:Latest revision as of 07:24, 9 August 2024 27: 2527: 2153:. John Wiley & Sons. pp. 91–92. 1607: 1059: 1526: 1376: 1256:– A failed proposal to extend robots.txt 1130: 1064: 923:is discouraged by standards bodies. The 772:did not, the rules that would apply for 28: 2021: 1929: 1812:"Webmasters: Robots.txt Specifications" 1500: 1498: 1496: 1494: 1492: 2635: 2317: 1585: 1568:Official Google Webmaster Central Blog 1507:"The text file that runs the internet" 1504: 1435: 2585: 2465:from the original on November 2, 2019 2267:from the original on January 24, 2017 2211:"Access Control - Apache HTTP Server" 1988: 1789:from the original on 16 February 2017 529:For Knowledge's robots.txt file, see 1911:from the original on 10 October 2022 1669:from the original on 27 January 2013 1489: 1436:Koster, Martijn (25 February 1994). 149: 91: 57: 531:https://en.wikipedia.org/robots.txt 257: 236: 217: 182: 165: 158: 145: 110: 98: 13: 2579: 2505: 2104:"Guide to General Server Security" 1729:from the original on 6 August 2013 1634:"Uncrawled URLs in search results" 1505:Pierce, David (14 February 2024). 744:https://www.example.com/robots.txt 656:generative artificial intelligence 524: 46: 2669: 2617: 2299:from the original on May 30, 2016 1404: 1254:Automated Content Access Protocol 1202:Maximum size of a robots.txt file 973:not to kill the company founders 815: 803: 101:Revision as of 06:33, 6 July 2024 2604:from the original on 6 July 2024 1287:National Digital Library Program 1216: 1188:A "noindex" HTTP response header 935:Many robots also pass a special 584:Martijn Koster (original author) 2565:from the original on 2022-10-17 2547: 2494:from the original on 2013-08-08 2447: 2422: 2410:from the original on 2018-11-18 2392: 2380:from the original on 2016-02-03 2341: 2330:from the original on 2018-11-18 2318:Newman, Lily Hay (2014-07-03). 2311: 2279: 2253: 2242:from the original on 2014-01-01 2228: 2217:from the original on 2013-12-29 2203: 2192:from the original on 2014-01-07 2178: 2167:from the original on 2016-04-01 2140: 2128:from the original on 2011-10-08 2095: 2083:from the original on 2015-08-21 2065: 2054:from the original on 2015-08-14 2040: 2015: 2003:from the original on 2017-05-16 1971:from the original on 2018-12-04 1953: 1941:from the original on 2017-02-18 1923: 1893: 1881:from the original on 2013-01-25 1848:from the original on 2013-01-21 1822:from the original on 2013-01-15 1759:from the original on 2014-08-18 1640:from the original on 2014-01-06 1626: 1574:from the original on 2019-07-10 1545:from the original on 2015-09-07 1533:Barry Schwartz (30 June 2014). 1477:from the original on 2013-11-25 1418:from the original on 2014-01-12 1392:from the original on 2013-09-27 1359:from the original on 2017-04-03 1074:for webmasters, to control the 930: 728:Internet Engineering Task Force 1935:"Robots.txt is a suicide note" 1556: 1459: 1429: 1370: 1341: 993:stands for all robots and the 784:does not apply to pages under 1: 2022:Koebler, Jason (2024-07-29). 1989:Jones, Brad (24 April 2017). 1414:. Robotstxt.org. 1994-06-30. 1335: 1093:directive, allowing multiple 895:reported that companies like 795: 782:http://example.com/robots.txt 760:A robots.txt file covers one 673:The standard was proposed by 520: 479: 395: 387: 310: 300: 2586:Allyn, Bobby (5 July 2024). 2351:. 2018-01-10. Archived from 18:Browse history interactively 7: 2455:"Robots.txt Specifications" 1659:"About Ask.com: Webmasters" 1209: 984: 910: 733: 10: 2674: 2643:Search engine optimization 1901:"ArchiveBot: Bad behavior" 1084: 1012:User-agent: * Disallow: / 958:redirect humans.txt to an 921:security through obscurity 768:had a robots.txt file but 668: 648:security through obscurity 623:used for implementing the 528: 147: 2515:Robots Exclusion Protocol 2147:Sverre H. Huseby (2004). 1595:Robots Exclusion Protocol 1000:User-agent: * Disallow: 625:Robots Exclusion Protocol 594: 577: 569: 561: 548: 544:Robots Exclusion Protocol 543: 477: 440: 437: 352: 349: 317: 298: 265: 262: 164: 134:Pending changes reviewers 97: 1636:. YouTube. Oct 5, 2009. 1266:Distributed web crawling 1191: 1155: 1121:Robot Exclusion Standard 1089:Some crawlers support a 941:pass alternative content 786:http://example.com:8080/ 691:denial-of-service attack 681:in February 1994 on the 631:to indicate to visiting 202:Extended confirmed users 130:Extended confirmed users 2238:. Iis.net. 2013-11-06. 2119:10.6028/NIST.SP.800-123 1003:User-agent: * Allow: / 846:Artificial intelligence 96: 1412:"The Web Robots Pages" 1377:Fielding, Roy (1994). 1193:X-Robots-Tag: noindex 1078:'s subsequent visits. 1060:Nonstandard extensions 2593:All Things Considered 1442:www-talk mailing list 1249:eBay v. Bidder's Edge 1131:Meta tags and headers 1065:Crawl-delay directive 872:. In 2023, blog host 627:, a standard used by 219:Reverting edit(s) by 2349:"/killer-robots.txt" 2213:. Httpd.apache.org. 1905:wiki.archiveteam.org 1455:on October 29, 2013. 1152:A "noindex" meta tag 946:Some sites, such as 790:https://example.com/ 693:on Koster's server. 644:voluntary compliance 344:===Search engines=== 337:===Search engines=== 2287:"Github humans.txt" 2261:"Google humans.txt" 2188:. User-agents.org. 1699:on 13 December 2012 1179:"noindex" 1115:Universal "*" match 776:would not apply to 677:, when working for 540: 229:copyright violation 2542:Proposed Standard. 2404:support.google.com 1871:"Using robots.txt" 1689:"About AOL Search" 1622:Proposed Standard. 1539:Search Engine Land 1283:for search engines 1170:"robots" 967:/killer-robots.txt 869:The New York Times 808:Some major search 538: 180: 108: 2559:Google Developers 2459:Google Developers 1967:. 17 April 2017. 1816:Google Developers 1453:archived message) 614: 613: 565:Proposed Standard 526:Internet protocol 519: 166: 150:→‎Further reading 99: 79: 39: 2665: 2629: 2628: 2626:Official website 2613: 2611: 2609: 2574: 2573: 2571: 2570: 2551: 2545: 2540: 2531: 2529:10.17487/RFC9309 2509: 2503: 2502: 2500: 2499: 2484: 2475: 2474: 2472: 2470: 2451: 2445: 2444: 2442: 2441: 2432:. Archived from 2426: 2420: 2419: 2417: 2415: 2396: 2390: 2389: 2387: 2385: 2370: 2364: 2363: 2361: 2360: 2345: 2339: 2338: 2336: 2335: 2315: 2309: 2308: 2306: 2304: 2283: 2277: 2276: 2274: 2272: 2257: 2251: 2250: 2248: 2247: 2232: 2226: 2225: 2223: 2222: 2207: 2201: 2200: 2198: 2197: 2182: 2176: 2175: 2173: 2172: 2144: 2138: 2137: 2135: 2133: 2127: 2108: 2099: 2093: 2092: 2090: 2088: 2069: 2063: 2062: 2060: 2059: 2044: 2038: 2037: 2035: 2034: 2019: 2013: 2012: 2010: 2008: 1986: 1980: 1979: 1977: 1976: 1965:blog.archive.org 1957: 1951: 1950: 1948: 1946: 1937:. Archive Team. 1927: 1921: 1920: 1918: 1916: 1907:. Archive Team. 1897: 1891: 1890: 1888: 1886: 1867: 1858: 1857: 1855: 1853: 1838: 1832: 1831: 1829: 1827: 1808: 1799: 1798: 1796: 1794: 1779:"DuckDuckGo Bot" 1775: 1769: 1768: 1766: 1764: 1745: 1739: 1738: 1736: 1734: 1715: 1709: 1708: 1706: 1704: 1695:. Archived from 1685: 1679: 1678: 1676: 1674: 1655: 1649: 1648: 1646: 1645: 1630: 1624: 1620: 1611: 1609:10.17487/RFC9309 1589: 1583: 1582: 1580: 1579: 1560: 1554: 1553: 1551: 1550: 1530: 1524: 1523: 1521: 1519: 1502: 1487: 1486: 1484: 1482: 1473:. 19 June 2006. 1463: 1457: 1456: 1454: 1444:. Archived from 1433: 1427: 1426: 1424: 1423: 1408: 1402: 1401: 1399: 1397: 1383: 1374: 1368: 1367: 1365: 1364: 1353:Greenhills.co.uk 1345: 1276:Internet Archive 1244: 1236: 1226: 1221: 1220: 1183: 1180: 1177: 1174: 1171: 1168: 1165: 1162: 1159: 1137:Robots meta tags 1126: 1107: 1100: 1092: 996: 992: 968: 953: 887: 834:Internet Archive 791: 787: 783: 779: 775: 771: 767: 745: 741: 610: 604: 601: 553: 541: 537: 252: 251: 246: 234: 212: 198: 179: 174: 156: 155: 153: 140: 126: 107: 80: 71: 70: 68: 63: 61: 53: 50: 32: 31: 21: 19: 2673: 2672: 2668: 2667: 2666: 2664: 2663: 2662: 2633: 2632: 2624: 2623: 2620: 2607: 2605: 2582: 2580:Further reading 2577: 2568: 2566: 2553: 2552: 2548: 2510: 2506: 2497: 2495: 2486: 2485: 2478: 2468: 2466: 2453: 2452: 2448: 2439: 2437: 2428: 2427: 2423: 2413: 2411: 2398: 2397: 2393: 2383: 2381: 2372: 2371: 2367: 2358: 2356: 2347: 2346: 2342: 2333: 2331: 2316: 2312: 2302: 2300: 2285: 2284: 2280: 2270: 2268: 2259: 2258: 2254: 2245: 2243: 2234: 2233: 2229: 2220: 2218: 2209: 2208: 2204: 2195: 2193: 2184: 2183: 2179: 2170: 2168: 2161: 2145: 2141: 2131: 2129: 2125: 2106: 2100: 2096: 2086: 2084: 2071: 2070: 2066: 2057: 2055: 2046: 2045: 2041: 2032: 2030: 2020: 2016: 2006: 2004: 1987: 1983: 1974: 1972: 1959: 1958: 1954: 1944: 1942: 1928: 1924: 1914: 1912: 1899: 1898: 1894: 1884: 1882: 1875:Help.yandex.com 1869: 1868: 1861: 1851: 1849: 1840: 1839: 1835: 1825: 1823: 1810: 1809: 1802: 1792: 1790: 1777: 1776: 1772: 1762: 1760: 1747: 1746: 1742: 1732: 1730: 1717: 1716: 1712: 1702: 1700: 1687: 1686: 1682: 1672: 1670: 1657: 1656: 1652: 1643: 1641: 1632: 1631: 1627: 1590: 1586: 1577: 1575: 1562: 1561: 1557: 1548: 1546: 1531: 1527: 1517: 1515: 1503: 1490: 1480: 1478: 1471:Charlie's Diary 1465: 1464: 1460: 1448: 1434: 1430: 1421: 1419: 1410: 1409: 1405: 1395: 1393: 1381: 1375: 1371: 1362: 1360: 1347: 1346: 1342: 1338: 1333: 1271:Focused crawler 1240: 1232: 1224:Internet portal 1222: 1215: 1212: 1204: 1195: 1194: 1190: 1185: 1184: 1181: 1178: 1175: 1172: 1169: 1166: 1163: 1160: 1157: 1154: 1133: 1124: 1117: 1112: 1102: 1098: 1090: 1087: 1082: 1067: 1062: 1057: 1043: 1037: 1031: 1025: 1019: 1013: 1004: 1001: 994: 990: 987: 966: 951: 933: 913: 885: 848: 818: 806: 798: 789: 785: 781: 777: 773: 769: 765: 743: 739: 736: 671: 606: 598: 590: 570:First published 557: 534: 527: 516: 509: 500: 493: 484: 473: 466: 457: 450: 433: 426: 417: 412: 405: 403: 393: 383: 378: 371: 363: 345: 338: 329: 324: 315: 306: 294: 287: 278: 273: 253: 247: 242: 237: 235: 218: 216: 215: 214: 210: 208: 188: 186: 181: 175: 170: 162: 160:← Previous edit 157: 148: 146: 144: 143: 142: 138: 136: 116: 114: 109: 103: 95: 94: 93: 92: 90: 89: 88: 87: 86: 85: 76: 72: 66: 64: 59: 56: 54: 51: 49:Content deleted 48: 45: 43:← Previous edit 40: 26: 25: 24: 17: 12: 11: 5: 2671: 2661: 2660: 2655: 2650: 2645: 2631: 2630: 2619: 2618:External links 2616: 2615: 2614: 2581: 2578: 2576: 2575: 2546: 2504: 2476: 2446: 2421: 2391: 2376:. 3 May 2012. 2365: 2340: 2324:Slate Magazine 2310: 2278: 2252: 2227: 2202: 2177: 2159: 2139: 2094: 2064: 2039: 2014: 1996:Digital Trends 1981: 1952: 1922: 1892: 1859: 1833: 1800: 1783:DuckDuckGo.com 1770: 1753:Blogs.bing.com 1740: 1710: 1693:Search.aol.com 1680: 1650: 1625: 1584: 1555: 1525: 1488: 1458: 1428: 1403: 1369: 1339: 1337: 1334: 1332: 1331: 1326: 1321: 1316: 1311: 1306: 1301: 1296: 1290: 1284: 1278: 1273: 1268: 1263: 1257: 1251: 1246: 1238: 1229: 1228: 1227: 1211: 1208: 1203: 1200: 1192: 1189: 1186: 1156: 1153: 1150: 1132: 1129: 1116: 1113: 1110: 1086: 1083: 1080: 1072:search console 1066: 1063: 1061: 1058: 1054: 1041: 1035: 1029: 1023: 1017: 1011: 1002: 999: 986: 983: 971:the Terminator 932: 929: 912: 909: 847: 844: 839:Digital Trends 817: 816:Archival sites 814: 805: 804:Search engines 802: 797: 794: 735: 732: 698:web developers 687:Charles Stross 675:Martijn Koster 670: 667: 612: 611: 596: 592: 591: 589: 588: 585: 581: 579: 575: 574: 571: 567: 566: 563: 559: 558: 554: 546: 545: 525: 523: 518: 517: 514: 512: 510: 507: 505: 502: 501: 498: 496: 494: 491: 489: 486: 485: 482: 480: 478: 475: 474: 471: 469: 467: 464: 462: 459: 458: 455: 453: 451: 448: 446: 443: 442: 439: 435: 434: 431: 429: 427: 424: 422: 419: 418: 415: 413: 410: 407: 406: 401: 398: 396: 394: 390: 388: 385: 384: 381: 379: 376: 373: 372: 368: 366: 364: 360: 358: 355: 354: 351: 347: 346: 343: 341: 339: 336: 334: 331: 330: 327: 325: 322: 319: 318: 316: 313: 311: 308: 307: 303: 301: 299: 296: 295: 293:==Compliance== 292: 290: 288: 286:==Compliance== 285: 283: 280: 279: 276: 274: 271: 268: 267: 264: 260: 259: 255: 254: 209: 200: 199: 185:ClaudineChionh 184: 163: 154:Updated a URL. 137: 128: 127: 112: 81: 75: 73: 55: 47: 41: 38: 37: 35: 23: 22: 14: 9: 6: 4: 3: 2: 2670: 2659: 2656: 2654: 2651: 2649: 2646: 2644: 2641: 2640: 2638: 2627: 2622: 2621: 2603: 2599: 2595: 2594: 2589: 2584: 2583: 2564: 2560: 2556: 2550: 2543: 2538: 2535: 2530: 2525: 2521: 2517: 2516: 2508: 2493: 2489: 2483: 2481: 2464: 2460: 2456: 2450: 2436:on 2009-03-05 2435: 2431: 2425: 2409: 2405: 2401: 2395: 2379: 2375: 2369: 2355:on 2018-01-10 2354: 2350: 2344: 2329: 2325: 2321: 2314: 2298: 2294: 2293: 2288: 2282: 2266: 2262: 2256: 2241: 2237: 2231: 2216: 2212: 2206: 2191: 2187: 2181: 2166: 2162: 2160:9780470857472 2156: 2152: 2151: 2143: 2124: 2120: 2116: 2112: 2105: 2098: 2082: 2078: 2074: 2068: 2053: 2049: 2043: 2029: 2025: 2018: 2002: 1998: 1997: 1992: 1985: 1970: 1966: 1962: 1956: 1940: 1936: 1932: 1926: 1910: 1906: 1902: 1896: 1880: 1876: 1872: 1866: 1864: 1847: 1843: 1837: 1821: 1817: 1813: 1807: 1805: 1788: 1784: 1780: 1774: 1758: 1754: 1750: 1744: 1728: 1724: 1720: 1719:"Baiduspider" 1714: 1698: 1694: 1690: 1684: 1668: 1664: 1663:About.ask.com 1660: 1654: 1639: 1635: 1629: 1623: 1618: 1615: 1610: 1605: 1601: 1597: 1596: 1588: 1573: 1569: 1565: 1559: 1544: 1540: 1536: 1529: 1514: 1513: 1508: 1501: 1499: 1497: 1495: 1493: 1476: 1472: 1468: 1462: 1452: 1447: 1443: 1439: 1432: 1417: 1413: 1407: 1396:September 25, 1391: 1387: 1380: 1373: 1358: 1354: 1350: 1344: 1340: 1330: 1327: 1325: 1324:Web archiving 1322: 1320: 1317: 1315: 1312: 1310: 1307: 1305: 1302: 1300: 1297: 1294: 1291: 1288: 1285: 1282: 1281:Meta elements 1279: 1277: 1274: 1272: 1269: 1267: 1264: 1261: 1258: 1255: 1252: 1250: 1247: 1243: 1239: 1235: 1231: 1230: 1225: 1219: 1214: 1207: 1199: 1149: 1147: 1143: 1138: 1128: 1122: 1109: 1106: 1096: 1079: 1077: 1073: 1053: 1050: 1048: 1040: 1034: 1028: 1022: 1016: 1010: 1007: 998: 982: 980: 976: 972: 963: 961: 957: 949: 944: 942: 938: 928: 926: 922: 918: 908: 906: 902: 901:Perplexity.ai 898: 894: 893: 884: 883: 877: 875: 871: 870: 865: 861: 857: 853: 852:generative AI 843: 841: 840: 835: 831: 828:. Co-founder 827: 823: 813: 811: 801: 793: 778:a.example.com 770:a.example.com 763: 758: 754: 751: 749: 731: 729: 724: 722: 718: 714: 710: 708: 703: 699: 694: 692: 688: 684: 680: 676: 666: 664: 659: 657: 653: 649: 645: 640: 638: 634: 630: 626: 622: 618: 609: 603: 597: 593: 586: 583: 582: 580: 576: 572: 568: 564: 560: 552: 547: 542: 536: 532: 521: 513: 511: 506: 504: 503: 497: 495: 490: 488: 487: 481: 476: 470: 468: 463: 461: 460: 454: 452: 447: 445: 444: 436: 430: 428: 423: 421: 420: 416: 414: 411: 409: 408: 397: 389: 386: 382: 380: 377: 375: 374: 367: 365: 359: 357: 356: 348: 342: 340: 335: 333: 332: 328: 326: 323: 321: 320: 312: 309: 302: 297: 291: 289: 284: 282: 281: 277: 275: 272: 270: 269: 261: 256: 250: 245: 240: 233: 230: 226: 222: 207: 203: 196: 192: 187: 178: 173: 169: 161: 151: 135: 131: 124: 120: 115: 106: 102: 84: 69: 62: 52:Content added 44: 36: 34: 33: 20: 2653:Web scraping 2606:. Retrieved 2591: 2567:. Retrieved 2558: 2549: 2541: 2514: 2507: 2496:. Retrieved 2469:February 15, 2467:. Retrieved 2458: 2449: 2438:. Retrieved 2434:the original 2424: 2412:. Retrieved 2403: 2394: 2382:. Retrieved 2368: 2357:. Retrieved 2353:the original 2343: 2332:. Retrieved 2323: 2313: 2301:. Retrieved 2290: 2281: 2269:. Retrieved 2255: 2244:. Retrieved 2230: 2219:. Retrieved 2205: 2194:. Retrieved 2180: 2169:. Retrieved 2149: 2142: 2130:. Retrieved 2110: 2097: 2085:. Retrieved 2077:The Register 2076: 2067: 2056:. Retrieved 2042: 2031:. Retrieved 2027: 2017: 2005:. Retrieved 1994: 1984: 1973:. Retrieved 1964: 1955: 1943:. Retrieved 1925: 1913:. Retrieved 1904: 1895: 1883:. Retrieved 1874: 1850:. Retrieved 1836: 1824:. Retrieved 1815: 1791:. Retrieved 1782: 1773: 1761:. Retrieved 1752: 1743: 1731:. Retrieved 1722: 1713: 1701:. Retrieved 1697:the original 1692: 1683: 1671:. Retrieved 1662: 1653: 1642:. Retrieved 1628: 1621: 1594: 1587: 1576:. Retrieved 1567: 1558: 1547:. Retrieved 1538: 1528: 1516:. Retrieved 1510: 1479:. Retrieved 1470: 1461: 1446:the original 1441: 1431: 1420:. Retrieved 1406: 1394:. Retrieved 1385: 1382:(PostScript) 1372: 1361:. Retrieved 1352: 1349:"Historical" 1343: 1242:security.txt 1205: 1196: 1134: 1120: 1118: 1104: 1101:in the form 1097:in the same 1088: 1068: 1051: 1044: 1038: 1032: 1026: 1020: 1014: 1008: 1005: 988: 969:instructing 964: 959: 945: 934: 931:Alternatives 914: 890: 880: 878: 867: 849: 837: 822:Archive Team 819: 807: 799: 759: 755: 752: 737: 725: 706: 695: 682: 672: 660: 641: 633:web crawlers 624: 616: 615: 535: 432:==Security== 425:==Security== 1945:18 February 1931:Jason Scott 1885:16 February 1852:16 February 1826:16 February 1763:16 February 1733:16 February 1703:16 February 1673:16 February 1329:Web crawler 1319:Spider trap 1127:statement. 979:Sergey Brin 830:Jason Scott 774:example.com 766:example.com 206:Rollbackers 113:DocWatson42 2658:Text files 2637:Categories 2569:2022-10-17 2498:2013-08-17 2440:2009-03-23 2414:22 October 2384:9 February 2359:2018-05-25 2334:2019-10-03 2303:October 3, 2271:October 3, 2246:2013-12-29 2221:2013-12-29 2196:2013-12-29 2171:2015-08-12 2132:August 12, 2087:August 12, 2058:2015-08-10 2033:2024-07-29 1975:2018-12-01 1915:10 October 1644:2013-12-29 1578:2019-07-10 1549:2015-11-19 1422:2013-12-29 1388:. Geneva. 1363:2017-03-03 1336:References 1146:httpd.conf 1099:robots.txt 1056:directory 975:Larry Page 952:humans.txt 937:user-agent 905:blocklists 796:Compliance 740:robots.txt 713:WebCrawler 637:web robots 635:and other 617:robots.txt 539:robots.txt 2028:404 Media 1723:Baidu.com 1512:The Verge 1451:Hypermail 1142:.htaccess 1125:Disallow: 1103:Sitemap: 1076:Googlebot 950:, host a 917:web robot 897:Anthropic 892:404 Media 882:The Verge 721:AltaVista 600:robotstxt 441:Line 202: 438:Line 202: 232:(RW 16.1) 221:Seo168168 2648:Websites 2602:Archived 2563:Archived 2492:Archived 2463:Archived 2408:Archived 2378:Archived 2328:Archived 2297:Archived 2265:Archived 2240:Archived 2215:Archived 2190:Archived 2165:Archived 2123:Archived 2081:Archived 2052:Archived 2001:Archived 1969:Archived 1939:Archived 1909:Archived 1879:Archived 1846:Archived 1820:Archived 1793:25 April 1787:Archived 1757:Archived 1727:Archived 1667:Archived 1638:Archived 1572:Archived 1543:Archived 1518:16 March 1481:19 April 1475:Archived 1416:Archived 1390:Archived 1357:Archived 1314:Sitemaps 1309:Perma.cc 1299:nofollow 1295:(NDIIPP) 1210:See also 1105:full-url 1095:Sitemaps 995:Disallow 985:Examples 911:Security 826:sitemaps 734:Standard 709:standard 707:de facto 683:www-talk 663:sitemaps 629:websites 621:filename 608:RFC 9309 353:Line 73: 350:Line 73: 266:Line 62: 263:Line 62: 195:contribs 123:contribs 67:Wikitext 1304:noindex 1260:BotSeer 1234:ads.txt 1173:content 1148:files. 1091:Sitemap 1085:Sitemap 810:engines 748:website 669:History 619:is the 595:Website 578:Authors 556:folder. 139:212,208 2608:6 July 2292:GitHub 2157: 1289:(NDLP) 1047:Google 962:page. 956:GitHub 948:Google 874:Medium 860:Google 856:OpenAI 762:origin 719:, and 702:server 652:server 562:Status 78:Inline 60:Visual 2126:(PDF) 2107:(PDF) 2007:8 May 1182:/> 960:About 886:' 717:Lycos 679:Nexor 213:edits 211:3,805 141:edits 2610:2024 2537:9309 2520:IETF 2471:2020 2416:2018 2386:2016 2305:2019 2273:2019 2155:ISBN 2134:2015 2089:2015 2009:2017 1947:2017 1917:2022 1887:2013 1854:2013 1828:2013 1795:2017 1765:2013 1735:2013 1705:2013 1675:2013 1617:9309 1600:IETF 1520:2024 1483:2014 1398:2013 1164:name 1161:meta 1158:< 1144:and 1119:The 977:and 899:and 866:and 602:.org 404:> 249:Undo 239:Tags 225:talk 191:talk 177:undo 172:edit 119:talk 105:edit 2598:NPR 2534:RFC 2524:doi 2115:doi 1614:RFC 1604:doi 864:BBC 788:or 483:* ] 2639:: 2600:. 2596:. 2590:. 2561:. 2557:. 2532:. 2522:. 2518:. 2490:. 2479:^ 2461:. 2457:. 2406:. 2402:. 2326:. 2322:. 2295:. 2289:. 2263:. 2163:. 2121:. 2113:. 2109:. 2079:. 2075:. 2050:. 2026:. 1999:. 1993:. 1963:. 1933:. 1903:. 1877:. 1873:. 1862:^ 1844:. 1818:. 1814:. 1803:^ 1785:. 1781:. 1755:. 1751:. 1725:. 1721:. 1691:. 1665:. 1661:. 1612:. 1602:. 1598:. 1570:. 1566:. 1541:. 1537:. 1509:. 1491:^ 1469:. 1440:. 1384:. 1355:. 1351:. 1108:: 981:. 907:. 792:. 723:. 715:, 658:. 605:, 244:RW 241:: 204:, 193:| 152:: 132:, 121:| 2612:. 2572:. 2539:. 2526:: 2501:. 2473:. 2443:. 2418:. 2388:. 2362:. 2337:. 2307:. 2275:. 2249:. 2224:. 2199:. 2174:. 2136:. 2117:: 2091:. 2061:. 2036:. 2011:. 1978:. 1949:. 1919:. 1889:. 1856:. 1830:. 1797:. 1767:. 1737:. 1707:. 1677:. 1647:. 1619:. 1606:: 1581:. 1552:. 1522:. 1485:. 1449:( 1425:. 1400:. 1366:. 1176:= 1167:= 991:* 533:. 223:( 197:) 189:( 125:) 117:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index