Bag-of-words model

977:, where words are mapped directly to indices with a hashing function. Thus, no memory is required to store a dictionary. Hash collisions are typically dealt via freed-up memory to increase the number of hash buckets. In practice, hashing simplifies the implementation of bag-of-words models and improves scalability. 639:" and "dog bites man" are the same, so any algorithm that operates with a BoW representation of text must treat them in the same way. Despite this lack of syntax or grammar, BoW representation is fast and may be sufficient for simple tasks that do not require word order. For instance, for 1083:

And this stock of combinations of elements becomes a factor in the way later choices are made ... for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its

675:

alternatives have been developed to account for the class label of a document. Lastly, binary (presence/absence or 1/0) weighting is used in place of frequencies for some problems (e.g., this option is implemented in the

667:

Implementations of the bag-of-words model might involve using frequencies of words in a document to represent its contents. The frequencies can be "normalized" by the inverse of document frequency, or

626: 643:, if the words "stocks" "trade" "investors" appears multiple times, then the text is likely a financial report, even though it would be insufficient to distinguish between 1208: 1368: 1098: 1346: 1757: 1201: 1117:

Weinberger, K. Q.; Dasgupta A.; Langford J.; Smola A.; Attenberg, J. (2009). "Feature hashing for large scale multitask learning".

1926: 572: 66: 20: 1957: 1667: 1358: 1194: 1921: 1103: 1528: 1682: 1513: 1152: 1453: 1870: 1523: 1518: 1263: 1787: 1508: 1480: 1096:

Youngjoong Ko (2012). "A study of term weighting schemes using class information for text classification".

399:

Each key is the word, and each value is the number of occurrences of that word in the given text document.

30:(BoW) is a model of text which uses a representation of text that is based on an unordered collection (a " 1825: 1810: 1782: 1647: 1642: 1217: 35: 1562: 1533: 1311: 659:

and so the BoW representation would be insufficient to determine the detailed meaning of the document.

62: 58: 47: 1962: 1405: 1258: 635:

The BoW representation of a text removes all word ordering. For example, the BoW representation of "

1931: 1855: 1587: 1543: 1428: 1326: 1835: 1805: 1472: 677: 640: 54: 1306: 421:(3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games. 1692: 1385: 1363: 1353: 1321: 1296: 1552: 88:

The following models a text document using bag-of-words. Here are two simple text documents:

39: 1905: 1581: 1557: 1410: 1132: 8: 1885: 1815: 1772: 1728: 1500: 1490: 1485: 1373: 1041: 672: 1136: 1895: 1767: 1632: 1395: 1378: 1158: 1122: 1006: 991: 986: 101:

Based on these two text documents, a list is constructed as follows for each document:

1900: 1612: 1420: 1331: 1148: 565:, the "union" of two documents in the bags-of-words representation is, formally, the 1777: 1662: 1637: 1438: 1341: 1140: 1078: 1074: 996: 1162: 1889: 1850: 1845: 1713: 1443: 1316: 1291: 1273: 1597: 1577: 1301: 566: 1186: 1049:

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 4

1951: 1860: 1672: 1652: 1433: 1062: 974: 636: 73: 72:

An early reference to "bag of words" in a linguistic context can be found in

1144: 57:

where, for example, the (frequency of) occurrence of each word is used as a

1840: 1458: 1119:

Proceedings of the 26th Annual International Conference on Machine Learning

1797: 1677: 1390: 1283: 1231: 1011: 1116: 1400: 668: 203: 43: 1268: 1743: 1723: 1708: 1687: 1657: 1602: 1567: 1448: 655:

Yesterday, investors were retreating, but today, they are rallying.

647:

Yesterday, investors were rallying, but today, they are retreating.

562: 31: 1127: 1880: 1738: 1718: 1592: 1336: 1251: 1001: 404:{"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} 1246: 1241: 1936: 1572: 671:. Additionally, for the specific purpose of classification, 199: 1042:"Efficient visual search of videos cast as text retrieval" 1733: 417:

Note: if another document is like a union of these two,

92:(1) John likes to watch movies. Mary likes movies too. 53:

The bag-of-words model is commonly used in methods of

575: 16:

Text represented as an unordered collection of words

690:# Make sure to install the necessary packages first 19:For the bag-of-words model in computer vision, see 973:A common alternative to using dictionaries is the 620: 46:(and thus most of syntax or grammar) but captures 1949: 1419: 1216: 402:The order of elements is free, so, for example 1202: 1095: 569:, summing the multiplicities of each element. 97:(2) Mary also likes to watch football games. 1110: 1209: 1195: 410:. It is also what we expect from a strict 1126: 683: 621:{\displaystyle BoW3=BoW1\biguplus BoW2} 425:its JavaScript representation will be: 1950: 1061: 1190: 1039: 21:Bag-of-words model in computer vision 1668:Simple Knowledge Organization System 1183:. Springer International Publishing. 1065:(1954). "Distributional Structure". 1033: 202:, and attributing to the respective 198:Representing each bag-of-words as a 680:machine learning software system). 13: 662: 14: 1974: 1683:Thesaurus (information retrieval) 1179:McTear, Michael (et al) (2016). 968: 1051:. opposition. pp. 591–605. 1264:Natural language understanding 1089: 1079:10.1080/00437956.1954.11659520 1055: 1024: 1: 1788:Optical character recognition 1173: 897:"Bag of word sentence 1: 630: 83: 1481:Multi-document summarization 1181:The Conversational Interface 7: 1958:Natural language processing 1811:Latent Dirichlet allocation 1783:Natural language generation 1648:Machine-readable dictionary 1643:Linguistic Linked Open Data 1218:Natural language processing 1040:Sivic, Josef (April 2009). 980: 693:# pip install --upgrade pip 36:natural language processing 34:") of words. It is used in 10: 1979: 1563:Explicit semantic analysis 1312:Deep linguistic processing 1030:McTear et al 2016, p. 167. 18: 1914: 1869: 1824: 1796: 1756: 1701: 1623: 1611: 1542: 1499: 1471: 1406:Word-sense disambiguation 1282: 1259:Computational linguistics 1224: 1932:Natural Language Toolkit 1856:Pronunciation assessment 1758:Automatic identification 1588:Latent semantic analysis 1544:Distributional semantics 1429:Compound-term processing 1327:Named-entity recognition 1017: 726:keras.preprocessing.text 696:# pip install tensorflow 687: 427: 419: 208: 103: 95: 90: 78:Distributional Structure 67:used for computer vision 1836:Automated essay scoring 1806:Document classification 1473:Automatic summarization 1145:10.1145/1553374.1553516 641:document classification 55:document classification 1693:Universal Dependencies 1386:Terminology extraction 1369:Semantic decomposition 1364:Semantic role labeling 1354:Part-of-speech tagging 1322:Information extraction 1307:Coreference resolution 1297:Collocation extraction 1121:. pp. 1113–1120. 657: 649: 622: 406:is also equivalent to 1454:Sentence segmentation 684:Python implementation 653: 645: 623: 561:So, as we see in the 40:information retrieval 1906:Voice user interface 1617:datasets and corpora 1558:Document-term matrix 1411:Word-sense induction 948:unique tokens." 573: 535:"football" 373:"football" 187:"football" 42:(IR). It disregards 1886:Interactive fiction 1816:Pachinko allocation 1773:Speech segmentation 1729:Google Ngram Viewer 1501:Machine translation 1491:Text simplification 1486:Sentence extraction 1374:Semantic similarity 1137:2009arXiv0902.2206W 76:'s 1954 article on 65:. It has also been 1896:Question answering 1768:Speech recognition 1633:Corpus linguistics 1613:Language resources 1396:Textual entailment 1379:Sentiment analysis 1007:Vector space model 992:Feature extraction 987:Additive smoothing 813:texts_to_sequences 618: 487:"movies" 268:"movies" 148:"movies" 130:"movies" 28:bag-of-words model 1945: 1944: 1901:Virtual assistant 1826:Computer-assisted 1752: 1751: 1509:Computer-assisted 1467: 1466: 1459:Word segmentation 1421:Text segmentation 1359:Semantic analysis 1347:Syntactic parsing 1332:Ontology learning 547:"games" 475:"watch" 451:"likes" 385:"games" 361:"watch" 337:"likes" 256:"watch" 232:"likes" 193:"games" 181:"watch" 169:"likes" 142:"likes" 124:"watch" 112:"likes" 1970: 1963:Machine learning 1922:Formal semantics 1871:Natural language 1778:Speech synthesis 1760:and data capture 1663:Semantic network 1638:Lexical resource 1621: 1620: 1439:Lexical analysis 1417: 1416: 1342:Semantic parsing 1211: 1204: 1197: 1188: 1187: 1167: 1166: 1130: 1114: 1108: 1107: 1093: 1087: 1086: 1059: 1053: 1052: 1046: 1037: 1031: 1028: 997:Machine learning 964: 961: 958: 955: 952: 949: 946: 943: 940: 937: 934: 931: 928: 925: 922: 919: 916: 913: 910: 907: 904: 901: 898: 895: 892: 889: 886: 883: 880: 877: 874: 871: 868: 865: 862: 859: 856: 853: 850: 847: 844: 841: 838: 835: 832: 829: 826: 823: 820: 817: 814: 811: 808: 805: 802: 799: 796: 793: 790: 787: 784: 781: 778: 775: 772: 769: 766: 763: 760: 757: 754: 751: 748: 745: 742: 739: 736: 733: 730: 727: 724: 721: 718: 715: 712: 709: 706: 703: 700: 697: 694: 691: 627: 625: 624: 619: 557: 554: 551: 548: 545: 542: 539: 536: 533: 530: 527: 524: 523:"also" 521: 518: 515: 512: 509: 506: 503: 500: 499:"Mary" 497: 494: 491: 488: 485: 482: 479: 476: 473: 470: 467: 464: 461: 458: 455: 452: 449: 446: 443: 440: 439:"John" 437: 434: 431: 414:representation. 405: 395: 392: 389: 386: 383: 380: 377: 374: 371: 368: 365: 362: 359: 356: 353: 350: 347: 344: 341: 338: 335: 332: 329: 326: 325:"also" 323: 320: 317: 314: 313:"Mary" 311: 308: 305: 302: 299: 296: 293: 290: 287: 284: 281: 280:"Mary" 278: 275: 272: 269: 266: 263: 260: 257: 254: 251: 248: 245: 242: 239: 236: 233: 230: 227: 224: 221: 220:"John" 218: 215: 212: 194: 191: 188: 185: 182: 179: 176: 173: 170: 167: 164: 163:"also" 161: 158: 157:"Mary" 155: 152: 149: 146: 143: 140: 137: 136:"Mary" 134: 131: 128: 125: 122: 119: 116: 113: 110: 107: 106:"John" 1978: 1977: 1973: 1972: 1971: 1969: 1968: 1967: 1948: 1947: 1946: 1941: 1910: 1890:Syntax guessing 1872: 1865: 1851:Predictive text 1846:Grammar checker 1827: 1820: 1792: 1759: 1748: 1714:Bank of English 1697: 1625: 1616: 1607: 1538: 1495: 1463: 1415: 1317:Distant reading 1292:Argument mining 1278: 1274:Text processing 1220: 1215: 1176: 1171: 1170: 1155: 1115: 1111: 1094: 1090: 1073:(2/3): 146–62. 1060: 1056: 1044: 1038: 1034: 1029: 1025: 1020: 983: 971: 966: 965: 962: 959: 956: 953: 950: 947: 944: 941: 938: 935: 932: 929: 927:"We found 926: 923: 920: 917: 914: 911: 908: 905: 902: 899: 896: 893: 890: 887: 884: 881: 878: 875: 872: 869: 866: 863: 860: 857: 854: 851: 848: 845: 842: 839: 836: 833: 830: 827: 824: 821: 818: 815: 812: 809: 806: 803: 800: 797: 794: 791: 788: 785: 782: 779: 776: 773: 770: 767: 764: 761: 758: 755: 752: 749: 746: 743: 740: 737: 734: 731: 728: 725: 722: 719: 716: 713: 710: 707: 704: 701: 698: 695: 692: 689: 686: 665: 663:Implementations 633: 574: 571: 570: 559: 558: 555: 552: 549: 546: 543: 540: 537: 534: 531: 528: 525: 522: 519: 516: 513: 511:"too" 510: 507: 504: 501: 498: 495: 492: 489: 486: 483: 480: 477: 474: 471: 468: 465: 462: 459: 456: 453: 450: 447: 444: 441: 438: 435: 432: 429: 423: 422: 403: 397: 396: 393: 390: 387: 384: 381: 378: 375: 372: 369: 366: 363: 360: 357: 354: 351: 348: 345: 342: 339: 336: 333: 330: 327: 324: 321: 318: 315: 312: 309: 306: 303: 300: 297: 294: 292:"too" 291: 288: 285: 282: 279: 276: 273: 270: 267: 264: 261: 258: 255: 252: 249: 246: 243: 240: 237: 234: 231: 228: 225: 222: 219: 216: 213: 210: 196: 195: 192: 189: 186: 183: 180: 177: 174: 171: 168: 165: 162: 159: 156: 154:"too" 153: 150: 147: 144: 141: 138: 135: 132: 129: 126: 123: 120: 117: 114: 111: 108: 105: 99: 98: 94: 93: 86: 61:for training a 24: 17: 12: 11: 5: 1976: 1966: 1965: 1960: 1943: 1942: 1940: 1939: 1934: 1929: 1924: 1918: 1916: 1912: 1911: 1909: 1908: 1903: 1898: 1893: 1883: 1877: 1875: 1873:user interface 1867: 1866: 1864: 1863: 1858: 1853: 1848: 1843: 1838: 1832: 1830: 1822: 1821: 1819: 1818: 1813: 1808: 1802: 1800: 1794: 1793: 1791: 1790: 1785: 1780: 1775: 1770: 1764: 1762: 1754: 1753: 1750: 1749: 1747: 1746: 1741: 1736: 1731: 1726: 1721: 1716: 1711: 1705: 1703: 1699: 1698: 1696: 1695: 1690: 1685: 1680: 1675: 1670: 1665: 1660: 1655: 1650: 1645: 1640: 1635: 1629: 1627: 1618: 1609: 1608: 1606: 1605: 1600: 1598:Word embedding 1595: 1590: 1585: 1578:Language model 1575: 1570: 1565: 1560: 1555: 1549: 1547: 1540: 1539: 1537: 1536: 1531: 1529:Transfer-based 1526: 1521: 1516: 1511: 1505: 1503: 1497: 1496: 1494: 1493: 1488: 1483: 1477: 1475: 1469: 1468: 1465: 1464: 1462: 1461: 1456: 1451: 1446: 1441: 1436: 1431: 1425: 1423: 1414: 1413: 1408: 1403: 1398: 1393: 1388: 1382: 1381: 1376: 1371: 1366: 1361: 1356: 1351: 1350: 1349: 1344: 1334: 1329: 1324: 1319: 1314: 1309: 1304: 1302:Concept mining 1299: 1294: 1288: 1286: 1280: 1279: 1277: 1276: 1271: 1266: 1261: 1256: 1255: 1254: 1249: 1239: 1234: 1228: 1226: 1222: 1221: 1214: 1213: 1206: 1199: 1191: 1185: 1184: 1175: 1172: 1169: 1168: 1153: 1109: 1088: 1063:Harris, Zellig 1054: 1032: 1022: 1021: 1019: 1016: 1015: 1014: 1009: 1004: 999: 994: 989: 982: 979: 970: 967: 688: 685: 682: 664: 661: 632: 629: 617: 614: 611: 608: 605: 602: 599: 596: 593: 590: 587: 584: 581: 578: 567:disjoint union 463:"to" 428: 420: 349:"to" 244:"to" 209: 175:"to" 118:"to" 104: 96: 91: 85: 82: 15: 9: 6: 4: 3: 2: 1975: 1964: 1961: 1959: 1956: 1955: 1953: 1938: 1935: 1933: 1930: 1928: 1927:Hallucination 1925: 1923: 1920: 1919: 1917: 1913: 1907: 1904: 1902: 1899: 1897: 1894: 1891: 1887: 1884: 1882: 1879: 1878: 1876: 1874: 1868: 1862: 1861:Spell checker 1859: 1857: 1854: 1852: 1849: 1847: 1844: 1842: 1839: 1837: 1834: 1833: 1831: 1829: 1823: 1817: 1814: 1812: 1809: 1807: 1804: 1803: 1801: 1799: 1795: 1789: 1786: 1784: 1781: 1779: 1776: 1774: 1771: 1769: 1766: 1765: 1763: 1761: 1755: 1745: 1742: 1740: 1737: 1735: 1732: 1730: 1727: 1725: 1722: 1720: 1717: 1715: 1712: 1710: 1707: 1706: 1704: 1700: 1694: 1691: 1689: 1686: 1684: 1681: 1679: 1676: 1674: 1673:Speech corpus 1671: 1669: 1666: 1664: 1661: 1659: 1656: 1654: 1653:Parallel text 1651: 1649: 1646: 1644: 1641: 1639: 1636: 1634: 1631: 1630: 1628: 1622: 1619: 1614: 1610: 1604: 1601: 1599: 1596: 1594: 1591: 1589: 1586: 1583: 1579: 1576: 1574: 1571: 1569: 1566: 1564: 1561: 1559: 1556: 1554: 1551: 1550: 1548: 1545: 1541: 1535: 1532: 1530: 1527: 1525: 1522: 1520: 1517: 1515: 1514:Example-based 1512: 1510: 1507: 1506: 1504: 1502: 1498: 1492: 1489: 1487: 1484: 1482: 1479: 1478: 1476: 1474: 1470: 1460: 1457: 1455: 1452: 1450: 1447: 1445: 1444:Text chunking 1442: 1440: 1437: 1435: 1434:Lemmatisation 1432: 1430: 1427: 1426: 1424: 1422: 1418: 1412: 1409: 1407: 1404: 1402: 1399: 1397: 1394: 1392: 1389: 1387: 1384: 1383: 1380: 1377: 1375: 1372: 1370: 1367: 1365: 1362: 1360: 1357: 1355: 1352: 1348: 1345: 1343: 1340: 1339: 1338: 1335: 1333: 1330: 1328: 1325: 1323: 1320: 1318: 1315: 1313: 1310: 1308: 1305: 1303: 1300: 1298: 1295: 1293: 1290: 1289: 1287: 1285: 1284:Text analysis 1281: 1275: 1272: 1270: 1267: 1265: 1262: 1260: 1257: 1253: 1250: 1248: 1245: 1244: 1243: 1240: 1238: 1235: 1233: 1230: 1229: 1227: 1225:General terms 1223: 1219: 1212: 1207: 1205: 1200: 1198: 1193: 1192: 1189: 1182: 1178: 1177: 1164: 1160: 1156: 1154:9781605585161 1150: 1146: 1142: 1138: 1134: 1129: 1124: 1120: 1113: 1105: 1101: 1100: 1092: 1085: 1080: 1076: 1072: 1068: 1064: 1058: 1050: 1043: 1036: 1027: 1023: 1013: 1010: 1008: 1005: 1003: 1000: 998: 995: 993: 990: 988: 985: 984: 978: 976: 975:hashing trick 969:Hashing trick 681: 679: 674: 670: 660: 656: 652: 648: 644: 642: 638: 637:man bites dog 628: 615: 612: 609: 606: 603: 600: 597: 594: 591: 588: 585: 582: 579: 576: 568: 564: 426: 418: 415: 413: 409: 400: 207: 205: 201: 102: 89: 81: 79: 75: 74:Zellig Harris 70: 68: 64: 60: 56: 51: 49: 45: 41: 37: 33: 29: 22: 1841:Concordancer 1237:Bag-of-words 1236: 1180: 1118: 1112: 1097: 1091: 1082: 1070: 1066: 1057: 1048: 1035: 1026: 972: 789:fit_on_texts 666: 658: 654: 650: 646: 634: 560: 424: 416: 411: 407: 401: 398: 197: 100: 87: 77: 71: 52: 48:multiplicity 27: 25: 1798:Topic model 1678:Text corpus 1524:Statistical 1391:Text mining 1232:AI-complete 1012:w-shingling 563:bag algebra 412:JSON object 200:JSON object 1952:Categories 1519:Rule-based 1401:Truecasing 1269:Stop words 1174:References 939:word_index 882:word_index 858:word_index 837:word_index 825:word_index 702:tensorflow 673:supervised 631:Word order 206:variable: 204:JavaScript 84:Definition 63:classifier 44:word order 1828:reviewing 1626:standards 1624:Types and 1128:0902.2206 954:print_bow 870:sequences 831:tokenizer 807:tokenizer 801:sequences 783:tokenizer 777:Tokenizer 771:tokenizer 744:print_bow 732:Tokenizer 604:⨄ 1744:Wikidata 1724:FrameNet 1709:BabelNet 1688:Treebank 1658:PropBank 1603:Word2vec 1568:fastText 1449:Stemming 1099:SIGIR'12 981:See also 960:sentence 819:sentence 795:sentence 750:sentence 735:sentence 1915:Related 1881:Chatbot 1739:WordNet 1719:DBpedia 1593:Seq2seq 1337:Parsing 1252:Trigram 1133:Bibcode 1002:MinHash 59:feature 1888:(c.f. 1546:models 1534:Neural 1247:Bigram 1242:n-gram 1163:291713 1161: 1151: 912:" 729:import 717:import 714:typing 705:import 669:tf–idf 1937:spaCy 1582:large 1573:GloVe 1159:S2CID 1123:arXiv 1045:(PDF) 1018:Notes 918:print 888:print 876:count 762:-> 708:keras 1702:Data 1553:BERT 1149:ISBN 1067:Word 765:None 756:List 723:from 720:List 711:from 699:from 678:WEKA 430:BoW3 408:BoW1 304:BoW2 211:BoW1 38:and 26:The 1734:UBY 1141:doi 1104:ACM 1084:use 1075:doi 933:len 906:bow 864:bow 852:key 849:for 840:bow 741:def 651:and 32:bag 1954:: 1157:. 1147:. 1139:. 1131:. 1102:. 1081:. 1071:10 1069:. 1047:. 900:\n 855:in 846:{} 780:() 556:}; 394:}; 301:}; 80:. 69:. 50:. 1892:) 1615:, 1584:) 1580:( 1210:e 1203:t 1196:v 1165:. 1143:: 1135:: 1125:: 1106:. 1077:: 963:) 957:( 951:) 945:} 942:) 936:( 930:{ 924:f 921:( 915:) 909:} 903:{ 894:f 891:( 885:) 879:( 873:. 867:= 861:: 843:= 834:. 828:= 822:) 816:( 810:. 804:= 798:) 792:( 786:. 774:= 768:: 759:) 753:: 747:( 738:= 616:2 613:W 610:o 607:B 601:1 598:W 595:o 592:B 589:= 586:3 583:W 580:o 577:B 553:1 550:: 544:, 541:1 538:: 532:, 529:1 526:: 520:, 517:1 514:: 508:, 505:2 502:: 496:, 493:2 490:: 484:, 481:2 478:: 472:, 469:2 466:: 460:, 457:3 454:: 448:, 445:1 442:: 436:{ 433:= 391:1 388:: 382:, 379:1 376:: 370:, 367:1 364:: 358:, 355:1 352:: 346:, 343:1 340:: 334:, 331:1 328:: 322:, 319:1 316:: 310:{ 307:= 298:1 295:: 289:, 286:1 283:: 277:, 274:2 271:: 265:, 262:1 259:: 253:, 250:1 247:: 241:, 238:2 235:: 229:, 226:1 223:: 217:{ 214:= 190:, 184:, 178:, 172:, 166:, 160:, 151:, 145:, 139:, 133:, 127:, 121:, 115:, 109:, 23:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index