Knowledge

Site reliability engineering

Source đź“ť

419:" refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including everything from system design and performance optimization to incident management and automation. This holistic approach allows SREs to address many challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities. By embracing this comprehensive perspective, SRE teams can foster a culture of continuous improvement and resilience, ultimately enhancing the overall reliability of services. 1930: 66: 25: 1920: 116: 428:
efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.
471:
Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with extensive experience across various implementations, these teams provide valuable insights and guidance tailored to specific organizational needs.
218:
with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003. By 2016, Google employed more than 1,000 site reliability engineers. After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently
479:
In large organizations that have adopted SRE, a hybrid model is common. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the unique reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform
314:
Site reliability engineering is considered a specific implementation of DevOps; SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly. Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful
462:
In an embedded model, individual SREs or small SRE pairs are integrated directly within software engineering teams. These SREs work closely with developers, applying core SRE principles, such as automation, monitoring, and incident response—directly to the software development lifecycle. This
453:
Site Reliability Engineering (SRE) teams dedicated to specific products or applications are common in large organizations. These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each
427:
Infrastructure SRE (Site Reliability Engineering) teams focus on maintaining and improving the reliability of key systems that support other teams’ workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring uptime, performance, and
496:
conference, bringing together site reliability engineers from various industries. This conference serves as a platform for professionals to share knowledge, explore best practices, and discuss the latest trends in site reliability engineering.
436:
Teams utilize a variety of tools to measure, maintain, and enhance system reliability. These tools play a crucial role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance,
219:
began to employ site reliability engineers. Dedicated SRE teams are common at larger web companies, however it is not uncommon to find Devops team serving dual purpose of SRE in some midsize and many smaller companies.
323:
There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:
255:
Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System
445:
is popular for collecting and querying metrics in cloud-native environments. Leveraging these tools, SRE teams can ensure optimal performance and quickly respond to potential reliability challenges.
247:. According to a 2021 report by the DevOps Institute, 22% of respondents in a survey of 2,000 worldwide IT professionals had adopted the SRE model compared to 15% percent the previous year. 407:
Site Reliability Engineering (SRE) teams collaborate with other departments within organizations to implement SRE principles effectively. Below is an overview of common practices:
473: 307:
Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in
82:
Please remove or replace such wording and instead of making proclamations about a subject's importance, use facts and attribution to demonstrate that importance.
928: 331:
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
454:
focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets
1340: 344:
The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:
1406: 38: 75: 581: 1103: 416: 537: 1923: 1789: 1718: 1297: 790: 623: 1034: 1612: 1515: 480:
engineering group to achieve shared reliability goals for a unified platform that supports all products and applications
1399: 1324: 1278: 1255: 1228: 1207: 1174: 758: 1592: 1459: 1444: 162: 133: 97: 52: 44: 1964: 954: 823: 269: 878: 1748: 1675: 1665: 1510: 1439: 340:—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask. 463:
approach helps improve reliability and performance while fostering collaboration between SREs and developers.
1969: 1933: 1799: 1728: 1670: 1392: 1288:
Adkins, Heather; Beyer, Betsy; Blankinship, Paul; Lewandowski, Piotr; Oprea, Ana; Stubblefield, Adam (2020).
1002: 301: 1738: 1597: 1079: 1954: 1660: 1655: 1469: 1290:
Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems
1855: 1703: 1698: 1650: 1627: 1607: 527: 137: 1860: 1850: 334:
Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
1763: 1562: 1545: 1454: 1713: 1557: 1378: 547: 359: 355: 195: 1768: 1525: 1520: 848: 532: 337: 1587: 1540: 552: 542: 442: 261: 1885: 1723: 1582: 1572: 1484: 1429: 1415: 703: 308: 293: 285: 265: 244: 183: 1267:
Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime
8: 1905: 1890: 1758: 1622: 1530: 1474: 376: 289: 277: 1895: 1535: 1334: 1193: 273: 1162:
The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services
730: 1809: 1567: 1320: 1303: 1293: 1274: 1251: 1234: 1224: 1203: 1199: 1180: 1170: 629: 619: 615: 522: 507: 395: 382: 281: 228: 187: 1246:
Beyer, Betsy; Murphy, Niall; Kawahara, Kent; Rensin, David; Thorne, Stephen (2018).
1880: 1824: 1602: 1494: 1489: 366: 1192:
Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall Richard, eds. (2016).
1959: 1900: 1753: 1733: 1617: 1479: 1055: 609: 512: 348: 1804: 1708: 1449: 1166: 651: 297: 1307: 1238: 987:(Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering. 655: 202:
IT systems. Although they are closely related, SRE is slightly different from
1948: 1784: 1552: 1184: 633: 191: 1819: 1814: 1743: 1360: 1354: 1218: 657:
What's the Difference Between DevOps and SRE? (class SRE implements DevOps)
257: 220: 1160: 1159:
Limoncelli, Tom; Chalup, Strata R.; Hogan, Christina J. (September 2014).
608:
Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016).
962: 557: 517: 472:
When working directly with clients, these SREs are often referred to as '
438: 328:
Automation or elimination of anything repetitive in a cost-effective way.
311:, a company may eventually hire specialists and engineers for the job. 1829: 1794: 1384: 795: 984: 678: 708: 1220:
Seeking SRE: Conversations About Running Production Systems at Scale
1577: 236: 199: 126: 1056:"The 7 SRE Principles [And How to Put Them Into Practice]" 1287: 1010: 735: 240: 1195:
Site Reliability Engineering: How Google Runs Production Systems
1127: 611:
Site Reliability Engineering: How Google Runs Production Systems
493: 1434: 1381:
learning center with resources for SREs working with Kubernetes
1132: 763: 661: 489: 224: 215: 203: 182:) is a set of principles and practices that applies aspects of 1248:
The Site Reliability Workbook: Practical Ways to Implement SRE
927:
Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021).
1634: 1270: 389: 1001:
Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015).
351:
as the implementation of the first principle outlined above.
1845: 824:"What Is a Site Reliability Engineer? What Does an SRE Do?" 791:"Are site reliability engineers the next data scientists?" 1366: 1245: 856: 441:
is widely used for system monitoring and alerting, while
232: 410: 214:
The field of site reliability engineering originated at
284:. Site reliability engineers often have backgrounds in 1372: 1191: 1035:"Interview with Betsy Beyer, Stephen Thorne of Google" 903: 607: 1158: 582:"Evaluating where your team lies on the SRE spectrum" 1000: 926: 1375:landing page for learning more about SRE in Google 955:"What it takes to be a site reliability engineer" 1946: 1317:Chaos Engineering: System Resiliency in Practice 1314: 1104:"SRE at Google: How to structure your SRE team" 649: 929:Upskilling 2021 Enterprise DevOps SkillsReport 132:There might be a discussion about this on the 1400: 1216: 1032: 372:Designing for and implementing observability. 1339:: CS1 maint: multiple names: authors list ( 16:Use of software engineering practices for IT 318: 76:promotes the subject in a subjective manner 53:Learn how and when to remove these messages 1407: 1393: 1014:. Vol. 40, no. 3. pp. 35–39 538:Operations, administration and management 448: 388:Change and release management, including 365:Non-Abstract Large Scale Systems Design ( 354:Defining and measuring reliability goals— 163:Learn how and when to remove this message 98:Learn how and when to remove this message 1598:Software development process/methodology 1414: 1223:(1 ed.). Sebastopol, CA: O'Reilly. 952: 415:In Site Reliability Engineering (SRE), " 1165:. Vol. 2. Upper Saddle River, NJ: 1080:"Learn about observability | Honeycomb" 821: 788: 756: 1947: 1315:Rosenthal, Jones, Casey, Nora (2020). 704:"Love DevOps? Wait until you meet SRE" 223:that have adopted the concept include 1388: 1264: 1217:Blank-Edelman, David N., ed. (2018). 996: 994: 411:Kitchen Sink, a.k.a. “Everything SRE” 296:. Focuses of SRE include automation, 1919: 1613:Software verification and validation 1516:Component-based software engineering 1355:Awesome Site Reliability Engineering 879:"Site Reliability Engineering (SRE)" 817: 815: 813: 784: 782: 679:"What is SRE? - SRE Explained - AWS" 673: 671: 645: 643: 603: 601: 576: 574: 109: 59: 18: 1003:"Hiring Site Reliability Engineers" 982: 492:organization has hosted the annual 73:This article contains wording that 13: 1152: 991: 402: 375:Defining, testing, and running an 78:without imparting real information 14: 1981: 1593:Software configuration management 1460:Search-based software engineering 1445:Experimental software engineering 1348: 953:Oehrlich, Eveline (May 4, 2021). 822:Gossett, Stephen (June 1, 2020). 810: 789:Fischer, Donald (March 2, 2016). 779: 723: 668: 640: 598: 571: 422: 34:This article has multiple issues. 1929: 1928: 1918: 1369:weekly newsletter devoted to SRE 701: 114: 64: 23: 1120: 1096: 1072: 1048: 1026: 976: 946: 920: 896: 871: 841: 42:or discuss these issues on the 1440:Empirical software engineering 849:"Site Reliability Engineering" 750: 695: 474:Customer Reliability Engineers 369:) with a focus on reliability. 1: 564: 466: 250: 1465:Site reliability engineering 1033:Dave Harrison (9 Oct 2018). 176:Site reliability engineering 7: 1470:Social software engineering 500: 483: 457: 10: 1986: 1608:Software quality assurance 934:(Report). DevOps Institute 528:High availability software 209: 125:contain a large number of 1914: 1873: 1838: 1777: 1691: 1684: 1643: 1503: 1422: 1764:Model-driven engineering 1563:Functional specification 1546:Software incompatibility 1455:Requirements engineering 883:engineering.linkedin.com 683:Amazon Web Services, Inc 431: 319:Principles and practices 123:This article appears to 1965:Reliability engineering 1558:Enterprise architecture 1379:Komodor K8s Reliability 548:Reliability engineering 1769:Round-trip engineering 1526:Backward compatibility 1521:Software compatibility 908:wikitech.wikimedia.org 533:Infrastructure as code 449:Product or application 300:, and improvements to 1588:Software architecture 1541:Forward compatibility 757:Treynor, Ben (2014). 553:System administration 543:Operations management 443:Prometheus (software) 294:system administration 194:. SRE aims to create 1970:Software engineering 1886:Computer engineering 1583:Software archaeology 1573:Programming paradigm 1485:Software maintenance 1430:Computer programming 1416:Software engineering 362:, and error budgets. 309:security engineering 286:software engineering 184:software engineering 138:improve this article 1906:Systems engineering 1891:Information science 1671:Service orientation 1623:Structured analysis 1531:Compatibility layer 1475:Software deployment 1265:Welch, Nat (2018). 859:. November 12, 2020 853:IBM Cloud Education 377:incident management 1955:2003 introductions 1896:Project management 1661:Object orientation 1628:Essential analysis 1536:Compatibility mode 614:. Sebastopol, CA: 290:system engineering 278:emergency response 1942: 1941: 1869: 1868: 1810:Information model 1714:Incremental model 1568:Modeling language 1299:978-1-4920-8312-2 1108:Google Cloud Blog 1084:docs.honeycomb.io 1060:www.blameless.com 985:"In Conversation" 654:(March 1, 2018). 625:978-1-4919-5118-7 586:Google Cloud Blog 523:Disaster recovery 508:Chaos engineering 396:Chaos engineering 383:Capacity planning 302:system resilience 282:capacity planning 270:change management 173: 172: 165: 155: 154: 108: 107: 100: 57: 1977: 1932: 1931: 1922: 1921: 1881:Computer science 1689: 1688: 1603:Software quality 1495:Systems analysis 1490:Software testing 1409: 1402: 1395: 1386: 1385: 1344: 1338: 1330: 1311: 1284: 1261: 1242: 1213: 1188: 1146: 1145: 1143: 1141: 1124: 1118: 1117: 1115: 1114: 1100: 1094: 1093: 1091: 1090: 1076: 1070: 1069: 1067: 1066: 1052: 1046: 1045: 1043: 1041: 1030: 1024: 1023: 1021: 1019: 1007: 998: 989: 988: 980: 974: 973: 971: 969: 950: 944: 943: 941: 939: 933: 924: 918: 917: 915: 914: 904:"SRE - Wikitech" 900: 894: 893: 891: 889: 875: 869: 868: 866: 864: 845: 839: 838: 836: 834: 819: 808: 807: 805: 803: 786: 777: 776: 774: 772: 754: 748: 747: 745: 743: 727: 721: 720: 718: 716: 699: 693: 692: 690: 689: 675: 666: 665: 647: 638: 637: 605: 596: 595: 593: 592: 578: 488:Since 2014, the 168: 161: 150: 147: 141: 118: 117: 110: 103: 96: 92: 89: 83: 68: 67: 60: 49: 27: 26: 19: 1985: 1984: 1980: 1979: 1978: 1976: 1975: 1974: 1945: 1944: 1943: 1938: 1910: 1901:Risk management 1865: 1834: 1773: 1754:Waterfall model 1724:Prototype model 1719:Iterative model 1680: 1656:Aspect-oriented 1639: 1618:Software system 1499: 1480:Software design 1418: 1413: 1351: 1332: 1331: 1327: 1300: 1281: 1258: 1231: 1210: 1177: 1155: 1153:Further reading 1150: 1149: 1139: 1137: 1128:"Usenix SREcon" 1126: 1125: 1121: 1112: 1110: 1102: 1101: 1097: 1088: 1086: 1078: 1077: 1073: 1064: 1062: 1054: 1053: 1049: 1039: 1037: 1031: 1027: 1017: 1015: 1005: 999: 992: 981: 977: 967: 965: 951: 947: 937: 935: 931: 925: 921: 912: 910: 902: 901: 897: 887: 885: 877: 876: 872: 862: 860: 847: 846: 842: 832: 830: 820: 811: 801: 799: 787: 780: 770: 768: 755: 751: 741: 739: 729: 728: 724: 714: 712: 702:Hill, Patrick. 700: 696: 687: 685: 677: 676: 669: 652:Fong-Jones, Liz 648: 641: 626: 606: 599: 590: 588: 580: 579: 572: 567: 562: 513:Cloud computing 503: 486: 469: 460: 451: 434: 425: 413: 405: 403:Implementations 349:Toil management 321: 253: 212: 196:highly reliable 169: 158: 157: 156: 151: 145: 142: 131: 119: 115: 104: 93: 87: 84: 81: 69: 65: 28: 24: 17: 12: 11: 5: 1983: 1973: 1972: 1967: 1962: 1957: 1940: 1939: 1937: 1936: 1926: 1915: 1912: 1911: 1909: 1908: 1903: 1898: 1893: 1888: 1883: 1877: 1875: 1874:Related fields 1871: 1870: 1867: 1866: 1864: 1863: 1858: 1853: 1848: 1842: 1840: 1836: 1835: 1833: 1832: 1827: 1822: 1817: 1812: 1807: 1805:Function model 1802: 1797: 1792: 1787: 1781: 1779: 1775: 1774: 1772: 1771: 1766: 1761: 1756: 1751: 1746: 1741: 1736: 1731: 1726: 1721: 1716: 1711: 1709:Executable UML 1706: 1701: 1695: 1693: 1686: 1682: 1681: 1679: 1678: 1673: 1668: 1663: 1658: 1653: 1647: 1645: 1641: 1640: 1638: 1637: 1632: 1631: 1630: 1620: 1615: 1610: 1605: 1600: 1595: 1590: 1585: 1580: 1575: 1570: 1565: 1560: 1555: 1550: 1549: 1548: 1543: 1538: 1533: 1528: 1518: 1513: 1507: 1505: 1501: 1500: 1498: 1497: 1492: 1487: 1482: 1477: 1472: 1467: 1462: 1457: 1452: 1450:Formal methods 1447: 1442: 1437: 1432: 1426: 1424: 1420: 1419: 1412: 1411: 1404: 1397: 1389: 1383: 1382: 1376: 1370: 1364: 1363:resources list 1358: 1357:resources list 1350: 1349:External links 1347: 1346: 1345: 1326:978-1492043867 1325: 1312: 1298: 1285: 1280:978-1788628884 1279: 1262: 1257:978-1492029502 1256: 1243: 1230:978-1491978863 1229: 1214: 1209:978-1491929124 1208: 1189: 1176:978-0133478549 1175: 1167:Addison-Wesley 1154: 1151: 1148: 1147: 1119: 1095: 1071: 1047: 1025: 990: 983:Treynor, Ben. 975: 945: 919: 895: 870: 840: 809: 778: 749: 731:"What is SRE?" 722: 694: 667: 639: 624: 616:O'Reilly Media 597: 569: 568: 566: 563: 561: 560: 555: 550: 545: 540: 535: 530: 525: 520: 515: 510: 504: 502: 499: 485: 482: 468: 465: 459: 456: 450: 447: 433: 430: 424: 423:Infrastructure 421: 412: 409: 404: 401: 400: 399: 393: 386: 380: 373: 370: 363: 352: 342: 341: 335: 332: 329: 320: 317: 268:, efficiency, 252: 249: 211: 208: 188:infrastructure 171: 170: 153: 152: 136:. Please help 122: 120: 113: 106: 105: 72: 70: 63: 58: 32: 31: 29: 22: 15: 9: 6: 4: 3: 2: 1982: 1971: 1968: 1966: 1963: 1961: 1958: 1956: 1953: 1952: 1950: 1935: 1927: 1925: 1917: 1916: 1913: 1907: 1904: 1902: 1899: 1897: 1894: 1892: 1889: 1887: 1884: 1882: 1879: 1878: 1876: 1872: 1862: 1859: 1857: 1854: 1852: 1849: 1847: 1844: 1843: 1841: 1837: 1831: 1828: 1826: 1825:Systems model 1823: 1821: 1818: 1816: 1813: 1811: 1808: 1806: 1803: 1801: 1798: 1796: 1793: 1791: 1788: 1786: 1783: 1782: 1780: 1776: 1770: 1767: 1765: 1762: 1760: 1757: 1755: 1752: 1750: 1747: 1745: 1742: 1740: 1737: 1735: 1732: 1730: 1727: 1725: 1722: 1720: 1717: 1715: 1712: 1710: 1707: 1705: 1702: 1700: 1697: 1696: 1694: 1692:Developmental 1690: 1687: 1683: 1677: 1674: 1672: 1669: 1667: 1664: 1662: 1659: 1657: 1654: 1652: 1649: 1648: 1646: 1642: 1636: 1633: 1629: 1626: 1625: 1624: 1621: 1619: 1616: 1614: 1611: 1609: 1606: 1604: 1601: 1599: 1596: 1594: 1591: 1589: 1586: 1584: 1581: 1579: 1576: 1574: 1571: 1569: 1566: 1564: 1561: 1559: 1556: 1554: 1553:Data modeling 1551: 1547: 1544: 1542: 1539: 1537: 1534: 1532: 1529: 1527: 1524: 1523: 1522: 1519: 1517: 1514: 1512: 1509: 1508: 1506: 1502: 1496: 1493: 1491: 1488: 1486: 1483: 1481: 1478: 1476: 1473: 1471: 1468: 1466: 1463: 1461: 1458: 1456: 1453: 1451: 1448: 1446: 1443: 1441: 1438: 1436: 1433: 1431: 1428: 1427: 1425: 1421: 1417: 1410: 1405: 1403: 1398: 1396: 1391: 1390: 1387: 1380: 1377: 1374: 1373:SRE at Google 1371: 1368: 1365: 1362: 1359: 1356: 1353: 1352: 1342: 1336: 1328: 1322: 1318: 1313: 1309: 1305: 1301: 1295: 1291: 1286: 1282: 1276: 1272: 1268: 1263: 1259: 1253: 1249: 1244: 1240: 1236: 1232: 1226: 1222: 1221: 1215: 1211: 1205: 1201: 1197: 1196: 1190: 1186: 1182: 1178: 1172: 1168: 1164: 1163: 1157: 1156: 1135: 1134: 1129: 1123: 1109: 1105: 1099: 1085: 1081: 1075: 1061: 1057: 1051: 1036: 1029: 1013: 1012: 1004: 997: 995: 986: 979: 964: 960: 956: 949: 930: 923: 909: 905: 899: 884: 880: 874: 858: 854: 850: 844: 829: 825: 818: 816: 814: 798: 797: 792: 785: 783: 767: 765: 760: 759:"Keys to SRE" 753: 738: 737: 732: 726: 711: 710: 705: 698: 684: 680: 674: 672: 663: 659: 658: 653: 650:Vargo, Seth; 646: 644: 635: 631: 627: 621: 617: 613: 612: 604: 602: 587: 583: 577: 575: 570: 559: 556: 554: 551: 549: 546: 544: 541: 539: 536: 534: 531: 529: 526: 524: 521: 519: 516: 514: 511: 509: 506: 505: 498: 495: 491: 481: 477: 475: 464: 455: 446: 444: 440: 429: 420: 418: 408: 397: 394: 391: 387: 384: 381: 378: 374: 371: 368: 364: 361: 357: 353: 350: 347: 346: 345: 339: 338:Observability 336: 333: 330: 327: 326: 325: 316: 312: 310: 305: 303: 299: 298:system design 295: 291: 287: 283: 279: 275: 271: 267: 263: 259: 248: 246: 242: 238: 234: 230: 226: 222: 221:Organizations 217: 207: 205: 201: 197: 193: 189: 185: 181: 177: 167: 164: 149: 139: 135: 129: 128: 121: 112: 111: 102: 99: 91: 79: 77: 71: 62: 61: 56: 54: 47: 46: 41: 40: 35: 30: 21: 20: 1820:Object model 1815:Metamodeling 1744:Spiral model 1644:Orientations 1464: 1361:How they SRE 1319:. O'Reilly. 1316: 1292:. O'Reilly. 1289: 1266: 1250:. O'Reilly. 1247: 1219: 1194: 1161: 1138:. Retrieved 1131: 1122: 1111:. Retrieved 1107: 1098: 1087:. Retrieved 1083: 1074: 1063:. Retrieved 1059: 1050: 1038:. Retrieved 1028: 1016:. Retrieved 1009: 978: 966:. Retrieved 958: 948: 936:. Retrieved 922: 911:. Retrieved 907: 898: 886:. Retrieved 882: 873: 861:. Retrieved 852: 843: 831:. Retrieved 827: 800:. Retrieved 794: 769:. Retrieved 762: 752: 740:. Retrieved 734: 725: 713:. Retrieved 707: 697: 686:. Retrieved 682: 656: 610: 589:. Retrieved 585: 487: 478: 470: 461: 452: 435: 426: 417:Kitchen Sink 414: 406: 343: 322: 313: 306: 258:availability 254: 213: 179: 175: 174: 159: 143: 124: 94: 85: 74: 50: 43: 37: 36:Please help 33: 1511:Abstraction 963:Micro Focus 558:Backup site 518:Data center 439:Nagios Core 266:performance 140:if you can. 1949:Categories 1830:View model 1795:Data model 1367:SRE Weekly 1308:1129470292 1239:1052565720 1113:2021-06-26 1089:2021-06-26 1065:2021-06-26 959:TechBeacon 913:2021-10-17 796:TechCrunch 688:2022-11-05 591:2021-06-26 565:References 467:Consulting 274:monitoring 251:Definition 192:operations 39:improve it 1839:Languages 1335:cite book 1185:891786231 888:March 12, 709:Atlassian 660:(Video). 634:945577030 245:Wikimedia 134:talk page 127:buzzwords 45:talk page 1934:Category 1800:ER model 1666:Ontology 1578:Software 1504:Concepts 1200:O'Reilly 1140:June 17, 1018:June 17, 968:June 17, 938:June 17, 863:June 21, 833:June 17, 828:Built In 802:June 17, 771:June 17, 766:SREcon14 742:June 17, 715:June 17, 501:See also 484:Industry 458:Embedded 379:process. 315:change. 237:LinkedIn 200:scalable 146:May 2023 88:May 2023 1924:Commons 1749:V-model 1040:24 July 1011:;login: 736:Red Hat 262:latency 241:Netflix 229:Dropbox 210:History 1960:Google 1685:Models 1435:DevOps 1423:Fields 1323:  1306:  1296:  1277:  1254:  1237:  1227:  1206:  1183:  1173:  1136:. 2021 1133:USENIX 764:USENIX 662:Google 632:  622:  494:SREcon 490:USENIX 280:, and 243:, and 225:Airbnb 216:Google 204:DevOps 186:to IT 1861:SysML 1785:SPICE 1778:Other 1739:Scrum 1699:Agile 1651:Agile 1635:CI/CD 1271:Packt 1006:(PDF) 932:(PDF) 432:Tools 390:CI/CD 367:NALSD 292:, or 1846:IDEF 1790:CMMI 1676:SDLC 1341:link 1321:ISBN 1304:OCLC 1294:ISBN 1275:ISBN 1252:ISBN 1235:OCLC 1225:ISBN 1204:ISBN 1181:OCLC 1171:ISBN 1142:2021 1042:2024 1020:2021 970:2021 940:2021 890:2024 865:2021 835:2021 804:2021 773:2021 744:2021 717:2021 630:OCLC 620:ISBN 360:SLOs 356:SLIs 198:and 190:and 1856:USL 1851:UML 1729:RAD 1704:EUP 857:IBM 476:.' 304:. 233:IBM 180:SRE 1951:: 1759:XP 1734:UP 1337:}} 1333:{{ 1302:. 1273:. 1269:. 1233:. 1202:. 1198:. 1179:. 1169:. 1130:. 1106:. 1082:. 1058:. 1008:. 993:^ 961:. 957:. 906:. 881:. 855:. 851:. 826:. 812:^ 793:. 781:^ 761:. 733:. 706:. 681:. 670:^ 642:^ 628:. 618:. 600:^ 584:. 573:^ 358:, 288:, 276:, 272:, 264:, 260:, 239:, 235:, 231:, 227:, 206:. 48:. 1408:e 1401:t 1394:v 1343:) 1329:. 1310:. 1283:. 1260:. 1241:. 1212:. 1187:. 1144:. 1116:. 1092:. 1068:. 1044:. 1022:. 972:. 942:. 916:. 892:. 867:. 837:. 806:. 775:. 746:. 719:. 691:. 664:. 636:. 594:. 398:. 392:. 385:. 178:( 166:) 160:( 148:) 144:( 130:. 101:) 95:( 90:) 86:( 80:. 55:) 51:(

Index

improve it
talk page
Learn how and when to remove these messages
promotes the subject in a subjective manner
Learn how and when to remove this message
buzzwords
talk page
improve this article
Learn how and when to remove this message
software engineering
infrastructure
operations
highly reliable
scalable
DevOps
Google
Organizations
Airbnb
Dropbox
IBM
LinkedIn
Netflix
Wikimedia
availability
latency
performance
change management
monitoring
emergency response
capacity planning

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑