Knowledge

Observability (software)

Source 📝

1047: 276:
Logs, or log lines, are generally free-form, unstructured text blobs that are intended to be human readable. Modern logging is structured to enable machine parsability. As with metrics, an application developer must instrument the application upfront and ship new code if different logging information
261:
can quickly make the storage size of telemetry data prohibitively expensive. Since metrics are cardinality-limited, they are often used to represent aggregate values (for example: average page load time, or 5-second average of the request rate). Without external context, it is impossible to correlate
299:
A cloud native application is typically made up of distributed services which together fulfill a single request. A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request. A trace shows the causal and temporal
330:
To be able to observe an application, telemetry about the application's behavior needs to be collected or exported. Instrumentation means generating telemetry alongside the normal operation of the application. Telemetry is then collected by an independent backend for later analysis.
303:
Instrumenting an application with traces means sending span information to a tracing backend. The tracing backend correlates the received spans to generate presentable traces. To be able to follow a request as it traverses multiple services, spans are labeled with
372:
Self monitoring is a practice where observability stacks monitor each other, in order to reduce the risk of inconspicuous outages. Self monitoring may be put in place in addition to high availability and redundancy to further avoid correlated failures.
335:
In fast-changing systems, instrumentation itself is often the best possible documentation, since it combines intention (what are the dimensions that an engineer named and decided to collect?) with the real-time, up-to-date information of live status in
256:
Application developers choose what kind of metrics to instrument their software with, before it is released. As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, their
340:
Instrumentation can be automatic, or custom. Automatic instrumentation offers blanket coverage and immediate value; custom instrumentation brings higher value but requires more intimate involvement with the instrumented application.
179:
Observability and monitoring are sometimes used interchangeably. As tooling, commercial offerings and practices evolved in complexity, "monitoring" was re-branded as observability in order to differentiate new tools from the old.
66:" of a system measures how well its state can be determined from its outputs. Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling). 198:
Majors et al. suggest that engineering teams that only have monitoring tools end up relying on expert foreknowledge (seniority), whereas teams that have observability tools rely on exploratory analysis (curiosity).
359:
Metrics, logs and traces are most commonly listed as the pillars of observability. Majors et al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability, arguing that
250:
Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience.
86:
software tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application along with the hardware and network it runs on
42:
is the ability to collect data about programs' execution, modules' internal states, and the communication among components. To improve observability, software engineers use a wide range of
280:
Logs typically include a timestamp and severity level. An event (such as a user request) may be fragmented across multiple log lines and interweave with logs from concurrent events.
125:
Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
138:
proactively collecting, visualizing, and applying intelligence to all of your metrics, events, logs, and traces—so you can understand the behavior of your complex digital system
157:(where 11 stands for the number of letters between the first letter and the last letter of the word). This is similar to other computer science abbreviations such as 54:, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue. 73:
a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre without needing to ship new code
351:
Verifying new features in production by shipping them together with custom instrumentation is a practice called "observability-driven development".
308:
that enable constructing a parent-child relationship between spans. Span information is typically shared in the HTTP headers of outbound requests.
207:
Observability relies on three main types of telemetry data: metrics, logs and traces. Those are often referred to as "pillars of observability".
104: 608:"Hidden in Plain Sight: Improvements in the observability of software can help you diagnose your most crippling performance problems" 344:
Instrumentation can be native - done in-code (modifying the code of the instrumented application) - or out-of-code (e.g. sidecar,
1088: 935: 793: 949: 510: 452: 243: 158: 17: 885: 382: 571:
Fellows, Geoff (1998). "High-Performance Client/Server: A Guide to Building and Managing Robust Distributed Systems".
683: 539: 481: 317: 322:
Continuous profiling is another telemetry type used to precisely determine how an application consumes resources.
1112: 112:
the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces
1107: 844: 130: 50:
techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to
91: 410: 78: 51: 1081: 258: 906: 364:
and dashboards have little value because "modern systems rarely fail in precisely the same way twice."
1015: 1062: 819: 1074: 763: 99:
observability starts by shipping all your raw data to central service before you begin analysis
416: 35: 886:"Monitoring, Observability & Telemetry: Everything You Need To Know for Observable Work" 859: 973: 400: 394: 31: 8: 271: 637: 289: 47: 955: 945: 689: 679: 629: 588: 545: 535: 516: 506: 487: 477: 458: 448: 305: 174: 641: 1054: 619: 584: 580: 226: 216: 733: 1058: 994: 959: 693: 549: 520: 491: 462: 1101: 633: 592: 388: 222: 63: 624: 607: 428: 941:
Distributed systems observability : a guide to building robust systems
939: 707: 673: 529: 502:
Distributed systems observability : a guide to building robust systems
500: 471: 442: 300:
relationships between the services that interoperate to fulfill a request.
229:) that represents some system state. Examples of common metrics include: 253:
Metrics are typically tagged to facilitate grouping and searchability.
162: 151: 143: 117: 262:
between events (such as user requests) and distinct metric values.
43: 361: 1046: 860:"Observability vs. Monitoring: What's The Difference in DevOps?" 675:
Observability engineering : achieving production excellence
473:
Observability engineering : achieving production excellence
405: 345: 672:
Majors, Charity; Fong-Jones, Liz; Miranda, George (2022).
470:
Majors, Charity; Fong-Jones, Liz; Miranda, George (2022).
57: 671: 469: 354: 183:
The terms are commonly contrasted in that systems are
944:(1st ed.). Sebastopol, CA: O'Reilly Media, Inc. 678:(1st ed.). Sebastopol, CA: O'Reilly Media, Inc. 505:(1st ed.). Sebastopol, CA: O'Reilly Media, Inc. 476:(1st ed.). Sebastopol, CA: O'Reilly Media, Inc. 62:
The term is borrowed from control theory, where the "
794:"DevOps measurement: Monitoring and observability" 69:The definition of observability varies by vendor: 429:CNCF Observability Technical Advisory Group (TAG) 1099: 929: 927: 845:"How Are Structured Logs Different from Events?" 27:Ability to collect data about software execution 936:"Chapter 4. The Three Pillars of Observability" 734:"How to Begin Observability at the Data Source" 168: 1082: 924: 788: 786: 784: 444:Cloud-Native Observability with OpenTelemetry 667: 665: 663: 661: 659: 657: 655: 653: 651: 907:"What is Observability? A Beginner's Guide" 757: 755: 440: 1089: 1075: 781: 527: 150:The term is frequently referred to as its 933: 817: 648: 623: 498: 857: 752: 605: 851: 570: 311: 14: 1100: 761: 441:Boten, Alex; Majors, Charity (2022). 294: 58:Etymology, terminology and definition 1041: 818:Reinholds, Amy (30 November 2021). 233:number of HTTP requests per second; 24: 383:Application performance management 367: 325: 202: 25: 1124: 1020:Cloud Native Computing Foundation 422: 1045: 883: 318:Profiling (computer programming) 1016:"What is continuous profiling?" 1008: 987: 966: 899: 877: 858:Hadfield, Ally (29 June 2022). 434: 236:total number of query failures; 191:, and monitored systems may be 837: 811: 726: 700: 599: 585:10.1108/intr.1998.17208eaf.007 564: 13: 1: 557: 531:Cloud Observability in Action 1061:. You can help Knowledge by 762:Livens, Jay (October 2021). 528:Hausenblas, Michael (2023). 411:Site reliability engineering 169:Observability vs. monitoring 52:site reliability engineering 7: 376: 242:time in seconds since last 10: 1129: 1040: 355:"Pillars of observability" 315: 287: 269: 214: 210: 172: 934:Sridharan, Cindy (2018). 499:Sridharan, Cindy (2018). 283: 187:using predefined sets of 820:"What is observability?" 764:"What is observability?" 606:Cantrill, Bryan (2006). 708:"What is observability" 625:10.1145/1117389.1117401 265: 239:database size in bytes; 34:, more specifically in 1113:Computer science stubs 338: 148: 135: 122: 109: 96: 83: 1108:Distributed computing 417:Sociotechnical system 333: 136: 123: 110: 97: 84: 71: 36:distributed computing 447:. Packt Publishing. 401:Synthetic monitoring 395:Real user monitoring 312:Continuous profiling 32:software engineering 18:Telemetry (software) 272:Logging (computing) 306:unique identifiers 295:Distributed traces 290:Tracing (software) 244:garbage collection 1070: 1069: 976:. W3C. 2021-11-23 951:978-1-4920-3342-4 740:. 26 October 2023 714:. 15 October 2021 573:Internet Research 512:978-1-4920-3342-4 454:978-1-80107-190-1 175:System monitoring 16:(Redirected from 1120: 1091: 1084: 1077: 1055:computer science 1049: 1042: 1032: 1031: 1029: 1027: 1012: 1006: 1005: 1003: 1002: 995:"b3-propagation" 991: 985: 984: 982: 981: 970: 964: 963: 931: 922: 921: 919: 917: 903: 897: 896: 894: 892: 881: 875: 874: 872: 870: 855: 849: 848: 841: 835: 834: 832: 830: 815: 809: 808: 806: 804: 790: 779: 778: 776: 774: 759: 750: 749: 747: 745: 730: 724: 723: 721: 719: 704: 698: 697: 669: 646: 645: 627: 603: 597: 596: 568: 553: 524: 495: 466: 146: 133: 120: 107: 94: 81: 21: 1128: 1127: 1123: 1122: 1121: 1119: 1118: 1117: 1098: 1097: 1096: 1095: 1038: 1036: 1035: 1025: 1023: 1014: 1013: 1009: 1000: 998: 993: 992: 988: 979: 977: 974:"Trace Context" 972: 971: 967: 952: 932: 925: 915: 913: 905: 904: 900: 890: 888: 884:Kidd, Chrissy. 882: 878: 868: 866: 856: 852: 847:. 26 June 2018. 843: 842: 838: 828: 826: 816: 812: 802: 800: 792: 791: 782: 772: 770: 760: 753: 743: 741: 732: 731: 727: 717: 715: 706: 705: 701: 686: 670: 649: 604: 600: 569: 565: 560: 542: 513: 484: 455: 437: 425: 379: 370: 368:Self monitoring 357: 328: 326:Instrumentation 320: 314: 297: 292: 286: 274: 268: 219: 217:Software metric 213: 205: 203:Telemetry types 177: 171: 147: 142: 134: 129: 121: 116: 108: 103: 95: 90: 82: 77: 60: 28: 23: 22: 15: 12: 11: 5: 1126: 1116: 1115: 1110: 1094: 1093: 1086: 1079: 1071: 1068: 1067: 1050: 1034: 1033: 1007: 986: 965: 950: 923: 898: 876: 850: 836: 810: 780: 751: 725: 699: 684: 647: 598: 562: 561: 559: 556: 555: 554: 540: 525: 511: 496: 482: 467: 453: 436: 433: 432: 431: 424: 423:External links 421: 420: 419: 414: 408: 403: 398: 392: 386: 378: 375: 369: 366: 356: 353: 327: 324: 316:Main article: 313: 310: 296: 293: 288:Main article: 285: 282: 270:Main article: 267: 264: 248: 247: 240: 237: 234: 221:A metric is a 215:Main article: 212: 209: 204: 201: 170: 167: 140: 127: 114: 101: 88: 75: 59: 56: 26: 9: 6: 4: 3: 2: 1125: 1114: 1111: 1109: 1106: 1105: 1103: 1092: 1087: 1085: 1080: 1078: 1073: 1072: 1066: 1064: 1060: 1057:article is a 1056: 1051: 1048: 1044: 1043: 1039: 1022:. 31 May 2022 1021: 1017: 1011: 996: 990: 975: 969: 961: 957: 953: 947: 943: 942: 937: 930: 928: 912: 908: 902: 887: 880: 865: 861: 854: 846: 840: 825: 821: 814: 799: 795: 789: 787: 785: 769: 765: 758: 756: 739: 735: 729: 713: 709: 703: 695: 691: 687: 685:9781492076445 681: 677: 676: 668: 666: 664: 662: 660: 658: 656: 654: 652: 643: 639: 635: 631: 626: 621: 617: 613: 609: 602: 594: 590: 586: 582: 578: 574: 567: 563: 551: 547: 543: 541:9781633439597 537: 533: 532: 526: 522: 518: 514: 508: 504: 503: 497: 493: 489: 485: 483:9781492076445 479: 475: 474: 468: 464: 460: 456: 450: 446: 445: 439: 438: 430: 427: 426: 418: 415: 412: 409: 407: 404: 402: 399: 396: 393: 390: 389:OpenTelemetry 387: 384: 381: 380: 374: 365: 363: 352: 349: 347: 342: 337: 332: 323: 319: 309: 307: 301: 291: 281: 278: 277:is required. 273: 263: 260: 254: 251: 245: 241: 238: 235: 232: 231: 230: 228: 225:measurement ( 224: 223:point in time 218: 208: 200: 196: 194: 190: 186: 181: 176: 166: 164: 160: 159:i18n and l10n 156: 153: 145: 139: 132: 126: 119: 113: 106: 100: 93: 87: 80: 74: 70: 67: 65: 64:observability 55: 53: 49: 45: 41: 40:observability 37: 33: 19: 1063:expanding it 1052: 1037: 1024:. Retrieved 1019: 1010: 999:. Retrieved 997:. openzipkin 989: 978:. Retrieved 968: 940: 914:. Retrieved 910: 901: 889:. Retrieved 879: 867:. Retrieved 863: 853: 839: 827:. Retrieved 823: 813: 801:. Retrieved 798:Google Cloud 797: 771:. Retrieved 767: 742:. Retrieved 737: 728: 716:. Retrieved 711: 702: 674: 618:(1): 26–36. 615: 611: 601: 576: 572: 566: 530: 501: 472: 443: 435:Bibliography 371: 358: 350: 343: 339: 334: 329: 321: 302: 298: 279: 275: 255: 252: 249: 220: 206: 197: 192: 188: 184: 182: 178: 154: 149: 137: 131:Google Cloud 124: 111: 98: 85: 72: 68: 61: 39: 29: 534:. Manning. 336:production. 259:cardinality 92:IBM Instana 1102:Categories 1001:2023-09-27 980:2023-09-27 960:1044741317 744:26 October 694:1315555871 558:References 550:1359045370 521:1044741317 492:1315555871 463:1314053525 193:observable 173:See also: 105:Edge Delta 824:New Relic 768:Dynatrace 634:1542-7730 593:1066-2243 189:telemetry 185:monitored 152:numeronym 144:New Relic 118:Dynatrace 79:Honeycomb 891:15 March 869:15 March 642:14505819 377:See also 362:runbooks 141:—  128:—  115:—  102:—  89:—  76:—  1026:9 March 916:9 March 864:Instana 829:9 March 803:9 March 773:9 March 718:9 March 211:Metrics 48:tracing 44:logging 958:  948:  911:Splunk 692:  682:  640:  632:  591:  548:  538:  519:  509:  490:  480:  461:  451:  406:DevOps 391:(OTel) 284:Traces 227:scalar 1053:This 738:Cisco 638:S2CID 612:Queue 579:(5). 413:(SRE) 397:(RUM) 385:(APM) 1059:stub 1028:2023 956:OCLC 946:ISBN 918:2023 893:2023 871:2023 831:2023 805:2023 775:2023 746:2023 720:2023 690:OCLC 680:ISBN 630:ISSN 589:ISSN 546:OCLC 536:ISBN 517:OCLC 507:ISBN 488:OCLC 478:ISBN 459:OCLC 449:ISBN 346:eBPF 266:Logs 161:and 155:o11y 46:and 712:IBM 620:doi 581:doi 348:). 163:k8s 30:In 1104:: 1018:. 954:. 938:. 926:^ 909:. 862:. 822:. 796:. 783:^ 766:. 754:^ 736:. 710:. 688:. 650:^ 636:. 628:. 614:. 610:. 587:. 575:. 544:. 515:. 486:. 457:. 195:. 165:. 38:, 1090:e 1083:t 1076:v 1065:. 1030:. 1004:. 983:. 962:. 920:. 895:. 873:. 833:. 807:. 777:. 748:. 722:. 696:. 644:. 622:: 616:4 595:. 583:: 577:8 552:. 523:. 494:. 465:. 246:. 20:)

Index

Telemetry (software)
software engineering
distributed computing
logging
tracing
site reliability engineering
observability
Honeycomb
IBM Instana
Edge Delta
Dynatrace
Google Cloud
New Relic
numeronym
i18n and l10n
k8s
System monitoring
Software metric
point in time
scalar
garbage collection
cardinality
Logging (computing)
Tracing (software)
unique identifiers
Profiling (computer programming)
eBPF
runbooks
Application performance management
OpenTelemetry

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.