Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
The Current State of AWS Log Management Security professionals have used log data to detect cyber threats for many years. It was in the late 1990s when organizations first started to use Syslog data to detect attacks by identifying and tracking malicious activity. Security teams rely on log data to detect threats because it provides a wealth of information about what is happening on their networks and systems. By analyzing this data, they can identify patterns that may indicate an attack is taking place. Migration to the cloud has complicated how security teams use log data to protect their networks and systems. The cloud introduces new complexities into the environment, as well as new attack vectors. A cloud-centric infrastructure changes how data is accessed and stored, impacting how security teams collect and analyze log data. Finally, the cloud makes it more difficult to correlate log data with other data sources, limiting the effectiveness of security analysis. Today, security teams have hundreds of AWS-specific tools and services available to consider and potentially implement. Once an organization has chosen a set of services, the logs produced by those same services can be extensive—and the challenges associated with ingesting and normalizing cloud log data can tax the abilities of even experienced security professionals. Security teams must adapt their cloud log management approach to overcome these challenges. First, it can be difficult to redirect or copy logs out of AWS into an external log management solution. According to Panther's recent State of AWS Log Management survey and report, 48.8% of security practitioners find it challenging to do so. Additionally, each AWS environment produces unique data that can come from a variety of sources. This data can often be staggering in size and complexity. While the data coming from AWS is complicated enough, it is often siloed in the AWS environment, too — unlinked and uncorrelated with the rest of an organization's data. AWS customers often find their security teams overwhelmed with the amount of data they need to process in order to detect threats effectively. This data is spread across various AWS services, and teams have little guidance on implementing an effective and sustainable threat detection strategy. As a result, security teams can struggle to identify and respond to threats promptly. Last year a Google Cloud Blog post stated, "Developing cloud-based data ingestion pipelines that replicate data from various sources into your cloud data warehouse can be a massive undertaking that requires significant investment of staffing resources." This means that most organizations need an easy way to cost-effectively centralize organized AWS logs into a system that has visibility across the rest of their environment. They need a solution that will scale alongside a growing AWS footprint and perform quickly across massive amounts of log data. Why Continuous Monitoring Is Critical Organizations must monitor AWS log data to ensure their infrastructure runs securely and protects sensitive information. This is because the infrastructure that runs an organization's application or software may be on AWS and can reveal sensitive information, such as customer credit card data. And in the case of health technology companies, health records, and history are stored in AWS. Security teams must also continuously monitor their AWS log data in order to detect threats and prevent damage to their networks and systems. By identifying and analyzing patterns in the data, they can identify malicious activity before it causes damage. In addition to quickly identifying and responding to threats, continuous monitoring enables security teams to correlate AWS log data with other data sources for a complete view of an organization's security posture. The right log management solution will offer features specifically designed to address the challenges associated with AWS log data. It will also help teams ingest, normalize, and search their AWS logs quickly and effectively. Conclusion AWS has increasingly become the go-to provider for cloud infrastructure in the past decade, with more and more companies placing their crown jewels in its hands. This includes most of their regular IT operations, as the cloud provider has become a staple of modern business. Modern organizations need a cloud security platform that offers a log management solution specifically designed for AWS environments. They need a solution that can support a wide range of AWS data sources with the ability to quickly and effectively ingest and normalize large volumes of data.
As more and more companies move to the cloud, it’s becoming essential to keep track of their resource usage to ensure cost-effectiveness. Amazon Web Services (AWS) is a leading platform among cloud providers, but its extensive range of services can pose a challenge when monitoring resource consumption efficiently. This article delves into the significance of tracking AWS resource utilization for cost optimization and offers practical tips on accomplishing this. What Is AWS Resource Utilization? As an AWS professional, it’s essential to understand the concept of AWS resource utilization. Essentially, it refers to the computing resources that your website or application consumes on the AWS platform. These resources may include CPU, memory, disk I/O, and network usage, among others. Fortunately, AWS offers several tools you can utilize to monitor your resource utilization. These tools include Amazon CloudWatch, AWS Trusted Advisor, and AWS Cost Explorer. By leveraging these services, you can keep track of your resource consumption and optimize your AWS usage for maximum efficiency. Why Monitoring AWS Resource Utilization Is Critical for Cost Optimization Identify Underutilized Resources Maximizing resource efficiency is crucial to keeping AWS costs under control. Oversized EC2 instances can result in unnecessary expenses for compute resources, whereas storing data in an S3 bucket that isn’t required can lead to expenses for unused storage. It’s important to optimize resource usage to avoid paying more than necessary. Apart from directly impacting costs, inefficient resource utilization can also harm performance. For instance, using an RDS database that is not appropriately sized for your requirements can result in sluggish query response times or even downtime during peak traffic periods. Identifying Opportunities for Optimization Monitoring resource usage can aid in identifying optimization opportunities that can lead to cost reduction and enhanced performance. One potential optimization method is reducing the size of an EC2 instance or combining multiple instances into one, which can reduce costs. Additionally, utilizing S3 lifecycle policies to relocate data to lower-cost storage tiers as it becomes less frequently accessed can be another cost-saving option. It’s essential to note that optimization is an ongoing process. As your usage habits evolve, your resource consumption will also fluctuate. Consistent monitoring and optimization practices can guarantee your resource usage remains efficient at all times. Identify Overutilized Resources Keeping track of your resource usage can aid in detecting excessively utilized resources. Take, for instance, an RDS database that persistently operates at maximum CPU utilization. In such a scenario, it might be necessary to upgrade to a larger instance size to maintain the seamless functioning of your application and forestall any probable periods of system unavailability. Forecast Future Resource Needs Keeping a tab on resource utilization can also aid in predicting future resource requirements. By comprehending the rate at which your resource usage increases, you can forecast when you will have to scale up or down your resources. This proactive approach can prevent the risk of running out of resources when you require them the most, as well as steer clear of over-provisioning and paying excessively for resources that you do not essentially need. Leveraging Third-Party Tools There are many third-party tools available for monitoring AWS resource utilization. These tools can provide additional insights and analytics and automate AWS cloud cost optimization workflows. Some popular third-party tools include CloudCheckr, CloudHealth, and ParkMyCloud. As you choose a third-party tool, it’s paramount to consider your unique needs and financial limitations. While some tools focus exclusively on cost reduction, others provide a broader range of capabilities. Additionally, pricing models may vary depending on the number of resources you use or the level of functionality you require. Therefore, conducting a thorough assessment of your needs before selecting a tool is crucial to guarantee it satisfies your precise requirements while fitting within your budget. Efficient monitoring of AWS resource utilization is crucial for cost optimization and optimal efficiency. You can leverage native AWS tools, such as Cost Explorer, Trusted Advisor, and CloudWatch, along with third-party options, to obtain valuable insights into your resource utilization. Acting upon these insights can enable you to optimize costs and enhance overall performance. Tips for Monitoring AWS Resource Utilization To optimize the utilization of your AWS resources while keeping your expenses low, implementing a set of best practices for resource monitoring is crucial. To help you achieve this, below are some practical suggestions for effective resource utilization monitoring: Use AWS Cost Explorer As an AWS user, keeping track of your costs and usage can be daunting. However, AWS Cost Explorer is a robust solution to alleviate this concern. Utilizing this tool lets you delve into the intricacies of your AWS spending patterns and usage statistics. The feature-rich Cost Explorer provides detailed reports on various resource utilization, such as EC2 instances, S3 buckets, and RDS databases. It allows you to create personalized reports that cater to your specific needs. With cost alerts, you can proactively monitor your expenses and receive timely notifications to ensure you stay within your budget. Set Up AWS Trusted Advisor AWS Trusted Advisor could be an excellent choice if you are looking for a reliable resource monitoring tool. This tool offers real-time guidance and suggestions on enhancing cost optimization, security, and performance. By analyzing your resource usage, Trusted Advisor can help you identify opportunities to reduce costs and recommend AWS best practices accordingly. This feature-packed tool could be a valuable addition to your AWS toolkit. Use AWS CloudWatch CloudWatch, a monitoring and logging service offered by AWS, enables real-time monitoring of AWS resources. It allows monitoring of key metrics like CPU utilization and network traffic of EC2 instances, RDS databases, and other resources. Additionally, you can configure alarms to alert you when metrics cross set thresholds. This way, you can proactively address any issues and ensure the optimal performance of your AWS resources. Set Up Alerts One effective way to promptly detect and address issues in your system is by configuring alerts in CloudWatch. For instance, you can establish an alarm notifying you when your CPU usage surpasses a threshold. By doing this, you can take necessary measures to prevent any application downtime from occurring. Conclusion Efficiently managing AWS resource utilization is pivotal for achieving cost optimization goals. It involves analyzing which resources are most frequently utilized, identifying areas for optimization, and predicting future resource requirements, enabling informed decision-making regarding resource allocation and cost optimization. To obtain valuable insights into resource utilization and enhance overall efficiency, AWS-native tools, such as Cost Explorer, Trusted Advisor, and CloudWatch, as well as third-party tools, can be utilized. Leveraging these tools enables you to take necessary actions for optimizing costs and improving the overall efficiency of your AWS resources.
Monitoring data stream applications is a critical component of enterprise operations, as it allows organizations to ensure that their applications are functioning optimally and delivering value to their customers. In this article, we will discuss in detail the importance of monitoring data stream applications and why it is critical for enterprises. Data stream applications are those that handle large volumes of data in real-time, such as those used in financial trading, social media analytics, or IoT (Internet of Things) devices. These applications are critical to the success of many businesses, as they allow organizations to make quick decisions based on real-time data. However, these applications can be complex, and any issues or downtime can have significant consequences. By monitoring data stream applications, enterprises can proactively identify and address issues before they impact the business. This includes identifying performance issues, detecting errors and anomalies, and ensuring that the application is meeting its service level agreements (SLAs). Monitoring also allows organizations to track key metrics, such as data throughput, latency, and error rates, and to make adjustments to optimize the application's performance. Reference data steam system: Unlocking the Potential of IoT Applications. In addition to these benefits, monitoring data stream applications is critical for ensuring regulatory compliance. Many industries, such as finance and healthcare, have strict regulations governing data privacy and security. By monitoring these applications, organizations can ensure that they are meeting these regulatory requirements and avoid costly fines and legal penalties. Another key benefit of monitoring data stream applications is that it allows organizations to optimize their infrastructure and resource usage. By monitoring resource utilization, enterprises can identify areas of inefficiency, such as overprovisioned resources or bottlenecks, and make adjustments to improve performance and reduce costs. Prometheus: Prometheus is an open-source monitoring system that is designed for collecting and querying time-series data. It can be used to monitor metrics from a variety of sources, including data stream applications. Prometheus provides a range of tools for data visualization and alerting and integrates with a variety of popular tools and platforms. Splunk: Splunk is a popular data analytics and monitoring platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track metrics such as data volume, latency, and error rates. Splunk also includes a range of machine learning and data analysis tools that can be used to identify anomalies and optimize performance. Amazon CloudWatch: Amazon CloudWatch is a monitoring and management service offered by Amazon Web Services (AWS). It can be used to monitor a variety of AWS resources, including data stream applications running on AWS. CloudWatch provides a range of metrics, logs, and alerts and can be integrated with other AWS tools, such as AWS Lambda. if your data streams running from AWS CloudWatch is the best option. DataDog: DataDog is a cloud-based monitoring and analytics platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track a wide range of metrics, including data volume, latency, and error rates. DataDog also includes a range of visualization and collaboration tools that can be used to improve communication and collaboration across teams. Finally, monitoring data stream applications is critical for maintaining customer satisfaction. In today's fast-paced, digital world, customers expect instant responses and seamless experiences. Any issues or downtime can have a significant impact on customer satisfaction and brand reputation. By proactively monitoring these applications, organizations can ensure that their customers are receiving the expected level of service and address any issues quickly and efficiently. In conclusion, monitoring data stream applications is critical for enterprise success. It allows organizations to proactively identify and address issues, ensure regulatory compliance, optimize resource utilization, and maintain customer satisfaction. By investing in monitoring tools and processes, enterprises can ensure that their applications are delivering value to their customers and stay ahead of the competition in today's fast-paced digital landscape.
When organizations move toward the cloud, their systems also lean toward distributed architectures. One of the most common examples is the adoption of microservices. However, this also creates new challenges when it comes to observability. You need to find the right tools to monitor, track and trace these systems by analyzing outputs through metrics, logs, and traces. It enables teams to quickly pinpoint the root cause of issues, fix them and optimize the application performance, giving them the confidence to deliver code faster. So, this article looks at the features, limitations, and important selling points of eleven popular observability tools to help you select the best one for your project. Helios Helios is a developer-observability solution that provides actionable insight into the end-to-end application flow. It incorporates OpenTelemetry's context propagation framework and provides visibility across microservices, serverless functions, databases, and 3rd party APIs. You can check out their sandbox or use it for free by signing up here. Key Features Provide a complete overview: Helios provides distributed tracing information in full context, showing how data flows through your entire application in any environment. Visualization: Enables users to collect and visualize trace data from multiple data sources to drill down and troubleshoot potential issues. Multi-language support: Supports multiple languages and frameworks, including Python, JavaScript, Node.js, Java, Ruby, .NET, Go, C++, and Collector. Share and reuse: You can easily collaborate with team members by sharing traces, tests, and triggers through Helios. In addition, Helios allows reusing requests, queries, and payloads with team members. Automatic test generation: Automatically generate tests based on trace data. Easy integrations: Integrates with your existing ecosystem, including logs, tests, error monitoring, and more. Workflow reproduction: Helios allows you to reproduce an exact workflow, including HTTP requests, Kafka and RabbitMQ messages, and Lambda invocations, in just a few clicks. Popular Use Cases Distributed tracing Multi-language application trace integration Serverless application observability Test troubleshooting API call automation Bottleneck analysis Prometheus Prometheus is an open-source tool broadly used to enable observability in cloud-native environments. It can collect and store time-series data and provides visualization tools to analyze and visualize the data collected. Key Features Data Collection: It can scrape metrics from various sources, including applications, services, and systems. It also supports many data formats supported out of the box, including logs, traces, and metrics. Data Storage: It stores the data collected in a time-series database, allowing efficient querying and aggregating of data over time. Alerting: Includes a built-in alerting system that can trigger alerts based on queries. Service Discovery: It can automatically detect and scrape metrics from services running in multiple environments, such as Kubernetes and other container orchestration systems. Grafana Integration: The tool has flexible integrations with Grafana, allowing it to create dashboards to display and analyze Prometheus metrics. Limitations Limited root cause analysis capabilities: The tool is primarily designed for monitoring and alerting. Therefore, it does not provide built-in root cause analysis. Scaling: Although the tool can handle many metrics, it can become resource intensive since Prometheus stores all data in memory. Data modeling: Contains a key-value pair-based data model and does not support nested fields and joins. Popular Use Cases Metrics collection and storage Alerting Service Discovery Grafana Grafana is an open-source tool predominantly used for data visualization and monitoring. It allows users to easily create and share interactive dashboards to visualize and analyze data from various sources. Key Features Data visualization: Creates customizable and interactive dashboards to visualize metrics and logs from various data sources. Alerting: Allows users to set up alerts based on the state of their metrics to indicate potential issues. Anomaly detection: Allows users to set up anomaly detection to automatically detect and alert based on abnormal behavior in their metrics. Root cause analysis: Allows users to drill down into the metrics to analyze the root cause by providing detailed information with historical context. Limitations Data storage: Its design does not support long-term storage and requires additional tools such as Prometheus or Elasticsearch to store metrics and logs. Data modeling: Grafana does not provide advanced data modeling capabilities. Hence, it is to model specific data types and perform complicated queries. Data aggregation: Grafana does not include built-in data aggregation capabilities. Popular Use Cases Metrics visualization Alerting Anomaly detection Elasticsearch, Logstash, and Kibana (ELK) The ELK stack is a popular open-source solution that helps to manage logs and analyze data. It comprises three components: Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed search and analytics engine that can handle large volumes of structured and unstructured data enabling users to store, index, and search large amounts of data. Logstash is a data collection and processing pipeline that allows users to collect, process, and enrich data from numerous sources, such as log files. Kibana is a data visualization and exploration tool that enables users to create interactive dashboards and visualizations based on the data within Elasticsearch. Key Features Log management: ELK allows users to collect, process, store and analyze log data and metrics from multiple sources while providing a centralized console to search through the logs. Search and analysis: Allows users to search and analyze relevant log data crucial in resolving and drilling down the root cause of issues. Data visualization: Kibana allows users to create customizable dashboards which can visualize log data and metrics from multiple data sources. Anomaly detection: Kibana allows the creation of alerts for abnormal activity within the log data. Root cause analysis: ELK stack allows users to drill down into the log data to better understand the root causes by providing detailed logs and historical context. Limitations Tracing: ELK does not natively support distributed tracing. Therefore, users may need to use additional tools such as Jaeger. Real-time monitoring: The design of ELK allows it to perform well as a log management and data analysis platform. But, there is a slight delay in the log reporting, and users will experience minor latencies. Complicated setup and maintenance: The platform involves a complex setup and maintenance process. Also, it requires specific knowledge to manage large amounts of data and numerous data sources. Popular Use Cases Log management Data visualization Compliance and security InfluxDB and Telegraf InfluxDB and Telegraf are open-source tools that are popular for their time-series data storage and monitoring capabilities. InfluxDB is a time-series database that stores and queries large amounts of time-series data using its SQL-like query language. On the other hand, Telegraf is a well-known data collection agent that can collect and send metrics and events to a wide range of receivers, such as InfluxDB. It also supports many data sources. Key Features The combination of InfluxDB and Telegraf brings in many features that benefit applications' observability. Metrics collection and storage: Telegraf allows users to collect metrics from many sources and sends them to InfluxDB for storage and analysis. Data visualization: InfluxDB can be integrated with third-party visualization tools such as Grafana to create interactive dashboards. Scalability: InfluxDB's design allows it to handle large amounts of time-series data and scale horizontally. Multiple data source support: Telegraf supports over 200 input plugins to collect metrics. Limitations Limited alerting capabilities: Both tools lack alerting capabilities and require a third-party integration to provide alerting. Limited root cause analysis: These tools lack native root cause analysis capabilities and require third-party integrations. Popular Use Cases Metrics collection and storage Monitoring Datadog Datadog is a popular cloud-based monitoring and analytics platform. It is widely used to get insights into the health and performance of distributed systems to troubleshoot issues beforehand. Key Features Multi-cloud support: Users can monitor applications running on multi-vendor cloud platforms such as AWS, Azure, GCP, etc. Service maps: Allows visualization of service dependencies, locations, services, and containers. Trace Analytics: Users can analyze traces while providing detailed information about application performance. Root cause analysis: Allows users to drill down into the metrics and traces to understand the root cause of the issues by providing detailed information with historical context. Anomaly detection: Can set up anomaly detection that can automatically detect and alert on abnormal behavior in metrics. Limitations Cost: Datadog is a cloud-based paid service, and charges are known to increase with large-scale deployments. Limited log ingestion, retention, and indexing support: Datadog does not provide log analysis support by default. You have to purchase log ingestion and indexing support for that separately. Hence, most organizations decide only to keep a limited number of logs retained, which can cause issues in troubleshooting since you can't access the complete history of the issue. Lack of control over data storage: Datadog stores data on its own servers and doesn't allow users to store data locally or in their own data centers. Popular Use Cases Observability pipelines Distributed tracing Container monitoring New Relic New Relic is a cloud-based monitoring and analytics platform that allows users to monitor applications and systems within a distributed environment. It uses the "New Relic Edge" service for distributed tracing and can observe 100% of an application's traces. Key Features Application performance monitoring: Provides a comprehensive APM solution to monitor and troubleshoot application performance. Multi-cloud support: Supports monitoring applications on multiple cloud platforms such as AWS, Azure, GCP, and more. Trace analytics: Enables users to analyze traces while providing detailed information about system and application performance. Root cause analysis: Allows users to drill down into the metrics and traces to analyze the root cause of issues. Log management: Collect, process, and analyze log data from various sources, providing a holistic view of the logs. Limitations Limited open-source integration: New Relic is a closed-source platform, and its integration with other open-source tools may be limited. Cost: New Relic can be costly compared to other solutions when working with large-scale deployments. Popular Use Cases Application performance monitoring Multi-cloud monitoring Trace analytics AppDynamics AppDynamics is a monitoring and analytics platform that allows you to observe, visualize, and manage each component of your application. In addition, it provides root cause analysis to identify underlying issues that may impact the application's performance. Key Features Data collection: Users can collect metrics and traces from numerous sources such as hosts, containers, cloud services, and applications. Anomaly detection: Enables users to set up anomaly detection, which can detect and alert on abnormal behavior. Trace Analytics: Users can analyze traces and provide detailed performance information. Application performance monitoring: Provides a comprehensive APM solution that allows users to monitor and troubleshoot the application's performance. Limitations Limited open-source integration: The vendor maintains the tool. Therefore, there may be limited open-source integrations. Limited customization: Customization options are not flexible compared to other tools since the users can not customize the solution themselves. Popular Use Cases Application performance monitoring Multi-cloud monitoring Business transaction management Selecting the Best Observability Tool Observability is an integral part of modern software development and operations. It helps organizations monitor the health and performance of their system and quickly solve problems before they become critical. This article discussed the 11 best observability tools developers should know when working with distributed systems. As you can see, each tool has its features and limitations. Therefore, evaluating them against your requirements is important to find the right fit for your organization. The best observability tool for your organization will depend on your specific needs, such as your environments, tech stack, developer experience, user profiles, monitoring and troubleshooting requirements, and workflow. I hope you have found this helpful. Thank you for reading!
I had the opportunity to catch up with Andi Grabner, DevOps Activist at Dynatrace during day two of Dynatrace Perform. I've know Andi for seven years and he's one of the people that has helped me understand DevOps since I began writing for DZone. We covered several topics that I'll share in a series of articles. Do Developers Want to Expand Beyond Just Coding? There will always be developers that just want to code and do what they're told. They're great at coding and that's perfect. But I think everyone that creates something, including developers are creative engineers. I think it's in every human's interest to see the impact they have with what they create. The impact can only be seen if that piece of code gets into the hands of the beneficiary. It could be an end user, it could be a third party that is calling an API. In order to know if the beneficiary is actually getting the value out of the code, you need observability. I think we have the obligation to actually educate engineers to think about how can you create something that makes a positive impact on society. How can you get insights on your code in a fast feedback loop? In the end, if I'm a developer, and I just write code, I never know if what I'm creating actually has any impact. This is a really boring life. When I spoke to Kelsey Hightower last year, he told me a story about when he was working for a company. They were managing SNAP payments for grocery stores. If this system goes down and the family's SNAP card is declined, people do not eat. Developers and engineers need to know when something bad is happening. That should be the the main motivation. Put observability in to figure out if the stuff they are building has the desired impact. In this case the desired impact is that everybody can purchase food when they need it. If I'm an artist, I want to know, if anybody likes my painting. I would probably look into the museum to see if people actually stopped by or don't stop at my painting, it's the same thing. I want to know if my code reaches the right people, and if it has the desired effect, because if not, if nobody looks at it, maybe I'm just wasting my time. We need to educate developers, because over the last 15 to 20 years, we educated them to evolve from just coding to to test-driven development. I think now it's about observability-driven development. So whatever you do whatever you build, you need to have observability in mind, because if you cannot observe the impact that you have with the software, then you're just flying blind.
Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? If so, this workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is and is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. In this article, you'll be introduced to some basic concepts and learn what Prometheus is and is not before you start getting hands-on with it in the rest of the workshop. Introduction to Prometheus I'm going to get you started on your learning path with this first lab that provides a quick introduction to all things needed for metrics monitoring with Prometheus. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab introduces you to the Prometheus project and provides you with an understanding of its role in the cloud-native observability community. The start is with background on the beginnings of the Prometheus project and how it came to be part of the Cloud Native Computing Foundation (CNCF) as a graduated project. This leads to some basic outlining of what a data point is, how they are gathered, what makes them a metric and all using a high-level metaphor. You are then walked through what Prometheus is, why we are looking at this project as an open-source solution for your cloud-native observability solution, and more importantly, what Prometheus can not do for you. A basic architecture is presented walking you through the most common usage and components of a Prometheus metrics deployment. Below you see the final overview of the Prometheus architecture: You are then presented with an overview of all the powerful features and tools you'll find in your new Prometheus toolbox: Dimensional data model - For multi-faceted tracking of metrics Query language - PromQL provides a powerful syntax to gather flexible answers across your gathered metrics data. Time series processing - Integration of metrics time series data processing and alerting Service discovery - Integrated discovery of systems and services in dynamic environments Simplicity and efficiency - Operational ease combined with implementation in Go language Finally, you'll touch on the fact that Prometheus has a very simple design and functioning principle and that this has an impact on running it as a highly available (HA) component in your architecture. This aspect is only briefly touched upon, but don't worry: we cover this in more depth later in the workshop. At the end of each lab, including this one, you are presented with the end state (in this case we have not yet done anything), a list of references for further reading, a list of ways to contact me for questions, and a link to the next lab. Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the following lab in this workshop where you'll learn how to install and set up Prometheus on your own local machine. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
Monitoring is a small aspect of our operational needs; configuring, monitoring, and checking the configuration of tools such as Fluentd and Fluentbit can be a bit frustrating, particularly if we want to validate more advanced configuration that does more than simply lift log files and dump the content into a solution such as OpenSearch. Fluentd and Fluentbit provide us with some very powerful features that can make a real difference operationally. For example, the ability to identify specific log messages and send them to a notification service rather than waiting for the next log analysis cycle to be run by a log store like Splunk. If we want to test the configuration, we need to play log events in as if the system was really running, which means realistic logs at the right speed so we can make sure that our configuration prevents alerts or mail storms. The easiest way to do this is to either take a real log and copy the events into a new log file at the speed they occurred or create synthetic events and play them in at a realistic pace. This is what the open-source LogGenerator (aka LogSimulator) does. I created the LogGenerator a couple of years ago, having addressed the same challenges before and wanting something that would help demo Fluentd configurations for a book (Logging in Action with Fluentd, Kubernetes, and more). Why not simply copy the log file for the logging mechanism to read? Several reasons for this. For example, if you're logging framework can send the logs over the network without creating back pressure, then logs can be generated without being impacted by storage performance considerations. But there is nothing tangible to copy. If you want to simulate into your monitoring environment log events from a database, then this becomes even harder as the DB will store the logs internally. The other reason for this is that if you have alerting controls based on thresholds over time, you need the logs to be consumed at the correct pace. Just allowing logs to be ingested whole is not going to correctly exercise such time-based controls. Since then, I've seen similar needs to pump test events into other solutions, including OCI Queue and other Oracle Cloud services. The OCI service support has been implemented using a simple extensibility framework, so while I've focused on OCI, the same mechanism could be applied as easily to AWS' SQS, for example. A good practice for log handling is to treat each log entry as an event and think of log event handling as a specialized application of stream analytics. Given that the most common approach to streaming and stream analytics these days is based on Kafka, we're working on an adaptor for the LogSimulator that can send the events to a Kafka API point. We built the LogGenerator so it can be run as a script, so modifying it and extending its behavior is quick and easy. we started out with developing using Groovy on top of Java8, and if you want to create a Jar file, it will compile as Java. More recently, particularly with the extensions we've been working with, Java11 and its ability to run single file classes from the source. We've got plans to enhance the LogGenerator so we can inject OpenTelementry events into Fluentbit and other services. But we'd love to hear about other use cases see for this. For more on the utility: Read the posts on my blog See the documentation on GitHub
Testing is a best-case scenario to validate the system's correctness. But, it doesn't predict the failure cases that may occur in production. Experienced engineering teams would tell you that production environments are not uniform and full of exciting deviations. The fun fact is – testing in production helps you test the code changes on live user traffic, catch the bugs early, and deliver a robust solution that increases customer satisfaction. But, it doesn't help you detect the root cause of the failure. And that's why adopting observability in testing is critical. It gives you full-stack visibility inside the infrastructure and production to detect and resolve problems faster. As a result, people using observability are 2.1 times more likely to detect any issues and report a 69% better MTTR. Symptoms of a Lack of Observability The signs of not having proper observability showed up in the work of engineers every day. When there was a problem with production, which happened daily, the developers' attempts to find the cause of the problem would often hit a wall, and tickets would stay stuck in Jira. This happened because they didn't have enough information and details to figure out the root cause. To overcome these challenges, the developers sometimes used existing logs, which was not very helpful as they had to access logs for each service one at a time using Notepad++ and manually search through them. This made the developers feel frustrated and made it difficult for the company to clearly show customers how and when critical issues would be fixed, which could harm the company's reputation over time. Observability: What Does It Really Mean? For a tester, having proper observability means the ability to know what's happening within a system. This information is very valuable for testers. Although observability is commonly associated with reliability engineering, it helps testers better understand and investigate complex systems. This allows the tester and their team to enhance the system's quality, such as its security, reliability, and performance, to a greater extent. I found out about this problem through a challenging experience. Many others might have had a similar experience. While checking a product, I had trouble understanding the complications of the product, which is common for testers. As I tried to understand the product by reading its instructions and talking to the people involved, I noticed that the information I gathered did not make sense. At the time, I was unfamiliar with the technical term for this, but in hindsight, it was evident that the system lacked observability. It was almost impossible to know what was happening inside the application. While testing concentrates on determining if a specific functionality performs as intended, observability concentrates on the system's overall health. As a result, they paint a complete picture of your system when taken as a whole. Traditional software testing, i.e., testing in pre-production or staging environments, focus on validating the system's correctness. However, until you run your services inside the production environment, you won't be able to cover and predict every failure that may occur. Testing in production helps you discover all the possible failure cases of a system, thereby providing service reliability and stability. With observability, you can have an in-depth view of your infrastructure and production environments. In addition, you can predict the failure in production environments through the telemetry data, such as logs, metrics, and traces. Observability in the production environment helps you deliver robust products to the customers. Is Observability Really Replacing Testing? From a tester's perspective, there's no replacement for the level of detail that a truly observable system can provide. Although on a practical level, observability has three pillars — logs (a record of an event that has happened inside a system), metrics (a value that reflects some particular behavior inside a system), and traces (a low-level record of how something has moved inside a system) — it is also more than those three elements. "It's not about logs, metrics, or traces," software engineer Cindy Sridharan writes in Distributed Systems Observability, "but about being data-driven during debugging and using the feedback to iterate on and improve the product." In other words, to do observability well, you not only need effective metrics, well-structured logs, and extensive tracing. You also need a mindset that is inquisitive, exploratory, and eager to learn and the processes that can make all of those things meaningful and impactful. This makes testers and observability natural allies. Testing is, after all, about asking questions about a system or application, being curious about how something works or, often, how something should work; observability is very much about all of those things. It's too bad, then, that too many testers are unaware of observability — not only will it help them do their job more effectively, but they're also exactly the sort of people in the software development lifecycle who can evangelize for building observable systems. To keep things simple, there are two key ways we should see observability as helping testers: It helps testers uncover granular details about system issues: During exploratory testing, observability can help testers find the root cause of any issues through telemetry data such as logs, traces, and metrics, helping in better collaboration among various teams and providing faster incident resolution. It helps testers ask questions and explore the system: Testers are curious and like to explore new things. With the observability tool, they can explore the system deeply and discover the issues. It helps them uncover valuable information that assists them in making informed decisions while testing. Conclusion Testing and observability go hand-in-hand in ensuring the robustness and reliability of a system. While traditional testing focuses on validating the system's correctness in pre-production environments, testing in production can uncover all the possible failure cases. On the other hand, Observability provides full-stack visibility into the infrastructure and production environments, helping detect and resolve problems faster. In addition, observability helps testers uncover granular details about system issues and enables them to ask questions and explore the system more deeply. Testers and observability are natural allies, and adopting observability can lead to better incident resolution, informed testing decisions, and increased customer satisfaction.
When exploring the capabilities of Blackbox Exporter and its role in monitoring and observability, I was eager to customize it to meet my specific production needs. Datadog is a powerful monitoring system that comes with pre-planned packages containing all the necessary services for your infrastructure. However, at times I need a more precise and intuitive solution for my infrastructure that allows me to seamlessly transition between multiple cloud monitoring systems. My use case involved the need to scrape metrics from endpoints using a range of protocols, including HTTP, HTTPS, DNS, TCP, and ICMP. That’s where Blackbox Exporter came into play. It’s important to note that there are numerous open-source exporters available for a variety of technologies, such as databases, message brokers, and web servers. However, for the purposes of this article, we will focus on Blackbox Exporter and how we can scrape metrics and send them to Datadog. If your system doesn’t use Datadog, you can jump to implement Step 1 and Step 3. Following are the steps one takes in order to scrape custom metrics to Datadog: Step-by-step instructions on how to install Blackbox Exporter using Helm, with guidance on how to use it locally or in a production environment Extract custom metrics to DataDog from the BlackBox Exporter endpoints Collect custom metrics from BlackBox Exporter endpoints and make them available in Prometheus, then use Grafana to visualize them for better monitoring Step 1: How to Install Blackbox Exporter Using Helm We’ll use Helm to install the Blackbox Exporter. If necessary, you can customize the Helm values to suit your needs. If you’re running in a Kubernetes production environment, you could opt to create an ingress: ingress: enabled: true annotations: kubernetes.io/ingress.class: ingress-class nginx.ingress.kubernetes.io/proxy-connect-timeout: "30" nginx.ingress.kubernetes.io/proxy-read-timeout: "180" nginx.ingress.kubernetes.io/proxy-send-timeout: "180" hosts: - host: blackbox-exporter.<organization_name>.com paths: - backend: serviceName: blackbox-exporter servicePort: 9115 path: / tls: - hosts: - '*.<organization_name>.com' We won’t create an ingress in our tutorial as we test the example locally.Our installation command is by the following: helm upgrade -i prometheus-blackbox-exporter prometheus-community/prometheus-blackbox-exporter --version 7.2.0 Let’s try and see the Blackbox Exporter in action. We will export the BlackBox Exporter service with port-forward: export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=prometheus-blackbox-exporter,app.kubernetes.io/instance=prometheus-blackbox-exporter" -o jsonpath="{.items[0].metadata.name}") export CONTAINER_PORT=$(kubectl get pod --namespace default $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}") echo "Visit http://127.0.0.1:9115 to use your application" kubectl --namespace default port-forward $POD_NAME 9115:$CONTAINER_PORT Let’s visit the URL : http://localhost:9115/ Let’s make a CURL request to check if we succeed to get a response of 200 from our BlackBox Exporter: curl -I http://localhost:9115/probe\?target\=http://localhost:9115\&module\=http_2xx If it passes successfully, we will be able to see the following result on the BlackBox Exporter dashboard: Step 2: Extract Custom Metrics to DataDog We’ll be using the following version of the Helm chart to install the Datadog agent in our cluster. Once installed, we can specify the metrics we want to monitor by editing the configuration and to add our OpenMetrics block. The OpenMetrics will enable us to extract custom metrics from any OpenMetrics endpoints. Our installation command is by the following: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update We are using the Prometheus Integration with Datadog in order to retrieve metrics from applications. However, instead of configuring the Prometheus URL, we will set up the BlackBox Exporter endpoints. Our configuration in the Datadog Helm values looks like: datadog: confd: openmetrics.yaml: |- instances: - prometheus_url: https://blackbox-exporter.<organization_name>.com/probe?target=https://jenkins.<organization_name>.com&module=http_2xx namespace: status_code metrics: - probe_success: 200 min_collection_interval: 120 prometheus_timeout: 120 tags: - monitor_app:jenkins - monitor_env:production - service_name:blackbox-exporter - prometheus_url: https://blackbox-exporter.<organization_name>.com/probe?target=https://argocd.<organization_name>.com&module=http_2xx namespace: status_code metrics: - probe_success: 200 min_collection_interval: 120 prometheus_timeout: 120 tags: - monitor_app:argocd - monitor_env:production - service_name:blackbox-exporter We’ve selected “probe_success” as the metric to scrape, and renamed it to “status_code:200” to make it more intuitive and easier to define alerts for later on. That’s all. Once you log in to your Datadog dashboard, you can explore the custom metrics by filtering based on the service_name tag that we defined as “blackbox-exporter”. Step 3: Extract Custom Metrics and Visualize Them in Grafana Using Prometheus We’ll be using the following version of the Helm chart to install Prometheus in our cluster. First we will create our values.yaml of our Helm configuration: prometheus: prometheusSpec: additionalScrapeConfigs: | - job_name: 'prometheus-blackbox-exporter' scrape_timeout: 15s scrape_interval: 15s metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - http://localhost:9115 - http://localhost:8080 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: prometheus-blackbox-exporter:9115 alertmanager: enabled: false nodeExporter: enabled: false Now we can proceed with the installation of the Prometheus Stack: helm upgrade -i prometheus prometheus-community/kube-prometheus-stack -f values.yaml --version 45.0.0 Let’s utilize Kubernetes port-forwarding for Prometheus: kubectl port-forward service/prometheus-kube-prometheus-prometheus -n default 9090:9090 To see that we’re scraping metrics from the BlackBox Exporter, navigate to http://localhost:9090/metrics. You can search for the job_name “prometheus-blackbox-exporter” that we defined in the Helm values of the Prometheus Stack. Let’s utilize Kubernetes port-forwarding for Grafana: # Get the Grafana password # Grafana username is: admin kubectl get secrets -n default prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo kubectl port-forward service/prometheus-grafana -n default 3000:80 To import the Prometheus Blackbox Exporter dashboard, go to http://localhost:3000/dashboard/import and use dashboard ID 7587. To confirm that Prometheus is consistently collecting metrics from the specified URLs (localhost:9115, localhost:8080), you can check by visiting http://localhost:9115/ and verifying that the “recent probes” count is increasing. Summary As covered in our article, we provided a simple and manageable method for customizing your metrics according to your system monitoring requirements. Whether you are utilizing a paid monitoring system or an open-source one, it is crucial to have the ability to choose and accurately identify your production needs. A thorough understanding of this process will result in cost-effectiveness and the enhancement of team knowledge.
Disclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This is not a promotion of any service, feature, or platform. In my previous article on CloudWatch(CW) cross-account observability for AWS Organization, I provided a step-by-step guide on how to set up multi-account visibility and observability employing a newly released feature called CloudWatch cross-account observability using AWS Console. In this article, I will provide a step-by-step guide on how you can automate the CloudWatch cross-account observability for your AWS Organization using Terraform and a CloudFormation template. Please refer to my earlier article on this topic for a better understanding of the concepts such as Monitoring Accounts and Source Accounts. Monitoring Account Configuration For monitoring account configuration, a combination of Terraform and CloudFormation are chosen as the aws_oam_sink and aws_oam_link. Resources are yet to be available in terraform-provider-aws as of Feb 26, 2023. Please refer to the GitHub issue. Also, the terraform-provider-awscc has an open bug (as of Feb 26, 2023) that fails on applying the Sink policy. Please refer to the GitHub issue link for more details. Terraform Code That Creates the OAM Sink in the Monitoring AWS Account Feel free to customize the code to modify providers and tags or change the naming conventions as per your organization's standards. provider.tf Feel free to modify the AWS provider as per your AWS account, region, and authentication/authorization needs. Refer to the AWS provider documentation for more details on configuring the provider for the AWS platform. provider "aws" { region = "us-east-1" assume_role { role_arn = "arn:aws:iam::MONITORING-ACCOUNT-NUMBER:role/YOUR-IAM-ROLE-NAME" } } terraform { required_providers { aws = { source = "hashicorp/aws" version = "4.53.0" } } } main.tf /* AWS Cloudformation stack resource that runs CFT - oam-sink-cft.yaml The stack creates a OAM Sink in the current account & region as per provider configuration Please create the AWS provider configuration as per your environment. For AWS provider configuration, please refer to https://registry.terraform.io/providers/hashicorp/aws/2.43.0/docs */ resource "aws_cloudformation_stack" "cw_sink_stack" { name = "example" template_body = file("${path.module}/oam-sink-cft.yaml") parameters = { OrgPath = var.org_path } tags = var.tags } /* SSM parameter resource puts the CloudWatch Cross Account Observability Sink ARN in the parameter store, So that the Sink arn can be used from the source account while creating the Link */ resource "aws_ssm_parameter" "cw_sink_arn" { name = "cw-sink-arn" description = "CloudWatch Cross Account Observability Sink identifier" type = "SecureString" value = aws_cloudformation_stack.cw_sink_stack.outputs["ObservabilityAccessManagerSinkArn"] tags = var.tags } variable.tf variable "tags" { description = "Custom tags for AWS resources" type = map(string) default = {} } variable "org_path" { description = "AWS Organization path that will be allowed to send Metric and Log data to the monitoring account" type = string } AWS CloudFormation Template That Is Used in the Terraform “AWS_cloudformation_stack” Resource The below CloudFormation template creates the OAM Sink resource in the Monitoring account. This template will be used to create the CloudFormation Stack in the Monitoring account. Make sure to put the template and the terraform files in the same directory. oam-sink-cft.yaml YAML AWSTemplateFormatVersion: 2010-09-09 Description: 'AWS CloudFormation Template to creates or updates a sink in the current account, so that it can be used as a monitoring account in CloudWatch cross-account observability. A sink is a resource that represents an attachment point in a monitoring account, which source accounts can link to to be able to send observability data.' Parameters: OrgPath: Type: String Description: 'Complete AWS Organization path for source account configuration for Metric data' Resources: ObservabilityAccessManagerSink: Type: 'AWS::Oam::Sink' Properties: Name: "observability-access-manager-sink" Policy: Version: '2012-10-17' Statement: - Effect: Allow Principal: "*" Resource: "*" Action: - "oam:CreateLink" - "oam:UpdateLink" Condition: ForAnyValue:StringLike: aws:PrincipalOrgPaths: - !Ref OrgPath ForAllValues:StringEquals: oam:ResourceTypes: - "AWS::CloudWatch::Metric" - "AWS::Logs::LogGroup" Outputs: ObservabilityAccessManagerSinkArn: Value: !GetAtt ObservabilityAccessManagerSink.Arn Export: Name: ObservabilityAccessManagerSinkArn Apply the Changes in the AWS Provider Platform Once you put all the above terraform and CloudFormation template files in the same directory, run terraform init to install the provider and dependencies and then terraform plan or terraform apply depending upon whether you want to view the changes only or view and apply the changes in your AWS account. Please refer to the Hashicorp website for more details on terraform commands. When you run terraform apply or terraform plan command, you need to input the org_path value. Make sure to provide the complete AWS Organization path to allow the AWS account(s) under that path to send the metric and log data to the monitoring account. For example, if you want to allow all the AWS accounts to send the metric and log data to the monitoring account under the Organization Unit (OU) ou-0dsf-dasd67asd (assuming the OU is directly under the Root account in the organization hierarchy), then the org_path value should look like ORGANIZATION_ID/ROOT_ID/ou-0dsf-dasd67asd/*. For more information on how to set the organization path, please refer to the AWS documentation. Once the org_path value is provided (you can also use the tfvars file to supply the variable values), and terraform apply is successful, you should see the AWS account is designated as a Monitoring account by navigating to CloudWatch settings in CloudWatch console. Source Account Configuration For source account configuration, we can use the terraform-provider-awscc as the link resource works perfectly. Also, the aws_oam_sink and aws_oam_link resources are yet to be available in the terraform-provider-aws as of Feb 26, 2023. Please refer to the GitHub issue. Terraform Code That Creates the OAM Link in the Source AWS Account Feel free to customize the code to modify provider and tags or change the naming conventions as per your organization's standards. provider.tf Feel free to modify the AWSCC provider as per your AWS account, region, and authentication/authorization needs. Refer to the AWSCC provider documentation for more details on configuring the provider. provider "aws" { region = "us-east-1" assume_role { role_arn = "arn:aws:iam::MONITORING-ACCOUNT-NUMBER:role/IAM-ROLE-NAME" } } provider "awscc" { region = "us-east-1" assume_role = { role_arn = "arn:aws:iam::SOURCE-ACCOUNT-NUMBER:role/IAM-ROLE-NAME" } } terraform { required_providers { aws = { source = "hashicorp/aws" version = "4.53.0" } awscc = { source = "hashicorp/awscc" version = "0.45.0" } } } main.tf /* Link resource to create the link between the source account and the sink in the monitoring account */ resource "awscc_oam_link" "cw_link" { provider = awscc label_template = "$AccountName" resource_types = ["AWS::CloudWatch::Metric", "AWS::Logs::LogGroup"] sink_identifier = data.aws_ssm_parameter.cw_sink_arn.value } /* SSM parameter data block retrieves the CloudWatch Cross Account Observability Sink ARN from the parameter store, So that the Sink arn can be associated with the source account while creating the Link */ data "aws_ssm_parameter" "cw_sink_arn" { provider = aws name = "cw-sink-arn" } Put both the terraform files in the same directory and run the terraform init and then terraform apply commands to create the link between the source and monitoring accounts. Steps To Validate the CloudWatch Cross-Account Observability Changes Now that changes are applied in both source and monitoring accounts, it's time to validate that CloudWatch log groups and metric data are showing up in the monitoring account. Navigate to CloudWatch Console > Settings > Manage source accounts in the monitoring account. You should see the new source account is listed, and it should show that CloudWatch log and metric are being shared with the monitoring account If you navigate to CloudWatch log groups in the monitoring account, you should now see some of the log groups from the source account. Also, if you navigate to CloudWatch Metrics > All Metrics in the monitoring account, now you should see some of the Metric data from the source account.
Joana Carvalho
Performance Engineer,
Postman
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep