Many of our customers frequently ask us how they can calculate the ROI of their observability platforms. It’s a tough question, and one that comes up because company decision-makers often feel like they may be overpaying for observability when things are running smoothly – especially when it comes to their applications. However, what they don’t realize is that, in the event of an incident, every additional data point becomes a crucial part of swiftly resolving issues and restoring proper application functionality.
In our recent webinar, “Spend on the Signals, Not on the Noise: Cost-Conscious Observability 101,” Navin Pai, Head of Engineering at OpsVerse, provided insightful commentary on the five strategies companies need to control rising observability costs. This blog delves deeper into those strategies, offering insights into how they can benefit your organization.
1. Managing Data Ingestion
Observability tools generate a vast amount of data due to their detailed monitoring of applications. This data is generated in diverse sources and formats, such as logs, metrics, traces, but not all of it is always useful. To make sense of what data is useful to you, here are a few data ingestion strategies that can help make managing large volumes of data a lot easier:
Data Sampling
Data Sampling refers to the practice of collecting and analyzing a subset of data from a larger dataset to gain insights into a system’s behavior or performance.
There are various sampling strategies, including:
- Random sampling: Selecting data points randomly from the entire dataset.
- Time-based sampling: Collecting data at regular intervals, such as every minute or every hour.
- Frequency-based sampling: Collecting data based on the frequency of events, such as sampling every nth event.
- Weighted sampling: Assigning different probabilities to different data points based on their importance or relevance.
Which sampling strategy to use depends on various factors, including the nature of the system being observed, the resources available for data storage and analysis, and the specific observability requirements of the organization.
Tiered Logging
Tiered logging is a best practice for managing data ingestion in observability, helping to prioritize and filter log data based on its importance and urgency. The common logging levels are:
- DEBUG logging: This includes detailed information, typically of interest only when diagnosing problems. Debug logging can be enabled selectively or conditionally (i.e. only for specific modules or in development environments) to prevent overwhelming the system with excessive data.
- INFO logging: This includes informational messages that highlight the progress of the application at a coarse-grained level. This level is used to confirm if things are working as expected. This level can be used for regular operational logs that are essential for understanding system performance without excessive detail. Furthermore, it is necessary to ensure info logs are concise and relevant to maintain low-latency data ingestion.
- WARNING logging: These logs indicate potentially harmful situations which still allow the application to continue running. They also highlight issues that need attention but are not immediately critical. Prioritize logging warnings to ensure that potential issues are captured without logging unnecessary information.
Compressing Data
Using Protobuf for data compression in observability platforms can yield significant performance and cost benefits. Metrics can be encoded using Protobuf before being transmitted to storage or monitoring systems. This reduces the size of the payload, optimizing both network usage and storage requirements.
Logs, especially high-volume ones, can be serialized using Protobuf. Compact representation helps in storing more logs in the same space while speeding up the transmission.
Another way to optimize observability costs is to drop unused namespaces, metrics, and labels. Unused namespaces can clutter the observability system, making it harder to navigate and manage. By removing them, you streamline the data structure and reduce overhead. Dropping unused metrics can help you focus on relevant data – reducing costs and improving query performance, too.
Converting Logs to Metrics
The conversion of logs to metrics involves transforming raw log data into numerical measurements or aggregated statistics that provide insights into application behavior and performance. As logs contain extensive, detailed information, converting them to metrics helps distill this data into manageable, focused summaries. For example, instead of analyzing every single log entry for error messages, you can track the number of errors per minute.
Storing and querying metrics is generally more efficient than dealing with raw log data, especially as the volume of logs grows. This efficiency reduces the overhead on storage and processing resources.
2. Rightsizing Infrastructure
Choosing Machines
Every organization has distinct requirements, constrained by memory and CPU usage. Engineers are advised to pick machines based on storage and query needs. When it comes to Amazon Web Services (AWS) infrastructures, Navin Pai highly recommends analyzing and selecting machine types – T, R, M, or C – based on individual requirements as it makes a noticeable difference in optimizing observability costs.
Leveraging Spot Instances and Annual Commits
Spot instances are unused EC2 instances that AWS sells at a discount, which can significantly reduce compute costs. Spot instances can be used for non-critical data processing tasks that can handle interruptions, such as batch processing logs or metrics. It can also be used for intermediate storage for observability data before it’s ingested into more stable systems.
Another way to optimize observability costs is by committing to annual or longer-term contracts with cloud providers. Organizations can assess your historical data usage, storage, and compute needs to better understand their baselines and peaks. Based on this information, they can then project future usage requirements.
3. Optimizing Storage
Using Tiered Storage
Navin Pai emphasizes the importance of strategically distributing data across various storage mediums based on its age and usage frequency. According to Pai, the most recent data, typically from the past week or month, should be stored in memory or Solid State Drives (SSDs) due to their high-speed access, which aligns with the need for rapid querying and analysis. As data ages, its access frequency decreases, making it more cost-effective to transition older data to Hard Disk Drives (HDDs) and eventually to more economical storage solutions like Amazon S3 for data from several months ago.
For organizations in highly regulated industries such as finance and healthcare, the challenge extends beyond mere cost-efficiency. These sectors are often required to retain data for extended periods to comply with stringent regulatory standards. For such long-term storage needs, Pai suggests utilizing services like Amazon S3 Glacier. This service is specifically designed for data archiving, offering a cost-effective solution for preserving infrequently accessed data while ensuring compliance with regulatory mandates. By leveraging Amazon S3 Glacier, companies can securely store their archival data for years, ensuring that they remain compliant without incurring the high costs associated with more immediate storage solutions.
Identifying a Retention Policy
A retention policy based on the aforementioned tiered storage strategy provides the flexibility to scale storage needs up or down based on data volume and access patterns. This scalability ensures that organizations only pay for the storage they need, when they need it, and without overcommitting on resources. By systematically transitioning data across different storage tiers, organizations can optimize their storage infrastructures, preventing the inefficiencies associated with keeping all data on high-cost storage solutions.
4. Selecting the Right Tools
Use Tools That Compress Data
Using tools like Grafana Loki with S3 for logging and ClickHouse for tracing can significantly help in optimizing observability costs through efficient data compression and scalable storage. These tools typically employ efficient data compression algorithms to minimize the storage footprint of logs and traces. Furthermore, S3 and ClickHouse are designed to provide scalable storage solutions. As your application grows and generates more logs and traces, these tools can seamlessly scale to accommodate the increasing volume of data without significant infrastructure changes.
S3 offers cost-effective storage options, such as infrequent access storage classes, which allow you to save more money by storing less frequently accessed data at a lower price. By leveraging these storage classes for archived or historical logs, you can further optimize observability costs without sacrificing accessibility.
Downsampling Older Data
Tools that downsample older data instead of outright deleting it can be instrumental in optimizing observability costs while still retaining valuable insights from historical logs and traces. Downsampling involves reducing the granularity or frequency of data points, allowing organizations to store historical data in a more cost-efficient way. Instead of storing high-resolution data indefinitely, downsampling enables the retention of summarized or aggregated data points which require less storage space. This approach significantly reduces storage costs compared to storing all data at full fidelity.
5. Networking Options
VPC Peering
VPC peering, in the context of optimizing observability costs, can play a significant role by streamlining data flows and reducing unnecessary data transfers across network boundaries. Observability often involves sending telemetry data, logs, and metrics from various services and resources within a cloud environment to monitoring and analytics tools. Without VPC peering, this data transfer might require routing through the public internet or using costly interregional communication mechanisms. By establishing VPC peering connections between VPCs that host your observability tools and those that host the resources generating data, organizations can avoid these unnecessary data transfer costs.
Avoiding Cross Availability Zones (Cross AZ) Data Access
As mentioned above, cloud providers typically charge for data transferred between different AZs and regions. When observability tools, such as logging or monitoring services, need to access data generated by resources deployed across multiple AZs, it can lead to significant data transfer costs. By architecting the observability infrastructure to leverage data access within a single AZ or region instead, these interregional data transfer charges can be reduced to a large extent.
Furthermore, storing and processing observability data within the same AZ or region as the resources that generate it improves data locality, leading to faster data access and reduced network overhead.
Conclusion
Optimizing observability costs isn’t just about trimming expenses; it’s also about ensuring the efficient utilization of resources while maintaining the agility needed to respond swiftly to incidents. The strategies outlined here, originally provided by Navin Pai, offer a comprehensive approach to achieving this cost-conscious observability. To dive deeper into these strategies and other essential details, watch the webinar here.