[go: up one dir, main page]

Skip to content

Latest commit

 

History

History

googlemanagedprometheusexporter

Google Managed Service for Prometheus Exporter

Status
Stability beta: metrics
Distributions contrib
Issues Open issues Closed issues
Code Owners @aabmass, @dashpole, @jsuereth, @punya, @damemi, @psx95

This exporter can be used to send metrics (including trace exemplars) to Google Cloud Managed Service for Prometheus. It is one of several supported approaches for sending metrics to Google Cloud Managed Service for Prometheus.

Configuration Reference

The following configuration options are supported:

  • project (optional): GCP project identifier.
  • user_agent (optional): Override the user agent string sent on requests to Cloud Monitoring (currently only applies to metrics). Specify {{version}} to include the application version number. Defaults to opentelemetry-collector-contrib {{version}}.
  • metric(optional): Configuration for sending metrics to Cloud Monitoring.
    • endpoint (optional): Endpoint where metric data is going to be sent to. Replaces endpoint.
    • compression (optional): Compression format for Metrics gRPC requests. Supported values: [gzip]. Defaults to no compression.
    • grpc_pool_size (optional): Sets the size of the connection pool in the GCP client. Defaults to a single connection.
    • use_insecure (optional): If true, disables gRPC client transport security. Only has applies if Endpoint is not "".
    • add_metric_suffixes (default=true): Add type and unit suffixes to metrics.
    • extra_metrics_config (optional): Enable or disable additional metrics.
      • enable_target_info (default=true): Add target_info metric based on resource.
      • enable_scope_info (default=true): Add otel_scope_info metric and scope_name/scope_version attributes to all other metrics.
    • resource_filters (optional): Provides a list of filters to match resource attributes which will be included in metric labels.
      • prefix (optional): Match resource attribute keys by prefix.
      • regex (optional): Match resource attribute keys by regex.
  • sending_queue (optional): Configuration for how to buffer traces before sending.
    • enabled (default = true)
    • num_consumers (default = 10): Number of consumers that dequeue batches; ignored if enabled is false
    • queue_size (default = 1000): Maximum number of batches kept in memory before data; ignored if enabled is false; User should calculate this as num_seconds * requests_per_second where:
      • num_seconds is the number of seconds to buffer in case of a backend outage
      • requests_per_second is the average number of requests per seconds.

Note: The sending_queue is provided (and documented) by the Exporter Helper

Example Configuration

receivers:
    prometheus:
        config:
          scrape_configs:
            # Add your prometheus scrape configuration here.
            # Using kubernetes_sd_configs with namespaced resources (e.g. pod)
            # ensures the namespace is set on your metrics.
            - job_name: 'kubernetes-pods'
                kubernetes_sd_configs:
                - role: pod
                relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                action: keep
                regex: true
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
                action: replace
                target_label: __metrics_path__
                regex: (.+)
                - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
                action: replace
                regex: (.+):(?:\d+);(\d+)
                replacement: $$1:$$2
                target_label: __address__
                - action: labelmap
                regex: __meta_kubernetes_pod_label_(.+)
processors:
    batch:
        # batch metrics before sending to reduce API usage
        send_batch_max_size: 200
        send_batch_size: 200
        timeout: 5s
    memory_limiter:
        # drop metrics if memory usage gets too high
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20
    resourcedetection:
        # detect cluster name and location
        detectors: [gcp]
        timeout: 10s
    transform:
      # "location", "cluster", "namespace", "job", "instance", and "project_id" are reserved, and 
      # metrics containing these labels will be rejected.  Prefix them with exported_ to prevent this.
      metric_statements:
      - context: datapoint
        statements:
        - set(attributes["exported_location"], attributes["location"])
        - delete_key(attributes, "location")
        - set(attributes["exported_cluster"], attributes["cluster"])
        - delete_key(attributes, "cluster")
        - set(attributes["exported_namespace"], attributes["namespace"])
        - delete_key(attributes, "namespace")
        - set(attributes["exported_job"], attributes["job"])
        - delete_key(attributes, "job")
        - set(attributes["exported_instance"], attributes["instance"])
        - delete_key(attributes, "instance")
        - set(attributes["exported_project_id"], attributes["project_id"])
        - delete_key(attributes, "project_id")

exporters:
    googlemanagedprometheus:

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch, memory_limiter, transform, resourcedetection]
      exporters: [googlemanagedprometheus]

Resource Attribute Handling

The Google Managed Prometheus exporter maps metrics to the prometheus_target monitored resource. The logic for mapping to monitored resources is designed to be used with the prometheus receiver, but can be used with other receivers as well. To avoid collisions (i.e. "duplicate timeseries enountered" errors), you need to ensure the prometheus_target resource uniquely identifies the source of metrics. The exporter uses the following resource attributes to determine monitored resource:

  • location: [location, cloud.availability_zone, cloud.region]
  • cluster: [cluster, k8s.cluster.name]
  • namespace: [namespace, k8s.namespace.name]
  • job: [service.name + service.namespace]
  • instance: [service.instance.id]

In the configuration above, cloud.availability_zone, cloud.region, and k8s.cluster.name are detected using the resourcedetection processor with the gcp detector. The prometheus receiver sets service.name to the configured job_name, and service.instance.id is set to the scrape target's instance. The prometheus receiver sets k8s.namespace.name when using role: pod.

Manually Setting location, cluster, or namespace

In GMP, the above attributes are used to identify the prometheus_target monitored resource. As such, it is recommended to avoid writing metric or resource labels that match these keys. Doing so can cause errors when exporting metrics to GMP or when trying to query from GMP. So, the recommended way to set them is with the resourcedetection processor.

If you still need to set location, cluster, or namespace labels (such as when running in non-GCP environments), you can do so with the resource processor like so:

processors:
  resource:
    attributes:
    - key: "location"
      value: "us-east1"
      action: upsert

Setting cluster, location or namespace using metric labels

This example copies the location metric attribute to a new exported_location attribute, then deletes the original location. It is recommended to use the exported_* prefix, which is consistent with GMP's behavior.

You can also use the groupbyattrs processor to move metric labels to resource labels. This is useful in situations where, for example, an exporter monitors multiple namespaces (with each namespace exported as a metric label). One such example is kube-state-metrics.

Using groupbyattrs will promote that label to a resource label and associate those metrics with the new resource. For example:

processors:
  groupbyattrs:
    keys:
    - namespace
    - cluster
    - location

Feature-gates

  • exporter.googlemanagedpromethues.intToDouble: Default=false Change all metric datapoint type to double to prevent Value type for metric <metric name> conflicts with the existing value type errors:
"--feature-gates=exporter.googlemanagedpromethues.intToDouble"

Troubleshooting

Conflicting Value Types

Error: Value type for metric <metric name> conflicts with the existing value type

Google Managed Service for Promethueus (and Google Cloud Monitoring) have fixed value types (INT and DOUBLE) for metrics. Once a metric has been written as an INT or DOUBLE, attempting to write the other type will fail with the error above. This commonly occurs when a metric's value type has changed, or when a mix of INT and DOUBLE for the same metric are being written to the same project. The recommended way to fix this is to convert all metrics to DOUBLE to prevent collisions using the exporter.googlemanagedpromethues.intToDouble feature gate, documented above.

Once you enable the feature gate, you will likely see new errors indicating type collisions, as some existing metrics will be changed from int to double. To fix this, you need to delete the metric descriptor. This will delete all existing data for the metric, but will allow it to be written as a double going forward. The simplest way to do this is by using the "Try this method" tab in the API reference for DeleteMetricDescriptor.

Points Written Too Frequently

Error: One or more points were written more frequently than the maximum sampling period configured for the metric.

Google Managed Service for Promethueus (and Google Cloud Monitoring) limit the rate at which points can be written to one point every 5 seconds. If you try to write points more frequently, you will encounter the error above. If you know that you aren't writing points more frequently than 5 seconds, this can be a symptom of the Timeseries Collision problem below.

Timeseries Collision

Error: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.

Error: Points must be written in order. One or more of the points specified had an older start time than the most recent point.

Explanation for Errors

The errors above, and sometimes the points were written more frequently than the maximum sampling period error can indicate that two metric datapoints are being written without any resource or metric attributes that distinguish them from each other. We refer to this as a "Timeseries Collision".

Duplicate TimeSeries encountered is the clearest indication of a timeseries collision. It means that two timeseries in a single request had identical monitored resource and metric labels.

Points must be written in order often indicates that two different collectors are writing the same timeseries, since they can race to deliver the same metric with slightly different timestamps. If the later timestamp is delivered first, it triggers this error. The duplicates don't appear in the same request, so it doesn't trigger the Duplicate TimeSeries encountered error, but they do still collide.

points were written more frequently than the maximum sampling period also often indicates that two different collectors are writing the same timeseries, but happens when the first timestamp is delivered first, and the later timestamp is delivered second. In this case, the points are in order, but are rejected because they are too close together.

Root-causing Timeseries Collisions

There are three main root causes for timeseries collisions:

  1. Resource attributes don't distinguish applications.
  2. Resource attributes are dropped by the exporter.
  3. Metric data point attributes don't distinguish timeseries (very rare).

The most common reason is (1), which means that it can be fixed by adding resource information. If you are running on GCP, you can use the resourcedetection processor with the gcp detector. If you are running on Kubernetes (including GKE), we recommend also using the k8sattributes processor to at least add k8s.namespace.name and k8s.pod.name. Finally, it is important to make sure service.name and service.instance.id are set by applications in a way that uniquely identifies each instance.

The next most common reason is (2), which means that the exporter's mapping logic from OpenTelemetry resource to Google Cloud's prometheus_target monitored resouce didn't preserve a resource attribute that was needed to distinguish timeseries. This can be mitigated by adding resource attributes as metric labels using resource_filters configuration in the exporter:

  googlemanagedprometheus:
    metric:
      resource_filters:
        regex: ".*"

If you need to troubleshoot errors further, start by filtering down to a single metric from the error message using the filter or transform processors, and using the debug exporter with detailed verbosity:

processors:
  filter:
    error_mode: ignore
    metrics:
      - name != "problematic.metric.name"
exporters:
  debug:
    verbosity: detailed

That can help identify which metric sources are colliding, so you know which applications or metrics need additional attributes to ditinguish them from one-another.