Skip to content

Observability

Chinmina produces traces and metrics via OpenTelemetry, and logs to stdout via zerolog.

For audit log details, see the auditing reference. For complete telemetry technical details, see the telemetry reference.

Set OBSERVE_ENABLED=true to enable telemetry collection.

Choose an exporter type with OBSERVE_TYPE:

  • "grpc" (default): Send to an OpenTelemetry collector via gRPC
  • "stdout": Write to standard output (development only)

For gRPC export to a collector:

Terminal window
OBSERVE_ENABLED=true
OBSERVE_TYPE=grpc
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

For stdout export during development:

Terminal window
OBSERVE_ENABLED=true
OBSERVE_TYPE=stdout

See the configuration reference for all OBSERVE_* variables, including collector settings, batch timeouts, and metric read intervals.

Critical user journeys (CUJs) define the key operations that affect users of the system. Each CUJ maps to a trace structure and a set of service level indicators (SLIs) to monitor.

Generates a GitHub token for the pipeline’s repository. This is the primary operation and the critical path for pipeline execution.

Endpoint: POST /token

Trace structure:

Server span: POST /token
├── Client span: GET api.buildkite.com/v2/.../pipelines/...
└── Client span: POST api.github.com/app/installations/.../access_tokens

The server span captures total request duration and HTTP status. The Buildkite API span shows pipeline lookup performance, and the GitHub API span shows token creation performance.

SLIs to monitor:

  • p95/p99 server span duration
  • HTTP 5xx error rate
  • Cache hit rate (cached requests skip both API calls)

Suggested SLO targets:

MetricObjectiveRationale
Success rate99.9%Critical path for pipeline execution
p99 latency< 2sMinimize delay in clone operations
p95 latency< 1sTypical case performance
Cache hit rate> 70%Reduce API load and latency
GitHub API p95 latency< 500msMonitor external dependency health
Buildkite API p95 latency< 300msMonitor external dependency health

Endpoint: POST /git-credentials

Identical trace structure to token generation (same underlying implementation). Git retries failed requests automatically, so slow responses directly delay clone operations. Monitor the same SLIs and SLO targets as token generation.

Endpoints: POST /organization/token/{profile}, POST /organization/git-credentials/{profile}

Generates tokens scoped to repositories defined in an organization profile rather than the pipeline’s own repository.

Trace structure:

Server span: POST /organization/token/{profile}
└── Client span: POST api.github.com/app/installations/.../access_tokens

No Buildkite API call occurs because the repository is determined by the profile configuration. Monitor the same SLIs as token generation, but expect lower latency on uncached requests due to the single API call.

Suggested SLO targets: Same as token endpoints. External API targets differ — only GitHub API applies:

MetricObjectiveRationale
GitHub API p95 latency< 500msMonitor external dependency health

Periodically fetches organization profile configurations from the configuration source.

Trace structure:

Internal span: refresh_organization_profile
└── Client span: GET api.github.com/...

Attributes:

  • profile.digest_current: Previous configuration hash
  • profile.digest_updated: New configuration hash
  • profile.digest_changed: Whether content changed

SLIs to monitor:

  • Span error rate (fetch failures affect profile availability)
  • profile.digest_changed frequency (unexpected changes may indicate configuration issues)

Symptoms: p95/p99 latency exceeds objectives

Investigation:

  1. Check external API span durations
  2. Verify cache hit rate meets objectives
  3. Review connection timing attributes
  4. Check for network issues between service and APIs

Remediation:

  • Increase token TTL to improve cache hit rate
  • Review network path to external APIs
  • Consider connection pooling configuration

Symptoms: HTTP 5xx error rate above threshold

Investigation:

  1. Filter traces by error status
  2. Examine error messages in span events
  3. Check audit logs for detailed error information
  4. Verify external API availability

Remediation:

  • Review GitHub App permissions
  • Verify Buildkite API token scopes
  • Check profile match conditions
  • Investigate panic recovery patterns

Symptoms: Cache hit rate below 70%

Investigation:

  1. Calculate hit/miss/mismatch ratio using token.cache.outcome
  2. Check token expiry times in audit logs
  3. Review repository access patterns
  4. Examine profile configurations

Remediation:

  • Increase token expiry duration (if GitHub App allows)
  • Consolidate repository access patterns
  • Review profile match conditions
  • Consider organizational endpoint usage
  • Enable the distributed cache to share tokens across replicas

Symptoms: The cache.encryption.total counter increases with encryption.outcome="error", or trace spans show cache.encrypt.outcome="error" or cache.decrypt.outcome="error". Decrypt failures surface as cache misses (the service falls back to API calls). Encrypt failures prevent caching of new tokens.

Investigation:

  1. Check error rate by operation type (encrypt vs decrypt) using cache.encryption.total
  2. Filter trace spans for cache.decrypt.outcome="error" or cache.encrypt.outcome="error" to correlate errors with specific requests
  3. Review service logs for specific error messages — decrypt errors fall into three categories:
    • Missing cb-enc: prefix: the cached value is unencrypted, common during encryption rollout
    • Base64 decode failure: corrupted data in Valkey
    • Decryption failure: wrong key, incomplete key rotation, or corrupted ciphertext
  4. Check logs for keyset refresh warnings ("failed to refresh encryption keyset")

Remediation:

  • Prefix errors during encryption rollout are expected — unencrypted entries resolve as cached tokens expire (within 15 minutes)
  • Decryption failures after key rotation: verify the rotation procedure in the distributed cache guide and confirm the old primary key was not disabled before cached tokens expired
  • Keyset refresh warnings: verify IAM permissions for Secrets Manager and KMS, then check service health
  • Persistent errors with no configuration changes: check Valkey connectivity and data integrity