Observability
Chinmina produces traces and metrics via OpenTelemetry, and logs to stdout via zerolog.
For audit log details, see the auditing reference. For complete telemetry technical details, see the telemetry reference.
Enabling OpenTelemetry
Section titled “Enabling OpenTelemetry”Set OBSERVE_ENABLED=true to enable telemetry collection.
Choose an exporter type with OBSERVE_TYPE:
"grpc"(default): Send to an OpenTelemetry collector via gRPC"stdout": Write to standard output (development only)
Minimal configuration
Section titled “Minimal configuration”For gRPC export to a collector:
OBSERVE_ENABLED=trueOBSERVE_TYPE=grpcOTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317For stdout export during development:
OBSERVE_ENABLED=trueOBSERVE_TYPE=stdoutSee the configuration reference for all OBSERVE_* variables, including collector settings, batch timeouts, and metric read intervals.
Critical user journeys
Section titled “Critical user journeys”Critical user journeys (CUJs) define the key operations that affect users of the system. Each CUJ maps to a trace structure and a set of service level indicators (SLIs) to monitor.
Token generation
Section titled “Token generation”Generates a GitHub token for the pipeline’s repository. This is the primary operation and the critical path for pipeline execution.
Endpoint: POST /token
Trace structure:
Server span: POST /token├── Client span: GET api.buildkite.com/v2/.../pipelines/...└── Client span: POST api.github.com/app/installations/.../access_tokensThe server span captures total request duration and HTTP status. The Buildkite API span shows pipeline lookup performance, and the GitHub API span shows token creation performance.
SLIs to monitor:
- p95/p99 server span duration
- HTTP 5xx error rate
- Cache hit rate (cached requests skip both API calls)
Suggested SLO targets:
| Metric | Objective | Rationale |
|---|---|---|
| Success rate | 99.9% | Critical path for pipeline execution |
| p99 latency | < 2s | Minimize delay in clone operations |
| p95 latency | < 1s | Typical case performance |
| Cache hit rate | > 70% | Reduce API load and latency |
| GitHub API p95 latency | < 500ms | Monitor external dependency health |
| Buildkite API p95 latency | < 300ms | Monitor external dependency health |
Git credentials
Section titled “Git credentials”Endpoint: POST /git-credentials
Identical trace structure to token generation (same underlying implementation). Git retries failed requests automatically, so slow responses directly delay clone operations. Monitor the same SLIs and SLO targets as token generation.
Organization endpoints
Section titled “Organization endpoints”Endpoints: POST /organization/token/{profile}, POST /organization/git-credentials/{profile}
Generates tokens scoped to repositories defined in an organization profile rather than the pipeline’s own repository.
Trace structure:
Server span: POST /organization/token/{profile}└── Client span: POST api.github.com/app/installations/.../access_tokensNo Buildkite API call occurs because the repository is determined by the profile configuration. Monitor the same SLIs as token generation, but expect lower latency on uncached requests due to the single API call.
Suggested SLO targets: Same as token endpoints. External API targets differ — only GitHub API applies:
| Metric | Objective | Rationale |
|---|---|---|
| GitHub API p95 latency | < 500ms | Monitor external dependency health |
Background profile refresh
Section titled “Background profile refresh”Periodically fetches organization profile configurations from the configuration source.
Trace structure:
Internal span: refresh_organization_profile└── Client span: GET api.github.com/...Attributes:
profile.digest_current: Previous configuration hashprofile.digest_updated: New configuration hashprofile.digest_changed: Whether content changed
SLIs to monitor:
- Span error rate (fetch failures affect profile availability)
profile.digest_changedfrequency (unexpected changes may indicate configuration issues)
Diagnostics
Section titled “Diagnostics”High latency
Section titled “High latency”Symptoms: p95/p99 latency exceeds objectives
Investigation:
- Check external API span durations
- Verify cache hit rate meets objectives
- Review connection timing attributes
- Check for network issues between service and APIs
Remediation:
- Increase token TTL to improve cache hit rate
- Review network path to external APIs
- Consider connection pooling configuration
High error rate
Section titled “High error rate”Symptoms: HTTP 5xx error rate above threshold
Investigation:
- Filter traces by error status
- Examine error messages in span events
- Check audit logs for detailed error information
- Verify external API availability
Remediation:
- Review GitHub App permissions
- Verify Buildkite API token scopes
- Check profile match conditions
- Investigate panic recovery patterns
Cache inefficiency
Section titled “Cache inefficiency”Symptoms: Cache hit rate below 70%
Investigation:
- Calculate hit/miss/mismatch ratio using
token.cache.outcome - Check token expiry times in audit logs
- Review repository access patterns
- Examine profile configurations
Remediation:
- Increase token expiry duration (if GitHub App allows)
- Consolidate repository access patterns
- Review profile match conditions
- Consider organizational endpoint usage
- Enable the distributed cache to share tokens across replicas
Cache encryption errors
Section titled “Cache encryption errors”Symptoms: The cache.encryption.total counter increases with encryption.outcome="error", or trace spans show cache.encrypt.outcome="error" or cache.decrypt.outcome="error". Decrypt failures surface as cache misses (the service falls back to API calls). Encrypt failures prevent caching of new tokens.
Investigation:
- Check error rate by operation type (encrypt vs decrypt) using
cache.encryption.total - Filter trace spans for
cache.decrypt.outcome="error"orcache.encrypt.outcome="error"to correlate errors with specific requests - Review service logs for specific error messages — decrypt errors fall into three categories:
- Missing
cb-enc:prefix: the cached value is unencrypted, common during encryption rollout - Base64 decode failure: corrupted data in Valkey
- Decryption failure: wrong key, incomplete key rotation, or corrupted ciphertext
- Missing
- Check logs for keyset refresh warnings (
"failed to refresh encryption keyset")
Remediation:
- Prefix errors during encryption rollout are expected — unencrypted entries resolve as cached tokens expire (within 15 minutes)
- Decryption failures after key rotation: verify the rotation procedure in the distributed cache guide and confirm the old primary key was not disabled before cached tokens expired
- Keyset refresh warnings: verify IAM permissions for Secrets Manager and KMS, then check service health
- Persistent errors with no configuration changes: check Valkey connectivity and data integrity