Skip to content

LGTM Stack documentation #872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/self_hosting/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,7 @@ Step-by-step guides that cover the installation, configuration, and scaling of y
- [Week of January 29, 2024 - LangSmith v0.2](./self_hosting/release_notes#week-of-january-29-2024---langsmith-v02): Release notes for version 0.2 of LangSmith.
- [FAQ](./self_hosting/faq): Frequently asked questions about LangSmith.
- [Troubleshooting](./self_hosting/troubleshooting): Troubleshooting common issues with your Self-Hosted LangSmith instance.
- [Observability](./self_hosting/observability): How to access telemetry data for your self-hosted LangSmith instance.
- [Export LangSmith telemetry](./self_hosting/observability/export_backend): Export logs, metrics and traces to your collector and/or backend of choice.
- [Collector configuration](./self_hosting/observability/langsmith_collector): Example yaml configurations for an OTel collector to gather LangSmith telemetry data.
- [LangSmith Observability Stack](./self_hosting/observability/observability_stack): Have LangSmith deploy a basic observability stack for you to view logs, metrics and traces for your deployment.
70 changes: 70 additions & 0 deletions docs/self_hosting/observability/export_backend.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
sidebar_label: Export LangSmith Telemetry
sidebar_position: 9
---

# Exporting LangSmith telemetry to your observability backend

:::warning Important
**This section is only applicable for Kubernetes deployments.**
:::

Self-Hosted LangSmith instances produce telemetry data in the form of logs, metrics and traces. This section will show you how to access and export that data to
an observability collector or backend.

This section assumes that you have monitoring infrastructure set up already, or you will set up this infrastructure and want to know how to configure LangSmith to collect data from it.

Infrastructure refers to:

- Collectors, such as [OpenTelemetry](https://opentelemetry.io/docs/collector/), [FluentBit](https://docs.fluentbit.io/manual) or [Prometheus](https://prometheus.io/).
- Observability backends, such as [Datadog](https://www.datadoghq.com/) or the [Grafana](https://grafana.com/) ecosystem.

# Logs: [OTel Example](./langsmith_collector#logs)

All services that are part of the LangSmith self-hosted deployment write logs to their node's filesystem and to stdout. In order to access these logs, you need to set up your collector to read from either the filesystem or stdout. Most popular collectors support reading logs from filesystems.

- **OpenTelemetry**: [File Log Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver)
- **FluentBit**: [Tail Input](https://docs.fluentbit.io/manual/pipeline/inputs/tail)
- **Datadog**: [Kubernetes Log Collection](https://docs.datadoghq.com/containers/kubernetes/log/?tab=datadogoperator)

# Metrics: [OTel Example](./langsmith_collector#metrics)

## LangSmith Services
The following LangSmith services expose metrics at an endpoint, in the Prometheus metrics format.
- <b>Backend</b>: `http://<langsmith_release_name>-backend.<namespace>.svc.cluster.local:1984/metrics`
- <b>Platform Backend</b>: `http://<langsmith_release_name>-platform-backend.<namespace>.svc.cluster.local:1986/metrics`
- <b>Host Backend</b>: `http://<langsmith_release_name>-host-backend.<namespace>.svc.cluster.local:1985/metrics`
- <b>Playground</b>: `http://<langsmith_release_name>-playground.<namespace>.svc.cluster.local:1988/metrics`

You can use a [Prometheus](https://prometheus.io/docs/prometheus/latest/getting_started/#configure-prometheus-to-monitor-the-sample-targets) or
[OpenTelemetry](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver) collector to scrape the endpoints, and export metrics to the
backend of your choice.

## Frontend Nginx
The frontend service exposes its Nginx metrics at the following endpoint: `langsmith-frontend.langsmith.svc.cluster.local:80/nginx_status`. You can either scrape them yourself, or
bring up a Prometheus Nginx exporter using the [LangSmith Observability Helm Chart](./observability_stack)

:::warning Important
**The following sections apply for in-cluster databases only. If you are using external databases, you will need to configure exposing and fetching metrics.**
:::
## Postgres + Redis
If you are using in-cluster Postgres/Redis instances, you can use a Prometheus exporter to expose metrics from your instance. You can deploy your own, or if you would like, you can
use the [LangSmith Observability Helm Chart](./observability_stack) to deploy an exporter for you.
## Clickhouse
The in-cluster Clickhouse is configured to expose metrics without the need for an exporter.
You can use your collector to scrape metrics at `http://<langsmith_release_name>-clickhouse.<namespace>.svc.cluster.local:9363/metrics`

# Traces: [OTel Example](./langsmith_collector#traces)

The LangSmith Backend, Platform Backend, Playground and LangSmith Queue deployments have been instrumented to emit [Otel](https://opentelemetry.io/docs/concepts/signals/traces/)
traces. Tracing is toggled off by default, and can be enabled for all LangSmith services with the following in your `langsmith_config.yaml` (or equivalent) file:

```yaml
config:
tracing:
enabled: true
endpoint: "<your_collector_endpoint>"
useTls: true # / false
env: "ls_self_hosted" # This value will be set as an "env" attribute in your spans
exporter: "http" # must be either http or grpc
```
13 changes: 13 additions & 0 deletions docs/self_hosting/observability/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
sidebar_label: Self-Hosted Observability
sidebar_position: 11
description: "Observability guides for LangSmith"
---

# Self-Hosted Observability

This section contains guides for accessing telemetry data for your self-hosted LangSmith deployments.

- [Export LangSmith Telemetry](./observability/export_backend): Export logs, metrics and traces to your collector and/or backend of choice.
- [Configure a Collector for LangSmith Telemetry](./observability/langsmith_collector): Example yaml configurations for an OTel collector to gather LangSmith telemetry data.
- [LangSmith Observability Stack](./observability/observability_stack): Have LangSmith deploy a basic observability stack for you to view logs, metrics and traces for your deployment.
281 changes: 281 additions & 0 deletions docs/self_hosting/observability/langsmith_collector.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
---
sidebar_label: Collector Configuration
sidebar_position: 9
---

# Configure your Collector to gather LangSmith telemetry

As seen in the previous section, the various services in a LangSmith deployment emit telemetry data in the form of logs, metrics and traces.
You may already have telemetry collectors set up in your Kubernetes cluster, or would like to deploy one to monitor your application.

This section will show you how to configure an [OTel Collector](https://opentelemetry.io/docs/collector/configuration/) to gather telemetry data from LangSmith.
Note that all of the concepts discussed below can be translated to other collectors such as [Fluentd](https://www.fluentd.org/) or [FluentBit](https://fluentbit.io/).

:::warning Important
**This section is only applicable for Kubernetes deployments.**
:::

# Receivers

## Logs

This is an example for a <b><u>Sidecar</u></b> collector to read logs from its own pod, excluding logs from non domain-specific containers.
A Sidecar configuration is useful here because we require access to every container's filesystem. A DaemonSet can also be used.

```yaml
filelog:
exclude:
- "**/otc-container/*.log"
include:
- /var/log/pods/${POD_NAMESPACE}_${POD_NAME}_${POD_UID}/*/*.log
include_file_name: false
include_file_path: true
operators:
- id: container-parser
type: container
retry_on_failure:
enabled: true
start_at: end

env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
volumes:
- name: varlogpods
hostPath:
path: /var/log/pods
volumeMounts:
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
```
:::info Note
**This configuration requires 'get', 'list', and 'watch' permissions on pods in the given namespace.**
:::
## Metrics
Metrics can be scraped using the Prometheus endpoints. A single instance <b><u>Gateway</u></b> collector can be be used to avoid
duplication of queries when fetching metrics. The following config scrapes all of the default named LangSmith services:
```yaml
prometheus:
config:
scrape_configs:
- job_name: langsmith-services
metrics_path: /metrics
scrape_interval: 15s
# Only scrape endpoints in the LangSmith namespace
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [<langsmith-namespace>]
relabel_configs:
# Only scrape services with the name langsmith-.*
- source_labels: [__meta_kubernetes_service_name]
regex: "langsmith-.*"
action: keep
# Only scrape ports with the following names
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: "(backend|platform|playground|redis-metrics|postgres-metrics|metrics)"
action: keep
# Promote useful metadata into regular labels
- source_labels: [__meta_kubernetes_service_name]
target_label: k8s_service
- source_labels: [__meta_kubernetes_pod_name]
target_label: k8s_pod
# Replace the default "host:port" as Prom's instance label
- source_labels: [__address__]
target_label: instance
```
:::info Note
**This configuration requires 'get', 'list', and 'watch' permissions on pods, services and endpoints in the given namespace.**
:::
### Traces
For traces, you need to enable the OTLP receiver. The following configuration can be used to listen to HTTP traces on port 4318, and GRPC on port 4317:
```
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
```
## Processors
### Recommended OTEL Processors
The following processors are recommended when using the OTel collector:
- [Batch Processor](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md): Groups the data into batches before sending to exporters.
- [Memory Limiter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md): Prevents the collector from using too much memory and crashing. When the soft limit is crossed,
the collector stops accepting new data.
- [Kubernetes Attributes Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor): Adds Kubernetes metadata such as pod name into the telemetry data.
## Exporters
Exporters just need to point to an external endpoint of your liking. The following configuration allows you to configure a separate endpoint for logs, metrics and traces:
```yaml
otlphttp/logs:
endpoint: <your_logs_endpoint>
otlphttp/metrics:
endpoint: <your_metrics_endpoint>
otlphttp/traces:
endpoint: <your_traces_endpoint>
```
:::note Note
**The OTel Collector also supports exporting directly to a [Datadog](https://docs.datadoghq.com/opentelemetry/setup/collector_exporter) endpoint.**
:::
# Example Collector Configuration: Logs Sidecar
```yaml
mode: sidecar
image: otel/opentelemetry-collector-contrib

config:
receivers:
filelog:
exclude:
- "**/otc-container/*.log"
include:
- /var/log/pods/${POD_NAMESPACE}_${POD_NAME}_${POD_UID}/*/*.log
include_file_name: false
include_file_path: true
operators:
- id: container-parser
type: container
retry_on_failure:
enabled: true
start_at: end

processors:
batch:
send_batch_size: 8192
timeout: 10s
memory_limiter:
check_interval: 1m
limit_percentage: 90
spike_limit_percentage: 80

exporters:
otlphttp/logs:
endpoint: <your-endpoint>

service:
pipelines:
logs/langsmith:
receivers: [filelog]
processors: [batch, memory_limiter]
exporters: [otlphttp/logs]

env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
volumes:
- name: varlogpods
hostPath:
path: /var/log/pods
volumeMounts:
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
```
# Example Collector Configuration: Metrics and Traces Gateway
```yaml
mode: deployment
image: otel/opentelemetry-collector-contrib

config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: langsmith-services
metrics_path: /metrics
scrape_interval: 15s
# Only scrape endpoints in the LangSmith namespace
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [<langsmith-namespace>]
relabel_configs:
# Only scrape services with the name langsmith-.*
- source_labels: [__meta_kubernetes_service_name]
regex: "langsmith-.*"
action: keep
# Only scrape ports with the following names
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: "(backend|platform|playground|redis-metrics|postgres-metrics|metrics)"
action: keep
# Promote useful metadata into regular labels
- source_labels: [__meta_kubernetes_service_name]
target_label: k8s_service
- source_labels: [__meta_kubernetes_pod_name]
target_label: k8s_pod
# Replace the default "host:port" as Prom's instance label
- source_labels: [__address__]
target_label: instance
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
batch:
send_batch_size: 8192
timeout: 10s
memory_limiter:
check_interval: 1m
limit_percentage: 90
spike_limit_percentage: 80

exporters:
otlphttp/metrics:
endpoint: <metrics_endpoint>
otlphttp/traces:
endpoint: <traces_endpoint>

service:
pipelines:
metrics/langsmith:
receivers: [prometheus]
processors: [batch, memory_limiter]
exporters: [otlphttp/metrics]
traces/langsmith:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [otlphttp/traces]
```
Loading
Loading