Skip to content

MSSQL Modernization #1450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

schmikei
Copy link
Contributor

@schmikei schmikei commented Jun 11, 2025

  • Utilizing common-lib, logs-lib and grafonnet for most of the mixin generation
  • Following inspiration for more recently updated mixins like Vsphere

This involves separating out the MSSQL Overview dashboard to separate logs and metrics. This would require some new screenshots.

image image image

@schmikei schmikei marked this pull request as ready for review June 11, 2025 18:34
@schmikei schmikei requested a review from a team as a code owner June 11, 2025 18:34
Comment on lines 6 to 16
database:
[
g.panel.row.new('Database'),
this.grafana.panels.databaseWriteStallDurationPanel
+ g.panel.timeSeries.gridPos.withW(12),
this.grafana.panels.databaseReadStallDurationPanel
+ g.panel.timeSeries.gridPos.withW(12),
this.grafana.panels.transactionLogExpansionsPanel
+ g.panel.timeSeries.gridPos.withW(24),
],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best if row panels are added not as array but inside
g.panel.row.withPanels() function. This way it is possible to toggle rows collapsed/uncollapsed.

example:
https://github.com/grafana/jsonnet-libs/blob/master/kafka-observ-lib/rows.libsonnet#L25-L33.
and then:
https://github.com/grafana/jsonnet-libs/blob/master/kafka-observ-lib/dashboards.libsonnet#L33-L52

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments but overall this looks great to me!
Admittedly a bit rusty on this what with PTO and all so leaving final approval up to @v-zhuravlev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing .json filetype on this and the mssql_overview files

"keepTime": true,
"title": "MSSQL Overview",
"type": "link",
"url": "/d/mssql_mssql_overview"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is generated, but it feels like a weird change to have mssql twice in the URL?

@schmikei
Copy link
Contributor Author

schmikei commented Jul 1, 2025

@v-zhuravlev just curious, do you think we should just default to using signals for this PR as well? I think the API is nice to use and gives us more flexibility probably moving forward.

@v-zhuravlev
Copy link
Contributor

@v-zhuravlev just curious, do you think we should just default to using signals for this PR as well? I think the API is nice to use and gives us more flexibility probably moving forward.

yes, it would be great using Signals where possible.

I also see asTarget() as first transitional step for easier migration,

using asTimeSeries() and asPanelMixin(), asTable(), asTableRow() (which propogate units, value mappings from signals) are even better for panel constructions from signals, I encourage you take a look in README or other observ-libs, like kafka, jvm etc.

But let's finish off Clickhouse first.

Copy link
Contributor

@v-zhuravlev v-zhuravlev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, some comments inline.

local this = self,
enableMultiCluster: false,
filteringSelector: 'job=~"integrations/mssql"',
groupLabels: ['job', 'cluster'], // Remove 'db' from here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like db is removed.


// logs lib related
enableLokiLogs: true,
logLabels: ['job', 'cluster', 'instance'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logLabels: ['job', 'cluster', 'instance'],
logLabels: self.groupLabels + self.instanceLabels,

would it be a better default?

"source": {
"git": {
"remote": "https://github.com/grafana/grafonnet.git",
"subdir": "gen/grafonnet-v10.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be bumped to 11.0.0 or 11.4.0?

Comment on lines 40 to 54
local dbVariable = {
label: 'Database',
name: 'db',
query: 'label_values(mssql_io_stall_seconds_total{%s}, db)' % this.config.filteringSelector,
datasource: {
type: 'prometheus',
uid: '${prometheus_datasource}',
},
refresh: 2,
sort: 1,
type: 'query',
includeAll: true,
allValue: '.*',
multi: true,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need special value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this because only some targets need the db field as part of the filter in the original design.

Old Source:

'100 * mssql_server_total_memory_bytes{' + matcher + '} / clamp_min(mssql_available_commit_memory_bytes{' + matcher + '},1)',

vs
'increase(mssql_io_stall_seconds_total{' + matcher + ', db=~"$database", operation="write"}[$__rate_interval:])',

Dropdown:

template.new(
'database',
promDatasource,
'label_values(mssql_log_growths_total{%(mssqlSelector)s, instance=~"$instance"}, db)' % $._config,
label='Database',
refresh=1,
includeAll=true,
multi=true,
allValues='.+',
sort=2

As you can see we got around this last time by selectively using the filter on the specific targets we needed.

With signals it seems a tad bit harder to specify these kind of scenarios where we want a filter dropdown but also want all the filters aside from the db selector. So changing the default value to .* makes it so that the filter still works but does not break other panels when filtering.

Let me know if you think there's an easier way of accomplishing this but this was my solution thus far to keep this going forward!

Copy link
Contributor

@v-zhuravlev v-zhuravlev Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need db var to accept .* as all value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but its moreso that if any database filter is applied then it kind of breaks the other panels if I do it as an instanceLabel...

I suppose is this behavior fine from your perspective? I didn't want to remove the funcitionality of this being a single pane where a user could correlate database data with instance data.

Screen.Recording.2025-07-15.at.12.34.14.PM.mov

if that doesn't make sense let me know but I figured the currently written way was more user friendly, but happy to make the change either way!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it. You check this example with Kafka instance and Kafka instance topics, very similar to yours.

  1. First topic is added(hardcoded) as instance label to topics signals.
    https://github.com/grafana/jsonnet-libs/blob/master/kafka-observ-lib/signals/topic.libsonnet#L7
    note that for your case it could be instanceLabels + ['db'], there instanceLabels are ommitted for completely different reason.
  2. Second, union of variables generated is used give variables for topics dashboard:
    https://github.com/grafana/jsonnet-libs/blob/master/kafka-observ-lib/dashboards.libsonnet#L7-L13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've implemented based off what you're saying. Let me know if you think its adequate!

Comment on lines 43 to 55
severeErrors: {
name: 'Severe errors',
nameShort: 'Severe Errors',
type: 'raw',
description: 'Rate of severe connection errors per second.',
unit: '/ sec',
sources: {
prometheus: {
expr: 'increase(mssql_kill_connection_errors_total{%(queriesSelector)s}[$__rate_interval:])',
legendCustomTemplate: '{{ instance }}',
},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be simplier to use type=counter and simple expression?
Alos, AFAIK, increase doesn't work well with rate_interval. Best if combined just $__interval to avoid increments duplication and wrong "Total" aggregations in the legend etc.

Comment on lines 57 to 66
name: 'Deadlocks',
nameShort: 'Deadlocks',
type: 'raw',
description: 'Rate of deadlocks per second.',
unit: '/ sec',
sources: {
prometheus: {
expr: 'rate(mssql_deadlocks_total{%(queriesSelector)s}[$__rate_interval])',
legendCustomTemplate: '{{ instance }}',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, why not to use type=counter?

unit: 'seconds',
sources: {
prometheus: {
expr: 'increase(mssql_io_stall_seconds_total{%(queriesSelector)s, db=~"$db", operation="write"}[$__rate_interval:])',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can db just added as the last element of instanceLabels? Can it be considered the smallest instnance you want to observ (single db within MSSQL host?) would it work?

},
databaseReadStallDuration: {
name: 'Database read stall duration',
nameShort: 'Read Stall',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nameShort: 'Read Stall',
nameShort: 'Read stall',

Please use Sentence case.

Comment on lines 34 to 39
type: 'raw',
description: 'Time spent waiting for database read operations.',
unit: 'seconds',
sources: {
prometheus: {
expr: 'increase(mssql_io_stall_seconds_total{%(queriesSelector)s, db=~"$db", operation="read"}[$__rate_interval:])',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type: 'raw',
description: 'Time spent waiting for database read operations.',
unit: 'seconds',
sources: {
prometheus: {
expr: 'increase(mssql_io_stall_seconds_total{%(queriesSelector)s, db=~"$db", operation="read"}[$__rate_interval:])',
type: 'counter',
description: 'Time spent waiting for database read operations.',
unit: 'seconds',
sources: {
prometheus: {
expr: 'mssql_io_stall_seconds_total{%(queriesSelector)s, db=~"$db", operation="read"})',

again, I would defined counters as type=counters and simple expression. See signals README, you can swap rate to increase() as the default function that is applied to the counter if needed.

},
bufferCacheHitPercentage: {
name: 'Buffer cache hit percentage',
nameShort: 'Buffer Cache Hit',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and similar, please check it is Sentence case.

"expr": "increase(mssql_kill_connection_errors_total{job=~\"integrations/mssql\",job=~\"$job\",cluster=~\"$cluster\",instance=~\"$instance\"}[$__rate_interval:])",
"expr": "rate(mssql_kill_connection_errors_total{job=~\"integrations/mssql\",job=~\"$job\",cluster=~\"$cluster\",instance=~\"$instance\"}[$__rate_interval])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some expr got changed function from increase to rate at the end. Just to check that this is what you want. If not, you can change rateFunction to be increase for some signals. See Signals README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants