dependence on scaling of data #26

stephens999 · 2018-10-23T14:59:01Z

I noticed that the scaling of the data matters, which seems undesirable (and unnecessary).

For example:

set.seed(51)
true_mean = rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
genomdat = list(x = rnorm(500, sd=0.2) + true_mean, true_mean=true_mean)

The cpt.mean default does not find any changepoints:

genomdat.cp = cpt.mean(genomdat$x,method="PELT")
plot(genomdat.cp)

But if we multiply the data by 10 we find many changepoints.

genomdat.cp = cpt.mean(10*genomdat$x,method="PELT")
plot(genomdat.cp)

I speculate that perhaps the cost function (log-likelihood) implicitly assumes the variance
is 1?

Incidentally to this, while digging around the code to see if I could understand the issue, I
noticed that some places in the code
use "norm.mean" whereas others use "mean.norm". I'm not sure that was intended?

Matthews-MacBook-Air-2:changepoint stephens$ grep norm.mean src/*
src/BinSeg_one_func_minseglen.c:     char **cost_func; //Descibe the cost function used i.e. norm.mean.cost (change in mean in normal distributed data)  
src/BinSeg_one_func_minseglen.c:  {"norm.mean", mll_mean},
src/BinSeg_one_func_minseglen.c:  {"norm.meanvar", mll_meanvar},
Matthews-MacBook-Air-2:changepoint stephens$ grep mean.norm src/*
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){

The text was updated successfully, but these errors were encountered:

rkillick · 2019-06-04T13:59:35Z

I agree that this is an undesirable "feature" but I haven't decided on a suitable "default" behaviour for estimating the variance in the presence of a mean change. If you have any suggestions, please let me know.

stephens999 · 2019-06-05T21:55:17Z

i'd suggest 0.5*median(abs(diff(y))/0.6745)^2 which is based on the fact that
y_t-y_{t+1} \sim N(0,2sigma^2) if there is no difference in mean between t and t+1
and E(median(abs(N(0,sigma^2)))) = 0.6745 sigma

rkillick · 2019-06-06T08:57:57Z

Thanks for the suggestion, i'll see how it behaves on different simulated data. The best one i've found at the moment is a sample variance on a rolling window of length 30 and then taking the median of those.
But this is undesirable when you have long time series.

rkillick added the enhancement label Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dependence on scaling of data #26

dependence on scaling of data #26

stephens999 commented Oct 23, 2018

rkillick commented Jun 4, 2019

Uh oh!

stephens999 commented Jun 5, 2019

Uh oh!

rkillick commented Jun 6, 2019

Uh oh!

dependence on scaling of data #26

dependence on scaling of data #26

Comments

stephens999 commented Oct 23, 2018

rkillick commented Jun 4, 2019

Uh oh!

stephens999 commented Jun 5, 2019

Uh oh!

rkillick commented Jun 6, 2019

Uh oh!