Skip to content

dependence on scaling of data #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
stephens999 opened this issue Oct 23, 2018 · 3 comments
Open

dependence on scaling of data #26

stephens999 opened this issue Oct 23, 2018 · 3 comments

Comments

@stephens999
Copy link

I noticed that the scaling of the data matters, which seems undesirable (and unnecessary).

For example:

set.seed(51)
true_mean = rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
genomdat = list(x = rnorm(500, sd=0.2) + true_mean, true_mean=true_mean)

The cpt.mean default does not find any changepoints:

genomdat.cp = cpt.mean(genomdat$x,method="PELT")
plot(genomdat.cp)

But if we multiply the data by 10 we find many changepoints.

genomdat.cp = cpt.mean(10*genomdat$x,method="PELT")
plot(genomdat.cp)

I speculate that perhaps the cost function (log-likelihood) implicitly assumes the variance
is 1?

Incidentally to this, while digging around the code to see if I could understand the issue, I
noticed that some places in the code
use "norm.mean" whereas others use "mean.norm". I'm not sure that was intended?

Matthews-MacBook-Air-2:changepoint stephens$ grep norm.mean src/*
src/BinSeg_one_func_minseglen.c:     char **cost_func; //Descibe the cost function used i.e. norm.mean.cost (change in mean in normal distributed data)  
src/BinSeg_one_func_minseglen.c:  {"norm.mean", mll_mean},
src/BinSeg_one_func_minseglen.c:  {"norm.meanvar", mll_meanvar},
Matthews-MacBook-Air-2:changepoint stephens$ grep mean.norm src/*
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
@rkillick
Copy link
Owner

rkillick commented Jun 4, 2019

I agree that this is an undesirable "feature" but I haven't decided on a suitable "default" behaviour for estimating the variance in the presence of a mean change. If you have any suggestions, please let me know.

@stephens999
Copy link
Author

i'd suggest 0.5*median(abs(diff(y))/0.6745)^2 which is based on the fact that
y_t-y_{t+1} \sim N(0,2sigma^2) if there is no difference in mean between t and t+1
and E(median(abs(N(0,sigma^2)))) = 0.6745 sigma

@rkillick
Copy link
Owner

rkillick commented Jun 6, 2019

Thanks for the suggestion, i'll see how it behaves on different simulated data. The best one i've found at the moment is a sample variance on a rolling window of length 30 and then taking the median of those.
But this is undesirable when you have long time series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants