Skip to content

Conversation

felipegb94
Copy link

RapidPT_min includes only the necessary files for RapidPT. Under RapidPT/include/ you can find grasta 1.2 which is the library we rely on for low-rank matrix completion.

The main files modified were:

  • snpm_cp.m --> Added an if statement saying that if nPerm >= 10000, and if type of test is 'TwoSampTTest' then use RapidPT. All the necessary files for snpm_pp to work are saved (SnPMt.mat,SnPMucp.mat, SnPM.mat, XYZ.mat). The MaxT only has the maximum test statistics and not the minimum ones. For the minimum ones I simply took the negative of the recovered MaxT (See line 811).
  • snpm_pi_TwoSampT.m --> I changed this file to stop forcing it to try to find different labelings for each permutation. For datasets with more than 20 subjects and when doing 10000 the chances of repeating a permutation should be low.
  • snpm_TwoSampleTGetLabelsMatrices., --> Added this function. This is called by snpm_pi_TwoSampT to get the label matrices for all permutations.

@felipegb94
Copy link
Author

I just realized that there are still a few files inside RapidPT_min that are not used. I will fix that.

Also the grasta library has a ton of files that are not really used by RapidPT but are part of the library. Should I remove them or just leave them there?

@nicholst
Copy link
Collaborator

nicholst commented Nov 8, 2016

@felipegb94 Thanks for this! I'm going to review this with @cmaumet who helps me maintain SnPM. If you can easily find the minimum set of files to use from grasta (whose library is that?) then, sure.

The one change I'm going to propose right away, though, is having a more direct switch so people can try this method. For example, imagine someone that has 1000+ subjects with high-resolution VBM data, 10,000 perms may cause memory problems but they still may want to try the method. Conversely, people may want to compare their results with 10,000 'real' permutations.

I'm not sure where to put it in the interface, but for now I'd use a defaults value. E.g. in snpm_defaults.m add a line like

% Use matrix completion acceleration?
%------------------------------------------------------------------------
SnPMdefs.RapidPT = 0; % 0 = Always off; 1 = On if nPerm>=10000; 2 = Always on.

And then you can check that and act correspondingly in snpm_cp.m accordingly.

Also, when the smoke clears, by the time snpm_cp is over, there should be a simple way to check whether the approximate method was used. E.g. Have a variable RapidPT that is saved, i.e., in your conditional add

s_SnPM_save = [s_SnPM_save ' RapidPT'];  % Save actual RaptidPT usage (0 or 1)

@cmaumet - How does all this look to you?

@felipegb94
Copy link
Author

@nicholst @cmaumet
Hi!
I added a RapidPT marker to SnPMdefs so that we can easily check if RapidPT is being used or not. It is always set to 0 and only set to 1 when nPerm >= 10000.

I removed all the unecessary RapidPT and GRASTA files from RapidPT_min. So right now that folder only has the code we need, some license files, and a README.

Just for Reference RapidPT can be found here: RapidPT
and GRASTA can be found here: GRASTA

Thanks!

@nicholst
Copy link
Collaborator

nicholst commented Nov 8, 2016

Thanks for that @felipegb94 ! But just to be clear, the user should always have the option to override, i.e. to (a) force RapidPT to be used or (b) force RapidPT to be not used (this is what I'm looking at https://github.com/nicholst/SnPM-devel/pull/28/files#diff-a366d644f7db2228d4719aa2e8b51dc1R295 )

@felipegb94
Copy link
Author

felipegb94 commented Nov 9, 2016

@nicholst
I made one more change to the options to use rapidpt. If you look at lines 298-304 in snpm_cp.m you will see the logic to see if we use RapidPT or not. There are four cases:

  1. User wants to force to use RapidPT (and the test stat is TwoSampT) --> SnPMdefs.RapidPT = 2
  2. User does not care if he uses RapidPT or regular perm testing, but nPerms >= 10000 (and the test stat is TwoSampT) --> SnPMdefs.RapidPT = 1
  3. User does not care and nPerm < 10000, or the test stat is not TwoSampT --> SnPMdefs.RapidPT can be either 0 or 1, in either case we do not use RapidPT.
  4. User does not want to use RapidPT --> SnPMdefs.RapidPT = 0

In the end, the user will specify if he wants to force to use RapidPT (SnPMdefs.RapidPT = 2), if he does not care (SnPMdefs.RapidPT = 1), or if he does not want to use it (SnPMdefs.RapidPT = 0). Therefore if the user wants to use RapidPT he has to change the SnPMdefaults, or we can have the default value be 1, i.e the user does not care and we pick when he uses it.

Is this what you were thinking?

Thanks!

@nicholst
Copy link
Collaborator

nicholst commented Nov 9, 2016

@felipegb94 - Yes! Thanks, exactly. Also, btw, don't change SnPMdefs.RapidPT... that expresses user's (invariant) desire; just record what happened with UseRapidPT.

I'll go over this with @cmaumet and merge it as soon as possible, but will probably take a week or so as we're both busy the rest of this week.

@cmaumet
Copy link
Member

cmaumet commented Nov 10, 2016

@felipegb94: thank you for this pull request! It looks like your branch cannot be merged at the moment as it has some conflicts with the master branch. Could you rebase your code to include our latest changes? Let me know if you need some guidance on how to do that. Thank you!

@nicholst
Copy link
Collaborator

Hi @felipegb94,

In reviewing your code I was checking your license and realised that SnPM has been utterly unclear about its license. I'm usually partial to MIT but L-GPL seems fine to me. HOWEVER, my post doc @cmaumet pointed out that since SPM is vanilla GPL, SnPM has to be GPL! This was news to me!

So this means you're in the same boat, I'm afraid... as we depend on SPM, and you depend on SnPM, the viral nature of the GPL means you must be GPL as well.

You should discuss this with Vikas but it would seem you have no choice but to switch to GPL.

@felipegb94
Copy link
Author

Hi @nicholst ,

I will look into this and talk to Vikas and Vamsi. But I think that GPL software can depend on software under the MIT license. What cannot happen is MIT licensed software to be distributed depending on software under the GPL license. For instance, RapidPT depends on Grasta, so if Grasta was GPL, RapidPT would have to be GPL.

Now the question is how do you define that a software depends on another one. In the case of spm and snpm, I would think that spm depends on snpm since snpm is a toolbox that can be added to extend functionality.

I am not an expert on this but I can look into this more. In the end, I am ok with making RapidPt GPL if that means that it can be added to snpm.

I will follow up on this after I talk with Vikas. Thanks for pointing this out!
Felipe

@nicholst
Copy link
Collaborator

nicholst commented Nov 13, 2016 via email

@cmaumet
Copy link
Member

cmaumet commented Nov 14, 2016

@felipegb94, @nicholst: just to add to this discussion, (I am not a licence expert but) here is what I would suggest:

  • From my understanding, RapidPT does not directly depend on SnPM (nor SPM) so I would say usage of the MIT licence is fine (as long as there are no other dependencies imposing certain licenses),
  • If we ship RapidPT with SnPM then we would release the full package (including SnPM, part of RapidPT and part of gratsa) under GPL. GPL is compatible with MIP and L-GPL so this should be fine as well.

Would this sound reasonable?

Felipe Gutierrez added 2 commits November 15, 2016 12:49
Conflicts:
	snpm_defaults.m
	snpm_pi_TwoSampT.m
remove GetLabelsMatrices (now done directly in snpm_pi_TwoSampT.m). switch SnPMdefs in snpm_cp to snpm_get_defaults
@felipegb94
Copy link
Author

felipegb94 commented Nov 15, 2016

@nicholst @cmaumet

Merge related: Resolved merge conflicts and removed a function I wrote that was not needed after the merge.

License Related: Regarding the license discussion. I think what @cmaumet is proposing makes sense. I read up on GPL and it says that a GPL-licensed package can use MIT and L-GPL licensed libraries (i.e they are compatible). The stand-alone version of RapidPT will have an MIT license since it will not depend on SnPM or SPM. Overall, the full SPM/SnPM package will be released with a GPL license. But stand-alone RapidPT will have an MIT License. Does this work?

Docs Related: I would like to add a short paragraph to explain when and how to enable RapidPT. Should I add that to the Readme?

Thanks!

@felipegb94
Copy link
Author

Hi @nicholst @cmaumet ,

License Followup: I talked to Vikas and he said that we will change the license to GPL.

Is there anything else I should modify in the PR?

Thanks!
Felipe

@cmaumet
Copy link
Member

cmaumet commented Jan 17, 2017

Hi @felipegb94,

I am very sorry for being so slow to reply to your messages. You are our first external contributor on GitHub and we are still working out how to best handle new contributions. Thank you again for this pull request! And very sorry for the delay.

Here are a few things that I think we should focus on in order to merge this PR:

  1. Non-regression testing (existing). SnPM has a test suite that we run on each new update to test for non-regression. We should run this test suite on your update.
  2. Non-regression testing (updated). We should also create non-regression tests (at least one) for your new feature. This means running an analysis that uses your code and save the corresponding results so that in future releases we can check we always get the very same results.
  3. Code review. Someone from our side should have a look at your update and send any feedback. I think @nicholst is probably the best person for that as he best knows the new method you are implementing.

So, I am planning to focus on (1) and (2): to make the test data publicly available and write up a contributing guideline.

@nicholst: would you be okay to do the code review (3)?

@felipegb94: if you want to get started on (2) maybe you could start designing/implementing a test for your additional feature (e.g. of code for existing tests is available in https://github.com/nicholst/SnPM-devel/tree/master/test)?

Let me know if this sounds okay. Thank you!

@felipegb94
Copy link
Author

felipegb94 commented Feb 21, 2017

Hi @cmaumet ,

I am trying to write the test for rapidpt, but it keeps crashing. I am not even able to run any of the regular tests such as: run(test_oneSample, 'test_onesample_1'). I was wondering if you have tried running the tests on Matlab 2016a? That is the version I am currently using and I think that that could be the problem. Because this is a snippet of the error I get:

Error occurred in test_rapidpt/test_rapidpt_maxt and it did not run to completion.

    ---------
    Error ID:
    ---------
    'MATLAB:UndefinedFunction'

    --------------
    Error Details:
    --------------
    Undefined function 'list' for input arguments of type 'cell'.
    
    Error in cfg_repeat/list (line 112)
                [id1, stop1] = list(citems{k}, spec, tropts);

Do you think that is a possibility? This is my unfinished test code: https://github.com/felipegb94/SnPM-devel/blob/master/test/test_rapidpt.m

I created a fork of the SnPM_test_data repository and added the test results to it. I will do a pull request there once my test is working.

Thanks!
Felipe

@cmaumet
Copy link
Member

cmaumet commented Feb 23, 2017

@felipegb94: Yes, that's right! We have not run the tests with Matlab 2016a yet and this is likely to be the cause of the issue you are experiencing. So, next step is for me to make sure the tests run on this version. I will keep you updated as soon as this is done.

@felipegb94
Copy link
Author

@cmaumet , Sounds good. I will try to test with other Matlab versions (hopefully I can get a hold of 2014a) since that seems to be the one that you are using to run the tests.

Thanks!

@cmaumet cmaumet mentioned this pull request Mar 14, 2017
Copy link
Collaborator

@nicholst nicholst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felipegb94 for these mods! In addition to felipegb94#1 please look at these

snpm_cp.m Outdated
save(strcat('timings',runInfo),'timings');

% Save variables for snpm_pp
MaxT = [MaxT',-MaxT'];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that MaxT should have the correctly labeled Max and Min T stat in the first row; presently it seems like MaxT(1,1) and MaxT(1,2) are just arbitrary permutations.
ALSO, the second column is the negation of the minimum T values, i.e., the MaxT(:,2) should be all positive.

Could you revise this so that MaxT is compatible with snpm_cp?

Copy link
Author

@felipegb94 felipegb94 Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised MaxT so that MaxT(1,1) and MaxT(1,2) contain the correct MaxT.
See: nicholst@2f3c7f9

snpm_cp.m Outdated
%-----------------------------------------------------------------------
nP = [];

if(UseRapidPT >= 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is bhPerms used at all? SnPM constructs a MaxT that has first column that is max T stat and second column that is minus the the min T. As a two-sample t-test does not assume distributional symmetry, you wouldn't necessarily want to automatically merge these two columns. Only under certain settings, specifically when all permutations are exhaustively computed does SnPM set bhPerm=1.

So when bhPerm=0, MaxT(:,1) is only used to make inferences on increases, MaxT(:,1) is only used to make inferences on decreases (multiplied by -1). WhenbhPerm=1, we allow ourselves to construct the max T distribution out of MaxT(:).

I assume you probably don't want to assume distributional symmetry (and you're not exhustive), so you should set bhPerm=0 at the top of this RapidPT code block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added MinT computation to code. Second col of MaxT is -MinT.
See: nicholst@2f3c7f9

snpm_cp.m Outdated
MaxT = [MaxT',-MaxT'];
% Calculate uncorrected p-vals
[~, SnPMucp, ~, ~] = ttest2(X(1:params.nGroup1, :), X(params.nGroup1+1:end, :), 0.05, 'both', 'unequal');
save('SnPMucp.mat','SnPMucp')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for saving SnPMucp.mat, XYZ.mat, SnPMt.mat and updating SnPM.mat here. However, there is more bookkeeping that is done starting at the comment "write out lP+ and lP- images" below. In particular, when RapidPT=1, for the images lP{+,-}, lP_FDR{+,-}, and lP_FWE{+,-}, the header files are written out above and then the img files are never created.

I would prefer it if "end" for this "else" came higher up, so that the lP* images could be written out. Let me know if you need a hand with this.

Also note that the progress bar is not appropriately closed out nor the timing reported, as is done in the code below starting at if bWin, spm_progress_bar('Clear')....

Copy link
Author

@felipegb94 felipegb94 Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicholst ,
Sorry for the delay. I will get to this this weekend.
Thanks!
Felipe

Copy link
Author

@felipegb94 felipegb94 Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nicholst,

I am trying to figure out where to end the if/else statement that decides to run RapidPT or SnPM, to make sure that all the bookkeeping done at the end is taken care of. I was thinking in adding it around line 1109 in snpm_cp:

https://github.com/felipegb94/SnPM-devel/blob/master/snpm_cp.m#L1099-L1111

So, right after the variable nP is divided by the number of permutations. Since I don't keep track of nP in my code I need to figure out how to compute it. When nP is defined:

nP    = zeros(1,WorkDim);	%-Nonparam P's

Does that mean that those are the nonparametric p-values? Are those the corrected p-values? I am not sure if they are because SnPMucp is set to nP and then saved in line 1109.

Thanks!

@nicholst
Copy link
Collaborator

Hi @felipegb94,

I had a chance to run some data through it, and I'd actually a bit worried... there seems to be a systematic underestimation of the tail. I did a 2-sample t-test on some data I have kicking around, and the compared it with code snippet below.

At least on this data, for two random runs, I find the RapidPT MaxT gives a 5% threshold that is lower (4.6 or 3.2% lower, as a percentage of SnPM's threshold). Also, if I take RapidPT's 95%ile, and use that as a threshold with SnPM's permutation distribution, I find it gives either a P-value of 0.0807 or 0.0675 in these two runs. Note that the 95% Monte Carlo CI's here are 0.05+[-1,1]*sqrt(0.05*0.95/10000) or [0.0478, 0.0522], so both of seem pretty far off to me.

What do you think is going on here?

Sn=load('TwoSampT-12-12-SnPM2/SnPM');
PT=load('TwoSampT-12-12-RPT2/SnPM');
figure;plot(sort(Sn.MaxT(:,1)),1e-4:1e-4:1,sort(PT.MaxT(:,1)),1e-4:1e-4:1);legend('SnPM','RapidPT');title('CDF comparison')
figure;plot(sort(Sn.MaxT(:,1)),sort(PT.MaxT(:,1)),'b+'); refline(1,0)
xlabel('SnPM MaxT');ylabel('RapidPT MaxT')
(prctile(Sn.MaxT(:,1),95)-prctile(PT.MaxT(:,1),95))/prctile(Sn.MaxT(:,1),95)*100  % 4.6% diff in threshold
(sum(Sn.MaxT(:,1)>=prctile(PT.MaxT(:,1),95))+1)/(size(Sn.MaxT,1)+1)  % SnPM's P-value for RapidPT's threshold: 0.0807

cdfcomp
qqcomp

@cmaumet
Copy link
Member

cmaumet commented Mar 24, 2017

@felipegb94: test data for MATLAB 2016b is now available at https://github.com/SnPM-toolbox/SnPM_test_data. Let us know if you have any issues using them.

@felipegb94
Copy link
Author

Hi @nicholst ,

Sorry for the delay in getting back to you regarding the underestimation of the tail. I will send you an email tomorrow with a dropbox with a similar analysis to the one you did, but on various datasets (7-7, 15-15, 25-25, 50-50, 100-100, and 200-200). I am still waiting for some of the runs to finish. The 7-7 dataset comes from the SnPM_test_data.

Thanks,
Felipe

@felipegb94
Copy link
Author

Hi @nicholst ,

I am sorry for taking so long to get back to you but I had a few delays due to travel, other commitments, and trying to do a good amount of experiments to explain what the issue was.

You can find all the experiments/results in a dropbox folder I shared with you on an email recently. You can also find the summarized results in this google doc..

Your experiments showed that RPT was consistently underestimating the tail of the max null (i.e less conservative test). In the paper, for datasets >= 50 subjects, we reported the opposite (i.e more conservative test). Our evaluations for the paper focused on the medium-larger size dataset (50-400 subjects) because that is where the speedup gains are valuable (a few minutes vs. a few hours, or a few hours vs. a few days). In smaller datasets (e.g 10-30 subjects) the speedup gains are just a few minutes difference, so at that point the user should be doing regular permutation testing. Hence, this regime was not thoroughly tested in the experiments

The small errors in the t-threshold estimation are a sub-sampling artifact. If we sub-sample at low rates there is a speedup/accuracy trade-off. In the paper what we show is that the accuracy trade-off is bounded and is small enough that for a 10x speedup (a few minutes vs. a few hours) it might be worth it.

I did more experiments following the same evaluation procedure that you shared with me, to illustrate the point. If you take a look at the google doc in the p-value tables you will see that for the smaller datasets (N = [14,24,30]]) RPT t-thresh is less conservative. On the other hand for larger datasets (N=[50,100,150,200,300,400]) RPT t-thresh is slightly more conservative.

Furthermore, I did a set of experiment to illustrate the fact that as we increase the sampling rate oscillates between the correct p-value like a monte carlo method would do. You can see the results in the google doc.

To address the above issues and give as much information to the user, we can make the plugin such that it tells the user that they should use RPT only when dealing with medium/larger datasets. Further, we could add a warning that suggests turning off RPT if the user is dealing with a small dataset (for instance, N < 30, V < 100,000). Would this be an ok plan?

Thanks,
Felipe

@felipegb94
Copy link
Author

In the last two commits I made changes that write out the correct images lP{+,-}, lP_FDR{+,-}, and lP_FWE{+,-}. The lP_FWE{+,-}, I believe that are being written correctly because I ran them through the SnPM inference the same MIP + Table report. The lP{+,-} and lP_FDR{+,-} images reports do not match. I believe it is because I am not keeping track of the nP correctly. Currently I update nP in the following way

  1. Calculate the test statistics with the relabelling.
  2. Compare them with the test statistics obtained from the original labels.
  3. If they are larger add 1 to corresponding index in nP.

This is all done in RapidPT.m in line 179. Is that correct?

Thanks!
Felipe

@felipegb94
Copy link
Author

Hi @cmaumet and @nicholst ,

It has been a while! I hope things are good with both of you.

I know it has been quite some time since this pull request, but I am getting close to graduation, and I wanted to try to tie any loose ends on my grad school projects.

I was wondering if there are any minor changes I can do that would make the PR go through, or if this is not possible due to some other major concerns.

Thank you!

Best,
Felipe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants