-
Notifications
You must be signed in to change notification settings - Fork 10
Add RapidPT to SnPM #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I just realized that there are still a few files inside RapidPT_min that are not used. I will fix that. Also the grasta library has a ton of files that are not really used by RapidPT but are part of the library. Should I remove them or just leave them there? |
@felipegb94 Thanks for this! I'm going to review this with @cmaumet who helps me maintain SnPM. If you can easily find the minimum set of files to use from grasta (whose library is that?) then, sure. The one change I'm going to propose right away, though, is having a more direct switch so people can try this method. For example, imagine someone that has 1000+ subjects with high-resolution VBM data, 10,000 perms may cause memory problems but they still may want to try the method. Conversely, people may want to compare their results with 10,000 'real' permutations. I'm not sure where to put it in the interface, but for now I'd use a defaults value. E.g. in % Use matrix completion acceleration?
%------------------------------------------------------------------------
SnPMdefs.RapidPT = 0; % 0 = Always off; 1 = On if nPerm>=10000; 2 = Always on. And then you can check that and act correspondingly in Also, when the smoke clears, by the time snpm_cp is over, there should be a simple way to check whether the approximate method was used. E.g. Have a variable RapidPT that is saved, i.e., in your conditional add s_SnPM_save = [s_SnPM_save ' RapidPT']; % Save actual RaptidPT usage (0 or 1) @cmaumet - How does all this look to you? |
@nicholst @cmaumet I removed all the unecessary RapidPT and GRASTA files from RapidPT_min. So right now that folder only has the code we need, some license files, and a README. Just for Reference RapidPT can be found here: RapidPT Thanks! |
Thanks for that @felipegb94 ! But just to be clear, the user should always have the option to override, i.e. to (a) force RapidPT to be used or (b) force RapidPT to be not used (this is what I'm looking at https://github.com/nicholst/SnPM-devel/pull/28/files#diff-a366d644f7db2228d4719aa2e8b51dc1R295 ) |
@nicholst
In the end, the user will specify if he wants to force to use RapidPT (SnPMdefs.RapidPT = 2), if he does not care (SnPMdefs.RapidPT = 1), or if he does not want to use it (SnPMdefs.RapidPT = 0). Therefore if the user wants to use RapidPT he has to change the SnPMdefaults, or we can have the default value be 1, i.e the user does not care and we pick when he uses it. Is this what you were thinking? Thanks! |
@felipegb94 - Yes! Thanks, exactly. Also, btw, don't change I'll go over this with @cmaumet and merge it as soon as possible, but will probably take a week or so as we're both busy the rest of this week. |
@felipegb94: thank you for this pull request! It looks like your branch cannot be merged at the moment as it has some conflicts with the |
Hi @felipegb94, In reviewing your code I was checking your license and realised that SnPM has been utterly unclear about its license. I'm usually partial to MIT but L-GPL seems fine to me. HOWEVER, my post doc @cmaumet pointed out that since SPM is vanilla GPL, SnPM has to be GPL! This was news to me! So this means you're in the same boat, I'm afraid... as we depend on SPM, and you depend on SnPM, the viral nature of the GPL means you must be GPL as well. You should discuss this with Vikas but it would seem you have no choice but to switch to GPL. |
Hi @nicholst , I will look into this and talk to Vikas and Vamsi. But I think that GPL software can depend on software under the MIT license. What cannot happen is MIT licensed software to be distributed depending on software under the GPL license. For instance, RapidPT depends on Grasta, so if Grasta was GPL, RapidPT would have to be GPL. Now the question is how do you define that a software depends on another one. In the case of spm and snpm, I would think that spm depends on snpm since snpm is a toolbox that can be added to extend functionality. I am not an expert on this but I can look into this more. In the end, I am ok with making RapidPt GPL if that means that it can be added to snpm. I will follow up on this after I talk with Vikas. Thanks for pointing this out! |
In the case of spm and snpm, I would think that spm depends on snpm since snpm is a toolbox that can be added to extend functionality.
Did you flip this? Clearly SnPM depends on SPM... If SnPM vanished tomorrow it doesn't affect SPM at all. Likewise, this particular RapidPT implementation depends on SnPM and the virality of the GPL kicks in.
|
@felipegb94, @nicholst: just to add to this discussion, (I am not a licence expert but) here is what I would suggest:
Would this sound reasonable? |
Conflicts: snpm_defaults.m snpm_pi_TwoSampT.m
remove GetLabelsMatrices (now done directly in snpm_pi_TwoSampT.m). switch SnPMdefs in snpm_cp to snpm_get_defaults
Merge related: Resolved merge conflicts and removed a function I wrote that was not needed after the merge. License Related: Regarding the license discussion. I think what @cmaumet is proposing makes sense. I read up on GPL and it says that a GPL-licensed package can use MIT and L-GPL licensed libraries (i.e they are compatible). The stand-alone version of RapidPT will have an MIT license since it will not depend on SnPM or SPM. Overall, the full SPM/SnPM package will be released with a GPL license. But stand-alone RapidPT will have an MIT License. Does this work? Docs Related: I would like to add a short paragraph to explain when and how to enable RapidPT. Should I add that to the Readme? Thanks! |
Hi @felipegb94, I am very sorry for being so slow to reply to your messages. You are our first external contributor on GitHub and we are still working out how to best handle new contributions. Thank you again for this pull request! And very sorry for the delay. Here are a few things that I think we should focus on in order to merge this PR:
So, I am planning to focus on (1) and (2): to make the test data publicly available and write up a contributing guideline. @nicholst: would you be okay to do the code review (3)? @felipegb94: if you want to get started on (2) maybe you could start designing/implementing a test for your additional feature (e.g. of code for existing tests is available in https://github.com/nicholst/SnPM-devel/tree/master/test)? Let me know if this sounds okay. Thank you! |
Hi @cmaumet , I am trying to write the test for rapidpt, but it keeps crashing. I am not even able to run any of the regular tests such as:
Do you think that is a possibility? This is my unfinished test code: https://github.com/felipegb94/SnPM-devel/blob/master/test/test_rapidpt.m I created a fork of the Thanks! |
@felipegb94: Yes, that's right! We have not run the tests with |
@cmaumet , Sounds good. I will try to test with other Matlab versions (hopefully I can get a hold of Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felipegb94 for these mods! In addition to felipegb94#1 please look at these
snpm_cp.m
Outdated
save(strcat('timings',runInfo),'timings'); | ||
|
||
% Save variables for snpm_pp | ||
MaxT = [MaxT',-MaxT']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that MaxT should have the correctly labeled Max and Min T stat in the first row; presently it seems like MaxT(1,1) and MaxT(1,2) are just arbitrary permutations.
ALSO, the second column is the negation of the minimum T values, i.e., the MaxT(:,2) should be all positive.
Could you revise this so that MaxT is compatible with snpm_cp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised MaxT so that MaxT(1,1) and MaxT(1,2) contain the correct MaxT.
See: nicholst@2f3c7f9
snpm_cp.m
Outdated
%----------------------------------------------------------------------- | ||
nP = []; | ||
|
||
if(UseRapidPT >= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is bhPerms
used at all? SnPM constructs a MaxT
that has first column that is max T stat and second column that is minus the the min T. As a two-sample t-test does not assume distributional symmetry, you wouldn't necessarily want to automatically merge these two columns. Only under certain settings, specifically when all permutations are exhaustively computed does SnPM set bhPerm=1
.
So when bhPerm=0
, MaxT(:,1) is only used to make inferences on increases, MaxT(:,1) is only used to make inferences on decreases (multiplied by -1). WhenbhPerm=1
, we allow ourselves to construct the max T distribution out of MaxT(:).
I assume you probably don't want to assume distributional symmetry (and you're not exhustive), so you should set bhPerm=0
at the top of this RapidPT code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added MinT computation to code. Second col of MaxT is -MinT.
See: nicholst@2f3c7f9
snpm_cp.m
Outdated
MaxT = [MaxT',-MaxT']; | ||
% Calculate uncorrected p-vals | ||
[~, SnPMucp, ~, ~] = ttest2(X(1:params.nGroup1, :), X(params.nGroup1+1:end, :), 0.05, 'both', 'unequal'); | ||
save('SnPMucp.mat','SnPMucp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for saving SnPMucp.mat, XYZ.mat, SnPMt.mat and updating SnPM.mat here. However, there is more bookkeeping that is done starting at the comment "write out lP+ and lP- images" below. In particular, when RapidPT=1, for the images lP{+,-}, lP_FDR{+,-}, and lP_FWE{+,-}, the header files are written out above and then the img files are never created.
I would prefer it if "end" for this "else" came higher up, so that the lP* images could be written out. Let me know if you need a hand with this.
Also note that the progress bar is not appropriately closed out nor the timing reported, as is done in the code below starting at if bWin, spm_progress_bar('Clear')...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicholst ,
Sorry for the delay. I will get to this this weekend.
Thanks!
Felipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nicholst,
I am trying to figure out where to end the if/else statement that decides to run RapidPT or SnPM, to make sure that all the bookkeeping done at the end is taken care of. I was thinking in adding it around line 1109 in snpm_cp:
https://github.com/felipegb94/SnPM-devel/blob/master/snpm_cp.m#L1099-L1111
So, right after the variable nP
is divided by the number of permutations. Since I don't keep track of nP
in my code I need to figure out how to compute it. When nP
is defined:
nP = zeros(1,WorkDim); %-Nonparam P's
Does that mean that those are the nonparametric p-values? Are those the corrected p-values? I am not sure if they are because SnPMucp
is set to nP
and then saved in line 1109.
Thanks!
Hi @felipegb94, I had a chance to run some data through it, and I'd actually a bit worried... there seems to be a systematic underestimation of the tail. I did a 2-sample t-test on some data I have kicking around, and the compared it with code snippet below. At least on this data, for two random runs, I find the RapidPT MaxT gives a 5% threshold that is lower (4.6 or 3.2% lower, as a percentage of SnPM's threshold). Also, if I take RapidPT's 95%ile, and use that as a threshold with SnPM's permutation distribution, I find it gives either a P-value of 0.0807 or 0.0675 in these two runs. Note that the 95% Monte Carlo CI's here are What do you think is going on here? Sn=load('TwoSampT-12-12-SnPM2/SnPM');
PT=load('TwoSampT-12-12-RPT2/SnPM');
figure;plot(sort(Sn.MaxT(:,1)),1e-4:1e-4:1,sort(PT.MaxT(:,1)),1e-4:1e-4:1);legend('SnPM','RapidPT');title('CDF comparison')
figure;plot(sort(Sn.MaxT(:,1)),sort(PT.MaxT(:,1)),'b+'); refline(1,0)
xlabel('SnPM MaxT');ylabel('RapidPT MaxT')
(prctile(Sn.MaxT(:,1),95)-prctile(PT.MaxT(:,1),95))/prctile(Sn.MaxT(:,1),95)*100 % 4.6% diff in threshold
(sum(Sn.MaxT(:,1)>=prctile(PT.MaxT(:,1),95))+1)/(size(Sn.MaxT,1)+1) % SnPM's P-value for RapidPT's threshold: 0.0807 |
@felipegb94: test data for MATLAB 2016b is now available at https://github.com/SnPM-toolbox/SnPM_test_data. Let us know if you have any issues using them. |
Move RapidPT options interface to snpm_ui & the GUI.
Hi @nicholst , Sorry for the delay in getting back to you regarding the underestimation of the tail. I will send you an email tomorrow with a dropbox with a similar analysis to the one you did, but on various datasets (7-7, 15-15, 25-25, 50-50, 100-100, and 200-200). I am still waiting for some of the runs to finish. The 7-7 dataset comes from the SnPM_test_data. Thanks, |
Hi @nicholst , I am sorry for taking so long to get back to you but I had a few delays due to travel, other commitments, and trying to do a good amount of experiments to explain what the issue was. You can find all the experiments/results in a dropbox folder I shared with you on an email recently. You can also find the summarized results in this google doc.. Your experiments showed that RPT was consistently underestimating the tail of the max null (i.e less conservative test). In the paper, for datasets >= 50 subjects, we reported the opposite (i.e more conservative test). Our evaluations for the paper focused on the medium-larger size dataset (50-400 subjects) because that is where the speedup gains are valuable (a few minutes vs. a few hours, or a few hours vs. a few days). In smaller datasets (e.g 10-30 subjects) the speedup gains are just a few minutes difference, so at that point the user should be doing regular permutation testing. Hence, this regime was not thoroughly tested in the experiments The small errors in the t-threshold estimation are a sub-sampling artifact. If we sub-sample at low rates there is a speedup/accuracy trade-off. In the paper what we show is that the accuracy trade-off is bounded and is small enough that for a 10x speedup (a few minutes vs. a few hours) it might be worth it. I did more experiments following the same evaluation procedure that you shared with me, to illustrate the point. If you take a look at the google doc in the p-value tables you will see that for the smaller datasets (N = [14,24,30]]) RPT t-thresh is less conservative. On the other hand for larger datasets (N=[50,100,150,200,300,400]) RPT t-thresh is slightly more conservative. Furthermore, I did a set of experiment to illustrate the fact that as we increase the sampling rate oscillates between the correct p-value like a monte carlo method would do. You can see the results in the google doc. To address the above issues and give as much information to the user, we can make the plugin such that it tells the user that they should use RPT only when dealing with medium/larger datasets. Further, we could add a warning that suggests turning off RPT if the user is dealing with a small dataset (for instance, N < 30, V < 100,000). Would this be an ok plan? Thanks, |
In the last two commits I made changes that write out the correct images lP{+,-}, lP_FDR{+,-}, and lP_FWE{+,-}. The lP_FWE{+,-}, I believe that are being written correctly because I ran them through the SnPM inference the same MIP + Table report. The lP{+,-} and lP_FDR{+,-} images reports do not match. I believe it is because I am not keeping track of the nP correctly. Currently I update nP in the following way
This is all done in RapidPT.m in line 179. Is that correct? Thanks! |
It has been a while! I hope things are good with both of you. I know it has been quite some time since this pull request, but I am getting close to graduation, and I wanted to try to tie any loose ends on my grad school projects. I was wondering if there are any minor changes I can do that would make the PR go through, or if this is not possible due to some other major concerns. Thank you! Best, |
RapidPT_min includes only the necessary files for RapidPT. Under RapidPT/include/ you can find grasta 1.2 which is the library we rely on for low-rank matrix completion.
The main files modified were: