Repos for RexRocket, and how to apply ML to high-dimensional problems with a small sample count
Aspired functions contained in the library are
Pre/post-processing
- Data coupling --> with genetic databases
- Data imputance
- Probeset cleaning
- Cohort/lab correction
- Gene prioritisation (top N genomes)
- Patient clustering
Analysis methods
- PCA, LDA, PLS, QDA, Autoencoding
- self-organising maps
- Hierarchical clustering
- t-SNE, isomap, mds, umap
- affinity propagation, community detection
- cancer similarity based on open data
Prediction
- ensemble learning
- deep learning, both for classification and regression.
- simple (but descriptive) methods: GPC, lSVM, LR etc.
- tree-based algorithms: extraTrees, random forest, C5.0, CART, XGB, LightGBM, EBM
- novel Cluster-enhanced extremely-biased estimator (CEBE)
- multi-omic modelling (networks/hierarchy of models)
Hyperlearning
- simulated annealing
- genetic algorithm
- Bayesian optimisation
- grid search
- random selection
- successive halving, hyperband
- neural architecture searh
- active learning -> output difficult classes and output test samples that need labeling (interactive)
Visualisation
- gene importance using graphs
- gene cluster identification
- patient cluster identification
- addition of image analysis/classification and the combination with genomic expression profiles
- increase robustness: Apply data augmentation such affine transformations, after mapping genome vector to surface https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
- apply biased estimator using aggregations of genomic vectors
Science:
- Dr. Harry Groen (Lung cancer)
- Dr. Casper van Eijck (Pancreas cancer)
- Dr. Jules Meijerink (Leukemia)
- Dr. Mohammed El Kebir; computational biologist
- Dr. Gunnar W. Klau; computational biologist
- Dr. Marc Deisenroth; trust and transparancy in ML
- Dr. Peter Hinrich ([email protected]); project bios/bbmri shared datastorage/processing for diabetes
Technology:
- NLeScienceCenter: Dr. Adriënne Mendrik
- SURF-SARA: Dr. Peter Hinrich, can help to set-up the data infrastructure.
Facilitator:
- Tjebbe Tauber
Business angels:
- Helmuth van Es: multiple gentech companies
- Fred Smulders: ingang bij Rockstart accelerator?
Partners:
- hospitals?
- government?
- https://mytomorrows.com/nl/
Comparables/competitors:
- Precision profile: http://precisionprofiledx.squarespace.com/product-portfolio/
People:
- (m) Tjebbe Tauber (inspiration wizard/connector)
- (m) Bram van Es (data science/q.a./privacy)
- (m) Sebastiaan de Jong (machine learning)
- (m) Evgeny (devops and machine learning)
- (f) Nela Lekic (graph analysis/machine learning)
- (f) Elizaveta Bakaeva (data analysis/visualisation)
- xxx (data viz/UX)
- xxx (bio-statistician)
- (f) Bo (NLP)
- PwC, DNB, AirBnB
Sources: https://gdc.cancer.gov/ https://www.ncbi.nlm.nih.gov/ https://www.kaggle.com/c/santander-value-prediction-challenge --> high dimensionality, low sample count https://databricks.com/product/genomics
- XGBOOST
- DNN
- CNN
- RVM
- simple noise addition to increase robustness (uniform distribution, single value range for entire matrix)
- lightGBM
- generate table with classification per patient, per classification method => send to Jules
- top-genome selector => send to Jules
- ROC/confusion matrix visualiser
- patient similarity
- add false positive rate (sklearn.feature_selection.SelectFpr)
- n-repetitions and bagging of stochastic methods (i.e. varying seed's)
Complexity: 1, 3, 5, 7, 13
-
x, [ ] Functionality; cancer type detector
-
x, [ ] Functionality; cancer phase detector
-
x, [ ] Functionality; cancer pathway estimator: Similarity Network Fusion (SNF), JNMF, Selective Cross-correlation.
-
x, [ ] Functionality; gene importance estimator and general factor importance tool: from weights, importance, variance explained to combinatoric importances (branch-wise importances)
-
x, [ ] Functionality: Counter-factual explanations, (what-if scenario's)
-
x, [ ] Functionality: survival estimator
-
5 [ ] api, GEO DataSets lib integration
-
5 [ ] api, TCGA integration
-
7 [ ] ux, Make GEO datasets interactive
-
7, [ ] ux, user-friendly way to set-up pipelines
-
x [ ] io, add support for .vcf mutation data
-
5, [ ] io, add genome/probeset/protein/miRNA/methyl mapping function, use docker with db (such as MonetDB, Druid or SparkSQL)
-
x, [ ] io, add containers for Neo4j
-
113, [ ] ux/io/viz, build web interface around Superset/Druid
-
15 [ ] ml, add specific outcome uncertainty to estimate accuracy: from the validation set extract a relationship between precision and the uncertainty interval, also consider Conformal Predictions.
-
20 [ ] ml, add multi-omic combiner class: start with concatenation-based approaches
-
20 [ ] ml, add similarity class: intra and inter omic.
-
20 [ ] ml, multi-modal learner
-
10 [ ] ml, add single splitting method: split based on modes or median with simple accuracy check
-
10 [ ] ml, add super seperation scorer: Combine normalised Wasserstein with classical statistical tests
-
10 [ ] ml, Denoising Autoencoder
-
10 [ ] ml, Bayesian deep learning https://www.youtube.com/watch?v=dj-FKXxy7HQ
-
10 [ ] ml, Factorisation machine for imputance
-
10 [ ] ml, DeepBagNet (see Approximating CNNs with Bag-of-local-features models..)
-
10 [ ] ml, search for differential pairs/triplets/quartets/etc.. : (1) for each n-level prune based on variance-minimum, (2) for each n+-level prune based on minimum differential expression
-
30 [ ] ml, Random Forest with Oblique splits (as opposed to orthogonal splits) (more accurate for numerical data)
-
30 [ ] ml, use DeepRec for treatment recommendation
-
30 [ ] ml, add Graph neural networks (GrapSage, DiffPool) for multi-omic analysis, Decagon lit
-
5 [ ] ml, add Generalised Additive Methods (GAM)
-
5 [ ] ml, add ExplainBoostingMachine (EBM)
-
30 [ ] ml, add Neural Conditional Random Field (NCRF)
-
20 [ ] ml, add factorisation machines (FFM) for imputance, https://github.com/aksnzhy/xlearn
-
10 [ ] ml, Lasso, ElasticNet
-
20 [ ] ml, add Supersparse linear integer models (SLIM) https://arxiv.org/abs/1502.04269
-
10 [ ] ml, feature augmentation: - add transformations of the features - add cluster-id from UMAP on raw data - add cluster-id from graph clustering on similarity data. - add feature combinations
-
3, [ ] ml, PCA/LDA number of components selector.
-
5, [ ] ml, add Generalised Additive Models --> only works for limited number of features. readme
-
21 [ ] ml, add support for AutoKeras
-
3, [ ] ml, add frequent item-set analysis: association rules, A-priori, PCY (multi-stage/hash)
-
3, [ ] ml, add factor analysis, gaussian random projection, sparse random projection
-
3, [ ] ml, add coefficient retrieval for LDA
-
7, [ ] ml, add hyperoptimisation routine: succesive halving (hyperband), grid search, differential evolution (scipy), bayesian opt (optuna)
-
3, [ ] ml, FDR/MW-U loop function with noise addition to get top genomes without creating a model
-
3, [ ] ml, add tree-based cumulative importance threshold for top genome selection
-
20, [ ] ml. add significant factor extractor: -- combine Kruskal-H with MW-U/FDR/FPR/KS -- 2-sided Kolmogorov-Smirnof -- PCA for variance explained --> sum (absolute) coefficients per feature -- LDA for seperation explained --> sum (absolute) coefficients per feature -- linear SVM/Logistic Regression: sign of importances -- tree methods for importances (use permutation importances (shap, rfpimp))
-
1, [ ] ml, add RFECV
-
30 [ ] ml/ux, add support for Snorkel
-
10, [ ] ml, add semi-supervised module (useful in case there is unlabeled data)
-
3, [ ] ml, element-wise noise addition using relative value range (n percentage of absolute value)
-
3, [ ] ml, add relative noise-level
-
3, [ ] ml, patient clustering ==> all genomes, reduced
-
7, [ ] ml, genome clustering/community detection ==> Sparse Affinity Propagation, Girvan-Newman Algorithm, Markov clustering, Edge Betweenness Centrality
-
10, [ ] ml, GAN to generate cancerous genomic profiles
-
7, [ ] ml, UMAP / Hierarchical t-SNE / HDBSCAN / Diffusion Maps / OPTICS / Sammon mapping / LTSA , source
-
3, [ ] ml, add other decision tree methods: FACT, C4.5, QUEST, CRUISE, GUIDE
-
13, [ ] ml, bias corrector class: COMBAT, PCA (EIGENSTRAT), DWD, L/S
-
20, [ ] ml, patient/sample similarity/clustering based bias detection
-
13, [ ] ml, bias detection class: between class KS/MW-U/Wasserstein/KL-divergence
-
13, [ ] ml, outlier detector/removal: isolation forest, one-class SVM,
-
13, [ ] ml, add Kernel Discriminant Analysis as a non-linear feature reducer
-
x, [ ] ml, add measuring bias detector (multiple datasets as inputs)
-
20, [ ] ml, CEBE: Cluster-enhanced extremely biased estimator
-
20, [ ] ml, HYCUB: Sparse hypercube probability map
-
13, [ ] ml, PAM method (bioinformatics) http://statweb.stanford.edu/~tibs/PAM/
-
5, [ ] ml, add ICA for genome seperation, http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html with ICA you can find commonalities in different groups.
-
20, [ ] ml, Sparse ICA/Sparse CCA/Sparse PLS/joint NMF module for multi-omic feature analysis
-
10, [ ] ml, polynomial expansion module for multi-omic feature combinations
-
7, [ ] ml, add SOM for genome seperation
-
10, [ ] ml, add Occam factor function to extract approximation of model complexity
-
3, [ ] ml, multilayer sparse auto encoding for pre-processing and feature detection, and DAE for denoising
-
x, [ ] ml, add iCluster(?), in R
-
5, [ ] ml, conditional survival estimator. i.e. add a Bayesian/GP regressor, Kaplan-Meier
-
13, [ ] ml, refactor/optimize: Cython, numba, static def's, parallelise, modularize
-
x, [ ] ml, add healthy patient reference routine
-
x, [ ] ml, healthy tissue/unhealthy tissue
-
x, [ ] ml, add disease dependent measurement error detector/filter
-
x, [ ] ml, add option for nested cross-validation
-
x, [ ] ml, add a posteriori accuracy checker
-
x, [ ] ml, add Automatic Relevance Determination (ARD), Bayesian Discriminative Modelling.
-
x, [ ] ml, add support for image based classification: test on kaggle set, https://www.kaggle.com/c/data-science-bowl-2018/data
-
x, [ ] ml, add support for time series based classification: test on EEG kaggle set, https://www.kaggle.com/c/grasp-and-lift-eeg-detection MyFly (CNN, LSTM): add TCN, GRU support
-
x, [ ] ml, add "deep dreaming": or sample generator functionality given a classification label generate a representative sample.
-
15, [ ] ml, Add graph abstraction: source, source , MST (Kruskal)
- 3, [ ] viz, add missing data visualizer, https://github.com/ResidentMario/missingno
- 3, [ ] viz, add tree visualiser, https://github.com/parrt/dtreeviz
- 5, [ ] viz, add parallel coordinates to visualise 'pathways': inflate height on dim axes by taking Hadamard power.
- ? [ ] viz, visualisation of training process
- 3, [ ] viz, add plot (expression value, importance/coefficient) group by classification, labelled with genome, use Bokeh
- 3, [ ] viz, add plot (number of genomes, versus accuracy)
- x, [ ] viz, add graph visualisation (intra-similarity of most prominent genomes, per label)
- x, [ ] viz, add quiver visualisation for genomes, also see https://distill.pub/2018/building-blocks/
- x, [ ] viz, add LIME/DeepLift visualisation for model explanations of neural net's (https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf)
- x, [ ] viz, add SHAP visualisation for the model explanation of tree methods,
- x, [ ] viz, add visualisation of cumulative importance of tree branches.
- x, [ ] viz, add tree interpreter (permutation importances) ELI5 https://github.com/TeamHG-Memex/eli5
- x, [ ] viz, add model interpreter (shapely values) SHAP https://github.com/slundberg/shap
- x, [ ] viz, add Additive feature attribution methods
- x, [ ] viz, model explainability: using L2X, QII and additive index models (xNN)
- x, [ ] viz, train simple model on complex model (GBT-> single DT regressor on proba's)
- x, [ ] viz, partial model dependence plots for clinical data and individual conditional expectation
- x, [ ] viz, add correlation graphs: corr --> networktools
- x, [ ] viz: https://www.kaggle.com/kanncaa1/rare-visualization-tools
- x, [ ] viz: https://www.kaggle.com/mirichoi0218/classification-breast-cancer-or-not-with-15-ml
- 5, [ ] viz, routine to generate heatmap table's
- 20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver, get inspiration from this playground
- 5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering
- 20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver.
- 5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering seaborne
- 10, [ ] viz, genome/patient clustering using Vega, Altair or D3js
- x, [ ] viz, add wrapper for (circos)[http://circos.ca/]
- x, [ ] viz, add lgbm/xgb/rf model visualisation
- x, [ ] viz, Datawrapper, LocalFocus, Flourish, Dash
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10558
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83744
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL97
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52581
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11863
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31586
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66499 !
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80796 !
- https://python-graph-gallery.com/405-dendrogram-with-heatmap-and-coloured-leaves/
- search engine for medical documents: hierarchical/DT based, human-in-the-loop
- use entity linking to fetch relevant journal papers
- build domain specific word embeddings for medical graph search
- use Siamese neural-network to get rid of the cohort bias
- use Kubeflow for pipelining
- add meta classifier: UMAP embedding+Convex hull+MSP+SVM,
- add meta classifier (see Matching Nets, and Relation Networks and Prototypical Networks): SAE or UMAP embedding+class-matching (Rank correlation, similarity, Wasserstein distance or softmax of distance) with Barycentered sample. Also see this overview.
- add MAML/Reptile to speed up learning
- add image-caption generator to evaluate images?
- add functionality for the practitioner to draw a decision plane to manually create a predictor
- add visualisation of phenotypical manifolds in omics-space and position of patient in that space.
- Neuro-conditional random field for tumor detection(research.baidu.com/Blog/index-view?id=104)
- contact https://turbine.ai/: they can simulate the effect of anti-tumour medication
- WBSO https://www.ugoo.nl/wbso-subsidie/wbso-subsidiecheck/?gclid=Cj0KCQiAzfrTBRC_ARIsAJ5ps0uImsv_6m-NiWK_jod-_XaW-8exS616zNvqDH_Pojs6MayyepqhT58aAgdiEALw_wcB
- SIDN https://www.sidnfonds.nl/aanvragen/internetprojecten
- KPN/Menzis/Monuta: https://fd.nl/economie-politiek/1239055/nieuw-fonds-met-durfkapitaal-voor-zorgstart-ups
- Blue Sparrows MedTech Fonds
- eScience https://www.esciencecenter.nl/funding/big-data-health