Skip to content

bramiozo/RexR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RexR

Repos for RexRocket, and how to apply ML to high-dimensional problems with a small sample count

Aspired functions contained in the library are

Pre/post-processing

  • Data coupling --> with genetic databases
  • Data imputance
  • Probeset cleaning
  • Cohort/lab correction
  • Gene prioritisation (top N genomes)
  • Patient clustering

Analysis methods

  • PCA, LDA, PLS, QDA, Autoencoding
  • self-organising maps
  • Hierarchical clustering
  • t-SNE, isomap, mds, umap
  • affinity propagation, community detection
  • cancer similarity based on open data

Prediction

  • ensemble learning
  • deep learning, both for classification and regression.
  • simple (but descriptive) methods: GPC, lSVM, LR etc.
  • tree-based algorithms: extraTrees, random forest, C5.0, CART, XGB, LightGBM, EBM
  • novel Cluster-enhanced extremely-biased estimator (CEBE)
  • multi-omic modelling (networks/hierarchy of models)

Hyperlearning

  • simulated annealing
  • genetic algorithm
  • Bayesian optimisation
  • grid search
  • random selection
  • successive halving, hyperband
  • neural architecture searh
  • active learning -> output difficult classes and output test samples that need labeling (interactive)

Visualisation

  • gene importance using graphs
  • gene cluster identification
  • patient cluster identification

Possible upgrades

  • addition of image analysis/classification and the combination with genomic expression profiles

Possible techniques

Possible collaborations

Science:

  • Dr. Harry Groen (Lung cancer)
  • Dr. Casper van Eijck (Pancreas cancer)
  • Dr. Jules Meijerink (Leukemia)
  • Dr. Mohammed El Kebir; computational biologist
  • Dr. Gunnar W. Klau; computational biologist
  • Dr. Marc Deisenroth; trust and transparancy in ML
  • Dr. Peter Hinrich ([email protected]); project bios/bbmri shared datastorage/processing for diabetes

Technology:

  • NLeScienceCenter: Dr. Adriënne Mendrik
  • SURF-SARA: Dr. Peter Hinrich, can help to set-up the data infrastructure.

Facilitator:

  • Tjebbe Tauber

Business angels:

  • Helmuth van Es: multiple gentech companies
  • Fred Smulders: ingang bij Rockstart accelerator?

Partners:

Comparables/competitors:

People:

  • (m) Tjebbe Tauber (inspiration wizard/connector)
  • (m) Bram van Es (data science/q.a./privacy)
  • (m) Sebastiaan de Jong (machine learning)
  • (m) Evgeny (devops and machine learning)
  • (f) Nela Lekic (graph analysis/machine learning)
  • (f) Elizaveta Bakaeva (data analysis/visualisation)
  • xxx (data viz/UX)
  • xxx (bio-statistician)
  • (f) Bo (NLP)
  • PwC, DNB, AirBnB

Sources: https://gdc.cancer.gov/ https://www.ncbi.nlm.nih.gov/ https://www.kaggle.com/c/santander-value-prediction-challenge --> high dimensionality, low sample count https://databricks.com/product/genomics

Done

  • XGBOOST
  • DNN
  • CNN
  • RVM
  • simple noise addition to increase robustness (uniform distribution, single value range for entire matrix)
  • lightGBM
  • generate table with classification per patient, per classification method => send to Jules
  • top-genome selector => send to Jules
  • ROC/confusion matrix visualiser
  • patient similarity
  • add false positive rate (sklearn.feature_selection.SelectFpr)
  • n-repetitions and bagging of stochastic methods (i.e. varying seed's)

TO DO

Complexity: 1, 3, 5, 7, 13

  • x, [ ] Functionality; cancer type detector

  • x, [ ] Functionality; cancer phase detector

  • x, [ ] Functionality; Image recognition, X-ray, MRI

  • x, [ ] Functionality; cancer pathway estimator: Similarity Network Fusion (SNF), JNMF, Selective Cross-correlation.

  • x, [ ] Functionality; gene importance estimator and general factor importance tool: from weights, importance, variance explained to combinatoric importances (branch-wise importances)

  • x, [ ] Functionality: Counter-factual explanations, (what-if scenario's)

  • x, [ ] Functionality: survival estimator

  • 5 [ ] api, GEO DataSets lib integration

  • 5 [ ] api, TCGA integration

  • 7 [ ] ux, Make GEO datasets interactive

  • 7, [ ] ux, user-friendly way to set-up pipelines

  • x [ ] io, add support for .vcf mutation data

  • 5, [ ] io, add genome/probeset/protein/miRNA/methyl mapping function, use docker with db (such as MonetDB, Druid or SparkSQL)

  • x, [ ] io, add containers for Neo4j

  • 113, [ ] ux/io/viz, build web interface around Superset/Druid

  • 15 [ ] ml, add specific outcome uncertainty to estimate accuracy: from the validation set extract a relationship between precision and the uncertainty interval, also consider Conformal Predictions.

  • 20 [ ] ml, add multi-omic combiner class: start with concatenation-based approaches

  • 20 [ ] ml, add similarity class: intra and inter omic.

  • 20 [ ] ml, multi-modal learner

  • 10 [ ] ml, add single splitting method: split based on modes or median with simple accuracy check

  • 10 [ ] ml, add super seperation scorer: Combine normalised Wasserstein with classical statistical tests

  • 10 [ ] ml, Denoising Autoencoder

  • 10 [ ] ml, Bayesian deep learning https://www.youtube.com/watch?v=dj-FKXxy7HQ

  • 10 [ ] ml, Factorisation machine for imputance

  • 10 [ ] ml, DeepBagNet (see Approximating CNNs with Bag-of-local-features models..)

  • 10 [ ] ml, search for differential pairs/triplets/quartets/etc.. : (1) for each n-level prune based on variance-minimum, (2) for each n+-level prune based on minimum differential expression

  • 30 [ ] ml, Random Forest with Oblique splits (as opposed to orthogonal splits) (more accurate for numerical data)

  • 30 [ ] ml, use DeepRec for treatment recommendation

  • 30 [ ] ml, add Graph neural networks (GrapSage, DiffPool) for multi-omic analysis, Decagon lit

  • 5 [ ] ml, add Generalised Additive Methods (GAM)

  • 5 [ ] ml, add ExplainBoostingMachine (EBM)

  • 30 [ ] ml, add Neural Conditional Random Field (NCRF)

  • 20 [ ] ml, add factorisation machines (FFM) for imputance, https://github.com/aksnzhy/xlearn

  • 10 [ ] ml, Lasso, ElasticNet

  • 20 [ ] ml, add Supersparse linear integer models (SLIM) https://arxiv.org/abs/1502.04269

  • 10 [ ] ml, feature augmentation: - add transformations of the features - add cluster-id from UMAP on raw data - add cluster-id from graph clustering on similarity data. - add feature combinations

  • 3, [ ] ml, PCA/LDA number of components selector.

  • 5, [ ] ml, add Generalised Additive Models --> only works for limited number of features. readme

  • 21 [ ] ml, add support for AutoKeras

  • 3, [ ] ml, add frequent item-set analysis: association rules, A-priori, PCY (multi-stage/hash)

  • 3, [ ] ml, add factor analysis, gaussian random projection, sparse random projection

  • 3, [ ] ml, add coefficient retrieval for LDA

  • 7, [ ] ml, add hyperoptimisation routine: succesive halving (hyperband), grid search, differential evolution (scipy), bayesian opt (optuna)

  • 3, [ ] ml, FDR/MW-U loop function with noise addition to get top genomes without creating a model

  • 3, [ ] ml, add tree-based cumulative importance threshold for top genome selection

  • 20, [ ] ml. add significant factor extractor: -- combine Kruskal-H with MW-U/FDR/FPR/KS -- 2-sided Kolmogorov-Smirnof -- PCA for variance explained --> sum (absolute) coefficients per feature -- LDA for seperation explained --> sum (absolute) coefficients per feature -- linear SVM/Logistic Regression: sign of importances -- tree methods for importances (use permutation importances (shap, rfpimp))

  • 1, [ ] ml, add RFECV

  • 30 [ ] ml/ux, add support for Snorkel

  • 10, [ ] ml, add semi-supervised module (useful in case there is unlabeled data)

  • 3, [ ] ml, element-wise noise addition using relative value range (n percentage of absolute value)

  • 3, [ ] ml, add relative noise-level

  • 3, [ ] ml, patient clustering ==> all genomes, reduced

  • 7, [ ] ml, genome clustering/community detection ==> Sparse Affinity Propagation, Girvan-Newman Algorithm, Markov clustering, Edge Betweenness Centrality

  • 10, [ ] ml, GAN to generate cancerous genomic profiles

  • 7, [ ] ml, UMAP / Hierarchical t-SNE / HDBSCAN / Diffusion Maps / OPTICS / Sammon mapping / LTSA , source

  • 3, [ ] ml, add other decision tree methods: FACT, C4.5, QUEST, CRUISE, GUIDE

  • 13, [ ] ml, bias corrector class: COMBAT, PCA (EIGENSTRAT), DWD, L/S

  • 20, [ ] ml, patient/sample similarity/clustering based bias detection

  • 13, [ ] ml, bias detection class: between class KS/MW-U/Wasserstein/KL-divergence

  • 13, [ ] ml, outlier detector/removal: isolation forest, one-class SVM,

  • 13, [ ] ml, add Kernel Discriminant Analysis as a non-linear feature reducer

  • x, [ ] ml, add measuring bias detector (multiple datasets as inputs)

  • 20, [ ] ml, CEBE: Cluster-enhanced extremely biased estimator

  • 20, [ ] ml, HYCUB: Sparse hypercube probability map

  • 13, [ ] ml, PAM method (bioinformatics) http://statweb.stanford.edu/~tibs/PAM/

  • 5, [ ] ml, add ICA for genome seperation, http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html with ICA you can find commonalities in different groups.

  • 20, [ ] ml, Sparse ICA/Sparse CCA/Sparse PLS/joint NMF module for multi-omic feature analysis

  • 10, [ ] ml, polynomial expansion module for multi-omic feature combinations

  • 7, [ ] ml, add SOM for genome seperation

  • 10, [ ] ml, add Occam factor function to extract approximation of model complexity

  • 3, [ ] ml, multilayer sparse auto encoding for pre-processing and feature detection, and DAE for denoising

  • x, [ ] ml, add iCluster(?), in R

  • 5, [ ] ml, conditional survival estimator. i.e. add a Bayesian/GP regressor, Kaplan-Meier

  • 13, [ ] ml, refactor/optimize: Cython, numba, static def's, parallelise, modularize

  • x, [ ] ml, add healthy patient reference routine

  • x, [ ] ml, healthy tissue/unhealthy tissue

  • x, [ ] ml, add disease dependent measurement error detector/filter

  • x, [ ] ml, add option for nested cross-validation

  • x, [ ] ml, add a posteriori accuracy checker

  • x, [ ] ml, add Automatic Relevance Determination (ARD), Bayesian Discriminative Modelling.

  • x, [ ] ml, add support for image based classification: test on kaggle set, https://www.kaggle.com/c/data-science-bowl-2018/data

  • x, [ ] ml, add support for time series based classification: test on EEG kaggle set, https://www.kaggle.com/c/grasp-and-lift-eeg-detection MyFly (CNN, LSTM): add TCN, GRU support

  • x, [ ] ml, add "deep dreaming": or sample generator functionality given a classification label generate a representative sample.

  • 15, [ ] ml, Add graph abstraction: source, source , MST (Kruskal)


  • 3, [ ] viz, add missing data visualizer, https://github.com/ResidentMario/missingno
  • 3, [ ] viz, add tree visualiser, https://github.com/parrt/dtreeviz
  • 5, [ ] viz, add parallel coordinates to visualise 'pathways': inflate height on dim axes by taking Hadamard power.
  • ? [ ] viz, visualisation of training process
  • 3, [ ] viz, add plot (expression value, importance/coefficient) group by classification, labelled with genome, use Bokeh
  • 3, [ ] viz, add plot (number of genomes, versus accuracy)
  • x, [ ] viz, add graph visualisation (intra-similarity of most prominent genomes, per label)
  • x, [ ] viz, add quiver visualisation for genomes, also see https://distill.pub/2018/building-blocks/
  • x, [ ] viz, add LIME/DeepLift visualisation for model explanations of neural net's (https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf)
  • x, [ ] viz, add SHAP visualisation for the model explanation of tree methods,
  • x, [ ] viz, add visualisation of cumulative importance of tree branches.
  • x, [ ] viz, add tree interpreter (permutation importances) ELI5 https://github.com/TeamHG-Memex/eli5
  • x, [ ] viz, add model interpreter (shapely values) SHAP https://github.com/slundberg/shap
  • x, [ ] viz, add Additive feature attribution methods
  • x, [ ] viz, model explainability: using L2X, QII and additive index models (xNN)
  • x, [ ] viz, train simple model on complex model (GBT-> single DT regressor on proba's)
  • x, [ ] viz, partial model dependence plots for clinical data and individual conditional expectation
  • x, [ ] viz, add correlation graphs: corr --> networktools
  • x, [ ] viz: https://www.kaggle.com/kanncaa1/rare-visualization-tools
  • x, [ ] viz: https://www.kaggle.com/mirichoi0218/classification-breast-cancer-or-not-with-15-ml
  • 5, [ ] viz, routine to generate heatmap table's
  • 20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver, get inspiration from this playground
  • 5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering
  • 20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver.
  • 5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering seaborne
  • 10, [ ] viz, genome/patient clustering using Vega, Altair or D3js
  • x, [ ] viz, add wrapper for (circos)[http://circos.ca/]
  • x, [ ] viz, add lgbm/xgb/rf model visualisation
  • x, [ ] viz, Datawrapper, LocalFocus, Flourish, Dash

Datasets

future

  • https://python-graph-gallery.com/405-dendrogram-with-heatmap-and-coloured-leaves/
  • search engine for medical documents: hierarchical/DT based, human-in-the-loop
  • use entity linking to fetch relevant journal papers
  • build domain specific word embeddings for medical graph search
  • use Siamese neural-network to get rid of the cohort bias
  • use Kubeflow for pipelining
  • add meta classifier: UMAP embedding+Convex hull+MSP+SVM,
  • add meta classifier (see Matching Nets, and Relation Networks and Prototypical Networks): SAE or UMAP embedding+class-matching (Rank correlation, similarity, Wasserstein distance or softmax of distance) with Barycentered sample. Also see this overview.
  • add MAML/Reptile to speed up learning
  • add image-caption generator to evaluate images?
  • add functionality for the practitioner to draw a decision plane to manually create a predictor
  • add visualisation of phenotypical manifolds in omics-space and position of patient in that space.
  • Neuro-conditional random field for tumor detection(research.baidu.com/Blog/index-view?id=104)
  • contact https://turbine.ai/: they can simulate the effect of anti-tumour medication

funds

Data protection and distribution

https://oceanprotocol.com/#why

About

Repos for RexRocket

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5