RexR

Repos for RexRocket, and how to apply ML to high-dimensional problems with a small sample count

Aspired functions contained in the library are

Pre/post-processing

Data coupling --> with genetic databases
Data imputance
Probeset cleaning
Cohort/lab correction
Gene prioritisation (top N genomes)
Patient clustering

Analysis methods

PCA, LDA, PLS, QDA, Autoencoding
self-organising maps
Hierarchical clustering
t-SNE, isomap, mds, umap
affinity propagation, community detection
cancer similarity based on open data

Prediction

ensemble learning
deep learning, both for classification and regression.
simple (but descriptive) methods: GPC, lSVM, LR etc.
tree-based algorithms: extraTrees, random forest, C5.0, CART, XGB, LightGBM, EBM
novel Cluster-enhanced extremely-biased estimator (CEBE)
multi-omic modelling (networks/hierarchy of models)

Hyperlearning

simulated annealing
genetic algorithm
Bayesian optimisation
grid search
random selection
successive halving, hyperband
neural architecture searh
active learning -> output difficult classes and output test samples that need labeling (interactive)

Visualisation

gene importance using graphs
gene cluster identification
patient cluster identification

Possible upgrades

addition of image analysis/classification and the combination with genomic expression profiles

Possible techniques

increase robustness: Apply data augmentation such affine transformations, after mapping genome vector to surface https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
apply biased estimator using aggregations of genomic vectors

Possible collaborations

Science:

Dr. Harry Groen (Lung cancer)
Dr. Casper van Eijck (Pancreas cancer)
Dr. Jules Meijerink (Leukemia)
Dr. Mohammed El Kebir; computational biologist
Dr. Gunnar W. Klau; computational biologist
Dr. Marc Deisenroth; trust and transparancy in ML
Dr. Peter Hinrich ([email protected]); project bios/bbmri shared datastorage/processing for diabetes

Technology:

NLeScienceCenter: Dr. Adriënne Mendrik
SURF-SARA: Dr. Peter Hinrich, can help to set-up the data infrastructure.

Facilitator:

Tjebbe Tauber

Business angels:

Helmuth van Es: multiple gentech companies
Fred Smulders: ingang bij Rockstart accelerator?

Partners:

hospitals?
government?
https://mytomorrows.com/nl/

Comparables/competitors:

Precision profile: http://precisionprofiledx.squarespace.com/product-portfolio/

People:

(m) Tjebbe Tauber (inspiration wizard/connector)
(m) Bram van Es (data science/q.a./privacy)
(m) Sebastiaan de Jong (machine learning)
(m) Evgeny (devops and machine learning)
(f) Nela Lekic (graph analysis/machine learning)
(f) Elizaveta Bakaeva (data analysis/visualisation)
xxx (data viz/UX)
xxx (bio-statistician)
(f) Bo (NLP)
PwC, DNB, AirBnB

Sources: https://gdc.cancer.gov/ https://www.ncbi.nlm.nih.gov/ https://www.kaggle.com/c/santander-value-prediction-challenge --> high dimensionality, low sample count https://databricks.com/product/genomics

Done

TO DO

Complexity: 1, 3, 5, 7, 13

x, [ ] Functionality; cancer type detector
x, [ ] Functionality; cancer phase detector
x, [ ] Functionality; Image recognition, X-ray, MRI
x, [ ] Functionality; cancer pathway estimator: Similarity Network Fusion (SNF), JNMF, Selective Cross-correlation.
x, [ ] Functionality; gene importance estimator and general factor importance tool: from weights, importance, variance explained to combinatoric importances (branch-wise importances)
x, [ ] Functionality: Counter-factual explanations, (what-if scenario's)
x, [ ] Functionality: survival estimator
5 [ ] api, GEO DataSets lib integration
5 [ ] api, TCGA integration
7 [ ] ux, Make GEO datasets interactive
7, [ ] ux, user-friendly way to set-up pipelines
x [ ] io, add support for .vcf mutation data
5, [ ] io, add genome/probeset/protein/miRNA/methyl mapping function, use docker with db (such as MonetDB, Druid or SparkSQL)
x, [ ] io, add containers for Neo4j
113, [ ] ux/io/viz, build web interface around Superset/Druid
15 [ ] ml, add specific outcome uncertainty to estimate accuracy: from the validation set extract a relationship between precision and the uncertainty interval, also consider Conformal Predictions.
20 [ ] ml, add multi-omic combiner class: start with concatenation-based approaches
20 [ ] ml, add similarity class: intra and inter omic.
20 [ ] ml, multi-modal learner
10 [ ] ml, add single splitting method: split based on modes or median with simple accuracy check
10 [ ] ml, add super seperation scorer: Combine normalised Wasserstein with classical statistical tests
10 [ ] ml, Denoising Autoencoder
10 [ ] ml, Bayesian deep learning https://www.youtube.com/watch?v=dj-FKXxy7HQ
10 [ ] ml, Factorisation machine for imputance
10 [ ] ml, DeepBagNet (see Approximating CNNs with Bag-of-local-features models..)
10 [ ] ml, search for differential pairs/triplets/quartets/etc.. : (1) for each n-level prune based on variance-minimum, (2) for each n+-level prune based on minimum differential expression
30 [ ] ml, Random Forest with Oblique splits (as opposed to orthogonal splits) (more accurate for numerical data)
30 [ ] ml, use DeepRec for treatment recommendation
30 [ ] ml, add Graph neural networks (GrapSage, DiffPool) for multi-omic analysis, Decagon lit
5 [ ] ml, add Generalised Additive Methods (GAM)
5 [ ] ml, add ExplainBoostingMachine (EBM)
30 [ ] ml, add Neural Conditional Random Field (NCRF)
20 [ ] ml, add factorisation machines (FFM) for imputance, https://github.com/aksnzhy/xlearn
10 [ ] ml, Lasso, ElasticNet
20 [ ] ml, add Supersparse linear integer models (SLIM) https://arxiv.org/abs/1502.04269
10 [ ] ml, feature augmentation: - add transformations of the features - add cluster-id from UMAP on raw data - add cluster-id from graph clustering on similarity data. - add feature combinations
3, [ ] ml, PCA/LDA number of components selector.
5, [ ] ml, add Generalised Additive Models --> only works for limited number of features. readme
21 [ ] ml, add support for AutoKeras
3, [ ] ml, add frequent item-set analysis: association rules, A-priori, PCY (multi-stage/hash)
3, [ ] ml, add factor analysis, gaussian random projection, sparse random projection
3, [ ] ml, add coefficient retrieval for LDA
7, [ ] ml, add hyperoptimisation routine: succesive halving (hyperband), grid search, differential evolution (scipy), bayesian opt (optuna)
3, [ ] ml, FDR/MW-U loop function with noise addition to get top genomes without creating a model
3, [ ] ml, add tree-based cumulative importance threshold for top genome selection
20, [ ] ml. add significant factor extractor: -- combine Kruskal-H with MW-U/FDR/FPR/KS -- 2-sided Kolmogorov-Smirnof -- PCA for variance explained --> sum (absolute) coefficients per feature -- LDA for seperation explained --> sum (absolute) coefficients per feature -- linear SVM/Logistic Regression: sign of importances -- tree methods for importances (use permutation importances (shap, rfpimp))
1, [ ] ml, add RFECV
30 [ ] ml/ux, add support for Snorkel
10, [ ] ml, add semi-supervised module (useful in case there is unlabeled data)
3, [ ] ml, element-wise noise addition using relative value range (n percentage of absolute value)
3, [ ] ml, add relative noise-level
3, [ ] ml, patient clustering ==> all genomes, reduced
7, [ ] ml, genome clustering/community detection ==> Sparse Affinity Propagation, Girvan-Newman Algorithm, Markov clustering, Edge Betweenness Centrality
10, [ ] ml, GAN to generate cancerous genomic profiles
7, [ ] ml, UMAP / Hierarchical t-SNE / HDBSCAN / Diffusion Maps / OPTICS / Sammon mapping / LTSA , source
3, [ ] ml, add other decision tree methods: FACT, C4.5, QUEST, CRUISE, GUIDE
13, [ ] ml, bias corrector class: COMBAT, PCA (EIGENSTRAT), DWD, L/S
20, [ ] ml, patient/sample similarity/clustering based bias detection
13, [ ] ml, bias detection class: between class KS/MW-U/Wasserstein/KL-divergence
13, [ ] ml, outlier detector/removal: isolation forest, one-class SVM,
13, [ ] ml, add Kernel Discriminant Analysis as a non-linear feature reducer
x, [ ] ml, add measuring bias detector (multiple datasets as inputs)
20, [ ] ml, CEBE: Cluster-enhanced extremely biased estimator
20, [ ] ml, HYCUB: Sparse hypercube probability map
13, [ ] ml, PAM method (bioinformatics) http://statweb.stanford.edu/~tibs/PAM/
5, [ ] ml, add ICA for genome seperation, http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html with ICA you can find commonalities in different groups.
20, [ ] ml, Sparse ICA/Sparse CCA/Sparse PLS/joint NMF module for multi-omic feature analysis
10, [ ] ml, polynomial expansion module for multi-omic feature combinations
7, [ ] ml, add SOM for genome seperation
10, [ ] ml, add Occam factor function to extract approximation of model complexity
3, [ ] ml, multilayer sparse auto encoding for pre-processing and feature detection, and DAE for denoising
x, [ ] ml, add iCluster(?), in R
5, [ ] ml, conditional survival estimator. i.e. add a Bayesian/GP regressor, Kaplan-Meier
13, [ ] ml, refactor/optimize: Cython, numba, static def's, parallelise, modularize
x, [ ] ml, add healthy patient reference routine
x, [ ] ml, healthy tissue/unhealthy tissue
x, [ ] ml, add disease dependent measurement error detector/filter
x, [ ] ml, add option for nested cross-validation
x, [ ] ml, add a posteriori accuracy checker
x, [ ] ml, add Automatic Relevance Determination (ARD), Bayesian Discriminative Modelling.
x, [ ] ml, add support for image based classification: test on kaggle set, https://www.kaggle.com/c/data-science-bowl-2018/data
x, [ ] ml, add support for time series based classification: test on EEG kaggle set, https://www.kaggle.com/c/grasp-and-lift-eeg-detection MyFly (CNN, LSTM): add TCN, GRU support
x, [ ] ml, add "deep dreaming": or sample generator functionality given a classification label generate a representative sample.
15, [ ] ml, Add graph abstraction: source, source , MST (Kruskal)

3, [ ] viz, add missing data visualizer, https://github.com/ResidentMario/missingno
3, [ ] viz, add tree visualiser, https://github.com/parrt/dtreeviz
5, [ ] viz, add parallel coordinates to visualise 'pathways': inflate height on dim axes by taking Hadamard power.
? [ ] viz, visualisation of training process
3, [ ] viz, add plot (expression value, importance/coefficient) group by classification, labelled with genome, use Bokeh
3, [ ] viz, add plot (number of genomes, versus accuracy)
x, [ ] viz, add graph visualisation (intra-similarity of most prominent genomes, per label)
x, [ ] viz, add quiver visualisation for genomes, also see https://distill.pub/2018/building-blocks/
x, [ ] viz, add LIME/DeepLift visualisation for model explanations of neural net's (https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf)
x, [ ] viz, add SHAP visualisation for the model explanation of tree methods,
x, [ ] viz, add visualisation of cumulative importance of tree branches.
x, [ ] viz, add tree interpreter (permutation importances) ELI5 https://github.com/TeamHG-Memex/eli5
x, [ ] viz, add model interpreter (shapely values) SHAP https://github.com/slundberg/shap
x, [ ] viz, add Additive feature attribution methods
x, [ ] viz, model explainability: using L2X, QII and additive index models (xNN)
x, [ ] viz, train simple model on complex model (GBT-> single DT regressor on proba's)
x, [ ] viz, partial model dependence plots for clinical data and individual conditional expectation
x, [ ] viz, add correlation graphs: corr --> networktools
x, [ ] viz: https://www.kaggle.com/kanncaa1/rare-visualization-tools
x, [ ] viz: https://www.kaggle.com/mirichoi0218/classification-breast-cancer-or-not-with-15-ml
5, [ ] viz, routine to generate heatmap table's
20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver, get inspiration from this playground
5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering
20, [ ] viz, Treat as 2D classification problem, and visualize with Quiver.
5, [ ] viz, top-genome visualiser: top-N list -> hierarchical (agglomerative) clustering seaborne
10, [ ] viz, genome/patient clustering using Vega, Altair or D3js
x, [ ] viz, add wrapper for (circos)[http://circos.ca/]
x, [ ] viz, add lgbm/xgb/rf model visualisation
x, [ ] viz, Datawrapper, LocalFocus, Flourish, Dash

Datasets

future

https://python-graph-gallery.com/405-dendrogram-with-heatmap-and-coloured-leaves/
search engine for medical documents: hierarchical/DT based, human-in-the-loop
use entity linking to fetch relevant journal papers
build domain specific word embeddings for medical graph search
use Siamese neural-network to get rid of the cohort bias
use Kubeflow for pipelining
add meta classifier: UMAP embedding+Convex hull+MSP+SVM,
add meta classifier (see Matching Nets, and Relation Networks and Prototypical Networks): SAE or UMAP embedding+class-matching (Rank correlation, similarity, Wasserstein distance or softmax of distance) with Barycentered sample. Also see this overview.
add MAML/Reptile to speed up learning
add image-caption generator to evaluate images?
add functionality for the practitioner to draw a decision plane to manually create a predictor
add visualisation of phenotypical manifolds in omics-space and position of patient in that space.
Neuro-conditional random field for tumor detection(research.baidu.com/Blog/index-view?id=104)
contact https://turbine.ai/: they can simulate the effect of anti-tumour medication

funds

WBSO https://www.ugoo.nl/wbso-subsidie/wbso-subsidiecheck/?gclid=Cj0KCQiAzfrTBRC_ARIsAJ5ps0uImsv_6m-NiWK_jod-_XaW-8exS616zNvqDH_Pojs6MayyepqhT58aAgdiEALw_wcB
SIDN https://www.sidnfonds.nl/aanvragen/internetprojecten
KPN/Menzis/Monuta: https://fd.nl/economie-politiek/1239055/nieuw-fonds-met-durfkapitaal-voor-zorgstart-ups
Blue Sparrows MedTech Fonds
eScience https://www.esciencecenter.nl/funding/big-data-health

Data protection and distribution

https://oceanprotocol.com/#why

Name		Name	Last commit message	Last commit date
Latest commit History 400 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
LungCancer_2021		LungCancer_2021
__pycache__		__pycache__
_doc		_doc
_example_plots		_example_plots
_external_codes		_external_codes
_hackathon2018		_hackathon2018
_hackathon2019		_hackathon2019
_kidney		_kidney
_literature		_literature
_paper		_paper
functions		functions
out		out
.Rhistory		.Rhistory
.gitignore		.gitignore
.stack.md.swp		.stack.md.swp
1_generate_data_set.py		1_generate_data_set.py
2_pre-process_data_set.py		2_pre-process_data_set.py
CODE_REQUIREMENTS.md		CODE_REQUIREMENTS.md
GEOdata reader.ipynb		GEOdata reader.ipynb
README.md		README.md
RexR.py		RexR.py
__init__.py		__init__.py
_helpers.py		_helpers.py
_notes_meetup2018.md		_notes_meetup2018.md
cebe.md		cebe.md
notebook_classification.ipynb		notebook_classification.ipynb
patients_used.csv		patients_used.csv
read.py		read.py
rvm.py		rvm.py
sebastiaan.py		sebastiaan.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RexR

Possible upgrades

Possible techniques

Possible collaborations

Done

TO DO

Datasets

future

funds

Data protection and distribution

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

bramiozo/RexR

Folders and files

Latest commit

History

Repository files navigation

RexR

Possible upgrades

Possible techniques

Possible collaborations

Done

TO DO

Datasets

future

funds

Data protection and distribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages