This repo has as objective reduce the startup time for a prototype or EDA analysis.
Some common libraries installed:
- Jupyter Lab
- Bokeh
- gensim
- scikit-learn
- Pandas
- Dask
- Opencv
- S3 support with fsspect or boto3 library.
For more details look at pyproject.toml or requirements.txt
For more stacks or context about this checkout stacks
This project could be used as least in two ways:
- download and run the docker image from nuxion/datascience
docker run --rm -p 127.0.0.1:8888:8888 nuxion/datascience
Mounting a dir to keep state:
docker run --rm -p 127.0.0.1:8888:8889 -v <your_dir>:/app/notebooks nuxion/datascience
- clone/fork this repo and run by your own.
The entrypoint of the project is the jupyter lab environment but some batch task could run, dask and papermill are provided as dependencies for that purpose. Also, dask could be used inside the jupyter lab environment if some dataset is to big.
🚧 Be warned that is not secure to expose this service to the world, so be carefull. By default if you use make run
, docker only will be listening in your localhost interface.
The password by default is changeme
, you can change that from jupyter_conf:
# To generate, type in a python/IPython shell:
#
# from jupyter_server.auth import passwd; passwd()
#
# The string should be of the form type:salt:hashed-password.
# Default: ''
c.ServerApp.password = 'argon2:$argon2id$v=19$m=10240,t=10,p=8$IzweiP2xT1dI2D65ElHBDw$q52+kB/xVzK5F4/j4ZunBw'
Please, read the instructions
You can also clone or download from releases the code and install manually the depencies using poetry or pip if docker is not your use case.
In Windows a conda environment could be a better approach.
The data stack world of python move fast. Deprecation warnings are allways attended by me when possible. Part of the intention of this image is to be sure of the dependencies. Some features like dask, depends on each node version of the libs installed there if not dask will not work. And other libraries like pandas, deprecate ways of do some things like: groupby and so forth.
Related to the python version:
As a rule of thumb I stay two versions behind of the last release for Python. The last python version is ~3.10
, then I use ~3.8
Because all the python dependencies are installed in a intermediate image as root and then the packages downloaded or built in that image are copied to the final image with the correct user permissions, but pip
warns about this first install as root.
In this way if I change some code without adding or removing dependencies, then when I rebuilt the image I will not compile again each dependecy.
Look at use mult-stage build for more information.
- Pandas bumped to version 1.3.4, this allows the use of new datatype:
string[pyarrow]
- jupyter-text added to pair *ipynb files with markdown or *.py
- Jupyterlab bumped to 3.2.1
- Set a specific gid and uid for the app user inside of the docker image. This is to share the same uuid and group than a nginx fileserver.
- spacy small spanish corpus download inside of the docker image, look at
Dockerfile
, more models could be added. - Nodejs is installed, some more rich ui features could be expected.
- Dask cluster manager plugin. Dask allows you scale out of core easily.
- Evaluates Intake project for datasources or a custom solution using smartopen or fsspec (used by Intake and Kedro projects)
- Add are easy way to get models like fastext.
- Add checksums to download source like nodejs.
- Add Vaex support.
- Fix warning when jupyter is started the first time.