Laypa: A Novel Framework for Applying Segmentation Networks to Historical Documents
HIP'23 paper: https://doi.org/10.1145/3604951.3605520
ArXiv paper: Coming soon!
Part of the Loghi pipeline
Laypa is a segmentation network, with the goal of finding regions (paragraph, page number, etc.) and baselines in documents. The current approach is using a ResNet backbone and a feature pyramid head, which made pixel wise classifications. The models are built using the detectron2 framework. The baselines and region classifications are then made available for further processing. This post-processing turn the classification into instances. So that they can be used by other programs (OCR/HTR), either as masks or directly as PageXML.
Developed using the following software and hardware:
| Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success | 
|---|---|---|---|---|---|---|
| Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x86_64-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | âś… | 
Click here to show all tested environments
More coming soon
Run tooling/collect_env_info.py to retrieve your environment information, and add them via pull request.
| Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success | 
|---|---|---|---|---|---|---|
| Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x86_64-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | âś… | 
The recommended way of running Laypa is inside a conda environment. To ensure easier compatibility a method of building a docker is also provided.
To start clone the github repo to your local machine using either HTTPS:
git clone https://github.com/stefanklut/laypa.gitOr using SSH:
git clone [email protected]:stefanklut/laypa.gitAnd make laypa the working directory:
cd laypaIf not already installed, install either conda or miniconda (install instructions), or mamba (install instructions).
The required packages are listed in the environment.yml file. The environment can be automatically created using the following commands.
Using conda/miniconda:
conda env create -f environment.ymlUsing mamba:
mamba env create -f environment.ymlWhen running Laypa always activate the conda environment
conda activate laypaIf not already installed, install the Docker Engine (install instructions). The docker environment can most easily be build with the provided script.
Laypa now has a release on dockerhub. Using the docker of loghi/docker.laypa, should pull the corresponding laypa docker directly from docker hub. If this fails from some reason it can be pulled manually from here. If it is outdated or requires differences to the source code, please try the Manual Installation.
Building the docker using the provided script:
./buildImage.sh <PATH_TO_LAYPA>Or the multistage build with some profiler tools taken out (might be smaller):
./buildImage.multistage.sh <PATH_TO_LAYPA>Click for manual docker install instructions (not recommended)
First copy the Laypa directory to the temporary docker directory:
tmp_dir=$(mktemp -d)
cp -r -T <PATH_TO_LAYPA> $tmp_dir/laypa
cp Dockerfile $tmp_dir/Dockerfile
cp _entrypoint.sh $tmp_dir/_entrypoint.sh
cp .dockerignore $tmp_dir/.dockerignoreThen build the docker image using the following command:
docker build -t loghi/docker.laypa $tmp_dirClick for minikube install instructions
Minikube is local Kubernetes, allowing you to test the Laypa tools in a Kubernetes environment. If not already installed start with installing minikube (install instructions)
If the docker images have already been built the minikube can run them straight away. To do so, start minikube without any special arguments:
minikube startAfterwards the docker for Laypa can be added to the running minikube instance using the following command (assuming the Laypa docker was built under the name loghi/docker.laypa):
minikube image load loghi/docker.laypaIt is also possible to build the Laypa docker using the minikube docker instance. This means minikube will need access to the Laypa code. As it stand, this is current still done using a copy command from the local storage. In order to do so start the minikube with the mount argument:
minikube start --mountThis will make the machines filesystem available to minikube. Then ssh into the running minikube:
minikube sshWithin the ssh minikube go to the location of the laypa where the host /home/<user> is mounted to minikube-host
cd minikube-host/<PATH_TO_LAYPA>And follow the instructions for install a docker version of Laypa as described here
When successful the docker image should be available under the name loghi/docker.laypa. This can be verified using the following command:
docker image lsAnd checking if loghi/docker.laypa is present in the list of built images.
Some initial pretrained models can be found here.
The dataset used for training requires images combined with ground truth PageXML. For structure the PageXML needs to be inside a directory one level down from the images. The dataset can be split over multiple directories, with the image paths specified in a .txt file. The structure should look as follows:
training_data
├── page
│   ├── image1.xml
│   ├── image2.xml
│   ├── image3.xml
│   └── ...
├── image1.jpg
├── image2.jpg
├── image3.jpg
└── ...
Where the image and PageXML filename stems should match image1.jpg <-> image1.xml. For the .txt based dataset absolute paths to the images are recommended. The structure for the data used as validation is the same as that for training.
When running inference the images you want processed should be in a single directory. With the images directly under the root folder as follows:
inference_data
├── image1.jpg
├── image2.jpg
├── image3.jpg
└── ...
Some dataset that should work with laypa are listed below, some preprocessing may be require:
Three things are required to train a model using train.py.
- A config file, See configs/segmentationfor examples of config files and their contents.
- Ground truth training/validation data in the form of images and their corresponding PageXML. The training/validation data can be provided by giving either a .txtfile containing image paths, the image paths themselves, or the path of a directory containing the images.
Required arguments:
python train.py \
    -c/--config <CONFIG> \
    -t/--train <TRAIN [TRAIN ...]> \ 
    -v/--val <VAL [VAL ...]>Click to see all arguments
Optional arguments:
python train.py \
    -c/--config CONFIG \
    -t/--train TRAIN [TRAIN ...] \
    -v/--val VAL [VAL ...] \
    [--tmp_dir TMP_DIR] \
    [--keep_tmp_dir] \
    [--num-gpus NUM_GPUS] \
    [--num-machines NUM_MACHINES] \
    [--machine-rank MACHINE_RANK] \
    [--dist-url DIST_URL] \
    [--opts OPTS [OPTS ...]]The optional arguments are shown using square brackets. The --tmp_dir parameter specifies a folder in which to store temporary files. While the --keep_tmp_dir parameter prevents the temporary files from being deleted after a run (mostly for debugging).
The remaining arguments are all for training with multiple GPUs or on multiple nodes. --num-gpus specifies the number of GPUs per machine. --num-machines specifies the number of nodes in the network. --machine-rank gives a node a unique number. --dist-url is the URL for the PyTorch distributed backend. The final parameter --opts allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8 sets the batch size to 8.
As indicated by the trailing dots multiple training sets can be passed to the training model at once. This can also be done using the train argument multiple types. The .txt files can also be mixed with the directories. For example:
# Pass multiple directories at once
python train.py -c config.yml -t data/training_dir1 data/training_dir2 -v data/validation_set
# Pass multiple directories with multiple arguments
python train.py -c config.yml -t data/training_dir1 -t data/training_dir2 -v data/validation_set
# Mix training directory with txt file
python train.py -c config.yml -t data/training_dir -t data/training_file.txt -v data/validation_setTip
See the tips and tricks section below for more information on how to train a model.
Tips and Tricks
- When a models output is close to what you want, but not quite there yet, training the model from scratch can be a waste of time. Instead, you can finetune the existing model with ground truth that better matches your use case. This can be done by changing the MODEL.WEIGHTSparameter in the config file to the path of the existing model. Or by using the--optsparameter to change the weights path (for example--opts MODEL.WEIGHTS <PATH_TO_WEIGHTS>).
- If you notice a specific part of the data the model is failing on you can add more of that data to the training set. This can be done by adding the data to the training set and running the training again.
- If a training was interrupted and you want to continue training from the last checkpoint, you can use the --optsparameter to change theTRAIN.WEIGHTSparameter to the path of the last checkpoint (for example--opts TRAIN.WEIGHTS <PATH_TO_WEIGHTS>). This can also be done by changing theTRAIN.WEIGHTSparameter in the config file.
- When a model does not fit on the GPU, the batch size can be reduced using the --optsparameter. For example,--opts SOLVER.IMS_PER_BATCH 8sets the batch size to 8. Or you can turn on the AMP (Automatic Mixed Precision) using the--opts MODEL.AMP_TRAIN.ENABLED Trueparameter.
- When the model is not learning, the learning rate can be changed using the --optsparameter. For example,--opts SOLVER.BASE_LR 0.0001sets the learning rate to 0.0001. The learning rate can also be changed using the--optsparameter. For example,--opts SOLVER.BASE_LR 0.0001sets the learning rate to 0.0001.
- When the loss during training becomes nan,infor0there is something wrong with the training. Try changing the learning rate or the batch size.
- The configs directory contains some example config files. These can be used as a starting point for your own config file. Also see the defaults.py and extra_defaults.py files for more information on what can be set in the config file. Config files can inherit from other config files, this can be done by setting the _BASE_parameter in the config file.
- Never include training examples in the validation set. This will cause the validation to not be a good representation of the model's performance. This can lead to overfitting.
- A good rule of thumb for a validation set is to have 10% of the training set. To turn you dataset into a training and validation set you can use the tooling/dataset_creation.py file. This file will split the dataset into a training and validation set. The split is done by taking the first 80% of the dataset as the training set, 10% as the validation set, and the last 10% as the test set. The test set is not used for training or validation. Or use the --splitparameter to change these percentages.
To run the trained model on images without ground truth, the images need to be in a single directory. The output consists of either PageXML in the case of regions or a mask in the other cases. This mask can then be processed using other tools to turn the pixel predictions into valid PageXML (for example on baselines). As stated, the regions are turned into polygons for the PageXML within the program already.
How to run the Laypa inference individually will be explained first, and how to run it with the full scripts that include the conversion from images to PageXML with come after.
To just run the Laypa inference in inference.py, you need three things:
- A config file, See configs/segmentationfor examples of config files and their contents.
- The data can be provided by giving either a .txtfile containing image paths, the image paths themselves, or the path of a directory containing the images.
- A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments
python inference.py \
    -c/--config CONFIG \ 
    -i/--input INPUT \ 
    -o/--output OUTPUTClick to see all arguments
Optional arguments:
python inference.py \
    -c/--config CONFIG \ 
    -i/--input INPUT \ 
    -o/--output OUTPUT
    [--opts OPTS [OPTS ...]]The optional arguments are shown using square brackets. The final parameter --opts allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8 sets the batch size to 8.
List values have to be overridden by encapsulating the whole list with quotes like --opts PREPROCESS.REGION.RECTANGLE_REGIONS '["Photo"]'
To set what weights the model should use, the MODEL.WEIGHTS parameter in the config file should be set to the path of the weights file. If the weights are not in the config file, the weights can be set using the --opts parameter.
An example of how to call the inference.py command is given below:
python inference.py -c config.yml -i data/inference_dir -o results_dirIf setting the weights using the --opts parameter the command would look as follows:
python inference.py -c config.yml -i data/inference_dir -o results_dir --opts MODEL.WEIGHTS <PATH_TO_WEIGHTS>Tip
See the tips and tricks section below for more information on how to run the model.
Tips and Tricks
- You can run the model with less GPU requirement by using AMP (Automatic Mixed Precision). This can be done by setting the MODEL.AMP_TEST.ENABLEDparameter toTruein the config file. Or by using the--optsparameter to change the weights path (for example--opts MODEL.AMP_TEST.ENABLED True).
- Specify what GPU the model the model should run on using the environment variable CUDA_VISIBLE_DEVICES. This should be in front of thepython inference.pycommand. For example,CUDA_VISIBLE_DEVICES=0 python inference.py -c config.yml -i data/inference_dir -o results_dir. This will run the model on GPU 0. To run on CPU useCUDA_VISIBLE_DEVICES="" python inference.py -c config.yml -i data/inference_dir -o results_dir.
Examples of running the full pipeline (with processing of baselines) are present in the scripts directory. These files make the assumption that the docker images for both Laypa and the loghi-tooling (Java post-processing) are available on your machine. The script will also try and verify this. The Laypa docker image needs to be build with the pretrained models included.
To run the scripts only two thing are needed:
- A directory with images to be processed.
- A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments:
./scripts/pipeline.sh <input> <output>Click to see all arguments
Optional arguments:
./scripts/pipeline.sh \
        <input> \
        <output> \ 
        -g/--gpu GPUThe required arguments are shown using angle brackets. The --gpu parameter specifies what GPU(s) is accessible to the docker containers. The default is all.
The positional arguments input and output refer to the input and output directory. An example of running the one of the pipelines is shown below:
./scripts/pipeline.sh inference_dir results_dirThe Flask Server is set up to run the inference code in a Kubernetes environment. To run the Flask API run the start_flask.sh application with the environment variables set. This can generally be set when running a docker, which can set the environment variables beforehand depending on the docker internal file structure.  To quickly test locally you can run the start_flask_local.sh application, which sets the environment variables at runtime.
The flask server will run on port 5000 and can be called from outside using a curl command. When testing on a localhost the command will look as follows:
curl -X POST 'http://localhost:5000/predict' -F image=@<PATH_TO_IMAGE> -F identifier=<identifier> -F model=<MODEL_FOLDER_NAME> The required form information is the image (image) that should be processed. A given identifier to differentiate multiple runs/tests (identifier). The identifier can be any string, but it is recommended to use a UUID or a timestamp to ensure uniqueness. And finally which config and weights to use (model). The config and weights are saved in a folder, this folder name is what needs to be provided. This folder should be relative to the LAYPA_MODEL_BASE_PATH, given as an environment variable. So if the LAYPA_MODEL_BASE_PATH is set to /models and the model is stored in /models/version1 then the model path is version1. The model folder should contain the config and weights files. The config file should be named config.yml and the weight file should end in .pth.
To monitor a specific request, the identifier can be used to check the status of the request. This can be done using the following commands:
curl -X GET 'http://localhost:5000/status_info/<identifier>'
curl -X GET 'http://localhost:5000/status_info' -F identifier=<identifier>This will return information about the request, such as the status of the request, the time it took to process the request, and what error occurred (if any). This information will be returned in JSON format.
To view more general overview of the history or performance of the server, the following command can be used:
curl -X GET 'http://localhost:5000/prometheus'This will give back the standard prometheus metrics. As well as the current number of images in the queue, the number of images processed, the number of exceptions encountered, and information about how long images are in the queue and how long it took to process them. If you just want the current number of images in the queue, you can use the following command:
curl -X GET 'http://localhost:5000/queue_size'For kubernetes checks there is a health check available. This can be done using the following command:
curl -X GET 'http://localhost:5000/health'The health check will return a 200 OK if the server is running and a 500 if the server is not running. The health check can be used to check if the server is running and ready to process requests.
To use the docker image as an API service, we recommend using docker compose. The docker compose file is provided in the docker-compose.yml file. The docker compose file can be run using the following command:
docker-compose upThen request the API (in this example using curl) with the same arguments as the Flask server (see Flask Server).
The model base path is set in the docker-compose.yml file.
For a small tutorial using some concrete examples see the tutorial directory.
The Laypa repository also contains a few tools used to evaluate the results generated by the model.
The first tool is a visual comparison between the predictions of the model and the ground truth. This is done as an overlay of the classes over the original image. The overlay class names and colors are taken from the dataset catalog. The tool to do this is visualization.py. The visualization has almost the same arguments as the training command (train.py).
Required arguments:
python tooling/visualization.py \
    -c/--config CONFIG \
    -i/--input INPUT [INPUT ...] \Click to see all arguments
Optional arguments:
python tooling/visualization.py \
    -c/--config CONFIG \
    -i/--input INPUT [INPUT ...] \
    [-o/--output OUTPUT] \
    [--tmp_dir TMP_DIR] \
    [--keep_tmp_dir]
    [--opts OPTS [OPTS ...]] \
    [--sorted] \
    [--save SAVE]The optional arguments are shown using square brackets. The -o/output parameter specifies the output directory for the visualization masks. The --tmp_dir parameter specifies a folder in which to store temporary files. While the --keep_tmp_dir parameter prevents the temporary files from being deleted after a run (mostly for debugging). The final parameter --opts allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8 sets the batch size to 8. The --sorted parameter sorts the images based on the order in the operating system. The --save parameter specifies what type of file the visualization should be saved as. The options are "pred" for the prediction, "gt" for the ground truth, "both" for both the prediction and the ground truth and "all" for all of the previous. If just --save is given the default is "all".
Example of running visualization.py:
python tooling/visualization.py -c config.yml -i input_dirThe visualization.py will then open a window with both the prediction and the ground truth side by side (if the ground truth exists). Allowing for easier comparison. The visualization masks are created in the same way the preprocessing converts PageXML to masks.
The second tool validation.py is used to get the validation scores of a model. This is done by comparing the prediction of the model to the ground truth. The validation scores are the Intersection over Union (IoU) and Accuracy (Acc) scores. The tool requires the input directory (--input) where there is also a page folder inside the input folder. The page folder should contain the xmls with the ground truth baselines/regions. To run the validation tool use the following command:
Required arguments:
python tooling/validation.py \ 
    -c/--config CONFIG \
    -i/--input INPUTClick to see all arguments
Optional arguments:
python validation.py \ 
    -c/--config CONFIG \
    -i/--input INPUT \
    [--opts OPTS [OPTS ...]]The optional arguments are shown using square brackets. The final parameter --opts allows you to change values specified in the config files. For example, --opts MODEL.WEIGHTS <PATH_TO_WEIGHTS> sets the path to the weights file. This needs to be done if the weights are not in the config file. Without MODEL.WEIGHTS the weights are taken from the config file. If the weights are not in the config file and not specified with MODEL.WEIGHTS the program will return results for an untrained model.
The third tool is a program to compare the similarity of two sets of PageXML. This can mean either comparing ground truth to predicted PageXML, or determining the similarity of two annotations by different people. This tool is the xml_comparison.py file. The comparison allows you to specify how regions and baseline should be drawn in when creating the pixel masks. The pixel masks are then compared based on their Intersection over Union (IoU) and Accuracy (Acc) scores. For the sake of the Accuracy metric one of the two sets needs to be specified as the ground truth set. So one set is the ground truth directory (--gt) argument and the other is the input directory (--input) argument.
Required arguments:
python tooling/xml_comparison.py \ 
    -g/--gt GT [GT ...] \
    -i/--input INPUT [INPUT ...]Click to see all arguments
Optional arguments:
python tooling/xml_comparison.py \ 
    -g/--gt GT [GT ...] \
    -i/--input INPUT [INPUT ...] \
    [-m/--mode {baseline,region,start,end,separator,baseline_separator}] \
    [--regions REGIONS [REGIONS ...]] \
    [--merge_regions [MERGE_REGIONS]] \
    [--region_type REGION_TYPE [REGION_TYPE ...]] \
    [-w/--line_width LINE_WIDTH] The optional arguments are shown using square brackets. The --mode parameter specifies what type of prediction the model has to do. If the mode is region, the --regions argument specifies which regions need to be extracted from the PageXML (for example "page-number"). The --merge_regions then specifies if any of these regions need to be merged. This could mean converting "insertion" into "resolution" since they are talking about the same thing resolution:insertion. The final region argument is --region_type which can specify the region type of a region. In the other modes lines are used. The line arguments are --line_width, which specifies the line width, and --line_color, which specifies the line color.
The final tool is a program for showing the PageXML as mask images. This can help with showing how the PageXML regions and baseline look. This can be done in gray scale, color, or as a colored overlay over the original image. This tool is located in the xml_viewer.py file. It requires an input directory (--input) argument and output directory (--output) argument.
Required arguments:
python tooling/xml_viewer.py \ 
    -c/--config CONFIG \
    -i/--input INPUT [INPUT ...] \
    -o/--output OUTPUT [OUTPUT ...] Click to see all arguments
Optional arguments:
python tooling/xml_viewer.py \ 
    -c/--config CONFIG \
    -i/--input INPUT [INPUT ...] \
    -o/--output OUTPUT [OUTPUT ...] \
    [--opts OPTS [OPTS ...]] \
    [-t/--output_type {gray,color,overlay}]The optional arguments are shown using square brackets. The parameter --opts allows you to change values specified in the config files. The --output_type parameter specifies which type of
Distributed under the MIT License. See LICENSE for more information.
This project was made while working at the KNAW Humanities Cluster Digital Infrastructure
Please report any bugs or errors that you find to the issues page, so that they can be looked into. Try to see if an issue with the same problem/bug is not still open. Feature requests should also be done through the issues page.
If you discover a bug or missing feature that you would like to help with please feel free to send a pull request.