Releases · Princeton-CDH/remarx · GitHub

27 Oct 16:11

tanhaow

v0.3 Latest

Latest

What's Changed

Sentence corpus creation

Sentence corpora generated from TEI now include line number field (line_number) based on line begin tag (<lb> n attribute)
Support for ALTO XML input as a zipfile with multiple pages
- Skips non-ALTO files, logs warnings for invalid or empty xml
- Yields sentence corpora indexed across pages; ordering based on natural sort of filenames
Improved logging output for remarx-create-corpus script, with optional verbose mode

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.3

Assets 2

15 Oct 16:22

tanhaow

v0.2

What's Changed

Application

The app now consists of two notebooks (Sentence Corpus Builder & Quote Finder)
Logging is now automatically configured by the application, and the log file location is reported to the user
Quote Finder notebook now supports quotation detection between two sentence corpus files (original and reuse)

Documentation

Add technical design document to MkDocs documentation

Sentence corpus creation

Add sentence id field (sent_id) to generated sentence corpora
Processes TEI/XML documents to yield separate chunks for body text and footnotes, with each footnote yielded individually as a separate element

Quotation detection

Add a method for generating sentence embeddings from a list of sentences
Added method for identifying likely quote sentence pairs

Scripts

Add parse_html script for converting the manifesto html files to plain text for sentence corpus input (one-time use)

Misc

Add a utility method (configure_logging) to configure logging, supporting logging to a file or to stdout

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.2

Assets 2

08 Sep 19:12

laurejt

v0.1

What's Changed

Initial release.

Sentence corpus creation

Add segment_text() function for splitting plain text into sentences with character-level indices
Add support for plain text files as input
Add preliminary support for TEI XML files as corpus input; includes page numbers, assumes MEGA TEI
Add factory method to initialize appropriate input class for supported file types
Add create_corpus() function to generate a sentence corpus CSV from a single supported input file
Add command line script remarx-create-corpus to input a supported file and generate a sentence corpus

Application

Add preliminary application with access to sentence corpus creation for supported file types
Add command line script to launch application

Documentation

Document package installation (README)
Set up MkDocs for code documentation
Add GitHub Actions workflow to build and deploy documentation to GitHub Pages for released versions (main branch)

Misc

Add GitHub Actions workflow to build and publish python package on PyPI when a new GitHub release
created

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.1

Assets 2