Skip to content

Releases: Princeton-CDH/remarx

v0.3

27 Oct 16:11

Choose a tag to compare

What's Changed

Sentence corpus creation

  • Sentence corpora generated from TEI now include line number field (line_number) based on line begin tag (<lb> n attribute)
  • Support for ALTO XML input as a zipfile with multiple pages
    • Skips non-ALTO files, logs warnings for invalid or empty xml
    • Yields sentence corpora indexed across pages; ordering based on natural sort of filenames
  • Improved logging output for remarx-create-corpus script, with optional verbose mode

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.3

v0.2

15 Oct 16:22

Choose a tag to compare

What's Changed

Application

  • The app now consists of two notebooks (Sentence Corpus Builder & Quote Finder)
  • Logging is now automatically configured by the application, and the log file location is reported to the user
  • Quote Finder notebook now supports quotation detection between two sentence corpus files (original and reuse)

Documentation

  • Add technical design document to MkDocs documentation

Sentence corpus creation

  • Add sentence id field (sent_id) to generated sentence corpora
  • Processes TEI/XML documents to yield separate chunks for body text and footnotes, with each footnote yielded individually as a separate element

Quotation detection

  • Add a method for generating sentence embeddings from a list of sentences
  • Added method for identifying likely quote sentence pairs

Scripts

  • Add parse_html script for converting the manifesto html files to plain text for sentence corpus input (one-time use)

Misc

  • Add a utility method (configure_logging) to configure logging, supporting logging to a file or to stdout

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.2

v0.1

08 Sep 19:12

Choose a tag to compare

What's Changed

Initial release.

Sentence corpus creation

  • Add segment_text() function for splitting plain text into sentences with character-level indices
  • Add support for plain text files as input
  • Add preliminary support for TEI XML files as corpus input; includes page numbers, assumes MEGA TEI
  • Add factory method to initialize appropriate input class for supported file types
  • Add create_corpus() function to generate a sentence corpus CSV from a single supported input file
  • Add command line script remarx-create-corpus to input a supported file and generate a sentence corpus

Application

  • Add preliminary application with access to sentence corpus creation for supported file types
  • Add command line script to launch application

Documentation

  • Document package installation (README)
  • Set up MkDocs for code documentation
  • Add GitHub Actions workflow to build and deploy documentation to GitHub Pages for released versions (main branch)

Misc

  • Add GitHub Actions workflow to build and publish python package on PyPI when a new GitHub release
    created

Full Changelog: https://github.com/Princeton-CDH/remarx/commits/0.1