naively try to figure out which articles of which documents correspond to which other articles.
your system package manager should be able to install pandoc
your system package manager should be able to install parallel
git clone https://github.com/smucclaw/pdfdiff.git
cd pdfdiff
In the pdfdiff
directory we expect a directory FTA/
to contain directories like these:
┌─[20230306-17:47:56] [mengwong@rosegold:~/src/smucclaw/pdfdiff/FTA] └─[0] <> ll total 24 drwx------@ 13 mengwong staff 416 Mar 6 17:12 . drwx------@ 9 mengwong staff 288 Mar 2 13:55 .. drwx------@ 8 mengwong staff 256 Mar 6 17:45 1._Goods drwx------@ 6 mengwong staff 192 Mar 6 17:38 2._Services drwx------@ 6 mengwong staff 192 Mar 6 17:38 3._Investment drwx------@ 4 mengwong staff 128 Mar 6 17:13 4._HS_Classification_&_Annotation
In my case, I downloaded a local copy of the FTA/
directpry from the shared Google Drive
cp -rp ~/"Google-Drive/My Drive/CCLAW Centre for Computational Law/RPCL Research Team/Knowledge Management - Research & Working Docs/FTA Use Case/FTA" ./
This may be different for you based on how your Google Drive plugin is set up. You can use the Google Drive web UI to download the entire folder as a Zip file and work with a local copy.
Once that’s done, you can run
make
This should produce matrix files that gauge the similarity of articles to each other.
┌─[20230306-18:09:31] [mengwong@rosegold:~/src/smucclaw/pdfdiff] └─[0] <git:(main e77e631✱✈) > ll FTA/*/matrix* -rw-r--r--@ 1 mengwong staff 137229 Mar 6 17:50 FTA/1._Goods/matrix-RulesOfOrigin.org -rw-r--r--@ 1 mengwong staff 84130 Mar 6 17:46 FTA/1._Goods/matrix-TradeInGoods.org -rw-r--r--@ 1 mengwong staff 106508 Mar 6 17:51 FTA/2._Services/matrix-TradeInServices.org -rw-r--r--@ 1 mengwong staff 22146 Mar 6 17:51 FTA/3._Investment/matrix-Investment.org
From here, you can split the documents up by article, lay them out on a web page, and run diffs between them.
The PDF-to-Docx conversion pathway introduces spurious paragraph breaks. We need to ignore them.
If the input is highly structured, we could do the stitching by examination:
(1) one one one (2) two two two (3) three three three (4) four four four four (5) five five five
We could write some Lua to attempt to fix these cases, and run that Lua directly within Pandoc.
Another more devil-may-care approach would simply consider all diff blocks raised by a diffing algorithm running between articles, and reconsider if those blocks would still be different if whitespace were deleted.
Consider two files alpha
and beta
which are the same except for paragraph breaks:
For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them.
For this invention will produce forgetfulness in the minds of those who learn to use it,
because they will not practice their memory. Their trust in writing,
produced by external characters which are no part of themselves, will discourage the use of their own memory within them.
We can squash newlines to create equivalence:
┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:05:25] └─[0] <git:(main fa53540✱✈) > perl -0000 -le '@all = <>; s/\n//g for @all; printf ("%s", join (" ", @all))' < alpha.txt > alpha-squashed.txt ┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:06:08] └─[0] <git:(main fa53540✱✈) > perl -0000 -le '@all = <>; s/\n//g for @all; printf ("%s", join (" ", @all))' < beta.txt > beta-squashed.txt
And we can then verify that diff is now more chill:
┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:06:15] └─[0] <git:(main fa53540✱✈) > diff alpha-squashed.txt beta-squashed.txt ┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:18:15] └─[0] <git:(main fa53540✱✈) >
If we want to stay within Haskell, we could just use Diff.
- Given two text blocks, each of which which could be an Article (a subsection of a file) or simply an entire file
- we run the diff algorithm to identify chunks of differences.
- We try to equate the chunks by running the above newline-deletion algorithm.
- If that algorithm succeeds in equating the chunks, we rewrite the longer chunk with the shorter.
- Otherwise, we leave the chunk alone.
- When we’re done, we output a new version of the input files.
- Any subsequent diffs run against those files should be “real” differences.