Skip to content

smucclaw/pdfdiff

Repository files navigation

pdfdiff

naively try to figure out which articles of which documents correspond to which other articles.

Prerequisites

pandoc

your system package manager should be able to install pandoc

parallel

your system package manager should be able to install parallel

install and setup

git clone https://github.com/smucclaw/pdfdiff.git
cd pdfdiff

In the pdfdiff directory we expect a directory FTA/ to contain directories like these:

┌─[20230306-17:47:56]   [mengwong@rosegold:~/src/smucclaw/pdfdiff/FTA]
└─[0] <> ll
total 24
drwx------@ 13 mengwong  staff     416 Mar  6 17:12 .
drwx------@  9 mengwong  staff     288 Mar  2 13:55 ..
drwx------@  8 mengwong  staff     256 Mar  6 17:45 1._Goods
drwx------@  6 mengwong  staff     192 Mar  6 17:38 2._Services
drwx------@  6 mengwong  staff     192 Mar  6 17:38 3._Investment
drwx------@  4 mengwong  staff     128 Mar  6 17:13 4._HS_Classification_&_Annotation

In my case, I downloaded a local copy of the FTA/ directpry from the shared Google Drive

cp -rp ~/"Google-Drive/My Drive/CCLAW Centre for Computational Law/RPCL Research Team/Knowledge Management - Research & Working Docs/FTA Use Case/FTA" ./

This may be different for you based on how your Google Drive plugin is set up. You can use the Google Drive web UI to download the entire folder as a Zip file and work with a local copy.

Once that’s done, you can run

make

This should produce matrix files that gauge the similarity of articles to each other.

┌─[20230306-18:09:31]   [mengwong@rosegold:~/src/smucclaw/pdfdiff]
└─[0] <git:(main e77e631✱✈) > ll FTA/*/matrix*
-rw-r--r--@ 1 mengwong  staff  137229 Mar  6 17:50 FTA/1._Goods/matrix-RulesOfOrigin.org
-rw-r--r--@ 1 mengwong  staff   84130 Mar  6 17:46 FTA/1._Goods/matrix-TradeInGoods.org
-rw-r--r--@ 1 mengwong  staff  106508 Mar  6 17:51 FTA/2._Services/matrix-TradeInServices.org
-rw-r--r--@ 1 mengwong  staff   22146 Mar  6 17:51 FTA/3._Investment/matrix-Investment.org

From here, you can split the documents up by article, lay them out on a web page, and run diffs between them.

Some Preprocessing Required

The PDF-to-Docx conversion pathway introduces spurious paragraph breaks. We need to ignore them.

A bottom-up approach

If the input is highly structured, we could do the stitching by examination:

(1) one one one

(2) two two two

(3) three three three

(4) four four

four four

(5) five five five

We could write some Lua to attempt to fix these cases, and run that Lua directly within Pandoc.

Eliminate newlines to produce equivalence classes

Another more devil-may-care approach would simply consider all diff blocks raised by a diffing algorithm running between articles, and reconsider if those blocks would still be different if whitespace were deleted.

Consider two files alpha and beta which are the same except for paragraph breaks:

For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. 
For this invention will produce forgetfulness in the minds of those who learn to use it,

because they will not practice their memory. Their trust in writing,

produced by external characters which are no part of themselves, will discourage the use of their own memory within them. 

We can squash newlines to create equivalence:

┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:05:25]
└─[0] <git:(main fa53540✱✈) > perl -0000 -le '@all = <>; s/\n//g for @all; printf ("%s", join (" ", @all))' < alpha.txt > alpha-squashed.txt
┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:06:08]
└─[0] <git:(main fa53540✱✈) > perl -0000 -le '@all = <>; s/\n//g for @all; printf ("%s", join (" ", @all))' < beta.txt > beta-squashed.txt

And we can then verify that diff is now more chill:

┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:06:15]
└─[0] <git:(main fa53540✱✈) > diff alpha-squashed.txt beta-squashed.txt
┌─[mengwong@solo-8] - [~/src/smucclaw/pdfdiff/test/01] - [2023-03-23 10:18:15]
└─[0] <git:(main fa53540✱✈) >

Haskell

If we want to stay within Haskell, we could just use Diff.

  1. Given two text blocks, each of which which could be an Article (a subsection of a file) or simply an entire file
  2. we run the diff algorithm to identify chunks of differences.
  3. We try to equate the chunks by running the above newline-deletion algorithm.
    1. If that algorithm succeeds in equating the chunks, we rewrite the longer chunk with the shorter.
    2. Otherwise, we leave the chunk alone.
  4. When we’re done, we output a new version of the input files.
  5. Any subsequent diffs run against those files should be “real” differences.

stack run produces the output file [[./out.org]]

About

compare multiple large pdf documents using diff and other tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •