diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..e50b515 --- /dev/null +++ b/LICENSE @@ -0,0 +1,33 @@ + Preamble + +This Simple Public License 2.0 (SimPL-2.0 for short) is a plain language implementation of GPL 2.0. The words are different, but the goal is the same - to guarantee for all users the freedom to share and change software. If anyone wonders about the meaning of the SimPL, they should interpret it as consistent with GPL 2.0. Original text available on the web: http://opensource.org/licenses/SimPL-2.0. + +Simple Public License (SimPL) 2.0 + +The SimPL applies to the software's source and object code and comes with any rights that I have in it (other than trademarks). You agree to the SimPL by copying, distributing, or making a derivative work of the software. + +You get the royalty free right to: + + Use the software for any purpose; + Make derivative works of it (this is called a "Derived Work"); + Copy and distribute it and any Derived Work. + +If you distribute the software or a Derived Work, you must give back to the community by: + + Prominently noting the date of any changes you make; + Leaving other people's copyright notices, warranty disclaimers, and license terms in place; + Providing the source code, build scripts, installation scripts, and interface definitions in a form that is easy to get and best to modify; + Licensing it to everyone under SimPL, or substantially similar terms (such as GPL 2.0), without adding further restrictions to the rights provided; + Conspicuously announcing that it is available under that license. + +There are some things that you must shoulder: + + You get NO WARRANTIES. None of any kind; + If the software damages you in any way, you may only recover direct damages up to the amount you paid for it (that is zero if you did not pay anything). You may not recover any other damages, including those called "consequential damages." (The state or country where you live may not allow you to limit your liability in this way, so this may not apply to you); + +The SimPL continues perpetually, except that your license rights end automatically if: + + You do not abide by the "give back to the community" terms (your licensees get to keep their rights if they abide); + Anyone prevents you from distributing the software under the terms of the SimPL. + +If you have questions, please contact beidi.chen@rice.edu regarding the license or use of this for industrial purposes. diff --git a/README.md b/README.md index 21ab179..19dc3b8 100644 --- a/README.md +++ b/README.md @@ -21,10 +21,14 @@ This package is written in C++ and Python. We require at least g++ version 5 and 3. Prerequisites +Software: + + C++ compiler + + Python 2.7 + The following packages are needed in Python for the code to run: ``` -C++, Python 2, ngram, sklearn, numpy, scipy, matlib +ngram, sklearn, numpy, scipy, matlib ``` Remark: In order to install using pip, one will need to run the following commands if errors arise from the terminal due to recent changes with SSH in pip (Linux and MacOS) @@ -38,8 +42,8 @@ pip2 install numpy scipy matplotlib ``` cd C++Codes -g++ -std=c++11 *.cpp -fopenmp (on Windows and Linux) -g++ *.cpp -fopenmp (on MacOS) +g++ -o minhash -std=c++11 *.cpp -fopenmp (on Windows and Linux) +g++ -o minhash *.cpp -fopenmp (on MacOS) ``` Remark: For mac users, the g++ version needs to be 5 or higher. @@ -65,7 +69,7 @@ Use the C++ Package folder in this repository. This is a fast minhash package wh 1. Update the Config file for minhash and run the program (Remember to change the outputfile name option to Restaurant_pair.csv or the particular name of your data set.) The second and third arguments are K and L respectively. ``` -./a.out Config.txt 1 10 +./C++Codes/minhash config_restaurant.txt 1 10 ``` The output is `Restaurant_pair.csv` where the output is candidate record pairs: @@ -81,7 +85,7 @@ Rec1 Rec2 where there are many customizable options. ``` -Python pipeline.py --input Restaurant_pair.csv --goldstan data/Restaurant.csv --output any_custom_file_name +python pipeline.py --input Restaurant_pair.csv --goldstan data/Restaurant.csv --output any_custom_file_name ``` @@ -105,7 +109,7 @@ ID RR (reduction ratio) LSHE LSHE is the proposed estimator. RR is the reduction ratio of the number of sampled pairs used in the estimation out of total possible pairs. -#### fasthash Script +#### Unique Entity Estimation Script For better usabiity, an example script `run_script.sh` produces the estimation of our LSHE estimates very similar to our paper as well as our LSHE plots. This script will run all four data sets, assuming the user has access to the two public data sets and two private data sets. To run the script, simply change into the main directory and them run @@ -128,5 +132,6 @@ Year = {2018}, Journal = {Annals of Applied Statistics, To Appear}} ``` -#### Awknowledgements -We would like to thank the Human Rights Data Analysis Group (HRDAG) for providing the data that has movitated this work. Specifically, we thank Megan Price and Patrick Ball for stimulating conversations and feedback that would have not made this work possible. This work would also have not been possible without the support and encouragement of Steve Fienberg and Lars Vilhuber. +### Acknowledgements + +We would like to thank the Human Rights Data Analysis Group (HRDAG) for providing the data that has movitated this work. Specifically, we thank Megan Price and Patrick Ball for stimulating conversations and feedback that would have not made this work possible. This work would also have not been possible without the support and encouragement of Steve Fienberg and Lars Vilhuber. \ No newline at end of file diff --git a/config_restaurant.txt b/config_restaurant.txt index dbea1d9..fb49891 100755 --- a/config_restaurant.txt +++ b/config_restaurant.txt @@ -23,10 +23,10 @@ Thresh=3 #Give the input CSV file. First line will be ignored (assumed to be header). Every line will be treated as a #record. #The line number of record will be its ID. That is the fist line after header is treated as record with ID 1 etc. -Input=data/restaurant.csv +Input=data/Restaurant.csv #Output File: this will contain a pair of record IDs in each line indicating a possible match. -Output=restaurant_pair.csv +Output=Restaurant_pair.csv ############################################################################## #These are advanced parameters depending on memory ############################################################################## diff --git a/run_script.sh b/run_script.sh index 543c13a..1a39c2e 100644 --- a/run_script.sh +++ b/run_script.sh @@ -3,30 +3,30 @@ #!/bin/bash -g++-7 -std=c++11 C++Codes/*.cpp -o output -fopenmp +g++-7 -std=c++11 C++Codes/*.cpp -o minhash -fopenmp For Restaurant for ((i=6;i<=25;i+=6)) ; do for ((j=1;j<=10; j++)); - do ./output config_restaurant.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.3 --input restaurant_pair.csv --goldstan data/restaurant.csv --output log-restaurant ; + do ./minhash config_restaurant.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.3 --input restaurant_pair.csv --goldstan data/restaurant.csv --output log-restaurant ; done done python plot.py --input log-restaurant --gt 753 #For CD - for ((i=6;i<=20;i+=4)) ; - do for ((j=1;j<=3; j++)); - do ./output config_cd.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.5 --input cd_pair.csv --goldstan data/cd.csv --delimiter ';' --output log-cd ; - done - done +# for ((i=6;i<=20;i+=4)) ; +# do for ((j=1;j<=3; j++)); +# do ./minhash config_cd.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.5 --input cd_pair.csv --goldstan data/cd.csv --delimiter ';' --output log-cd ; +# done +# done - python plot.py --input log-cd --gt 9508 +# python plot.py --input log-cd --gt 9508 #For Voter # for ((i=25;i<=40;i+=5)) ; # do for ((j=1;j<=10; j++)); -# do ./output config_voter.txt 4 $i; python pipeline.py --flag 0 --id $i --trainsize 0.1 --input voter_pair.csv --goldstan data/voter.csv --delimiter ',' --c 0.0001 --output log-voter ; +# do ./minhash config_voter.txt 4 $i; python pipeline.py --flag 0 --id $i --trainsize 0.1 --input voter_pair.csv --goldstan data/voter.csv --delimiter ',' --c 0.0001 --output log-voter ; # done # done @@ -36,7 +36,7 @@ g++-7 -std=c++11 C++Codes/*.cpp -o output -fopenmp # python preprocess.py #for ((i=1;i<=10;i++)) ; -# do ./output config_syria.txt 15 10; python pipeline_for_syria.py --input syria_pair.csv --output log-syria --rawdata data/syria.csv --goldstandpair data/syria_train.csv; +# do ./minhash config_syria.txt 15 10; python pipeline_for_syria.py --input syria_pair.csv --output log-syria --rawdata data/syria.csv --goldstandpair data/syria_train.csv; #done #python count.py --input log-syria diff --git a/setup.sh b/setup.sh new file mode 100644 index 0000000..a806e71 --- /dev/null +++ b/setup.sh @@ -0,0 +1,18 @@ +# Setup script +# Assumes presence of Anaconda + +# Create an environment +conda create --name LSH python=2.7 +source activate LSH + +# Install packages from Anaconda +conda install numpy +conda install scipy + +# Install packages using pip +pip install --pre subprocess32 +pip install ngram +pip install sklearn +pip install matlib + +# this fails due to dependency failure: matlib.h \ No newline at end of file