Skip to content

CamiloDFM/weaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

weaver

A small Python library that helps you build word networks from text

This library provides two main functionalities: the NetBuilder class, useful for generating word networks from within a Python script, and the weaver command, which works as a command line wrapper for NetBuilder.

Install instructions

Weaver isn't available at PyPI - pip install weaver downloads an unrelated library. So, you need to install it directly from this repo.

You'll also need to download the NLTK data packages needed by the net builder.

git clone [email protected]:CamiloDFM/weaver.git
cd weaver
pip install .
python weaver/nltk_initial_setup.py

Command usage instructions

weaver [OPTIONS...] input_path

Options

-h, --help: Displays a help message, autogenerated by argparse.
-c, --edge-criterion: Enum that indicates the criterion used to create an edge. It can be distance1 (word co-ocurrence), distance2 (co-ocurrence, plus being adjacent to the neighbour word), and sentence (the edge is created if the two words exist in the same sentence somewhere in the text). Defaults to distance1.
-f, --from-line: Line number at which the actual text starts in the provided input file. Defaults to line 1.
-o, --output-path: Path where to write the output network as a Pajek file (.net). Defaults to a network.net file in the current directory.
--pos-whitelist: List of POS tags that will act as a POS whitelist. The builder will ignore all words not matching any of those tags.
--pos-blacklist: List of POS tags that will act as a POS blacklist. Makes the builder ignore any word matching any of those tags.
(More information about NLTK POS tags here)
--sentence: Boolean flag indicating whether to split the text into sentences before building the network. This is implicitly done with the sentence criterion, but can also be used with the other criteria to prevent the end of a sentence and the start of the next one from getting an edge.
-s, --stemming: Boolean flag indicating whether to perform stemming on the text.
-x, --stopwords: Boolean flag indicating whether stopwords are to be parsed.
-t, --to-line: Line number at which the actual text ends in the provided input file. Defaults to the last line.
--top: Integer that indicates how many of the most frequent words should be considered when building the network.
-w, --weighted: Boolean flag indicating whether to build a weighted network, using the number of edge appearances as weights.
-l, --whitelist: Path to a whitelist file. If provided, it will only consider the words present in the whitelist when building the network.

Example: We download a book from Project Gutenberg, the file starts and ends with metadata. The actual book text starts at line 40 and ends at line 300.
Build a network from this book, using co-ocurrence as the edge criterion, taking only proper nouns, and making the edges weighted. Save the result at nets/book.net:

weaver path/to/book/file.txt --weighted --pos-whitelist NNP NNPS -o nets/book.net -f 40 -t 300

Class signature

class weaver.wordnet.NetBuilder(
    criterion='distance1',
    sentence=False,
    stemming=False,
    stopwords=False,
    top_words=None,
    weighted=False,
    whitelist_path=None,
    pos_whitelist=None,
    pos_blacklist=None
)

About

A small Python library that helps you build word networks from text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages