Akin API Reference

MinHash

The Akin library offers two classes for generating the MinHash object: UniMinHash and MultiMinHash.

UniMinHash

Creates a MinHash object that contains matrix of Minhash Signatures for each text.

Texts are shingled and hashed using the bottom-k variant of the MinHash algorithm, each text is hashed once and the k-smallest values selected for k permutations. This method is less computationally intensive than multi_hash but also less stable.

UniMinHash(
    text, 
    n_gram=9, 
    n_gram_type='char', 
    permutations=100, 
    hash_bits=64, 
    seed=None
)

Parameters

text {list or ndarray}
Iterable containing strings of text for each text in a corpus.

n_gram int, optional, default: 9
Size of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected dependent on average text length as too low a shingle size will yield false similarities, whereas too high a shingle size will fail to return similar documents.

For character shingles a size of 5 is recommended for shorter texts such as emails, the default size of 9 is recommended for longer texts or documents.

n_gram_type str, optional, default: 'char'
Type of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into overlapping sequences of words.

permutations int, optional, default: 100
Number of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm will take to run.

hash_bits int, optional, default: 64
Hash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit.

Hash value size should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false hash collisions leading to false similarities between documents for larger corpora of texts.

seed int, optional, default: None
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.

Properties

n_gram: int
Returns size of each overlapping text shingle used to create minhash signatures.

n_gram_type: int
Returns type of n-gram used for text shingling.

permutations: int
Returns number of permutations used to create signatures.

hash_bits: int
Returns hash value size used to create signatures.

seed: int
Returns seed value used to generate random hashes in minhash function.

signatures: numpy.array
Returns matrix of text signatures generated by minhash function.
n = text row, m = selected permutations.

MultiMinHash

Creates a MinHash object that contains matrix of Minhash Signatures for each text.

Texts are shingled, then hashed once per permutation and the minimum hash value selected each time to construct a signature.

MultiMinHash(
    text, 
    n_gram=9, 
    n_gram_type='char', 
    permutations=100, 
    hash_bits=64, 
    seed=None
)

Parameters

text {list or ndarray}
Iterable containing strings of text for each text in a corpus.

n_gram int, optional, default: 9
Size of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected dependent on average text length as too low a shingle size will yield false similarities, whereas too high a shingle size will fail to return similar documents.

For character shingles a size of 5 is recommended for shorter texts such as emails, the default size of 9 is recommended for longer texts or documents.

n_gram_type str, optional, default: 'char'
Type of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into overlapping sequences of words.

permutations int, optional, default: 100
Number of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm will take to run.

hash_bits int, optional, default: 64
Hash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit. Hash value size should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false hash collisions leading to false similarities between documents for larger corpora of texts.

seed int, optional, default: None
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.

Properties

n_gram: int
Returns size of each overlapping text shingle used to create minhash signatures.

n_gram_type: int
Returns type of n-gram used for text shingling.

permutations: int
Returns number of permutations used to create signatures.

hash_bits: int
Returns hash value size used to create signatures.

seed: int
Returns seed value used to generate random hashes in minhash function.

signatures: numpy.array
Returns matrix of text signatures generated by minhash function.
n = text row, m = selected permutations.

LSH

Creates an LSH model of text similarity that can be used to return similar texts based on estimated Jaccard similarity.

LSH(permutations, no_of_bands=None, seed=1)

Parameters

permutations int
Number of permutations used in minhashing signatures MultiMinHash or UniMultiHash.

no_of_bands optional, default: permutations // 2
Number of bands to break minhash signature into before hashing into buckets. A smaller number of bands will result in a stricter algorithm, requiring larger possibly leading to false negatives missing some similar texts, whereas a higher number may lead to false similarities.

seed int optional, default: 1
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.

Methods

.update(minhash_signatures, labels)

Updates model with minhash signatures and their corresponding labels.

minhash list
MinHash object containing signatures of new texts, parameters must match any previous MinHash objects.

new_labels list
List, array or Pandas series containing unique labels for each text.

.query(label, min_jaccard=None, sensitivity=1, include_similarity=False)

Returns list of near-duplicates for text with provided label.

label str
Label of text for which to return near-duplicates.

min_jaccard float optional, default: None
Minimum Jaccard Similarity for texts to be returned as near duplicates. If specified the Jaccard similarity for each candidate near-duplicate signature will be explicitly calculated, improving accuaracy but also increasing run time.

sensitivity int optional, default: 1
umber of unique buckets two ids must co-occur in to be considered a candidate near-duplicate pair.

include_similarity bool optional, default: False
Return similarity score alongside estimated near duplicates, if selected scores are returned as a list of (label, score) tuples. Note, a min_jaccard score must be provided.

.remove(labels)

Remove label and associated text signature from model.

labels list
List of labels to remove from the LSH model.

.adjacency_list(labels=None, min_jaccard=None, sensitivity=1)

Returns an adjacency list dictionary mapping all labels to their estimated near duplicates. Can be used to create an undirected graph for texts in the LSH object.

label list optional, default: None
List of labels to limit the adjacency list to.

min_jaccard float optional, default: None
Minimum Jaccard Similarity for texts to be returned as near duplicates. If specified the Jaccard similarity for each candidate near-duplicate signature will be explicitly calculated, improving accuaracy but also increasing run time.

sensitivity int optional, default: 1
umber of unique buckets two ids must co-occur in to be considered a candidate near-duplicate pair.

Properties

no_of_bands: int
Number of bands used in LSH model.

seed: int
Seed used to generate random hash function.

permutations: int
Number of permutations used to create minhash signatures used in LSH model.