v0.2.1

Buzzwords(params_dict: Dict[str, Dict[str, str]] = None)

Model capable of gathering topics from a collection of text or image documents

Parameters

params_dict : Dict[str, Dict[str, str]], optional Custom parameters for the model, use to override the defaults. Has the format { model_type1: {parameter1: value, parameter2: value }, .. } with the following model types:
- Embedding - Parameters from your chosen model_type, e.g. SentenceTransformers or CLIP
- UMAP - Parameters from UMAP
- HDBSCAN - Parameters from HDBSCAN
- Keywords
  - min_df : int min_df value for CountVectoriser
  - num_words : int Number of keywords to return per topic
  - num_word_candidates : int Number of top td-idf candidates to consider as possible keywords
- Buzzwords
  - lemmatise_sentences : bool Whether to lemmatise sentences before getting keywords
  - embedding_batch_size : int Batch size to embed sentences with
  - matryoshka_decay: int The speed at which HDBSCAN.min_cluster_size decreases on each recursion
  - get_keywords: bool Whether or not to gather the keywords
  - keyword_model: str Which keyword model to use - ‘ctfidf’ or ‘keybert’
  - model_type: str Type of encoding model to run, e.g. ‘sentencetransformers’ or ‘clip’

Attributes

model_parameters : Dict[str, Dict[str, str]] Values for each model (params_dict parameter alters this)
embedding_model : Union[SentenceTransformer, clip_encoder.CLIPEncoder] Chosen embedding model
umap_model : cuml.manifold.umap.UMAP cuML’s UMAP model
hdbscan_model : custom_hdbscan.HDBSCAN Custom HDBSCAN model
keyword_model : keywords.Keywords Chosen model for keyword gathering
topic_embeddings : np.ndarray The top n (num_word_candidates) words concatenated and embedded for each topic
topic_descriptions : Dict[int, str] {Topic number:Topic Keywords} for each topic

Examples

from buzzwords import Buzzwords
params_dict = {'UMAP': {'n_neighbors': 20}}
model = Buzzwords(params_dict)

Run with default params, overriding UMAP’s n_neighbors value

model = Buzzwords()
docs = df['text_column']
topics = model.fit_transform(docs)
topic_keywords = [model.topic_descriptions[topic] for topic in topics]

Basic model training

model = Buzzwords()
train_df = df.iloc[:50000]
pred_df = df.iloc[50000:]
topics = model.fit_transform(train_df.text_col)
topics.extend(model.transform(pred_df.text_col.reset_index(drop=True)))

Train a model on a batch of data, predict on the rest

keyword = 'covid vaccine corona'
closest_topics = model.get_closest_topic(keyword, n=5)

Get 5 topics (from a trained model) similar to a given phrase

model = Buzzwords()
model.load('saved_model/')

Load a saved model from disk

fit(docs: List[str], recursions: int = 1) -> None

Fit model based on given data

Parameters

docs : List[str] Text documents to get topics for
recursions : int Number of times to recurse the model. See Notes

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

recursions is used as input for matryoshka_iteration(), the outlier reduction method. When it’s set to 1, the model is run once on the input data, which can leave a significant number of outliers. To alleviate this, you can recurse the fit and run another fit_transform on the outliers themselves. This will consider the outliers a separate set of data and train a new model to cluster them, repeating recursions times. The format of the output is the same, except as num_recursions increases, the amount of outliers in the final dataset decreases.

fit_transform(docs: List[str], recursions: int = 1) -> List[int]

Fit model based on given data and return the transformations

Parameters

docs : List[str] Text documents to get topics for
recursions : int Number of times to recurse the model. See Notes

Returns

topics : List[int] Topics for each document

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

get_closest_topic(word: str, n: int = 5) -> List[Tuple[int, str]]

Return the top n closest topics a new document may go in

Parameters

word : str Keyword to gather closest topics for
n : int Number of closest topics to show

Returns

closest_topics : List[Tuple[int, str]] List with topic_index:topic_keywords of the n closest topics

Notes

This differs from transform() as it returns multiple choices based on the full-size topic embedding, rather than use UMAP/HDBSCAN to return a single pred.

Generating probabilities for HDBSCAN cluster selections is very inefficient, so this is a simpler alternative if we want to return multiple possible topics

load(destination: str) -> None

Load model from local filesystem

Parameters

destination : str Location of locally saved model

**matryoshka_iteration_(embeddings: , recursions: int, min_cluster_size: int = None, highest_topic: int = -1, topics: = None) -> _**

Iterate through a training loop of umap/hdbscan, recursing on outliers each time

Parameters

embeddings : np.array Vector embeddings to cluster
recursions : int Number of times to recursively run this function
min_cluster_size : int HDBSCAN.min_cluster_size to use for this iteration
highest_topic : int Highest topic number from previous recursion
topics : np.array Topic list from previous recursion

Returns

topics : np.array Every topic for the input data - 1 per datapoint in the input

Notes

This is a recursive function for adding more granularity to a model. It’s used to train a new UMAP/HDBSCAN model on the outliers of each previous recursion. e.g. if you run a model and the first recursion has 40% outliers with 100 topics, the next recursion would be run only on the 40% outliers and would start from topic 101. This keeps going recursions times, reducing the number of outliers and increasing the number of topics each time.

The final output will then be a stack of models which will cascade down when transforming new data. So for a given datapoint, the simplified process goes like:

for model in model_stack:
    topic = model.predict(datapoint)

    if topic is not an outlier:
        break

The key to getting a good distribution of topics is in the matryoshka_decay parameter. This will reduce the minimum cluster size multiplicatively on each recursion, meaning you get a smooth transition from large models to smaller models. To illustrate this, imagine you set a minimum cluster size of 400 for a training dataset of size 500k - the third recursion training set will be much smaller than 500k, so it doesn’t necessarily make sense to keep the min cluster size at 400 (it will lead to a very skewed topic distribution and outliers are often dealt with poorly). By multiplying 400 by a matryoshka decay of 0.8, it means that the second recursion has a min cluster size of 4000.8=320 and then the third has 3200.8=256 and so on

merge_topic(source_topic: int, destination_topic: int) -> None

Merge two similar topics into one. This is useful when you want to perform surgery on your topic model and curate the topics found

Parameters

source_topic : int Topic to add to destination_topic
destination_topic : int Topic to add source_topic to

save(destination: str) -> None

Save model to local filesystem

Parameters

destination : str Location to dump model to

**transform_(docs: List[str]) -> _**

Predict topics with trained model on new data

Parameters

docs : List[str] New data to predict topics for

Returns

topics : numpy.ndarray[int] Topics for each document

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a series

**transform_iteration_(embeddings: , umap_models: List[cuml.manifold.umap.UMAP], hdbscan_models: List[buzzwords.models.custom_hdbscan.HDBSCAN], topics: = None) -> _**

Iterate through UMAP and HDBSCAN models to reduce outliers

Parameters

embeddings : np.array Vector embeddings to get clusters for
umap_models : List[UMAP] Trained UMAP models to iterate through
hdbscan_models : List[HDBSCAN] Trained HDBSCAN models to iterate through
topics : np.array List of topics from previous recursion

Returns

topics : np.array Every topic for the input data - 1 per datapoint in the input

Notes

See matryoshka_iteration() Notes