× Documentation Info Installation Tutorial
>>>

« Back

v0.2.2

Buzzwords(params_dict: Dict[str, Dict[str, str]] = None)

Model capable of gathering topics from a collection of text or image documents

Parameters

Attributes

Examples

from buzzwords import Buzzwords
params_dict = {'UMAP': {'n_neighbors': 20}}
model = Buzzwords(params_dict)

Run with default params, overriding UMAP’s n_neighbors value

model = Buzzwords()
docs = df['text_column']
topics = model.fit_transform(docs)
topic_keywords = [model.topic_descriptions[topic] for topic in topics]

Basic model training

model = Buzzwords()
train_df = df.iloc[:50000]
pred_df = df.iloc[50000:]
topics = model.fit_transform(train_df.text_col)
topics.extend(model.transform(pred_df.text_col.reset_index(drop=True)))

Train a model on a batch of data, predict on the rest

keyword = 'covid vaccine corona'
closest_topics = model.get_closest_topic(keyword, n=5)

Get 5 topics (from a trained model) similar to a given phrase

model = Buzzwords()
model.load('saved_model/')

Load a saved model from disk


fit(docs: List[str], recursions: int = 1) -> None

Fit model based on given data

Parameters

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

recursions is used as input for matryoshka_iteration(), the outlier reduction method. When it’s set to 1, the model is run once on the input data, which can leave a significant number of outliers. To alleviate this, you can recurse the fit and run another fit_transform on the outliers themselves. This will consider the outliers a separate set of data and train a new model to cluster them, repeating recursions times. The format of the output is the same, except as num_recursions increases, the amount of outliers in the final dataset decreases.


fit_transform(docs: List[str], recursions: int = 1) -> List[int]

Fit model based on given data and return the transformations

Parameters

Returns

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

recursions is used as input for matryoshka_iteration(), the outlier reduction method. When it’s set to 1, the model is run once on the input data, which can leave a significant number of outliers. To alleviate this, you can recurse the fit and run another fit_transform on the outliers themselves. This will consider the outliers a separate set of data and train a new model to cluster them, repeating recursions times. The format of the output is the same, except as num_recursions increases, the amount of outliers in the final dataset decreases.


get_closest_topic(word: str, n: int = 5) -> List[Tuple[int, str]]

Return the top n closest topics a new document may go in

Parameters

Returns

Notes

This differs from transform() as it returns multiple choices based on the full-size topic embedding, rather than use UMAP/HDBSCAN to return a single pred.

Generating probabilities for HDBSCAN cluster selections is very inefficient, so this is a simpler alternative if we want to return multiple possible topics


load(destination: str) -> None

Load model from local filesystem

Parameters


**matryoshka_iteration_(embeddings: , recursions: int, min_cluster_size: int = None, highest_topic: int = -1, topics: = None) -> _**

Iterate through a training loop of umap/hdbscan, recursing on outliers each time

Parameters

Returns

Notes

This is a recursive function for adding more granularity to a model. It’s used to train a new UMAP/HDBSCAN model on the outliers of each previous recursion. e.g. if you run a model and the first recursion has 40% outliers with 100 topics, the next recursion would be run only on the 40% outliers and would start from topic 101. This keeps going recursions times, reducing the number of outliers and increasing the number of topics each time.

The final output will then be a stack of models which will cascade down when transforming new data. So for a given datapoint, the simplified process goes like:

for model in model_stack:
    topic = model.predict(datapoint)

    if topic is not an outlier:
        break

The key to getting a good distribution of topics is in the matryoshka_decay parameter. This will reduce the minimum cluster size multiplicatively on each recursion, meaning you get a smooth transition from large models to smaller models. To illustrate this, imagine you set a minimum cluster size of 400 for a training dataset of size 500k - the third recursion training set will be much smaller than 500k, so it doesn’t necessarily make sense to keep the min cluster size at 400 (it will lead to a very skewed topic distribution and outliers are often dealt with poorly). By multiplying 400 by a matryoshka decay of 0.8, it means that the second recursion has a min cluster size of 4000.8=320 and then the third has 3200.8=256 and so on


merge_topic(source_topic: int, destination_topic: int) -> None

Merge two similar topics into one. This is useful when you want to perform surgery on your topic model and curate the topics found

Parameters


save(destination: str) -> None

Save model to local filesystem

Parameters


**transform_(docs: List[str]) -> _**

Predict topics with trained model on new data

Parameters

Returns

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a series


**transform_iteration_(embeddings: , umap_models: List[cuml.manifold.umap.UMAP], hdbscan_models: List[buzzwords.models.custom_hdbscan.HDBSCAN], topics: = None) -> _**

Iterate through UMAP and HDBSCAN models to reduce outliers

Parameters

Returns

Notes

See matryoshka_iteration() Notes