× Documentation Info Installation Tutorial
>>>

« Back

v0.2.0

Buzzwords(params_dict: Dict[str, Dict[str, str]] = None)

Model capable of gathering topics from a collection of text documents

Parameters

Attributes

Examples

from buzzwords import Buzzwords
params_dict = {'UMAP': {'n_neighbors': 20}}
model = Buzzwords(params_dict)

Run with default params, overriding UMAP’s n_neighbors value

model = Buzzwords()
docs = df['text_column']
topics = model.fit_transform(docs)
topic_keywords = [model.topic_descriptions[topic] for topic in topics]

Basic model training

model = Buzzwords()
train_df = df.iloc[:50000]
pred_df = df.iloc[50000:]
topics = model.fit_transform(train_df.text_col)
topics.extend(model.transform(pred_df.text_col.reset_index(drop=True)))

Train a model on a batch of data, predict on the rest

keyword = 'covid vaccine corona'
closest_topics = model.get_closest_topic(keyword, n=5)

Get 5 topics (from a trained model) similar to a given phrase

model = Buzzwords()
model.load('saved_model/')

Load a saved model from disk


fit(docs: List[str], recursions: int = 1) -> None

Fit model based on given data

Parameters

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

recursions is used as input for matryoshka_iteration(), the outlier reduction method. When it’s set to 1, the model is run once on the input data, which can leave a significant number of outliers. To alleviate this, you can recurse the fit and run another fit_transform on the outliers themselves. This will consider the outliers a separate set of data and train a new model to cluster them, repeating recursions times. The format of the output is the same, except as num_recursions increases, the amount of outliers in the final dataset decreases.


fit_transform(docs: List[str], recursions: int = 1) -> List[int]

Fit model based on given data and return the transformations

Parameters

Returns

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series

recursions is used as input for matryoshka_iteration(), the outlier reduction method. When it’s set to 1, the model is run once on the input data, which can leave a significant number of outliers. To alleviate this, you can recurse the fit and run another fit_transform on the outliers themselves. This will consider the outliers a separate set of data and train a new model to cluster them, repeating recursions times. The format of the output is the same, except as num_recursions increases, the amount of outliers in the final dataset decreases.


get_closest_topic(word: str, n: int = 5) -> List[Tuple[int, str]]

Return the top n closest topics a new message may go in.

Parameters

Returns

Notes

This differs from transform() as it returns multiple choices based on the full-size topic embedding, rather than use UMAP/HDBSCAN to return a single pred.

Generating probabilities for HDBSCAN cluster selections is very inefficient, so this is a simpler alternative if we want to return multiple possible topics


load(destination: str) -> None

Load model from local filesystem

Parameters


**matryoshka_iteration_(embeddings: , recursions: int, min_cluster_size: int = None, highest_topic: int = -1, topics: = None) -> _**

Iterate through a training loop of umap/hdbscan, recursing on outliers each time

Parameters

Returns

Notes

This is a recursive function for adding more granularity to a model. It’s used to train a new UMAP/HDBSCAN model on the outliers of each previous recursion. e.g. if you run a model and the first recursion has 40% outliers with 100 topics, the next recursion would be run only on the 40% outliers and would start from topic 101. This keeps going recursions times, reducing the number of outliers and increasing the number of topics each time.

The final output will then be a stack of models which will cascade down when transforming new data. So for a given datapoint, the simplified process goes like:

for model in model_stack:
    topic = model.predict(datapoint)

    if topic is not an outlier:
        break

merge_topic(source_topic: int, destination_topic: int) -> None

Merge two similar topics into one. This is useful when you want to perform surgery on your topic model and curate the topics found

Parameters


save(destination: str) -> None

Save model to local filesystem

Parameters


**transform_(docs: List[str]) -> _**

Predict topics with trained model on new data

Parameters

Returns

Notes

Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a series


**transform_iteration_(embeddings: , umap_models: List[cuml.manifold.umap.UMAP], hdbscan_models: List[buzzwords.models.custom_hdbscan.HDBSCAN], topics: = None) -> _**

Iterate through UMAP and HDBSCAN models to reduce outliers

Parameters

Returns

Notes

See matryoshka_iteration() Notes