Tutorial: Buzzwords Usage

Instantiating a Model

To instantiate the model is very simple

from buzzwords import Buzzwords
 
model = Buzzwords()

You can also change the parameters by adding your own parameter dictionary

params_dict = {'UMAP': {'n_neighbors': 5}}
 
model = Buzzwords(params_dict=params_dict)

This will override the defaults, for more info see the API Docs

model.model_parameters
 
>>> {'Embedding': {'model_name_or_path': 'paraphrase-MiniLM-L3-v2'}, 'UMAP': {'n_neighbors': 10, 'n_components': 5, 'min_dist': 0.0, 'random_state': 123}, 'HDBSCAN': {'min_cluster_size': 20, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}, 'CTFIDF': {'min_df': 1, 'num_words': 5, 'num_word_candidates': 30}, 'Buzzwords': {'similarity_threshold': 0.15, 'lemmatise_sentences': False, 'embedding_batch_size': 128}}

Training a Model

To train the model on a set of documents, call the fit_transform() function to return the topics

docs = df['text_column']
 
topics = model.fit_transform(docs)

And use the topic descriptions from the model to get the keywords for each topic

first_doc_topic = topics[0]
 
model.topic_descriptions[first_doc_topic]

It’s recommended that when using large datasets, to train the model on a sample and then predict on batches. This is to prevent memory issues as the library is very memory-intensive

train_docs = df.iloc[:500000]['text_column']
 
topics = model.fit_transform(train_docs)

# Reset index to prevent error from SentenceTransformer
predict_docs = df.iloc[500000:1000000]['text_column'].values.tolist()
 
topics.extend(model.transform(predict_docs))

Saving/Loading a Model

Buzzwords objects offer built-in functions for saving and loading models.

model = Buzzwords()

topics = model.fit_transform(df['text_column'])

model.save('models/model.buzz')

And similarly for loading pretrained models:

model = Buzzwords()

model.load('models/model.buzz')

Inference

You can use pretrained models to make inferences on new datapoints

model = Buzzwords()

model.load('models/model.buzz')

topics = model.transform(df['text_column'])

Image Topic Modelling

Topic modelling for images works much the same as for sentences, you just set the model_type to clip and use the paths to your images as input

params_dict = {
	'Buzzwords':{
		'model_type': 'clip',
		'get_keywords': False
	},
	'Embedding': {
		'device': 'cuda',
  		'model_name_or_path': 'ViT-B/32'
	}
}

model = Buzzwords(params_dict)

# Image PATHS not image objects
image_paths = df['image_path']

topics = model.fit_transform(image_paths)