v0.2.1
Buzzwords(params_dict: Dict[str, Dict[str, str]] = None)
Model capable of gathering topics from a collection of text or image documents
Parameters
- params_dict : Dict[str, Dict[str, str]], optional Custom parameters for the model, use to override the defaults. Has the
format
{ model_type1: {parameter1: value, parameter2: value }, .. }
with the following model types:- Embedding -
Parameters from your chosen
model_type
, e.g. SentenceTransformers or CLIP - UMAP - Parameters from UMAP
- HDBSCAN - Parameters from HDBSCAN
- Keywords
- min_df : int min_df value for CountVectoriser
- num_words : int Number of keywords to return per topic
- num_word_candidates : int Number of top td-idf candidates to consider as possible keywords
- Buzzwords
- lemmatise_sentences : bool Whether to lemmatise sentences before getting keywords
- embedding_batch_size : int Batch size to embed sentences with
- matryoshka_decay: int The speed at which HDBSCAN.min_cluster_size decreases on each recursion
- get_keywords: bool Whether or not to gather the keywords
- keyword_model: str Which keyword model to use - ‘ctfidf’ or ‘keybert’
- model_type: str Type of encoding model to run, e.g. ‘sentencetransformers’ or ‘clip’
- Embedding -
Parameters from your chosen
Attributes
- model_parameters : Dict[str, Dict[str, str]] Values for each model (params_dict parameter alters this)
- embedding_model : Union[SentenceTransformer, clip_encoder.CLIPEncoder] Chosen embedding model
- umap_model : cuml.manifold.umap.UMAP cuML’s UMAP model
- hdbscan_model : custom_hdbscan.HDBSCAN Custom HDBSCAN model
- keyword_model : keywords.Keywords Chosen model for keyword gathering
- topic_embeddings : np.ndarray The top n (num_word_candidates) words concatenated and embedded for each topic
- topic_descriptions : Dict[int, str] {Topic number:Topic Keywords} for each topic
Examples
from buzzwords import Buzzwords
params_dict = {'UMAP': {'n_neighbors': 20}}
model = Buzzwords(params_dict)
Run with default params, overriding UMAP’s n_neighbors value
model = Buzzwords()
docs = df['text_column']
topics = model.fit_transform(docs)
topic_keywords = [model.topic_descriptions[topic] for topic in topics]
Basic model training
model = Buzzwords()
train_df = df.iloc[:50000]
pred_df = df.iloc[50000:]
topics = model.fit_transform(train_df.text_col)
topics.extend(model.transform(pred_df.text_col.reset_index(drop=True)))
Train a model on a batch of data, predict on the rest
keyword = 'covid vaccine corona'
closest_topics = model.get_closest_topic(keyword, n=5)
Get 5 topics (from a trained model) similar to a given phrase
model = Buzzwords()
model.load('saved_model/')
Load a saved model from disk
fit(docs: List[str], recursions: int = 1) -> None
Fit model based on given data
Parameters
- docs : List[str] Text documents to get topics for
- recursions : int Number of times to recurse the model. See Notes
Notes
Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series
recursions is used as input for matryoshka_iteration()
, the outlier reduction
method. When it’s set to 1, the model is run once on the input data, which can leave a
significant number of outliers. To alleviate this, you can recurse the fit and run
another fit_transform on the outliers themselves. This will consider the outliers a
separate set of data and train a new model to cluster them, repeating recursions
times. The format of the output is the same, except as num_recursions increases, the
amount of outliers in the final dataset decreases.
fit_transform(docs: List[str], recursions: int = 1) -> List[int]
Fit model based on given data and return the transformations
Parameters
- docs : List[str] Text documents to get topics for
- recursions : int Number of times to recurse the model. See Notes
Returns
- topics : List[int] Topics for each document
Notes
Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a Series
recursions is used as input for matryoshka_iteration()
, the outlier reduction
method. When it’s set to 1, the model is run once on the input data, which can leave a
significant number of outliers. To alleviate this, you can recurse the fit and run
another fit_transform on the outliers themselves. This will consider the outliers a
separate set of data and train a new model to cluster them, repeating recursions
times. The format of the output is the same, except as num_recursions increases, the
amount of outliers in the final dataset decreases.
get_closest_topic(word: str, n: int = 5) -> List[Tuple[int, str]]
Return the top n closest topics a new document may go in
Parameters
- word : str Keyword to gather closest topics for
- n : int Number of closest topics to show
Returns
- closest_topics : List[Tuple[int, str]] List with topic_index:topic_keywords of the n closest topics
Notes
This differs from transform() as it returns multiple choices based on the full-size topic embedding, rather than use UMAP/HDBSCAN to return a single pred.
Generating probabilities for HDBSCAN cluster selections is very inefficient, so this is a simpler alternative if we want to return multiple possible topics
load(destination: str) -> None
Load model from local filesystem
Parameters
- destination : str Location of locally saved model
**matryoshka_iteration_(embeddings:
Iterate through a training loop of umap/hdbscan, recursing on outliers each time
Parameters
- embeddings : np.array Vector embeddings to cluster
- recursions : int Number of times to recursively run this function
- min_cluster_size : int HDBSCAN.min_cluster_size to use for this iteration
- highest_topic : int Highest topic number from previous recursion
- topics : np.array Topic list from previous recursion
Returns
- topics : np.array Every topic for the input data - 1 per datapoint in the input
Notes
This is a recursive function for adding more granularity to a model. It’s used to train a new UMAP/HDBSCAN model on the outliers of each previous recursion. e.g. if you run a model and the first recursion has 40% outliers with 100 topics, the next recursion would be run only on the 40% outliers and would start from topic 101. This keeps going recursions times, reducing the number of outliers and increasing the number of topics each time.
The final output will then be a stack of models which will cascade down when transforming new data. So for a given datapoint, the simplified process goes like:
for model in model_stack:
topic = model.predict(datapoint)
if topic is not an outlier:
break
The key to getting a good distribution of topics is in the matryoshka_decay
parameter.
This will reduce the minimum cluster size multiplicatively on each recursion, meaning you
get a smooth transition from large models to smaller models. To illustrate this, imagine
you set a minimum cluster size of 400 for a training dataset of size 500k - the third
recursion training set will be much smaller than 500k, so it doesn’t necessarily make
sense to keep the min cluster size at 400 (it will lead to a very skewed topic
distribution and outliers are often dealt with poorly). By multiplying 400 by a matryoshka
decay of 0.8, it means that the second recursion has a min cluster size of 4000.8=320
and then the third has 3200.8=256 and so on
merge_topic(source_topic: int, destination_topic: int) -> None
Merge two similar topics into one. This is useful when you want to perform surgery on your topic model and curate the topics found
Parameters
- source_topic : int Topic to add to destination_topic
- destination_topic : int Topic to add source_topic to
save(destination: str) -> None
Save model to local filesystem
Parameters
- destination : str Location to dump model to
**transform_(docs: List[str]) ->
Predict topics with trained model on new data
Parameters
- docs : List[str] New data to predict topics for
Returns
- topics : numpy.ndarray[int] Topics for each document
Notes
Also accepts numpy arrays/pandas series as input Make sure to reset_index(drop=True) if you use a series
**transform_iteration_(embeddings:
Iterate through UMAP and HDBSCAN models to reduce outliers
Parameters
- embeddings : np.array Vector embeddings to get clusters for
- umap_models : List[UMAP] Trained UMAP models to iterate through
- hdbscan_models : List[HDBSCAN] Trained HDBSCAN models to iterate through
- topics : np.array List of topics from previous recursion
Returns
- topics : np.array Every topic for the input data - 1 per datapoint in the input
Notes
See matryoshka_iteration() Notes