Temporal Topic Modelling with Temporal Mapper and Toponymy

In this notebook, we will go though an example of how to use Temporal Mapper and Toponymy together to create a temporal topic model of a corpus of documents. Be warned that this is an experimental, evolving workflow, so what you’re about to see is not pretty.

The dataset we will use is the United Nations General Debate Corpus which consists of transcripts of the United Nations general debate from 1970 to 2015. I’ve preprocessed the dataset by chunking the speeches and then embedding the chunks with a sentence-transformer and reducing them to 2D using UMAP. Let’s fetch the dataset from the HuggingFace Hub:

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

ungdc_df = pd.read_parquet("https://huggingface.co/datasets/kalebr/un-general-debate-corpus-chunked/resolve/main/ungdc-all-chunked.pq")
ungdc_df.head()
[1]:
session year country text chunk information_weight embedding reduced
0 44 1989 MDV It is indeed a pleasure for me and the member... It is indeed a pleasure for me and the member... 29.816833 [-0.009967008, 0.028972907, 0.014457686, 0.022... [9.491389, 7.566777]
0 44 1989 MDV It is indeed a pleasure for me and the member... Developments in southern Africa, and more part... 23.011437 [0.050711717, 0.09013895, 0.0096756825, -0.016... [6.820655, 4.6092267]
0 44 1989 MDV It is indeed a pleasure for me and the member... Positive strides have been taken towards the s... 21.294486 [0.07367871, 0.045660958, 0.020714706, -0.0277... [1.6624191, 2.9213235]
0 44 1989 MDV It is indeed a pleasure for me and the member... The process of reunification of peoples should... 21.153001 [0.057730194, 0.083791696, 0.012951973, -0.019... [-0.5313732, 0.14173953]
1 44 1989 FIN \nMay I begin by congratulating you. Sir, on ... In the process of preparing both the developme... 25.820817 [-0.012931386, -0.017169893, 0.012649347, -0.0... [3.1583965, 11.189846]

The information weight column is an importance metric similar to TF-IDF that I computed using InformationWeightTransform. Let’s filter the dataset by taking the top 50% of informative chunks - this is just to get a somewhat more manageable size of dataset.

[2]:
q = 50
cutoff = np.percentile(ungdc_df['information_weight'].values, q)
top_weighted = ungdc_df[ungdc_df['information_weight']>=cutoff]

embedding = np.stack(top_weighted['embedding'].values)
reduced = np.stack(top_weighted['reduced'].values)
time = top_weighted['year'].to_numpy()
text = top_weighted['chunk'].to_numpy()
print(len(top_weighted))
18654

Toponymy Parameters

Next, we set up our Toponymy parameters:

[3]:

from toponymy.toponymy import Toponymy, ToponymyClusterer, KeyphraseBuilder, ClusterLayerText from toponymy.llm_wrappers import AzureAINamer from sentence_transformers import SentenceTransformer embedding_model = SentenceTransformer("paraphrase-MiniLM-L3-v2") api_key_file = 'cohere.txt' with open(api_key_file, 'r') as file: azure_api_key = file.read().strip() llm_wrapper=AzureAINamer( azure_api_key, endpoint="https://azureaitimcuse5821437469.services.ai.azure.com/models", model="Cohere-command-r-08-2024", )
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L3-v2
Key                     | Status     |  |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  |

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.

MapperClusterer class for Toponymy

In the next two cells, I create a MapperClusterer class that inherents Toponymy’s Clusterer class. This will be the clusterer that we pass to Toponymy to create our Topic Model.

The MapperClusterer takes another (non-temporal) Toponymy Clusterer object as a parameter. Then what it does is initialize a TemporalMapper and use it to compute temporal density and slice the dataset as per usual for Mapper. However, instead of clustering and creating a graph, we run the Toponymy Clusterer at each time slice, generating a tree of clusters of each time step.

For each layer l of the cluster hierarchy, we duplicate the initial TemporalMapper object and then assign the layer-l clusters to it, and complete the TemporalMapper algorithm. This means we have a Mapper graph for each layer l of the Toponymy clusters. We use the TemporalMapper.assign_topics() method to define an equivalence class on the nodes of the cluster trees across time slices, and identify the cluster trees into one big tree using this equivalence class. This big cluster tree is then the output of MapperClusterer.

Right now (as of 2026-03-12) the MapperClusterer class is ad-hoc and not robust to many things that can go wrong. However it is my goal to get the class to a useable state and contribute it to the Toponymy package for ready availability.

[4]:
import networkx as nx
from collections import defaultdict, deque

class UnionFind:
    def __init__(self):
        self.parent = {}
        self.rank = {}

    def add(self, x):
        if x not in self.parent:
            self.parent[x] = x
            self.rank[x] = 0

    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return
        if self.rank[rx] < self.rank[ry]:
            self.parent[rx] = ry
        else:
            self.parent[ry] = rx
            if self.rank[rx] == self.rank[ry]:
                self.rank[rx] += 1


def convert(node, layer):
    t,c = node.split(":")
    return (layer, int(c))

def merge_trees(topic_trees, graphs):
    """
    graphs[0] = leaves
    graphs[-1] = nodes just below root
    Parent of node in layer l is in layer l+1.

    Returns:
        {
            layer: {
                node: equivalence_class_id
            }
        }
    """

    # ---------------------------------------------
    # Build parent lookup from topic_trees
    # ---------------------------------------------
    parent_lookup = []

    for tree in topic_trees:
        parent = {}
        for p, children in tree.items():
            for c in children:
                parent[c] = p

        # roots get parent None
        for node in tree:
            if node not in parent:
                parent[node] = None

        parent_lookup.append(parent)

    n_layers = len(graphs)
    result = {}

    # ---------------------------------------------
    # Process layers top-down
    # ---------------------------------------------
    for l in reversed(range(n_layers)):
        G = graphs[l]
        uf = UnionFind()

        topics = nx.get_node_attributes(G, "topic")
        slice_no = nx.get_node_attributes(G, "slice_no")

        nodes = list(G.nodes())

        for node in nodes:
            uf.add(node)

        groups = defaultdict(list)

        for node in nodes:
            tree_index = slice_no[node]
            parent_node = parent_lookup[tree_index][convert(node,l)]

            # Determine parent equivalence class
            if parent_node == (n_layers, 0):
                parent_class = None
            else:
                _,pc = parent_node
                parent_class = result[l + 1][f'{tree_index}:{pc}']

            key = (topics[node], parent_class)
            groups[key].append(node)

        # Merge nodes within each structural group
        for group_nodes in groups.values():
            base = group_nodes[0]
            for other in group_nodes[1:]:
                uf.union(base, other)

        # Assign final class IDs for this layer
        rep_to_class = {}
        class_counter = 0
        layer_map = {}

        for node in nodes:
            rep = uf.find(node)
            if rep not in rep_to_class:
                rep_to_class[rep] = class_counter
                class_counter += 1
            layer_map[node] = rep_to_class[rep]
        # add the noise point possibilities
        for t in range(len(topic_trees)):
            layer_map[f'{t}:{-1}'] = -1

        result[l] = layer_map

    return result
[5]:
from toponymy.clustering import Clusterer, build_cluster_tree, centroids_from_labels
from temporalmapper import TemporalMapper
from toponymy._utils import handle_verbose_params
from copy import deepcopy
import networkx as nx
import numpy as np
from scipy.sparse import issparse
from sklearn.utils.validation import check_is_fitted, check_array
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
class MapperClusterer(Clusterer):
    def __init__(
        self,
        base_clusterer: Clusterer,
        mapper_params: dict | None = None,
        verbose: bool = None,
        show_progress_bar: bool = None,
    ):
        self.base_clusterer = base_clusterer
        if mapper_params is None:
            mapper_params = {}
        self.mapper_params = mapper_params

        super().__init__()
        _, self.verbose = handle_verbose_params(
            verbose=verbose, show_progress_bar=show_progress_bar, default_verbose=False
        )

    def fit(
        self,
        clusterable_vectors: np.ndarray,
        embedding_vectors: np.ndarray,
        projection_index: int = -1,
        layer_class = ClusterLayerText,
        verbose: bool = None,
        show_progress_bar: bool = None,
        **layer_kwargs,
    ) -> Clusterer:
        _, verbose_output = handle_verbose_params(
            verbose=verbose if verbose is not None else self.verbose,
            show_progress_bar=show_progress_bar,
            default_verbose=False,
        )
        base_mapper = TemporalMapper(
            clusterer = None,
            **self.mapper_params,
        )
        lens = clusterable_vectors[:, projection_index]
        data = np.delete(
            clusterable_vectors,
            projection_index,
            axis=1
        )
        if issparse(clusterable_vectors):
            base_mapper._mapper.scaler_ = StandardScaler(copy=False, with_mean=False)
        else:
            base_mapper._mapper.scaler_ = StandardScaler(copy=False)
        base_mapper._mapper._compute_midpoints(lens)
        base_mapper._mapper._compute_density(data, lens)
        base_mapper._mapper._compute_weights(data, lens)
        n_layers = 0
        topic_trees = []
        graphs = []
        mappers = []
        slicewise_layers = []
        n_slices = len(base_mapper._mapper.slices_)
        for i, slice_ in enumerate(base_mapper._mapper.slices_):
            cvectors = data[slice_]
            evectors = embedding_vectors[slice_]
            cluster_layers, cluster_tree  = self.base_clusterer.fit_predict(
                clusterable_vectors = cvectors,
                embedding_vectors = evectors,
                layer_class=layer_class,
                verbose=verbose,
                show_progress_bar=show_progress_bar,
                **layer_kwargs,
            )
            if len(cluster_layers)>n_layers:
                n_layers = len(cluster_layers)
            topic_trees.append(cluster_tree)
            slicewise_layers.append(cluster_layers)
        print(f"Layers per slice: {[len(x) for x in slicewise_layers]}")
        for l in range(n_layers):
            sizes = []
            for clayers in slicewise_layers:
                n_clusters = np.unique(clayers[l].cluster_labels).size
                sizes.append(n_clusters)
            print(f"Layer {l} n_cluster: {sizes}")

        layer_clusters = []
        for l in range(n_layers):
            if l>=len(cluster_layers):
                break
            mapper = deepcopy(base_mapper)
            labels = np.full((n_slices, data.shape[0]), -2, dtype=int)
            for i, slice_ in enumerate(base_mapper._mapper.slices_):
                labels[i,slice_] = slicewise_layers[i][l].cluster_labels

            mapper._mapper.labels_ = np.array(labels)
            mapper._mapper._add_vertices()
            mapper._mapper._build_adjacency_matrix(lens)
            mapper._mapper._add_edges()
            mapper._mapper.is_fitted_ = True
            mapper.data = data
            mapper.time = lens
            mapper.n_samples = data.shape[0]
            mapper.n_components = data.shape[1]
            mapper.populate_node_attrs()
            t_attrs = nx.get_node_attributes(mapper.graph, "slice_no")
            mapper.populate_edge_attrs()
            mapper.is_fitted_ = True
            mapper.assign_topics()
            # Run the clustering logic from TemporalMapper.cluster
            dist = cdist(
                mapper._mapper.midpoints_.reshape(-1,1),
                time.reshape(-1,1)
            )
            pt_max_cluster = np.argmin(
                dist,
                axis=0
            )
            topics = nx.get_node_attributes(mapper.graph, 'topic')
            clusters = []
            clrs = []
            for pt,t in enumerate(pt_max_cluster):
                c = mapper.clusters[t,pt]
                clrs.append(c)
                if c != -2:
                    clusters.append(f'{t}:{c}')
                elif c == -2:
                    clusters.append(f'{t}:{-1}')

            layer_clusters.append(clusters)
            mappers.append(mapper)
            graphs.append(mapper.graph)

        topic_map = merge_trees(topic_trees, graphs)
        # now assign each point its merged cluster val
        cluster_label_layers = []
        for l in range(n_layers):
            clusters = np.full(data.shape[0], -1, dtype=int)
            for node in graphs[l].nodes():
                indices = mappers[l].get_vertex_data(node)
                clusters[indices] = topic_map[l][node]
            cluster_label_layers.append(clusters)

        self.cluster_tree_ = build_cluster_tree(cluster_label_layers)
        self.cluster_layers_ = [
            layer_class(
                labels,
                centroids_from_labels(labels, embedding_vectors),
                layer_id=i,
                verbose=verbose,
                show_progress_bar=show_progress_bar,
                **layer_kwargs,
            )
            for i, labels in enumerate(cluster_label_layers)
        ]
        self.topic_map_ = topic_map
        self.mappers_ = mappers
        return self

    def fit_predict(
        self,
        clusterable_vectors: np.ndarray,
        embedding_vectors: np.ndarray,
        layer_class = ClusterLayerText,
        verbose: bool = None,
        show_progress_bar: bool = None,
        **layer_kwargs,
    ):
        self.fit(
            clusterable_vectors,
            embedding_vectors,
            layer_class=layer_class,
            verbose=verbose,
            show_progress_bar=show_progress_bar,
            **layer_kwargs,
        )
        return self.cluster_layers_, self.cluster_tree_

Now we can run our MapperClusterer to build our cluster hierarchy:

[6]:

reduced_vectors_with_time = np.hstack([reduced, time.reshape(-1,1)]) clusterer_params = { 'min_clusters':2, 'max_layers':2, 'base_min_cluster_size':25, 'verbose':False } base_clusterer = ToponymyClusterer(**clusterer_params) toponymy_params = { 'llm_wrapper':llm_wrapper, 'text_embedding_model':embedding_model, 'object_description':"excerpts from a speech", 'corpus_description':"United Nations General Debate Transcripts", 'exemplar_delimiters':["<EXAMPLE_TRANSCRIPT>\n","\n</EXAMPLE_TRANSCRIPT>\n\n"], } clusterer = MapperClusterer( base_clusterer, mapper_params = dict( n_slices = 8, n_neighbors = 1000, slice_method='data', overlap=0.8 ) ) toponymy_params['clusterer'] = clusterer clusterer.fit(reduced_vectors_with_time, embedding)
/work/home/kdrusci/winter2026/dbmapper/tm-dev/temporal-mapper/src/temporalmapper/temporal_mapper.py:128: UserWarning: You have not passed a clusterer, this TemporalMapper cannot be fit.
  warn("You have not passed a clusterer, this TemporalMapper cannot be fit.")
Layers per slice: [2, 2, 2, 2, 2, 2, 2, 2]
Layer 0 n_cluster: [53, 50, 42, 69, 58, 42, 59, 50]
Layer 1 n_cluster: [16, 16, 12, 21, 21, 16, 20, 17]
[6]:
<__main__.MapperClusterer at 0x7dfe9f0e2750>

After adjusting the parameters until we get

  • The same number of layers per slice

  • Roughly comparible number of clusters per slice for each layer

We’re ready to fit the Toponymy, using an LLM to generate topic summaries for each cluster.

[7]:
toponymy = Toponymy(**toponymy_params)
toponymy.fit(
    text,
    embedding,
    reduced_vectors_with_time,
)
[7]:
<toponymy.toponymy.Toponymy at 0x7dfe7fd3a290>

Finally all that’s left to do is to make plots! Remember, MapperClusterer actually builds a Mapper for each layer of the cluster hierarchy. We’ll do a static plot of the highest layer, since it has fewer topics, and an interactive plot of the lower layer.

[8]:
from temporalmapper.plotting import squarify_text


## Function to extract the cluster names from Toponymy
def make_labels(toponymy, layer, filter_dupes=False):
    topic_map = toponymy.clusterer.topic_map_[layer]
    cluster_labels = {}
    for node in toponymy.clusterer.mappers_[layer].graph.nodes():
        label =  toponymy.topic_names_[layer][topic_map[node]]
        if filter_dupes:
            G = toponymy.clusterer.mappers_[layer].graph
            if G.in_degree(node)==1 and G.out_degree(node)!=0:
                prev = [e for e in G.in_edges(node)][0][0]
                prev_label = toponymy.topic_names_[layer][topic_map[prev]]
                if prev_label == label:
                    label = ''
        cluster_labels[node]=squarify_text(label)
    return cluster_labels

## Plotting
fig, ax = plt.subplots(figsize=(16,9))

layer = 1
clusterer.mappers_[layer].temporal_plot(
    ax=ax,
    cluster_labels=make_labels(toponymy,layer),
    layout='ordered',
    cluster_label_kwargs=dict(fontsize=8),
    node_scaling=5,
    node_size_bounds=(25,100),
    node_size_scale='linear',
    edge_scaling=2,
    edge_weight_bounds=(5,25),
)

label_times = clusterer.mappers_[layer].midpoints
label_text = [int(label) for label in label_times]
ax.set_xticks(label_times, labels=label_text)
ax.tick_params(axis='x', labelrotation=90)
ax.tick_params(bottom=True, labelbottom=True)
ax.set_title("Temporal Mapper plot of United Nations General Debate transcripts, 1970-2015")
plt.show()
_images/toponymy-integration_14_0.png
[9]:
import plotly.io as pio
from plotly.graph_objects import Layout
pio.renderers.default = 'sphinx_gallery'

layer = 0
fig = clusterer.mappers_[layer].interactive_temporal_plot(
    cluster_labels=make_labels(toponymy,layer),
    layout='ordered',
    layout_kwargs=dict(spacing=10),
    graph_layout=Layout(
        title=dict(text="Temporal Mapper plot of United Nations General Debate transcripts, 1970-2015"),
        width=1200,
        height=700,
        showlegend=False,
    ),
    node_scaling=5,
    node_size_bounds=(25,100),
    node_size_scale='linear',
    edge_scaling=1,
    edge_weight_bounds=(1,5),
)
fig.show()