AI Topic Modeling

Introduction

Topic Modeling is a machine learning process that enables a topic model to ingest a series of documents and group them thematically by which topic each document falls into. This enables a data scientist to cluster a set of unlabeled documents, such as news articles, into coherent document categories, such as politics, sports, and international affairs, for example, in the news domain. The problem with traditional topic modeling approaches is that it is difficult for a non-technical stakeholder to interpret the results of the topic model without a data scientist spending time doing the interpretation work and presenting what each topic represents. Incorporating generative AI into the topic modeling workflow can automate the interpretability work of the data scientist and greatly improve how stakeholders digest the insights from the topic model.

Traditional Topic Modeling

Why might results be difficult to interpret? A topic model doesn’t tell you what each topic truly represents. It outputs a topic number identifier that each document belongs to without showing what that topic really conveys or how the documents are grouped. You can see what the most representative words are for each topic, but that still doesn’t provide much clarity to a non-technical stakeholder and still requires interpretative work. This interpretative work often involves manually reviewing the most relevant vocabulary terms and the most representative documents per topic. Traditionally, topic models are presented using a PCA visualization, as shown below:

In this visualization, on the left side, you can see each topic represented as a circle, the size of the circle correlates to how many documents fall into that topic, and the spatial overlap between the topic circles shows how much these topics overlap with each other regarding their similar themes & vocabulary. On the right panel, we can see the most relevant vocabulary terms for each topic. This type of visualization might be useful for a data scientist trying to interpret the topic model, but it isn’t terribly useful for a non-technical stakeholder.

Interpreting Topic Models with AI

With generative AI, we can automate the interpretative work of determining what each topic represents and present those insights to stakeholders of any background. At Delphi Intelligence, we often use the Bertopic topic model algorithm. With Bertopic, we get back a set of three representative documents per topic as well as every document’s probability that it belongs to its classified topic. We feed those representative documents as well as AI-generated document summaries of the other documents that fall into that topic to a large context window LLM, such as Gemini 2.5 Pro, and can extract topic titles, descriptions, and other insights. Alternatively, we can provide the entire documents themselves instead of summaries depending on how large the corpus is. We can also filter the document samples provided to Gemini by only giving it documents above a certain probability threshold that it belongs to the topic.

Case Study

At Delphi Intelligence, we’ve created topic modeling dashboards for our clients, like Conceptual Academy. For Conceptual Academy, the topic model is trained on the anonymized set of chat transcripts from Ask Alia, an AI tutoring service we’ve developed for their textbooks. We’re able to ingest all of the chat transcripts for a specific classroom for a specific date range. This then shows all of the topics the students are talking to Alia about during the semester. This helps teachers identify what areas their students might be struggling with and what to focus their attention on. 

In the dashboard shown below, we can set a specific timeframe and see the chat volume over that time period for our demo classroom in this bar chart. 

We can also see each of the topic titles and their proportion of the total chat volume in the second bar chart below:

If we scroll further down in the dashboard, we can also see an overview of each topic. In the image below, we can see the topic title, the percentage of conversations it represents, a summary of what the topic means, bullet points for the topic, common questions the students asked, as well as a word cloud visualizing the chat conversations for this topic. This dashboard enables the teachers to ascertain which subject areas their students might be struggling with by looking at what topics their students are talking to Alia about.

Conclusion

By integrating generative AI into our topic modeling workflows, we can automate the interpretive work for topic models and present digestible insights to non-technical stakeholders.


Next
Next

Press Release: Ask Alia