Using R to Detect Communities of Correlated Topics

Creating a topic network

For Project Mosaic, I’m researching UNCC publications in social science and computing & informatics by analyzing the abstract text and the co-authorship social network.

For text mining, I’m running topic modeling (Latent Dirichlet Allocation or LDA for short) on five years of peer-reviewed publication abstracts to identify key research themes by university researchers. (If you’re not familiar with LDA, a good start is David Blei’s “Probabilistic Topic Models”.)

One problem I came across was: how to measure the relationships (correlations) between topics? In particular, I want to creating a network visualization that connects similar topics and helps users easier navigate a large collection of topics (in this case 100 topics).

In this tutorial, I accomplish this by combining code from two awesome resources:

Data preparation

Our first step is to load our topic matrices that are outputs of LDA. There are two outputs to LDA: a word-topic matrix and a document-topic matrix. For this tutorial, I uploaded previously saved LDA results that were saved as flat (csv) files.

As an alternative to loading flat files, you can use the output of the topicmodels package lda function to create any word-topic and document-topic matrices. Take the output of your lda function and run the posterior function on the output.

# load in author-topic matrix, first column is word
author.topic <- read.csv("./author_topics.csv", stringsAsFactors = F)

# load in word-topic matrix, first column is word
word.topic <- read.csv("./term_topics.csv", stringsAsFactors = F)
num.col <- ncol(word.topics)

# create topic names using first five words
for (i in 2:num.col){
  top.words <- word.topics[order(-word.topic[,i])]
  name$topic_name[i] <- paste(top.words[1:5], collapse = " + ")
}

# rename topics
colnames(word.topic) <- c("word",name$topic_name)
colnames(author.topic) <- c("author_name",name$topic_name)

Unlike standard LDA in which the abstracts were the documents, I ran an “author-centered” LDA in which all author’s abstracts were combined and treated as one document per author. I ran this because my ultimate goal is to use topic modeling as an information retrieval process to determine researcher expertise by topic.

Create static networks

In the next step, I create a network using the correlation between each topic’s word probabilities.

First, I decide to keep only relationships (edges) that have significant correlation (20%+ correlation). I use 20% because its the .05 level of statistical significance for a sample of 100 observations Wikipedia.

cor_threshold <- .2

cor_mat <- cor(word.topic[,2:num.col])
cor_mat[ cor_mat < cor_threshold ] <- 0
diag(cor_mat) <- 0

Next, we use the correlation matrix to create an igraph data structure, removing all edges that have less than the 20% minimum threshold correlation.

library(igraph)

graph <- graph.adjacency(cor_mat, weighted=TRUE, mode="lower")

E(graph)$edge.width <- E(graph)$weight
V(graph)$label <- paste(1:100)

Let’s plot a simple igraph network.

par(mar=c(0, 0, 3, 0))
set.seed(110)
plot.igraph(graph, edge.width = E(graph)$edge.width, 
            edge.color = "blue", vertex.color = "white", vertex.size = 1,
            vertex.frame.color = NA, vertex.label.color = "grey30")
title("Strength Between Topics Based On Word Probabilities", cex.main=.8)

Each number represents a topic, each topic is numbered to identify it.

My first observation is that there appears to be three main clusters.

Let’s use community detection, specifically the label propagation algorithm in igraph, to determine clusters within the network.

clp <- cluster_label_prop(graph)
class(clp)

plot(clp, graph, edge.width = E(graph)$edge.width, vertex.size = 2, vertex.label = "")
title("Community Detection in Topic Network", cex.main=.8)

Community detection found thirteen communites, plus multiple additional communities for each of the isolated topics (i.e., topics that do not have any connections).

Similar to my initial observation, the algorithm found the three main clusters we recognized in the first plot, but also added additional smaller clusters that don’t seem to fit well in any of the three main clusters.

Let’s save our communities and also calculate degree centrality and betweenness which we’ll use in the next section.

V(graph)$community <- clp$membership
V(graph)$betweenness <- betweenness(graph, v = V(graph), directed = F)
V(graph)$degree <- degree(graph, v = V(graph))

Dynamic Visualizations

This section, we’ll use the visNetwork package that allows interactive network graphs in R.

First, let’s call the library and run visIgraph that runs an interactive network but on an igraph structure (graph) using igraph graph settings.

library(visNetwork)

visIgraph(graph)

This is a good start, but we need more details about the network.

Let’s go a different route by creating visNetwork data structure. To do this, we convert our igraph structure into a visNetwork data structure, then separating the list into two dataframes: nodes and edges.

data <- toVisNetworkData(graph)
nodes <- data[[1]]
edges <- data[[2]]

Delete nodes (topics) that don’t have a connection (degree = 0).

nodes <- nodes[nodes$degree != 0,]

Let’s add colors and other network parameters to improve our network.

library(RColorBrewer)

col <- brewer.pal(12, "Set3")[as.factor(nodes$community)]
nodes$shape <- "dot" 
nodes$shadow <- TRUE # Nodes will drop shadow
nodes$title <- nodes$id # Text on click
nodes$size <- ((nodes$betweenness / max(nodes$betweenness))+.2)*20 # Node size
nodes$borderWidth <- 2 # Node border width
nodes$color.background <- col
nodes$color.border <- "black"
nodes$color.highlight.background <- "orange"
nodes$color.highlight.border <- "darkred"
edges$title <- round(edges$edge.width,3)

Finally, let’s create our network with an interactive plot. You can zoom by using your mouse scroll wheel.

visNetwork(nodes, edges) %>% 
    visOptions(highlightNearest = TRUE, selectedBy = "community", nodesIdSelection = TRUE)