A community in a graph is a subset of vertices that are densely connected to each other and are in turn less densely connected to the other vertices in the graph.
We analysed the properties of the English Wikipedia graph, that is, the graph with each vertex standing for a single Wikipedia article and an egde standing for a Wikipedia link. We discovered that the Wikipedia graph has a distinguished community structure. Moreover, we observed that Wikipedia articles falling into the same community are more semantically similar to each other than the ones randomly chosen across the corpus, that is, communities in the Wikipedia graph serve as grouping units for semantically similar Wikipedia articles.
Due to the Wikipedia graph being scale-free one, communities that comprise the Wikipedia graph have their sizes distributed closely to the power law, with a few large communities and a long tail of small ones. We discovered that larger communities inherit the properties of the whole graph, i.e., they possess community structure in turn. Dividing a large community into further sub-communities has the effect of obtaining a further division of the group of Wikipedia articles into nested sub-groups of more semantically similar articles.
Dividing each community while it exhibits good sub-community structure, we organize Wikipedia articles into the hierarchy. The hierarchy relies on solely the link structure of Wikipedia and is fully automatic.
Community titles are selected using PageRank, that is, using purely link-based analysis as well.
Properties of the community structure of the Wikipedia graph
researched by
Dmitry Lizorkin,
Olena Medelyan and
Maria Grineva.