As part of the Network Grand Challenge project, we applied parallel text analysis, community detection, and trend tracking algorithms to the 17 million abstracts in the PubMed repository. Our main goal was to test our tools for scalability on a body of real data: would we get meaningful results at all at that scale?
The answer turned out to be yes. That raised an even more interesting challenge. How do you communicate results about a data set that large? Trying to display 17 million pieces of anything is futile: the answer is clearly to work with higher-order results. But what? We don't have conclusive answers yet. Until we do, we offer the following sketches as a starting point. We intend them as sketches and ideas for further work rather than usable analytic tools.
We implemented parallel Latent Dirichlet Allocation in the Titan toolkit using ideas from Newman et al. (2007) and ran the resulting code on Red Sky with the number of topics set to 50. We took the top 100 words to a collaborator with a medical background and asked her for subject headings for each one. We were surprised at how coherent the topics appeared even to those of us who were not domain experts.
Note: Click all images below for larger view.
We constructed a co-authorship graph among the 9.3 million authors listed in the 17 million articles. We ran this graph through weighted Clauset-Newman-Moore community detection (Berry et al. 2007) to identify communities.
This chart shows author communities with 10-150 members grouped by the most prevalent subject area in their publications. The exact coordinates are meaningless, as is the relative placement of any two author communities with respect to one another.
- Each community gets a single circle within its cluster.
- About half of the articles listed in PubMed include lists of affiliations for their authors. We extracted city names from as many of those as we could and used the results to identify papers that were collaborations among authors from different institutions, cities and countries. We drew those links on a map.
About half of the articles listed in PubMed include lists of affiliations for their authors. We extracted city names from as many of those as we could and used the results to identify papers that were collaborations among authors from different institutions, cities and countries. We drew those links on a map.
Each arc on the map represents a single collaboration link: that is, two authors in different places who worked on at least one paper together. The brightness of the arc indicates the number of times that link occurred. We filtered out links that occur fewer than 5 times to reduce clutter. The arcs themselves are segments of great circles.
- Worldwide collaboration. Challenge: Identify the cluster of authors just off the coast of Nigeria.
- The US subset of the collaboration map. Note that the presence or absence of any given point of activity is dependent on two factors: first, whether journal articles published by its researchers are indexed within PubMed; second, whether those articles that are indexed have author affiliations included that we were able to parse.
[Newman et al. 2007] David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling. "Distributed Inference for Latent Dirichlet Allocation". In Proceedings of NIPS, pp. 1081-1088.
[Berry et al. 2009] "Tolerating the community detection resolution limit with edge weighting." arXiv:0903.1072v2[physics.soc-ph], March 2009.
[Wilson et al. 2010] Andrew Wilson, Michael Trahan, Jason Shepherd, Thomas Otahal, Steven Kempka, Mark Foehse, Nathan Fabian, Warren Davis, Gloria Casale. "Text Analysis Tools and Techniques of PubMed Data using the Titan Scalable Informatics Toolkit". Sandia Tech Report SAND2011-3335.