There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.
Abstract
Despite their relatively low sampling factor, the freely available, randomly sampled
status streams of Twitter are very useful sources of geographically embedded social
network data. To statistically analyze the information Twitter provides via these
streams, we have collected a year's worth of data and built a multi-terabyte relational
database from it. The database is designed for fast data loading and to support a
wide range of studies focusing on the statistics and geographic features of social
networks, as well as on the linguistic analysis of tweets. In this paper we present
the method of data collection, the database design, the data loading procedure and
special treatment of geo-tagged and multi-lingual data. We also provide some SQL recipes
for computing network statistics.