James' infovis thinking blog: Digging into Data

Digging into Data Challenge is an initiative grant competition sponsored by JISC, NEH, NSF, and SSHRC to answer "how does the notion of scale affect humanities and social science reserch?"or how the huge repositories of digitized data mean for research.

I read two of the related readings. What do you do with a million books? by Gregory Crane and The Unreasonable effectiveness of data by Alon Halevy, Peter norvig, and Fernando Pereira.

The first paper discussed the changes that digital libraries may confront in terms of the digitized trends of contents. The discussion of this paper focused on scalability issuse, and how this feature introduces new research topics for domain specialists interacting with huge collections. Authors considered that "document analysis, multilingual technology, and information extraction can be modeled in general terms, but these technologies acquire meaning when they are aligned with the needs of particualr domains." This paper discussed the scalability issue from a classic digital library perspective, seeing several large scale digitization initiatives such as Google Library, Making of America, Open Content Alliance, and etc. and their impacts to digital libraries. Integrating the huge digitized collections with specific domains could dig out rich and meaningful contents for research, which traditional printed contents can hardly achieve.

The second paper, in my mind, extends the audiancs or users of large digitized collections from researchers to broader ranges. The second paper view the whole WWW as the largest digitized collections and a rich sementic sourse for unsupervised learning. The authors argued that instead of trying to create neat theories for natural language processing, researchers might turn to use the best ally: the unreasonable effecitiveness of data, because "for many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without genrative rules." Hence for Sementic Interpratation--the problem of understanding human speech and writing--could be done by "learning as much as possible about the context of content to correctly disambiguate it." This study may reveal the bright side of Data Deluge--the quickly increased data could serve as valuable data source for some tasks, which require large training data sets that are previously harder to get, such as paralle multiligual corpura.

When facing such huge digitized collections, I would ask whether researchers who want to take advantage of such collections must become the experts of computing, who can handle such large volumes of data. Also what is the role of Visualization in the whole process of using the effectiveness of data?

James' infovis thinking blog

Saturday, April 25, 2009

Digging into Data

No comments:

Post a Comment

Welcome to my thinking tank of infovis

Blog Archive

About Me