Sunday, April 26, 2009

Digging into Data (Cont)

In the Digging into Data homepage, there are several interesting readings. Here I would like to share my thinking of the one from Wired Magazine: The End of Theory: The Data Deluge Makes Scientific Method Obsolete,  by Chris Anderson, EIC of Wired. This article triggers 76 comments, which list a various thoughts of the idea. This article, along with the The Unreasonable Effectiveness of Data, promotes the Google's approach. 

Chris is very optimistic about the "new" approach that leverage the power of huge amount of data since, as in the Unreasonable said, for many tasks, the millions or so examples might represent the whole picture of the world. 

By reading the comments, I realized that, maybe it is time to think what tasks can leverage the power of data, and let the data speaks itself. One comments try to challenge the approach by "nuclear fusion;" another thought some disciplines, like physics, still stand upon neat theories and models, while other disciplines, like biology, economics, where models come and go, could benefit more from data deluge than others.

Going back to my own questions, what are those tasks? It seems to me that the data generated by human beings themselves such as languages and behaviours might be able to speak themselves, meanwhile data generated from the nature--earth, sky, universe-- (of course, they are measured and sensored by human beings, but not of ourselves) might need to be spoken by real experiments and observations. As one comment said, the models discovered by J. Craig Venter are not discoveries until these models are confirmed with things found in real world. Therefore models from ourselves could be relatively easily validated than models from natural data, like in high energy physics, where very few data will come out until new accelerators are built.

Saturday, April 25, 2009

Digging into Data

Digging into Data Challenge is an initiative grant competition sponsored by JISC, NEH, NSF, and SSHRC to answer "how does the notion of scale affect humanities and social science reserch?"or how the huge repositories of digitized data mean for research.

I read two of the related readings. What do you do with a million books? by Gregory Crane and The Unreasonable effectiveness of data by Alon Halevy, Peter norvig, and Fernando Pereira. 

The first paper discussed the changes that digital libraries may confront in terms of the digitized trends of contents. The discussion of this paper focused on scalability issuse, and how this feature introduces new research topics for domain specialists interacting with huge collections. Authors considered that "document analysis, multilingual technology, and information extraction can be modeled in general terms, but these technologies acquire meaning when they are aligned with the needs of particualr domains." This paper discussed the scalability issue from a classic digital library perspective, seeing several large scale digitization initiatives such as Google Library, Making of America, Open Content Alliance, and etc. and their impacts to digital libraries. Integrating the huge digitized collections with specific domains could dig out rich and meaningful contents for research, which traditional printed contents can hardly achieve.

The second paper, in my mind, extends the audiancs or users of large digitized collections from researchers to broader ranges. The second paper view the whole WWW as the largest digitized collections and a rich sementic sourse for unsupervised learning. The authors argued that instead of trying to create neat theories for natural language processing, researchers might turn to use the best ally: the unreasonable effecitiveness of data, because "for many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without genrative rules." Hence for Sementic Interpratation--the problem of understanding human speech and writing--could be done by "learning as much as possible about the context of content to correctly disambiguate it." This study may reveal the bright side of Data Deluge--the quickly increased data could serve as valuable data source for some tasks, which require large training data sets that are previously harder to get, such as paralle multiligual corpura. 

When facing such huge digitized collections, I would ask whether researchers who want to take advantage of such collections must become the experts of computing, who can handle such large volumes of data. Also what is the role of Visualization in the whole process of using the effectiveness of data?

Saturday, April 18, 2009

Robert Spence's InfoVis

This week's book reading is Robert Spence's InfoVis 2001 (Amazon link).

Different from the other two classic InfoVis books, The Craft of InfoVis, by Bederson and Shneiderman, and InfoVis Using Vision to Think, by Card, Mackinlay, and Shneiderman, both of which consist of multiple research papers, the Spence's InfoVis book could be a nice text book for InfoVis 101 in my idea. As a colorly printed book, having around 190 pages, this work introduces nearly all fundamental perspectives of InfoVis, and easy to read in a week. Terminology in this book is slightly different from those in papers I read recently, such as using Rearrangement as the same level with Interaction, using Univariate, Bivariate,..., to talk about 1D, 2D, 3D, and multi-D data.

Spence emphasizes that Visualization is a cognitive activity and therefore, "The potential value of visualization--that  of gaining INSIGHT and UNDERSTANDING--follows from these definitions but also, in view of the coginitive nature of visualization does the difficulty of its study" (p. 1) This emphasis of cognitive perspective forms the major strength of this book.

I skimed several chapters, which relate the general ideas of InfoVis. And put my main efforts on scrutinizing chapter 6 where Spence introduced a "Nevigation Model" to construct a loop that consists of Internal Models, their Formation, and Interpretation.  Spence considered that the Navigation is of such importance to infovis, and includes the major concepts relevant to infovis, like Internal Models and their characteristics, how models form in our mind, how people interpretate externalizaiton of data, how we the browsing stragegy forms, 

As this kind of cognitive theorise are relatively "old" compared with the Distributed Cognitiontheories, which are recently introduced to the InfoVis community by Liu (2008). It would be very interesting to read the Liu's paper again so that new insights may come out.

Friday, April 17, 2009

Related Sites: InfoVis Wiki

I found this site (InfoVis Wiki) about one month ago. A good wiki portal of InfoVis, having a lot of info. Do hope it is updated frequently.


Tuesday, April 14, 2009

Insight-based Evaluation of Vis

(Plaisant, Fekete, Grinstein, 2008) Promoting Insight-Based Evaluation of Visualizations: From Contest to Benchmark Reporsitory. (Link from ACM portal, link from authors)

Great summary and reflections of InfoVis Contest 03-05. I hope I could read this paper before I drafted my submission to infovis09 since I could get more insights of the 04 contest data and made my argument stronger by comparing my findings with the insights discovered and evaluated by the judgers.


Redefine Visual Analytics

Keim, Andrienko, Fekete, Gorg, Kohlhammer, and Melancon, (2008). Visual Analytics: Definition, Process, and Challenges. (pdf link from author).

An excellent summary paper, talking about Visual Analytics (VA), introducing the origins of the data overload challenge, describing the questions needed to be answer and the tasks needed to be performanced efficiently and effectively, defining the concept of VA and its goals, differentiating the scientic visual, inforvis, and VA, discussing the related areas (Visualization, Data management, Data analysis, Perception and cognition, HCI, and Infrastructure and evaluation) to VA, bringing a mantra to VA--"Analyze first, Show the Important, Zoom, filter and analyze further, Details on demand," listing Application challenges and Technical challenges, and at the end giving two vivid examples of VA. 


Sunday, April 12, 2009

User experiments with five tree visual

A. Kobsa, "User Experiments with Tree Visualization Systems," in IEEE Symposium on Information Visualization, Austin, TX USA, 2004. (Link)

An empirical study that compared the 5 infovis systems for tree hierachies with using Windows Explorer as baseline system. The five infovis systems are
  1. Treemap 3.2
  2. Sequoia View 1.3
  3. BeanTrees
  4. Star Tree Studio 3.0
  5. Tree Viewer
The tasks in this study included 15 questions that could be divided into two catgories: structure-related tasks and attribute-related tasks.

This study find the baseline--Windows Explorer, works well in terms of task correctness, completion times, and user satisfaction. In the five tested systems, Treemap 3.2 has achieved similar or better performace than Explorer. The other four were overperformanced by Explorer. Bean Trees had relatively the worse performance based on the three evaluation metrics.

The author also used video records of screen shot for further analysis, and gave detailed descriptions of each tested visual system.

Strengthes:
        1. A true emprical studies of usability and usefulness of infovis systems.
        2. Strong study design, based on both previous study in InfoVis 2003 Contest and their own requiements.
        3. Comprehensive analyses based on quantitative metrics and qualitative analysis.
        4. Interesting results: system 2-5 had poor performance compared with Windows Explorer, which I didn't expected before reaching the results section. 

Weaknesses:
        As theoretical studies like to view InfoVis in two perspectives: Visual representation and interaction, I at first thought this paper would compare the unique feature of each visual representation of hierarchies in tree structure. The results of this study, however, mixed the two features. It is hard to examine whether the performance of each system comes from the unique visual feature, or the having or lacking of proper interactive features to support the visual representation. I wonder what the different impacts between visual representation of each system and the interactive functions of each system are; or, for each visual representation, what interactive functions can help to improve performance most than other functions. At the end, the author suggested in the conclusion section that further studies might look at the functionality beyond the pure visualization, which could help increase the performance of systems. 

Thursday, April 9, 2009

insights of Infovis

The first one is by Ji Soo Yi, talking about how people gain insights from infovis.

(Yi, Kang et al. 2008) Understanding and Characterizing Insights: How do people gain insights using information visualization? (link from ACM portal)

This paper is a serial papers from Yi, which bring lots of insights into the theoretical fundatation of InfoVis at Infovis Conference, BELIV, and his dissertation. 

In this paper, authors revealed four procedures through which people gain insights from infovis systems or tools by reviewing pervious studies on benefits gained from infovis. The four procedures include:

  1. Provide overview
  2. Adjust
  3. Detect pattern
  4. Match mental mode
Before going to the detailed descriptions of the four procedures, authors argued why they endeavour to find "how" to gain insights, instead of "what" is a insight, because the definition of "insights" are complex, and a uniformed, wildly accepted definition of insights is still lack. Knowing how people gain insights could help to define what insights are.

Authors also argued that the "insights" are not only the "end results," but the sources or stimuli of other insights, and often a by-product of exploration without an initial destination. This arguments lead me to think about the mantra at Thomas and Cook's Illuminating the Path, "detect expected, and discovery unexpected." Are insights expected or unexpected? What are the differences between "insights" and "discoveris"?

Authors use data/frame theory of sense making as a critical step for understanding insight gaining, because "we found that understanding sense making is critical." I do hope to hear more about the relationship between the two concepts, which may lay down solid foundation of InfoVis and Visual Analytics.

In the final conclusion, authors raised an interesting question, "Is there something unique about visualization in the way  it produces insight that is different from other ways of producing insights?" For this question, I would like to think if any of the four processes discussed in above can have this unique feature. Is visualization good at helping match mental model of human beings with the data?