Research Projects

Study of various fields in database (6/98 - now)

Ontology Management

Probabilistic Ontologies (9/2004 - 8/2005)

View Maintenance on RDF Databases (2/2004- 11/2004)

TOSS: An Extension of TAX with Ontologies and Similarity Queries (9/2003 - 3/2004)

Data Mining

Database

E-Commerce and User Interface
Theory
Distributed Computing


Research on "PNL: Parallel Mining of Outliers in Large Database" (7/98 - 9/99)

Survey on "Fault Tolerance and Checkpointing Schemes for Clusters of Workstations" (11/98 - 12/98) Research on "Data Cube System Design: An Optimization Problem" (11/98 - 5/2001) Study on software TM1, POWEROLAP (12/98 - 2/99)

Study on "Universal Usability in Practice: Blind and Low Vision Users" (2/2001 - 4/2001)

Research on "Establishing Trust among Online Customers" (2/2001 - 9/2001) Research on "Deduction of Procmail Recipes from Classified Emails" (2/2001 - 11/2001) Research on "Inapproximability of Materialized View Selection Problem and Non-metric K-medians Problem" (4/2001 - 5/2001) Research on "PXML, PIXML: Probabilistic Semistructured Database" (8/2001 - 7/2004)
Edward Hung is supported by Croucher Foundation Scholarships. Work is funded in part by the Army Research Lab under contract number DAAL-0197K0135, the Army Research Office under grant number DAAD190010484, by DARPA/RL contract number F306029910552, the ARL CTA on Advanced Decision Architectures, and NSF grant 0205489. Research on "Testing of Database Applications" (9/2001 - 12/2001) Research on "TOSS: An Extension of TAX with Ontologies and Similarity Queries" (9/2003 - 3/2004)
This work was supported in part by the Army Research Lab under contracts DAAL0197K0135 and DAAD190320026, the CTAs on Advanced Decision Architectures and Telecommunications, by ARO contract DAAD190310202, and by NSF grants 0205489 and IIS0329851.

Research on "View Maintenance on RDF Databases" (2/2004- 11/2004)

This work was supported in part by the Army Research Lab under contracts DAAL0197K0135 and DAAD190320026, the CTA on Advanced Decision Architectures, by ARO contracts DAAD190010484 and DAAD190310202, by DARPA/UC Berkeley contract number 03-000219, and by NSF grants 0205489 and IIS0329851.

Resource Description Framework (RDF) has been recommended by the World Wide Web Consortium as a standard for integrating web data. RDF specifications are increasingly being stored in RDF databases. Many applications will maintain views over RDF databases - these views may describe resources of interest to the owner of the views. As data about the resources available on the web is subject to frequent change, there is a need to incrementally maintain such views. We present algorithms to maintain materialized RDF views when data is (i) inserted into, (ii) deleted from, or (iii) modified in RDF database instances. We have developed a prototype of our algorithms. It is to be noted that the expressive power of RDF is a subset of relational databases. As a consequence, algorithms to maintain views over RDF instances can be far more powerful than relational database view maintenance algorithms. We show that the naive approach of storing RDF instances in a relational database and using classical view maintenance algorithms such as the well known algorithms of Gupta et. al. is far less efficient than using our algorithms. Our algorithms are 10 to 200 times faster than the relational view maintenance algorithms when applied to RDF-views.

In addition, we study the problem of aggregate queries. We develop an algorithm to compute answers to aggregate queries over RDF databases and algorithms to maintain views involving those aggregates. Though RDF data can be stored in a standard relational DBMS (and hence we can execute standard relational aggregate queries and view maintenance methods on them), we show experimentally that our algorithms that operate directly on the RDF representation exhibit significantly superior performance.

A paper was published in ICE 2005 and two technical reports were written.

Research on "Probabilistic Ontologies" (9/2004 - 8/2005)

Work supported in part by ARO grant DAAD190310202, NSF grants IIS0329851 and 0205489, and by a DARPA subcontract from the Univ. of California – Berkeley.

The relational algebra and calculus do not take the semantics of terms into account when answering queries. As a consequence, not all tuples that should be returned in response to a query are always returned, leading to low recall. In this paper, we propose the novel notion of a constrained probabilistic ontology (CPO). We developed the concept of a CPO-enhanced relation in which each attribute of a relation has an associated CPO. These CPOs describe relationships between terms occurring in the domain of that attribute. We show that the relational algebra can be extended to handle CPO-enhanced relations. This allows queries to yield sets of tuples, each of which has a probability of being correct. Using the well known Internet Movie Database (IMDB) as a testbed, we show that such \emph{probabilistic answers} yield a far higher ``quality of answer'' than ordinary relational DBs can provide today. We further develop methods to automatically infer probabilistic ontologies from relational data, and develop experimental results showing that the use of probabilistic ontologies is highly scalable.

A paper was published in OTM 2005 and a journal article was submitted.

Research on "Data Mining on Uncertain Data" (6/2005 - now)

There has been a large amount of research work done on mining on relational databases that store data in exact values. However, in many real-life applications, the raw data are usually uncertain when they are collected or produced. Sources of uncertain data include readings from sensors, classification results of image processing using statistical classifiers, results from predictive programs used for stock market, and weather predictions in meteorology, etc. However, since traditional databases only store exact values, uncertain data are usually transformed into exact data by, for example, taking the mean value (for quantitative attributes) or by taking the value with the highest frequency or possibility. The shortcomings are obvious: (1) by approximating the uncertain source data values, the results from the mining tasks will also be approximate and may be wrong; (2) useful probabilistic information may be omitted from the results.

Research on probabilistic databases began in 1980s. While there has been a great deal of work on supporting uncertainty in databases, there is little work on mining on such uncertain data. By classifying uncertain data into different categories, our main objective is to propose a framework to develop different probabilistic data mining techniques that can be applied directly on uncertain data in order to produce results that preserve the accuracy. In this article, we introduce the framework with a scheme to categorize uncertain data with different properties. We also propose a variety of definitions and approaches for different mining tasks on uncertain data with different properties.

A paper was submitted.