Ontology Management
Probabilistic Ontologies (9/2004 - 8/2005)
View Maintenance on RDF Databases (2/2004- 11/2004)
TOSS: An Extension of TAX with Ontologies and Similarity Queries (9/2003 - 3/2004)
Privacy Preserving Data Mining (7/2005 - now)
Data Mining on Uncertain Data (6/2005 - now)
PNL: Parallel Mining of Outliers in Large Database (7/98 - 9/99)
Comparative Evaluation of Methods in Induction of Procmail Recipes from Classified Emails (9/2001 - 11/2001)
Deduction of Procmail Recipes from Classified Emails (2/2001 - 11/2001)
Database
PXML, PIXML: Probabilistic Semistructured Database (8/2001 - 7/2004)
Data Cube System Design: An Optimization Problem (11/98 - 5/2001)
Testing of Database Applications (9/2001 - 12/2001)
Study on software TM1, POWEROLAP (12/98 - 2/99)
Universal Usability in Practice: Blind and Low Vision Users (2/2001 - 4/2001)
Research on "PNL: Parallel Mining of Outliers in Large Database" (7/98 - 9/99)
A more efficient version (ENL) and its parallel version (PNL) are introduced. In theory, the improvement of performance in PNL is linear to the number of processors, as shown in a performance comparison between ENL and PNL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network.
A conference paper was published in IDC'99. A journal paper was published
in Distributed and Parallel Database, 2002.
A report was written.
In an OLAP system, data cubes (precomputed multidimensional views of data) can be used to support real-time queries. To reduce the maintenance cost, some data cubes can be merged, but the resulting larger data cubes will increase the response time of answering some queries. In order to satisfy the maintenance-cost bound and query-cost bound given by the user, some of the queries have to be sacrificed and not taken into our consideration. The optimization problem in the data cube system design is to optimize an initial set of cubes such that a maximum number of queries can be answered with the bounds satisfied. This is an NP-hard problem.
With a simplified data cube system model, a two-phase approach was proposed. First, an initial set of data cubes is derived, each for a specific query. Then, optimization is done on the initial data cube set so that the resultant set of data cubes satisfies the two bounds.
The optimization phase can be divided into two levels: the query level and the attribute level. Optimization at the query level tries to use the attribute-level optimization subroutine to obtain a data cube set to answer a maximum number of queries with the bounds satisfied. Optimization at the attribute level tries to optimize a given data cube set so that the new data cube set can answer the same set of queries answered by the former set, with the minimum query cost and the maintenance-cost bound satisfied. To the best of our knowledge, there are only a few papers related to our problem, but their approaches and algorithms have their own shortcomings.
At the query level, Greedy Removing (GR) algorithm was suggested, which chooses to remove one query from consideration at each time, so that the query cost of the data cube set is the lowest among others. At the attribute level, 2-Greedy Merging (2GM) algorithm was proposed to choose two data cubes to merge to become a larger data cube at each time. A multiple-path version of 2GM, 2-Greedy Merging with Multiple paths (2GMM), was also proposed with better performance than 2GM.
The performance of the algorithms was verified by experimental results. For the attribute-level optimization, the goodness of the solutions obtained from 2GM and 2GMM was very close to an optimal solution Moreover, 2GM and 2GMM were very efficient. For the query-level optimization, the number of queries removed was the same for both of a GR solution and an optimal solution for almost every query-cost bound. Moreover, the execution time of GR was only within several minutes. Therefore, it could be concluded that the combination of 2GMM (for the attribute-level optimization) and GR (for the query-level optimization) is a good choice for solving the data cube system design optimization problem.
A conference paper was published in PAKDD 2000. A M.Phil. thesis was
accepted. A journal paper was published in Journal of Intelligent Information
Systems. A journal paper is under review.
Study on "Universal Usability in Practice: Blind and Low Vision Users" (2/2001 - 4/2001)
The purpose of this article is to provide recommendations, guidelines, examples and resources to web site developers on how to develop web sites accessible to users that are blind or have low vision. People with visual disabilities are individuals who are blind, have low vision, or have color blindness. In this article, we will consider users with blindness and low vision.
A report was written. Available at http://www.otal.umd.edu/UUPractice/vision.
A report was written. A book chapter was published in the book: "E-Service:
New Directions in Theory and Practice", 2002.
The growth of the Internet and the increasing use of emails create the need of automatic classification of emails. To our best knowledge, previous research works were done on building classifiers that worked without letting users know by which rules emails were classified. We believe that it is necessary for users to be able to understand and modify the classification ways with full control. Therefore, we aim at generating rules that are easy to understand, possible to execute immediately and free for users to modify. We suggest deduction of procmail recipes, which are classification rules used by a widely used email-filtering software called procmail. We consider finely-divided domain-specific features and propose two approaches, namely, the naive approach (based on common features among all emails in the same class) and decision tree induction approach. With the experimental results, we show that the decision tree induction approach outperforms the naive approach and has a satisfactory accuracy of higher than 90% in most of the time, but higher than 80% even in worse cases.
A report was written.
We consider finely-divided domain-specific features and propose to comparatively evaluate the performance of two machine learning methods, namely, the decision tree induction method (C4.5 program) and the rule learning method (using AQ program). With the experimental results, we show that the decision tree induction method (C4.5) outperforms the rule learning method (AQ) and has a satisfactory accuracy of higher than 95% in most of the time, and higher than 88% even in worse cases.
A report was written.
A report was written.
A paper was published in ICDT 2003. A paper was published in ICDE 2003.
A journal paper was accepted by ACM TOCL.
A report was written.
A paper was published in SIGMOD 2004.
Research on "View Maintenance on RDF Databases" (2/2004- 11/2004)
This work was supported in part by the Army Research Lab under contracts DAAL0197K0135 and DAAD190320026, the CTA on Advanced Decision Architectures, by ARO contracts DAAD190010484 and DAAD190310202, by DARPA/UC Berkeley contract number 03-000219, and by NSF grants 0205489 and IIS0329851.
Resource Description Framework (RDF) has been recommended by the World Wide Web Consortium as a standard for integrating web data. RDF specifications are increasingly being stored in RDF databases. Many applications will maintain views over RDF databases - these views may describe resources of interest to the owner of the views. As data about the resources available on the web is subject to frequent change, there is a need to incrementally maintain such views. We present algorithms to maintain materialized RDF views when data is (i) inserted into, (ii) deleted from, or (iii) modified in RDF database instances. We have developed a prototype of our algorithms. It is to be noted that the expressive power of RDF is a subset of relational databases. As a consequence, algorithms to maintain views over RDF instances can be far more powerful than relational database view maintenance algorithms. We show that the naive approach of storing RDF instances in a relational database and using classical view maintenance algorithms such as the well known algorithms of Gupta et. al. is far less efficient than using our algorithms. Our algorithms are 10 to 200 times faster than the relational view maintenance algorithms when applied to RDF-views.
In addition, we study the problem of aggregate queries. We develop an algorithm to compute answers to aggregate queries over RDF databases and algorithms to maintain views involving those aggregates. Though RDF data can be stored in a standard relational DBMS (and hence we can execute standard relational aggregate queries and view maintenance methods on them), we show experimentally that our algorithms that operate directly on the RDF representation exhibit significantly superior performance.
A paper was published in ICE 2005 and two technical reports were written.
- Edward Hung, Yu Deng, V.S. Subrahmanian, "RDF Aggregate Queries and Views", in the Proceedings of the 21st International Conference on Data Engineering (ICDE), pages 717-728, Tokyo, Japan, April 5-8, 2005. (acceptance rate = 67/521 = 13%)
- Edward Hung, Yu Deng and V.S. Subrahmanian, "RDF Aggregate Queries and Views", Technical Report UMCP-CSD CS-TR-4611 (UMIACS-TR-2004-53). August 2004.
- Edward Hung, Yu Deng and V.S. Subrahmanian, "Maintaining RDF Views", Technical Report UMCP-CSD CS-TR-4612 (UMIACS-TR-2004-54). August 2004.
Research on "Probabilistic Ontologies" (9/2004 - 8/2005)
Work supported in part by ARO grant DAAD190310202, NSF grants IIS0329851 and 0205489, and by a DARPA subcontract from the Univ. of California – Berkeley.
The relational algebra and calculus do not take the semantics of terms into account when answering queries. As a consequence, not all tuples that should be returned in response to a query are always returned, leading to low recall. In this paper, we propose the novel notion of a constrained probabilistic ontology (CPO). We developed the concept of a CPO-enhanced relation in which each attribute of a relation has an associated CPO. These CPOs describe relationships between terms occurring in the domain of that attribute. We show that the relational algebra can be extended to handle CPO-enhanced relations. This allows queries to yield sets of tuples, each of which has a probability of being correct. Using the well known Internet Movie Database (IMDB) as a testbed, we show that such \emph{probabilistic answers} yield a far higher ``quality of answer'' than ordinary relational DBs can provide today. We further develop methods to automatically infer probabilistic ontologies from relational data, and develop experimental results showing that the use of probabilistic ontologies is highly scalable.
A paper was published in OTM 2005 and a journal article was submitted.
- Octavian Udrea, Yu Deng, Edward Hung, V.S. Subrahmanian, “Probabilistic Ontologies and Relational Databases”, in the Proceedings of the OTM Confederated International Conferences (CoopIS, DOA, and ODBASE) 2005, Agia Napa, Cyprus, Oct 31 - Nov 4, 2005.
Research on "Data Mining on Uncertain Data" (6/2005 - now)
There has been a large amount of research work done on mining on relational databases that store data in exact values. However, in many real-life applications, the raw data are usually uncertain when they are collected or produced. Sources of uncertain data include readings from sensors, classification results of image processing using statistical classifiers, results from predictive programs used for stock market, and weather predictions in meteorology, etc. However, since traditional databases only store exact values, uncertain data are usually transformed into exact data by, for example, taking the mean value (for quantitative attributes) or by taking the value with the highest frequency or possibility. The shortcomings are obvious: (1) by approximating the uncertain source data values, the results from the mining tasks will also be approximate and may be wrong; (2) useful probabilistic information may be omitted from the results.
Research on probabilistic databases began in 1980s. While there has been a great deal of work on supporting uncertainty in databases, there is little work on mining on such uncertain data. By classifying uncertain data into different categories, our main objective is to propose a framework to develop different probabilistic data mining techniques that can be applied directly on uncertain data in order to produce results that preserve the accuracy. In this article, we introduce the framework with a scheme to categorize uncertain data with different properties. We also propose a variety of definitions and approaches for different mining tasks on uncertain data with different properties.A paper was submitted.