Ph.D
Group : Artificial Intelligence and Inference Systems
Understanding the hidden web
Starts on
Advisor : ABITEBOUL, Serge
Funding :
Affiliation : INRIA
Laboratory :
Defended on 12/01/2007, committee :
Serge Abitboul
francois Bourdoncle
Patrick gallinari
Georg Gottlob
Christine Paulin-Mohring
Val Tannen
Research activities :
- Semantic Web
Abstract :
The hidden Web (also known as deep or invisible Web), that is, the part of the Web not directly accessible through hyperlinks, but through HTML forms or Web services, is of great value, but difficult to exploit.
We discuss a process for the fully automatic discovery, syntactic and semantic analysis, and querying of hidden-Web services. We propose first a general architecture that relies on a semi-structured warehouse of imprecise (probabilistic) content. We provide a detailed complexity analysis of the underlying probabilistic tree model. We describe how we can use a combination of heuristics and probing to understand the structure of an HTML form. We present an original use of a supervised machine-learning method, namely conditional random fields, in an unsupervised manner, on an automatic, imperfect, and imprecise, annotation based on domain knowledge, in order to extract relevant information from HTML result pages. So as to obtain semantic relations between inputs and outputs of a hidden-Web service, we investigate the complexity of deriving a schema mapping between database instances, solely relying on the presence of constants in the
two instances. We finally describe a model for the semantic representation and intensional indexing of hidden-Web sources, and discuss how to process a user’s high-level query using such descriptions.