Spring 2010   CSE6339 (Section 002, #23887)

Special Topics in Advanced Database Systems

Web Search, Mining, and Integration


Resources: Google       Google Scholar      CiteSeer       DBLP Bibliography    ACM Digital Library       IEEE Xplore       Other Computer Science articles


Course Information:

Instructor: Chengkai Li

TA: Ning Yan

  • Office hours: Tue/Thu 3-5pm
  • Office: GeoScience Building 237
  • Phone: 682-227-9412
  • E-mail: ning.yan [AT] mavs [DOT] uta [DOT] edu

Course Description: We will study papers on Web Search, Mining, and Integration, covering topics in databases, data mining, information retrieval, and the intersections of these areas. The goals of the course are: to expose graduate students to the cutting-edge of research in these areas;  to equip them with the necessary skill sets for finding jobs; to help them identify research topics and come up with preliminary works through course projects; and to prepare new students for doing research with faculty in databases, data mining, and information retrieval. Detailed topics include:

Prerequisites: CSE 3330/5330  Database Systems I     or      CSE 5334  Data Mining       or     similar courses    or     consent of instructor

Reference Textbook (Not Required)

We will mainly use research papers.

Grades

There is no exam. We will focus on paper review, presentation, and project.


Announcements: Stay tuned and make sure to check WebCT frequently. Important announcements will be posted there.

Assignments and Deadlines

Regrading: Regrading request must be made within 7 days after we post scores on WebCT. TA will handle regrade requests. If student is not satisfied with the regarding results, you get 7 days to request again. The instructor will regrade, and the decision is final.

WebCT: (WebCT is not ready yet. You will be notified when it is ready.)

Log in to the WebCT page http://www.uta.edu/webct with your NetID and password. We use WebCT for: (1) Announcements; (2) Assignment Submission; (3) Discussion;  (4) Releasing materials, assignments, scores and grades. Follow these steps exactly during electronic assignment submission.


Ethics Policies and Academic Integrity: The College cannot and will not tolerate any form of academic dishonesty by its students. This includes, but is not limited to cheating on examinations, plagiarism, or collusion (explained in the document below). Students are required to read the following document carefully, sign it, return the signed copy to the instructor, and keep a copy for their own records. Hardcopies of this document will be provided to the students in the first class, and also can be picked up in the instructor's office. If you print by yourself, please make it double-sided.

Statement on Ethics, Professionalism, and Conduct for Engineering Students

Miscellaneous: If you require accommodation based on disability, I would like to meet with you in the privacy of my office during the first week of the semester to ensure that you are appropriately accommodated. Please read the page of the office for students with disabilities.


Schedule

Date Lecture/Activities

Presenter

Due

Lecture Notes

Introduction
01/19 Course Overview Chengkai Li   [PDF]
01/21

Paper Review, Presentation, Research Resources

Chengkai Li   [PDF]
Basics
01/26

Paper Review, Presentation, Research Resources (cont'd)

Chengkai Li    
01/28

Course Project Topics

Chengkai Li   [slides in WebCT]
02/02 Boolean Query Model, Vector Space Model, Inverted Index and Distributed Index Chengkai Li   [PDF]
02/04

Web Search Basics

  • Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117 (1998)
  • Luiz André Barroso, Jeffrey Dean, Urs Hölzle: Web Search for a Planet: The Google Cluster Architecture. IEEE Micro 23(2): 22-28 (2003)
  • Abdus Salam   [PPT]
    02/09

    Text Clustering

  • Steinbach, M., Karypis, G., and Kumar, V. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000
  • Mahashweta Das   [PPT]
    02/11 (Rescheduled due to snow)      
    02/16

    Text Classification

  • Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl. Boilerplate Detection using Shallow Text Features. WSDM 2010.
  • Rakesh Ramegowda   [PPT]
    Semantic Web
    02/18

    SemTag

  • Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, Ramanathan V. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien: SemTag and seeker: bootstrapping the semantic web via automated semantic annotation. WWW 2003: 178-186
  • Shahina Ferdous Proposal [PPT]
    02/23 YAGO

    Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum: YAGO: A Large Ontology from Wikipedia and WordNet. J. Web Sem. 6(3): 203-217 (2008)
    Quazi Hasan   [PDF]
    02/26 (make-up class for 02/11)
    Simone Paolo Ponzetto, Michael Strube: Deriving a Large-Scale Taxonomy from Wikipedia. AAAI 2007: 1440-1445

    Cäcilia Zirn, Vivi Nastase, Michael Strube: Distinguishing between Instances and Classes in the Wikipedia Taxonomy. ESWC 2008: 376-387
    Chengkai Li   [PDF]
    Entity Recognition and Disambiguation
    02/25 D. Milne and I. H. Witten. Learning to link with Wikipedia. In CIKM ’08, pages 509–518, 2008.

    R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM ’07, pages 233–242, 2007.
    Abhijit Tendulkar   [PDF]
    03/02

    S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in Web text. In KDD ’09, pages 457–466, 2009.

    X. Han and J. Zhao. Named entity disambiguation by leveraging Wikipedia semantic knowledge. In CIKM ’09, pages 215–224.

    Avinash Bharadwaj [PDF]
    Information Extraction
    03/04 Machine Learning Approach, Wrapper
  • Andrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005.
  • Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)
  • Shanshan Lu   [PDF]
    03/09

    KnowItAll

  • Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates: Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1): 91-134 (2005)
  • Chengkai Li   [PDF]
    03/11

    TextRunner and Open Information Extraction

  • Oren Etzioni, Michele Banko, Stephen Soderland, Daniel S. Weld: Open information extraction from the web. Commun. ACM 51(12): 68-74 (2008)
  • Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007: 2670-2676
  • Michele Banko and Oren Etzioni: The Tradeoffs Between Open and Traditional Relation Extraction. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008)
  • Ning Yan Essay 1 (03/12) [PDF]
    03/16

    spring break

    03/18
    Structured Querying Over the Web, Entity Search and Ranking
    03/23 ExDB

    Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni: Structured Querying of Web Text Data: A Technical Challenge. CIDR 2007: 225-234
    Shahina Ferdous   [PDF]
    03/25 EntityRank

    Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007: 387-398
    Abdus Salam   [PDF]
    03/30 Soumen Chakrabarti, Kriti Puniyani, Sujatha Das: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. WWW 2006: 717-726 Xiaonan Li Progress Report [PDF]
    04/01 SQoUT

    Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano: To search or to crawl?: towards a query optimizer for text-centric tasks. SIGMOD Conference 2006: 265-276
    Avinash Bharadwaj   [PDF]
    04/02 Last day to drop class
    Guest Lectures
    04/06 Xiaonan Li Essay 2 [PDF]
    04/08 Facetedpedia: Dynamic Generation of Query-Dependent Faceted Interfaces for Wikipedia. Chengkai Li, Ning Yan, Senjuti Basu Roy, Lekhendro Lisham, Gautam Das. To appear in Proceedings of the 19th International World Wide Web Conference (WWW 2010), Raleigh, North Carolina, April 2010. Ning Yan   [PDF]
    Web Data Mining (cont'd)
    04/13

    Clustering Web Search Results

  • Douglas R. Cutting, Jan O. Pedersen, David R. Karger, John W. Tukey: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. SIGIR 1992: 318-329
  • Mahashweta Das   [PDF]
    04/15

    GABRILOVICH, G. AND S. MARKOVITCH. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAI’07, p.1606–1611.

    Quazi Hasan   [PDF]
    04/20 MILNE, D. AND WITTEN, I.H. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. WIKIAI'08 Shanshan Lu   [PDF]
    Social Networks
    04/22

    Tagging

  • Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, Julia Stoyanovich. Efficient Network-Aware Search in Collaborative Tagging Sites. VLDB 2008.
  • Abhijit Tendulkar   [PDF]
    04/27 Shenghua Bao, Gui-Rong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, Zhong Su: Optimizing web search using social annotations. WWW 2007: 501-510 Rakesh Ramegowda   [PDF]
    04/29 overflow lecture   Essay 3 [PDF]
    05/11 Project presentation and Demo Salam
    Shanshan
    Rakesh /Avinash
    Final Report, Presentation and Demo Slides, source code (due at 05/10) [PDF]
    05/13 Project presentation and Demo Sunny/Shahina
    Mahashweta/Abhijit
      [PDF]

    University calendar: Spring 2010