Date 10AM to 11:30AM 11:30AM to 1PM 1:00PM to 2:00PM 2:00PM to 3:30PM 3:30PM to 5:00PM
08th Dec. 2014 SAARC Charter Day Celebration
Lunch Break
Inaugural Ceremony Ian H. Witten
09th Dec. 2014 Ian H. Witten
Vivek Singh
10th Dec. 2014 David Eduardo Pinto Marc Franco Salvador
Marc Franco Salvador
11th Dec. 2014 Niladri Chatterjee
T J Siddiqui
12th Dec. 2014 Sharma Chakravarthy
Reshma Khemchandani Madhu Kumari
13th Dec. 2014 Asif Ekbal
Valedictory + Keynote-K R Murali Mohan

Topics and Abstracts

Ian H. Witten: The evolution of text mining: a personal view

The idea of formal analyses of the meaning of text dates back a century, to Wittgenstein’s Tractatus Logico-Philosophicus (published in 1921). We will take a whirlwind tour through the history of text mining, including my own experience with closely related problems over a period of many decades. Particular recent highlights are the knowledge revolution (facilitated by the emergence of Wikipedia) and the “linked data” movement. But links are only useful if they relate to meaning (rather than, say, to URL), and, despite the hype, todays linked data resources are shallow. This leads to current research on ontology building and its evaluation by crowd-sourcing.

Ian H. Witten: Semantic document representation: do it with Wikification

Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual effort and judgment. “Wikification” is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the first, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectively. Wikification produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientific documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. I will show how it can be applied to document clustering and classification algorithms, and to produce back of the book indexes, improving on the state of the art in each case.

Ian H. Witten: Data Mining with Weka

This is a practical introduction to data mining using the Weka machine learning workbench, aimed at end users who are new to Weka. I will introduce basic concepts such as classification, evaluation and overfitting. I will describe simple machine learning methods, including statistical modeling, decision trees and rules, association rules, various kinds of linear models, instance-based learning and clustering. Lecture material will be interspersed with practical demonstrations: I will show you the main features of the Weka Explorer, including filters, classifiers, and visualization. I will also introduce my two MOOCs (Massive Open Online Courses), through which you will be able to learn more about data mining with Weka.

Vivek Singh: Sentiment Analysis

Sentiment analysis is language processing task that uses an algorithmic formulation to identify opinionated content and categorize it as having ‘positive’, ‘negative’ or ‘neutral’ polarity. It has been formally defined as an approach that works on a quintuple ; where, Oi is the target object, Fij is a feature of the object Oi, Skijl is the sentiment polarity (+ve, -ve or neutral) of opinion of holder k on jth feature of object i at time l, and Tl is the time when the opinion is expressed. It can be clearly inferred from this definition that sentiment analysis involves a number of tasks ranging from identifying whether the target carries an opinion or not and if it carries an opinion then to classify the opinion as having ‘positive’ or ‘negative’ polarity. The sentiment analysis task may be done at different levels, document-level, sentence-level or aspect-level. There are broadly two kinds of approaches for sentiment analysis: those based on machine learning classifiers and those based on lexicon. The machine learning classifiers for sentiment analysis are usually a kind of supervised machine learning paradigm that uses training on labelled data before they can be applied to the actual sentiment classification task. Lexicon-based methods on the other hand extracts some selected features and use a dictionary look up to compute their sentiment polarities and aggregate them in some way to find overall polarity. The tutorial aims to introduce the sentiment analysis problem and characterize various approaches. Some standard datasets and application areas will also be discussed.

Niladri Chatterjee: Statistical Machine Translation

Machine Translation, or automated translation of text of one natural language into another, is a challenging task both technically and linguistically. Although traditionally considered to be an Artificial Intelligence problem, Statistical Machine Translation (SMT) has gained popularity in last one decade or so. With the availability of huge parallel corpora SMT looks at computing different probabilities (e.g Unigram, bigram, trigram for Language modeling; translation probabilities using alignment functions) from those corpora which are used to generate the most probable translation of a given input sentence. The whole idea started with 5 IBM models in 1993, which have further been extended by different researchers. In this tutorial we first look at Machine translation as a subject, and examine its difficulties with focus on English to Hindi (and some other Indian languages) machine translations. It also pays a short visit to the history of MT and different MT paradigms. Then it develops the technique of statistical MT starting from the IBM models in a systematic way and illustrates the development in a step-by-step way. The tutorial deals with some SMT software such as Moses and Giza++.

T J Siddiqui: Multi-Document Summarization

This tutorial will focus on automatic text summarization, in particular on multi-document summarization. The tutorial is in two parts. In the first part, I will introduce basic concepts involved in creating single document summary. After a quick overview of what is multi-document summary and how it differs from single document summary, I will give an overview of automatic evaluation of summarization systems. In the second part of the tutorial, I will discuss issues and challenges specific to multi-document summarization and its applications. During the course of the talk, I will attempt to survey existing statistical and shallow semantic approaches to multi-document summarization. Finally, future challenges will be discussed.

Sharma Chakravarthy InfoSift: Adapting Graph Mining Techniques for Document Classification

I will briefly describe ongoing projects at the IT Lab before the main presentation.

Text classification is the problem of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a sample of pre-classified documents, which are used as a training corpus. A number of machine learning, probabilistic, and information retrieval based approaches have been proposed for text classification.

This talk proposes a novel graph-based mining approach for document classification. Our approach is based on the premise that representative – common and recurring – structures or patterns can be extracted from a pre-classified document class and the same can be used effectively for classifying incoming documents. To the best of our knowledge, there is no existing work in the area of text, email or web page classification based on pattern inference and the utilization of the learned patterns for classification. A number of factors that influence representative structure extraction and classification are analyzed conceptually and validated experimentally. In our approach, the notion of inexact graph match is leveraged for deriving structures that provide coverage for characterizing the contents of a document class. The results of our approach are compared with Naïve Bayes approach. We discuss both single and multi-folder classification for emails.

This is a joint work with my students Many Aery and Aravind Venkatachalam.

Marc Franco Salvador: Knowledge Graph-based Natural Language Processing

A knowledge graph (KG) is a weighted and labeled graph that expands and relates the original concepts present in a set of words. A knowledge base (KB) is a weighted and directed graph, were nodes represent concepts (optionally in multiple languages), and edges represent semantic relations between them. KG's can be created as a subset of the original KB focused on the concepts belonging to a text, and in the intermediate concepts and relations between them. This knowledge graph representation is used as replacement of traditional vector-based text representations to model the text context in a language independent way, i.e. a concept in multiple languages is used as its multilingual representation. In this tutorial we first overview the different resources that can be used to create KG's. Then we explain how to use a KB to create KG's. Finally we study how to use KG's to obtain state-of-the-art performance in different Natural Language Processing tasks: Word Sense Disambiguation, Cross-language (CL) Plagiarism Detection, CL Document Retrieval, CL Text Categorization and Cross-domain Polarity Categorization.

Asif Ekbal: Some Issues in Named Entity Recognition, Biotext Mining and Coreference Resolution

The talk starts with the various issues (definition, challenges, evaluation metrics, etc.) of named entity recognition and classification (NERC), followed by different approaches to NERC (Indian as well non-Indian languages), with special focus on ensemble learning and/or feature selection for NERC involving Indian languages and biomedical texts. The talk will also introduce the basic concepts of coreference resolution and the evolutionary optimization based approaches for solving the problem.

Reshma Khemchandani: Machine Learning Techniques for Text Documents Classification

With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining and machine learning techniques to get meaningful knowledge. The aim of this talk is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques.

Madhu Kumari: Social Computing in Online Social Media

The behemoth growth of social media mandates the analysis and modelling of the phenomena which govern the whole landscape of interactions of individuals’ and design of this milieu. These phenomena arise from users’ intent, interaction and influence, which in turn affect the popularity, reputation and impact of any social forum. Therefore the first and most significant requirement is to group different behaviours of users as an individual as well as in a community, based on well accepted notions and matrices. The outcomes of this classification can easily be exploited to develop a behavioural framework in terms of models which not only explain idiosyncratic behaviours of users but also capture the essence of the complex interplay of interaction of individuals in open sphere and in a well designated online community. This talk focuses on issues pertaining to uses’ behaviour classification, its modelling along with the dynamics of interactions in social media.

Jayadeva: Next generation machine learning with the MCM

Over the last decade and a half, support vector machines (SVMs) have become the paradigm of choice for most learning applications. The first part of this talk will focus on SVMs and how to use them. SVMs and their variants now provide state-of-the-art results for many applications.

Surprisingly, computational learning theory tells us that SVMs provide no guarantee for good generalization, and in fact, can do very poorly at times. It is known that the Vapnik-Chervonenkis or VC dimension of SVMs can be very large or infinite. In short, we do not have algorithms that can provide performance guarantees. In the last few years, new sources of data have emerged, ranging from high dimensional micro-array and bio-informatics data, to very large databases emanating from social networks and telecom service providers. The analysis of such big data sources demands performance guarantees.

In this talk, we introduce the Minimal Complexity Machine (MCM), which minimizes an exact bound on the VC dimension. This means that the VC dimension of a MCM classifier can be kept small, and it provides a radically new direction to learning. On a number of benchmark datasets, the MCM generalizes better than SVMs. The MCM typically uses one-third the number of support vectors used by a SVM; on many datasets, the ratio may be as large as 100, indicating that the MCM does indeed learn simpler representations.

David Edurado Pinto: Narrow Domain Short Text Processing

Analysis of social media data is a rapidly growing area of research. People in the computational linguistic field are looking to extract a wide variety of information from these texts in order to address specific user needs, profile attitudes and intentions, and target advertising, etc., which may require application of the full range of natural processing techniques.

However, many of the texts in question—including news feeds, document titles, FAQs, and tweets—exist as short, sometimes barely sentence-like snippets that do not always follow the lexical and syntactic conventions assumed by many language processing tools.

Many NLP analyses rely on the repetition of specific lexical items throughout the text in order to identify topic, genre, and other features; without sufficient context to enable such analyses, and because of their often eccentric grammatical style, short texts pose a new kind of challenge for language processing research. In this talk we will present different approaches for narrow domain short text processing. The discussion includes types of short texts, enrichment methods, machine learning approaches and applications.