Engineering and Computer Science, 2019

Sunita Sarawagi

Institute Chair Professor, Computer Science and Engineering, Indian Institute of Technology, Bombay

The Infosys Prize 2019 in Engineering and Computer Science is awarded to Prof. Sunita Sarawagi for her research in databases, data mining, machine learning and natural language processing, and for important applications of these research techniques. The prize recognizes her pioneering work in developing information extraction techniques for unstructured data.

Bio

Prof. Sunita Sarawagi is Institute Chair Professor in Computer Science and Engineering at IIT-Bombay. Prof. Sarawagi received her B.Tech. in Computer Science from the Indian Institute of Technology, Kharagpur in May 1991. She received her M.S. and Ph.D. in Computer Science from the University of California at Berkeley where she studied under Michael Stonebraker.

Following her Ph.D. Sarawagi did stints at IBM Almaden Research Center as research scholar, Carnegie Mellon as visiting faculty, and joined IIT-Bombay in 1999.

Between July 2014 and July 2016 Prof. Sarawagi was Visiting Scientist at Google Inc. in Mountain View where she worked on deep learning models for personalizing and diversifying YouTube and Play recommendations, improving Duo’s conversation assistance engine, and extracting attributes of classes from the Knowledge Graph.

Among her many awards are IBM Faculty award (2003 and 2008); Fellow of the Indian National Academy of Engineering (INAE) (2013); PAKDD Most Influential Paper Award 2014 for the paper: “Discriminative Methods for Multi-labeled Classification Shantanu Godbole and Sunita Sarawagi in PAKDD 2004”.

Prof. Sarawagi has several patents to her name. They include a patent for “Database System and Method Employing Data Cube Operator for Group-By Operations” and a patent for “Efficient evaluation of queries with mining predicates”.

Scope and impact of work

Prof. Sunita Sarawagi’s research is based on the development of fundamental principles and has had profound practical impact. Both these characteristics can be illustrated using just two examples from Prof. Sarawagi’s many papers.

Postal addresses have a structure: country, state, PIN, city, street, and so on. However, postal addresses that appear on the web and in other repositories are continuous text and often have some of these attributes missing. A challenge is to convert such unstructured text into structured information, which is much more efficient for handling queries and other applications.

Previous work on this problem had taken largely ad hoc approaches that were often labor-intensive. Prof. Sarawagi extended the theory of Hidden Markov Models (HMM) to solve this problem automatically. She and her colleagues created a software package, DATAMOLD, which has been used by many companies to improve address structuring in India.

The second example is Sarawagi’s work on extracting numerical information from unstructured text containing numbers on the web. Examples of queries with numerical answers are: “What is Microsoft’s revenue?” and “How many calories in a pizza?” The queries are imprecise: What size pizza and with what toppings? What are the units: calories or kilocalories? Nevertheless, users post many such questions to search engines and expect an answer.

Prof. Sarawagi and her colleagues developed QuTree to deal with units, and a probabilistic model for collective consensus. They implemented a family of algorithms, tested them, and reported the results of their algorithms comparing them with ground truth (such as values reported by the World Bank). These papers exhibit a systematic approach to build foundational models and theories, and then develop software and carry out testing on critical problems.

Citation by the Jury

Prof. Sunita Sarawagi was one of the earliest researchers to develop information extraction techniques that went beyond the world of structured databases to the kind of unstructured data one finds on the World Wide Web. This necessitated the use of novel machine learning techniques for extraction of information from natural language text. For example, Prof. Sarawagi and colleagues showed how one could extract and analyze unstructured numerical data in the web and other sources. She developed the formalism of semi-Markov conditional random fields for the task of segmenting out sequences of words which might correspond to “named entities” such as company names or job titles.

Sarawagi’s research has had valuable practical applications such as the development of software for cleaning and structuring Indian addresses, as well as de-duplicating them.

Congratulatory message from the Jury Chair—Arvind

“Congratulations Prof. Sunita Sarawagi for winning the Infosys Prize in Engineering and Computer Science. Your pioneering research in using machine learning to analyze and understand unstructured data makes it possible to use the wealth of information in the worldwide web and other sources for the betterment of society and for creating new businesses. You richly deserve this award.”