I. Introduction to Text Mining 1 I.1 Defining Text Mining 1 I.2 General Architecture of Text Mining Systems 13 II. Core Text Mining Operations 19 II.1 Core Text Mining Operations 19 II.2 Using Background Knowledge for Text Mining 41 II.3 Text Mining Query Languages 51 III. Text Mining Preprocessing Techniques 57 III.1 Task-Oriented Approaches 58 III.2 Further Reading 62 IV. Categorization 64 IV.1 Applications of Text Categorization 65 IV.2 Definition of the Problem 66 IV.3 Document Representation 68 IV.4 Knowledge Engineering Approach to TC 70 IV.5 Machine Learning Approach to TC 70 IV.6 Using Unlabeled Data to Improve Classification 78 IV.7 Evaluation of Text Classifiers 79 IV.8 Citations and Notes 80 V. Clustering 82 V.1 Clustering Tasks in Text Analysis 82 V.2 The General Clustering Problem 84 V.3 Clustering Algorithms 85 V.4 Clustering of Textual Data 88 V.5 Citations and Notes 92 VI. Information Extraction 94 VI.1 Introduction to Information Extraction 94 VI.2 Historical Evolution of IE: The Message Understanding Conferences and Tipster 96 VI.3 IE Examples 101 VI.4 Architecture of IE Systems 104 VI.5 Anaphora Resolution 109 VI.6 Inductive Algorithms for IE 119 VI.7 Structural IE 122 VI.8 Further Reading 129 VII. Probabilistic Models for Information Extraction 131 VII.1 Hidden Markov Models 131 VII.2 Stochastic Context-Free Grammars 137 VII.3 Maximal Entropy Modeling 138 VII.4 Maximal Entropy Markov Models 140 VII.5 Conditional Random Fields 142 VII.6 Further Reading 145 VIII. Preprocessing Applications Using Probabilistic and Hybrid Approaches 146 VIII.1 Applications of HMM to Textual Analysis 146 VIII.2 Using MEMM for Information Extraction 152 VIII.3 Applications of CRFs to Textual Analysis 153 VIII.4 TEG: Using SCFG Rules for Hybrid Statistical–Knowledge-Based IE 155 VIII.5 Bootstrapping 166 VIII.6 Further Reading 175 IX. Presentation-Layer Considerations for Browsing and Query Refinement 177 IX.1 Browsing 177 IX.2 Accessing Constraints and Simple Specification Filters at the Presentation Layer 185 IX.3 Accessing the Underlying Query Language 186 IX.4 Citations and Notes 187 X. Visualization Approaches 189 X.1 Introduction 189 X.2 Architectural Considerations 192 X.3 Common Visualization Approaches for Text Mining 194 X.4 Visualization Techniques in Link Analysis 225 X.5 Real-World Example: The Document Explorer System 235 XI. Link Analysis 244 XI.1 Preliminaries 244 XI.2 Automatic Layout of Networks 246 XI.3 Paths and Cycles in Graphs 250 XI.4 Centrality 251 XI.5 Partitioning of Networks 259 XI.6 Pattern Matching in Networks 272 XI.7 Software Packages for Link Analysis 273 XI.8 Citations and Notes 274 XII. Text Mining Applications 275 XII.1 General Considerations 276 XII.2 Corporate Finance: Mining Industry Literature for Business Intelligence 281 XII.3 A “Horizontal” Text Mining Application: Patent Analysis Solution Leveraging a Commercial Text Analytics Platform 297 XII.4 Life Sciences Research: Mining Biological Pathway Information with GeneWays 309 Appendix A: DIAL: A Dedicated Information Extraction Language forText Mining 317 A.1 What Is the DIAL Language? 317 A.2 Information Extraction in the DIAL Environment 318 A.3 Text Tokenization 320 A.4 Concept and Rule Structure 320 A.5 Pattern Matching 322 A.6 Pattern Elements 323 A.7 Rule Constraints 327 A.8 Concept Guards 328 A.9 Complete DIAL Examples 329 Bibliography 337 Index 391