|
|
This article or section has multiple issues. Please help improve the article or discuss these issues on the talk page.
|
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content textual sources for business intelligence, exploratory data analysis, research, or investigation. Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis via natural language processing (NLP).
The term also describes that application of text analysis to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text[1]. These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.
A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.
Contents |
Text Analysis Processes
Subtasks -- components of a larger text-analysis effort -- typically include:
- Information Retrieval or identification of a Corpus is a preparatory step: collecting or identifying a set textual materials, on the Web or held in a file system, database, or content management system, for analysis.
- Named Entity Recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation -- the use of contextual clues -- may be required to decide where, for instance, "Ford" refers to a former U.S. president, a vehicle manufacturer, a movie star (Glenn or Harrison?) or so other entity.
- Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
- Coreference: identification of noun phrases and other terms that refer to the same object. For example, anaphora is a type of coreference.
- Relationship, Fact, and Event Extraction: identification of associations among entities and other information in text
- Sentiment Analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analysis techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object[2].
Applications
The earliest text-analysis applications, dating to the late 1990s, emerged in life-sciences research and government intelligence. The technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function. Using this approach to classifying solutions, application categories include:
- Enterprise Business Intelligence/Data Mining, Competitive Intelligence
- E-Discovery, Records Management
- National Security/Intelligence
- Scientific Discovery, especially Life Sciences
- Sentiment Analysis Tools, Listening Platforms
- Natural Language/Semantic Toolkit or Service
- Publishing
- Search/Information Access
Software
There are many text analysis research, commercial, and open source software options. Some are comprehensive solutions; others handle particular subtasks.
Commercial Software
- AeroText - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
- AlchemyAPI - web-based text analysis API: document categorization, language identification, term extraction, named entities, etc. Multi-lingual support.
- IBM LanguageWare is the IBM suite for Text Analysis (Tools and Runtime).
- Infonic provides commercial sentiment analysis of financial news feeds for the Thomson Reuters RMDS trading information system. The "sentiment scores" that this software provides are used within algorithmic trading systems by several major trading banks. Infonic also develops unique document summarization and textual navigation technologies that aid in Knowledge Management.
- Nstein Technologies - text mining solution that creates rich metadata to allow publishers to increase page views, increase site stickiness, optimize SEO, automate tagging, improve search experience, increase editorial productivity, decrease operational publishing costs, increase online revenues. In combination with search engines it is used to create semantic search applications.
- SPSS - provider of PASW Text Analysis for Surveys and PASW Text Analysis, Advanced NLP-based text analysis software (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with SPSS Predictive Analysis Solutions.
- Execware - publisher of Reason, PC program with patented automated data tables for visually detecting connections - text/numeric data about anything, i.e. objects, events, people, places, or anything else.
- Webropol - provides text mining built into its online survey software[3]
Open-Source Software
- GATE - General Architecture for Text Engineering, an open-source toolbox for natural language processing
- UIMA - Unstructured Information Management Architecture
- RapidMiner - open-source software for data and text mining
See also
- Noisy text analytics
- Information extraction
- Computational linguistics
- Natural language processing
- Named entity recognition
- Identity resolution
- Text mining
- News analytics
Notes
External links
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)




