Issues in Text Mining Expository Essay

Exclusively available on Available only on IvyPanda® • No AI

Table of Contents

Internet Application
Problems Associated with Text Mining
How Text Mining has affected the Internet
A Report on Text Mining Applications
Reference List

Internet Application

Advances in internet applications are increasingly allowing users to perform complex tasks, which are more network centric, omnipresent, knowledge intensive, and computing demanding (Wen-Chung et al, 2010).

Today, more than ever before, internet users can use internet applications such as Weka4WS to mine data. According to Talia & Trunfio (2010), Weka4WS is a broadly used open source data mining tool that utilizes the distributed architecture to allow the mining of data in WS-Resource Framework (WSRF) enabled grids.

Problems Associated with Text Mining

Some of the problems associated with text mining, according to Lee at al (2010), include: word mismatch challenges; difficulties in determining the number of topics; incapacity by some tools to model relations among topics; difficulties in labeling a topic in some applications using words in the topic, and; difficulties in interpreting loading values with probability meaning.

How Text Mining has affected the Internet

Text mining, also referred to as text analytics, has affected the internet in major ways. In its most basic element, text mining refers to the process of extracting important data and knowledge from unstructured text through a process that employs strong sifting protocols to enable users sieve trivial information from important data (Weiss et al, 2010).

Owing to the importance of data in managing customer relationships, forecasting business scenarios, and in making important decisions, organizations across the world are making heavy investments in information and communication technologies in a focused attempt aimed at bringing information closer to where it is mostly needed.

The internet happens to be a major carrier of contextual data, and hence it has been influenced in major ways by text mining processes and applications (Feldman & Sanger, 2006).

The internet is increasingly been used by people in need of discovering new information. For example, most marketing executives are increasingly using the internet to discover about new and industry-specific trends in fulfilling customer demands.

It is important to note that such discoveries require 100% recall; that is, these marketing executives cannot afford to miss any relevant information, thus the need to engage in text mining as opposed to conducting a search on the internet using a standard search engine (Weiss et al, 2010). This orientation underlines how text mining has affected the use of the internet.

In equal measure, it can be argued that the internet and the World Wide Web are availing enormous quantities of data easily accessible for analysis and consumption through the process of text mining.

Indeed, it is a difficult task to overstate the impact of the internet on the dissemination and sharing of critical information. This observation implies that text mining is to a large extent dependent on the internet in the same vein as the internet has proved valuable to most professional users due to the availability of text mining processes (Weiss et al, 2010).

It is common knowledge that the internet contains vast amounts of information, which may prove to be of no germane value in the absence of text mining processes and applications. For example, a financial analyst may be unable to find value in thousands upon thousands of financial-related documents found on the internet; however, he can be capable of finding the relevant information on the internet when the right text mining tools and applications are used.

Lastly, it can be argued that the internet has been turned from a general repository of electronic information into a profit-making enterprise with a core function of dispensing specific knowledge and critical information, courtesy of the text mining processes and applications (Feldman & Sanger, 2006).

One only needs to look at the multiplier effects the internet has had on accredited research institutions, academic institutions, and the business fraternity to understand that the practice of text mining has changed how internet is used in major ways.

For example, a nursing personnel interested in making a clinical decision on the best practice about the treatment of a heart condition only requires to undertake text mining to review what other peer hospitals have been doing to manage the condition. In this perspective, the internet becomes a tool for knowledge actualization and empowerment (Weiss et al, 2010).

A Report on Text Mining Applications

Contemporary information systems allow organizations to capture and use vast amounts of data. There exists compelling evidence to the effect that much of this data or information is structured to the extent that it can be effectively and efficiently analyzed using traditional database software (Talia & Trunfio, 2010).

Increasingly, however, vast quantities of data such as textual data are unstructured, and therefore challenge simple efforts to synthesize and make sense of it.

In essence, traditional analysis of such unstructured textual data becomes an impractical endeavor (Lee et al, 2010). It is against this background that this report aims to discuss several text mining methods that could be used to analyze and make sense of such unstructured data, including discussing their benefits as well as their pitfalls.

According to Lee et al (2010), “…text mining has been developed from descriptive models such as latent semantic analysis to generative models such as probabilistic latent semantic analysis, latent dirichlet allocation, and correlated topic model” (p. 2).

Latent semantic analysis (LSA) models of text mining, according to Lee et al (2010) and Talia & Trunfio (2010), are not only able to project an original vector space or term-document matrix into a small factor space, but they have the capability of using singular value decomposition to realize the dimensional reduction of a matrix.

One of the major benefits of the LSA models of text mining is that they have the capability to extract topics or factors from the unstructured textual information/data through the use of factor loadings or matrix rotation (Wei & Bo, 2009).

It should be noted that by virtue of the orthogonal features of the factors, words contained in a particular factor have minimal relations with similar words in other factors, but words in a factor have extremely positive associations with words in that factor (Wei & Bo, 2009).

This particular orientation is beneficial to users because it not only allows them to sift and retain data that is specifically relevant to their undertaking, but it also facilitates easier arrangement of information in the folders. As such, LSA models of text mining are increasingly beneficial in search and retrieval, automatic essay grading, spam filtering, and topic detection.

Among the pitfalls, it is often difficult to determine the number of topics when using the LSA models, difficult to interpret loading values with probability meaning, and always challenging to label a topic in some instances using words in the topic (Lee et al, 2010).

The second text mining model that will be discussed in this report is referred to as the latent Dirichlet allocation (LDA) model. As observed by Lee et al (2010), the LDA model “…incorporates the generative process of documents with Dirichlet distribution” (p. 4). The LDA application operates using the following three steps.

In the initial step, the number of words or phrases used in a document is comprehensively determined through undertaking sampling using Poisson distribution.

In the next step, a distribution over probable topics for a document is generated using the Dirichlet distribution protocol. Lastly, topics are generated and then words or phrases for each topic are elicited based on the framework of document-specific distribution (Lee et al, 2010).

As observed by Misra et al (2011), the LDA application of text mining is not only beneficial in filtering information on specified meta-details when searching or viewing, but it can also be effectively used for modeling long-length documents that include multiple topics.

Another advantage of the LDA application is that it can be applied in wide-ranging areas, including automatic essay grading, automatic labeling, sentiment summarization, role discovery, emotion topic, and word sense disambiguation (Misra et al, 2011; Lee et al, 2010).

Lastly, the third text mining application under discussion is referred to as the Correlated topic model (CTM). One of the underlying thematic orientations of CTM, according to Shehata et al (2010), is that it is explicitly capable of addressing topic relations while employing the concept of logistic normal distribution.

Lee et al (2010) notes that “…in LDA, it is assumed that word in topics comprises a multinomial distribution, and multiple topics can be appeared in a document and the proportion of topics varies according to Dirichlet distribution” (p. 4).

CTM have distinct advantages. For polesemy words with diverse significance or connotations, CTM is able to demonstrate these words in other topics simultaneously, implying that it is indeed possible to search the meta-details of any words or phrases from the already-found documents (Shehata et al, 2010).

To reflect topic relation, which is an important attribute in any text mining exercise, “…CTM employs more flexible logistic normal distribution by considering covariance structure among the components of proportions” (Lee et al, 2010, p. 4).

As such, CTM can also be used in wide-ranging areas, including topic detection, tracking objects, image retrieval and query classification. Among the pitfalls, CTM not only requires complex computations, but also contains too general words in topics (Lee et al, 2010).

To conclude, it is imperative to note that the above discussed models of text mining represent a fundamental shift from the traditional methods of data analysis to contemporary techniques capable of working with unstructured textual information to establish relationships (Wen-Chung et al, 2010). The task for users, particularly business users, is for them to select which text mining model they could rely on to achieve optimal results in their attempt to analyze a multiplicity of meta-details and topic discourses.

Reference List

Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press

Lee, S., Song, J., & Kim, Y.J. (2010). An empirical comparison of four text mining methods. Journal of Computer Information Systems, 51(1), 1-10. Retrieved from Business Source Premier Database

Shehata, S., Karray, F., & Kamel, M.S. (2010). An efficient model for enhancing text categorization using sentence semantics. Computational Intelligence, 26(3), 215-231. Retrieved from Academic Search Premier

Talia, D., & Trunfio, P. (2010). How distributed data mining tasks can thrive as knowledge services. Communication of the ACM, 53(7), 132-137. Retrieved from Business Source Premier Database

Wei, W., & Bo, Y. (2009). Text categorization based on combination of modified back propagation neural network and latent semantic analysis. Neural Computing & Applications, 18(8), 875-881. Retrieved from Academic Search Premier Database

Weiss, S.M., Indurkhya, N., & Zhang, T. (2010). Fundamentals of predictive text mining. Warren, MI: Springer

Wen-Chung, S., Chao-Tung, Y., & Shian-Shyong, T. (2010). Performance-based data distribution for data mining applications on grid computing. Journal of Supercomputing, 52(2), 171-198. Retrieved from Academic Search Premier Database