Textual analysis
Textual analysis is widely used to identify key category concepts and clusters, allowing us to process sizeable textual data and identify patterns that are difficult to identify by simply examining the file. For this assignment, the data for textual analysis was compiled into a single.csv file — this included five columns covering consumer reviews about various products associated with Amazon’s Alexa voice assistant. Specifically, the data were divided into Rating, Formation Date, Product Variation, Detailed Unstructured Review Transcript, and Review Classification as Feedback. The total number of rows in the data file was 3,150. The work was performed using SAS Visual Text statistical software to identify common trends in textual notes through statistical analysis and machine learning techniques. The current work presents two parts, including conceptual and categorical analysis.
Conceptual Analysis
Concept: Bad Reviews
The critical goal of conceptual analysis in SAS for the selected file was to extract the conceptual entities of unstructured text. The application value of this approach is to differentiate the total volume of text into some number of conceptual clusters, which creates a classification system. The first concept focused on finding bad reviews in which customers could express all or part of their indignation about Alexa products.
This concept included eleven of the most common synonyms associated with the word ‘Bad’ that consumers could potentially use to express negative feedback:
- CLASSIFIER:bad
- CLASSIFIER:worse
- CLASSIFIER:poor
- CLASSIFIER:inferior
- CLASSIFIER:awful
- CLASSIFIER:sad
- CLASSIFIER:negative
- CLASSIFIER:wrong
- CLASSIFIER:incorrect
- CLASSIFIER:worthless
- CLASSIFIER:useless
The algorithm “CLASSIFIER:code” was aimed at searching for the designated codes among all the elements of the text file, which allowed us to find exactly 69 out of 3150 reviews (2.19%) that contained these words, as shown in the screenshot below. It is important to clarify that the algorithm was not configured to search for combinations, which means that each time the designated word was found in the text elements, it was regarded as a unique case.
The essential question is the prevalence of specific terms in the cluster found. This step is important because it allows us to determine the relevance of each code in the algorithm because if a word has not been found, its use seems invalid. In reality, invalid words do not complicate the work for this procedure, but in the case of extensive data, any unnecessary terms can slow down the processing time and create the need for additional (but useless) work. In particular, out of 69 times, terms (case does not matter) were detected the following number of times:
- Bad = 21
- Worse = 2
- Poor = 7
- Inferior = 1
- Awful = 4
- Sad = 4
- Negative = 6
- Wrong = 15
- Incorrect = 0
- Worthless = 3
- Useless = 13
As expected, there was a discrepancy between the mention of a particular code and the total number of codes found: 76 detections instead of the expected 69. The problem is that two different codes could have been used in the same sentence, which caused a decrease in the total number of finds. In addition, it was also found that the term “Incorrect” is not found among the reviews, so for further work, it would be quite appropriate to remove it. It is fair to note that not all of the reviews found were strictly negative but could contain both positive and negative evaluations. For example, a combination of “Good” and “Poor” codes resulted in some of these reviews. In other words, the feedback detected in the conceptual analysis did not necessarily have to be strictly negative but at a minimum, included information about Alexa’s product flaws and negative consumer experiences in using.
Concept: Good Reviews
The second concept was diametrically opposed to textual analysis: this involved not only using a different philosophy but also different rules for finding codes. Finding reviews that contained all or partially good ratings was necessary as a fundamental concept for judging favorable characteristics and consumer experiences. The strategic value of such a method allows us to study the customer experience and modify marketing strategies based on what the user likes best.
Since the first concept was based on the ” CLASSIFIER:” rule, there is a need to use a different algorithm to form the second concept. So, it was decided to use the rule “Concept_Rule:” which allows operating with a Boolean algebra for search combinations. The specific algorithm that was used includes the following mechanics:
CONCEPT_RULE:(OR,”_c”,”_c”, “_c”,”_c”,”_c”,”_c”, “_c”, “_c”)
In this case, “OR” is used as a Boolean operator, which is the logic that means that all the codes listed below will match if at least one of them occurs in the text. The algorithm also uses the “@” symbol, which allows you to get the designated words in any form of English if any. For example, “good” is converted to “better.”.
The total number of matches found was 2188 out of 3150 (69.5%), which at first glance indicates a much higher number of positive consumer reviews than negative ones. It is necessary to investigate the presence of each of the terms as part of this algorithm (case is not important):
- Easy = 318
- Great = 661
- Love = 828
- Happy = 57
- Amazing = 81
- Good = 249
- Awesome = 78
- Like = 435
One can see that most of the reviews contain the words “love” and “great,” which allows us to characterize consumers’ feelings about Alexa’s product line in a favorable way. However, it is worth clarifying that not all of the 2,188 matches actually contain positive reviews (although most do). For example, the use of the “like” code could have been used in a negative way, as shown in the example review from the screenshot below. Meanwhile, the word “like” could have been used in the meaning of comparison or “as,” which also creates limitations to the useful output.
Categorical Analysis
Category: Correction
Categories are a narrower set than concepts because they ultimately create conditions for testing a statement for truth or falsity using ordinary logic. The first category uses the “NOT” operator, which allows you to exclude an argument from a rendition. The point of such a tool is to study the issuance of positive reviews, which will be devoid of negative evaluations. For this reason, the following algorithm is used:
(AND, (AND, (AND, (AND, (AND, (AND, (AND, (OR, “Easy”, “Great”, “Happy”, “Amazing”, “Good”, “Awesome”, “Love”), (NOT, “Bad”)), (NOT, “Poor”)), (NOT, “Worse”)), (NOT, “Inferior”)), (NOT, “Wrong”)), (NOT, “Awful”)), (NOT, “Sad”)), (NOT, “Negative”))
Notice that the output became expectedly smaller because all negative words were excluded from it using an algorithm that allows finding (AND) any (OR) of the positive words in such reviews, which would exclude (NOT) negative words. Note that the word “Like” was not added to this output, as it could have been used in different contexts, which would have distorted the results. Now one can be sure that every review found contains only positive items, which means that this category served as a text filter.
Category: Competition
Competitive analysis is essential for business, so exploring the output that would inform about the presence of specific competitive devices among the reviews is appropriate. It is suggested to use a simple operator, “OR,” which allows getting only such reviews in which at least one of the designated codes occurs. The specific form of such an algorithm is as follows:
(OR, “Google@”, “Apple@”, “Siri@”, “HomePod@”)
Only the specified codes were chosen for analysis, although they do not represent the entire global set of voice assistants and brands. Again, the “@” symbol indicates that any form of the designated code can be used. The final output size was 74, which is how often respondents referred to competitors in their reviews. It is worth clarifying that the presence of competitors does not mean negative feedback, as the use of comparison can favor Amazon.