Business Analytics. Can We Trust Big Data? Case Study

Exclusively available on Available only on IvyPanda® • No AI

Table of Contents

Big data: concept and implications
Hadoop analytic platforms
Works Cited

Big data: concept and implications

Today, one of the key areas for corporate leaders is working with data. Almost nine out of ten financial professionals believe that namely the data has the potential to change the approach to doing business (Erl, Khattak, and Buhler 9). Other studies have shown that just over half (51%) of corporate leaders rank big data and analytics as the top ten corporate priorities (Ankam 14). All this suggests that after the IT revolution of recent decades, business is entering an era when data is becoming a driving force.

The advent of open source platforms such as Hadoop and later Spark has played a significant role in the dissemination of big data, as these tools simplify the processing of big data and reduce the cost of storage. Over the years, the volume of big data has grown by orders. The sources of Big Data are data from measuring devices, message flows from social communities, weather data, Earth sounding data, GPS signals from mobile operators about the location of their subscribers, audio and video recording devices, etc.

The impetus for development and the beginning of widespread use of these sources is expected to enhance the penetration of Big Data technologies both in research and development, and in the business, financial, and public administration sectors (Schroeder 8). The results of processing large amounts of information are used to identify trends and patterns.

Frank Akito, one of the largest Big Data experts, believes that the Internet is the strongest factor in expanding the range of Big Data applications (Akito as cited in Ankam 36). The more devices connected to the Internet, the more information appears on the network that can be successfully used to conduct business. This is due to the fact that, analyzing the data obtained, companies study what principles a consumer is guided in when choosing a product or service. As a result of this, marketers create a model of potential consumer behavior and launch an appropriate advertising campaign.

In essence, the concept of big data implies working with information of enormous volume and diverse composition, very often updated and located in different sources in order to increase work efficiency, create new products, and increase competitiveness. According to the white paper Oracle Information Architecture: An Architect’s Guide to Big Data published by Oracle, when working with big data, one approaches information differently than when doing business analysis (Oracle). Working with big data is not like the usual business analytics process, where a simple addition of known values brings the result: for example, the result of adding data on paid bills becomes the sales volume for the year.

When working with big data, the result is obtained in the process of cleaning it by sequential modeling: first, a hypothesis is put forward, a statistical, visual, or semantic model is built, the validity of the hypothesis is verified on its basis, and then the following one is put forward. This process requires the researcher to either interpret visual meanings or make interactive queries based on knowledge, or develop adaptive machine learning algorithms that can get the desired result. Moreover, the lifetime of such an algorithm can be quite short.

The Forrester consulting company provides a concise and reasonably clear statement: “Big data combines techniques and technologies that extract meaning from data at the extreme limit of practicality” (Forrester as cited in Yu and Guo 52). Today, the scope of Big Data is characterized by the following features: Volume – the volume, the accumulated database represents a large amount of information; Velocity – speed (this indicator shows increasing the speed of data accumulation); Variety – diversity, i.e., the possibility of simultaneous processing of structured and unstructured multiformat information.

Based on the definition of Big Data, one can formulate the basic principles of working with such data, which one way or another all modern tools for working with big data follow (Erl, Khattak, and Buhler 38-41):

Horizontal scalability. Since there can be arbitrarily much data, any system that involves the processing of big data must be extensible.
Fail-safety. The principle of horizontal scalability implies that there can be many machines in a cluster. For example, Yahoo’s Hadoop cluster has more than 42,000 machines. This means that part of these machines will be guaranteed to fail. Methods of working with big data should take into account the possibility of such failures and survive them without any significant consequences.
Locality of data. In large distributed systems, data is distributed across a large number of machines. If the data is physically located on one server, and processed on another, the cost of data transfer may exceed the cost of the processing itself. Therefore, one of the most important principles for designing BigData solutions is the principle of data locality ‑ if possible, data should be processed on the same machine on which it is stored.

Today, Big Data is actively implemented in foreign companies. Companies such as Google, IBM, VISA, Master Card, HSBC, AT&T, and Coca Cola are already using Big Data resources. Procter & Gamble, using Big Data technologies, design modern products and develop global marketing campaigns. Procter & Gamble has opened Business Spheres specialized offices for viewing information in real time. Thus, the management of the company has the opportunity to instantly test hypotheses and conduct experiments. Procter & Gamble believes that Big Data helps in predicting company performance (Finch 33-38).

Netflix also analyzes user data: employees of the service developed an algorithm to formulate high-quality recommendations of films. Moreover, the company used the accumulated information to create its own unique content, which became worthy of competition with the best cable TV products.

Based on the huge amounts of data about users available from open sources (for example, what they read, how much time they spend on certain pages, what they do on social networks, whether they tend to make purchases in online stores, etc.), businesses can make accurate “portraits” of their potential customers. It allows opening new target audiences and, as a whole, conduct in-depth analysis of own business and make more informed decisions.

Big Data opens the possibility in real time to create statistical models that are close to reality, revealing previously hidden, non-obvious patterns; these correlations offer marketers previously unprecedented opportunities. An important distinctive feature of Big Data methods is the ability to answer the questions “why it happens this way,” “what should be expected in the future,” and even “what actions will help to achieve the desired result” (Schroeder 10-11). Previously, such decisions could be given only by a whole staff of experts who analyzed the incoming information for weeks.

As the trading potential of companies develops, the traditional databases cease to meet the growing requirements of the business, which is why the system cannot provide adequate detail for management accounting. Moving to big data, new technologies make it possible to optimize goods distribution management, achieve data relevance and speed of their processing to assess the consequences of managerial decisions, and quickly generate management reports.

The total amount of accumulated data is more than 100 exabytes, while only Walmart with the help of big data processes 2.5 Petabytes of data per hour. At the same time, the operational profitability is increased by 60% due to the use of Big Data technologies, and according to Hadoop statistics, after the implementation of Big Data, analytics productivity increases to process 120 algorithms, and profit grows by 7-10% (Finch 43-44). Undoubtedly, the variety and volume of data resulting from all kinds of interactions is a powerful base for a business to build and refine forecasts, identify patterns, evaluate performance, etc.

In a survey conducted by Accenture Analytics, representatives of organizations using big data today overwhelmingly declare satisfaction with the effect (Accenture Analytics as cited in Finch, 60). They see big data as a catalyst for moving their company to the digital level. Large companies are increasingly aware of the extremely high importance of big data for the implementation of their digital strategy.

Big Data technologies involve working with huge amounts of information. There is no universal method for processing Big Data, but there is the possibility of using various methods to partially solve this task. Successful application of the Big Data concept at any enterprise can seriously increase work efficiency and stimulate the creation of a new product. At the same time, the development of Big Data processing technologies is a very promising area of activity.

Big data can be applied in a variety of areas, but it is important to understand the pros and cons of this tool. Despite the obvious and potential advantages of using Big Data, its use has some drawbacks, which are primarily associated with large amounts of information, different methods of access to it, and often insufficient resource support of the information security function in organizations. Although big data is a very promising technology, it is impossible to turn a blind eye to a number of possible problems that the widespread adoption of analytical software will easily lead to.

One of the most significant points is the ability to evaluate the data ‑ to look at the relationships between them ‑ and then link them into a single whole picture. However, one can not always trust the correlations in the obtained data. In addition, many tools based on big data can be tricked (Hariri, Fredericks, and Bowers 34-39). Even such giants of the IT industry and apologists of Big Data as Google are not immune from errors. The company never managed to defeat the phenomenon of “search bombs,” and the Google Flu Trends project, which the developers assured of being able to predict disease outbreaks, made a mistake much more often than the US Centers for Disease Control and Prevention (Nasser and Soomro 7).

According to the data, 92% of companies working with big data companies have difficulty developing Big Data projects (Balachandran and Prasad 1113). The most serious obstacles are the underdevelopment of the existing infrastructure and organizational difficulties in introducing new approaches for data collection (Balachandran and Prasad 1112–1115). The problem may lie in the notorious “human factor” ‑ far from every analyst can work effectively in this direction. To truly immerse in the study of data, a person must be well versed in statistics and probability theory, as well as be able to conduct experiments and test his hypotheses, visualize the data.

It is also important that the implementation of Big Data solutions can lead to the creation or discovery of previously confidential information. Therefore, companies must ensure that all data security requirements are monitored and maintained. Obviously, the data must be masked in order to preserve the original data sources. In addition, it should be noted that, in the world, there are no special legislative norms regarding Big Data. All these problems lead to the fact that many companies cautiously introduce Big Data technologies. It is since when working with third parties they themselves have the problem of disclosing insider information, which the company could not disclose using only its own resources.

Securing and encrypting large amounts of data is a complex task. According to the Gemalto 2015 Breach Level Index, data leakage criticality index, today increasingly more organizations are not able to prevent data leakages and protect their information assets, regardless of the size of these assets (Srinuvasu and Santhosh 252). The increase in leakage is associated precisely with the transition of companies and departments to the centralized storage of Big Data.

The most “popular” for attackers were companies from the high-tech sector, as well as banks and insurers (Srinuvasu and Santhosh 253). The data that serve as a source for analysis, as a rule, contain information sensitive to business: business secrets, personal data. Breaking the confidentiality of working with such data can result in serious problems, including fines from regulators, outflow of customers, loss of market capitalization, critical reduction of goodwill.

As with any other aspect of information security, Big Data security should imply a multi-level approach to ensure maximum efficiency. Safety should be seen as a complex of various levels, which includes not only efforts to prevent leaks, but also measures to mitigate the effects of leaks. Organizations should protect data, not just the perimeter, and all this should be carried out simultaneously with measures aimed at ensuring the safety regarding leaks, which implies both the protection of the data itself and the protection of users working with this data. In addition, organizations should provide for secure storage and management of all encryption keys, as well as access control and user authentication.

Today, there are no clearly formulated methods that describe the systematic steps and actions to protect Big Data. Approaches focused on protecting critical data at all stages of their processing are required. However, in any case, security solutions should not affect system performance and should not cause delays. One way or another, high speed data access is one of the key defining characteristics of Big Data.

Hadoop analytic platforms

The growing needs for processing and analyzing large amounts of data in the IT industry (including analysis of web data, analysis of user movement on sites, analysis of network monitoring logs, analysis of social network data and analysis of corporate data) served as an incentive for the development of new solutions. The scientific field (for example, analysis of data generated large-scale modeling, sensor systems and high-performance laboratory equipment) also made contribution to the awareness of this need.

In particular, solutions are necessary for Web Mining ‑ the application of Data Mining methods and algorithms for detecting and searching for dependencies and knowledge on the Internet. Web Mining is a technology that uses Data Mining methods to research and extract information from Web documents and services. Examples of using Web Mining are Netflix and the world famous Google search engine.

The following stages of Web Mining are distinguished: resource search ‑ obtaining data from sources; information extraction ‑ receiving extraction from found Web resources; generalization ‑ the discovery of common patterns in separate and intersecting sets of sites; analysis and interpretation of results (Ankam 39). From the point of view of applying Data Mining algorithms when searching for patterns of user behavior, the following methods are most often used (Ankam 40-43):

Clustering ‑ search for groups of similar visitors, sites, pages, etc.
Associations ‑ search for jointly requested pages, ordered goods.
Sequence analysis ‑ search for sequences of actions. The most commonly used is the version of the apriori algorithm designed to analyze frequent sets but modified to identify frequent fragments of sequences and transitions.

The Hadoop platform offers a simple but powerful programming model and runtime environment that simplify the creation of scalable applications for parallel processing of large amounts of data on large clusters from commercially available computing systems. Hadoop’s open source solution has many applications in science and industry. IBM, Oracle, Microsoft, and some successful new companies, such as Cloudera, MapR, Platfora, and Trifecta, have developed Hadoop-based solutions (Bengfort and Kim 17). Now, Hadoop terminology and big data have come to be used, in fact, interchangeably.

Hadoop is an open source project managed by the Apache Software Foundation. It is used for reliable, scalable, and distributed computing, but can also be used as a general-purpose file storage that can accommodate petabytes of data; many companies use Hadoop for research and production. Hadoop consists of two key components (Sridnar 20-22):

Distributed file system Hadoop (HDFS), which is responsible for storing data on a Hadoop cluster;
MapReduce system designed for computing and processing large amounts of data on a cluster.

Based on these key components, several subprojects have been created, such as Pig, Hive, HBase, etc. However, unfortunately, modern RDBMS systems cannot accommodate all the huge amount of data that is created in large companies, and then there is a need to compromise. The compromise consists in the fact that the data is either only partially copied to the RDBMS or deleted after a certain time. The need for such trade-offs disappears if Hadoop is used as an intermediate layer between the interactive database and the data warehouse (Bengfort and Kim 50-51):

Data processing productivity increases in proportion to the increase in data storage, while in high-performance servers, the increase in the amount of data and the change in performance are disproportionate.
When using Hadoop, to increase processing performance, it is sufficient to simply add new nodes to the data warehouse.

However, Hadoop has a number of serious limitations, and, therefore, it cannot be used as an operating database. In particular, it still takes a few seconds to complete the fastest task in Hadoop. No changes to the data stored in the HDFS system are allowed; moreover, Hadoop does not support transactions. Nevertheless, Hadoop works in clusters of standard computers and represents a cost-effective solution for storing and processing structured, partially structured, and unstructured data without restrictions on their formats. This makes Hadoop ideal for creating data ‘lakes’ in support of Big Data analytics initiatives.

Works Cited

Ankam, Venkat. Big Data Analytics. Packt Publishing, 2016.

Balachandran, Bala M. and Shivika Prasad. “Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence.” Procedia Computer Science, vol. 112, 2017, pp. 1112–1122.

Bengfort, Benjamin and Jenny Kim. Data Analytics with Hadoop: An Introduction for Data Scientists. O’Reilly Media, 2016.

Erl, Thomas, Wajid Khattak, and Paul Buhler. Big Data Fundamentals: Concepts, Drivers & Techniques. Prentice Hall, 2016.

Finch, Victor. Big Data for Business: Your Comprehensive Guide To Understand Data Science, Data Analytics and Data Mining To Boost More Growth and Improve Business. CreateSpace Independent Publishing Platform, 2017.

Hariri, Reihaneh H., Erik M. Fredericks, and Kate M. Bowers. “Uncertainty in Big Data Analytics: Survey, Opportunities, and Challenges.” Journal of Big Data, vol. 6, no. 44, 2019, pp. 34-42.

Nasser, Thabet and Tariq Rahim Soomro. “Big Data Challenges.” Journal of Computer Engineering & Information Technology, vol. 4, no. 3, 2015, pp. 1-10.

Oracle. Oracle Information Architecture: An Architect’s Guide to Big Data. Oracle, 2013.

Schroeder, Ralph. “Big Data Business Models: Challenges and Opportunities.” Cogent Social Sciences, vol. 2, 2016, pp. 1-15.

Sridnar, Alla. Big Data Analytics with Hadoop 3: Build Highly Effective Analytics Solutions to Gain Valuable Insight into Your Big Data. Packt Publishing, 2018.

Srinuvasu, Muttipati Appala and Egala Bhaskara Santhosh. “Big Data: Challenges and Solutions.” International Journal of Computer Sciences and Engineering, vol. 5, no. 10, 2017, pp. 250-255.

Yu, Shui and Song Guo. Big Data Concepts, Theories, and Applications. Springer, 2016.