Data Warehouse Research Paper

Exclusively available on Available only on IvyPanda® • No AI

Abstract

Data warehousing, as a means of organizing enterprise information in order for businesses to manage knowledge and benefit from the knowledge acquired from possible analysis, is a common business venture in most firms today. Gone are the days when one large and expensive supercomputer would be used to manage an entire organization’s data.

Today, various Central Processing Units (CPUs) are available and at the disposal of the IT team. The beauty of this scenario is that the CPUs can be used simultaneously to perform completely different, but related tasks that are part of the major task and thus completing the major task in record time.

One of the many advantages of data warehousing is the fact that these systems become a central data source after consolidation, which is accessible to end users and information derivation becomes simpler if not straightforward. Consequently, this element increases the efficiency of business transactions, which eventually draws the line between the firms with business acumen and those without.

However, one inherent disadvantage follows data warehousing and it involves data mining. Ideally, data mining is the final stage of data warehousing because at this point, it is possible to gather all possible types of relational information from the system and determine links and relationships that were not decipherable before. As a result, the accuracy of queries increases and business output increases.

However, this case does not apply in practice due to a few hitches that attach to this process of data mining. First, after completing the process of data mining, only a few users in the entire enterprise can actually get to use the procedure due to the high level of specialization required in its application. In fact, the number presently oscillates at a maximum of five.

Given this scenario, unsurprisingly most organizations do not see the point of paying very expensively for a process that would only be used by five people in the firm. Therefore, they pay peanuts. On the other hand, data-warehouse builders know that they require a lot of upfront capital and heavy investment in time resources upfront before coming up with a data-mining algorithm, which is infamous for its complexity.

This aspect coupled with the fact that it is virtually impossible to predict the resourcefulness of a data mining infrastructure from the onset and thus decapitating the technician from having a sales pitch, makes a very bad case for data mining, and yet its importance cannot be overemphasized.

This paper looks into several such poignant features of data warehousing and close with a few recommendations as well as forecasts into the future of data warehousing.

Introduction

Data warehousing is a rather new term for an old concept. In fact, it emerged in the 1990s where it was initially referred to as Decision Support System or Executive Information system. The father of data warehousing is one William Inmon and a co-innovator usually lined up beside him in reviews is Ralph Kimball.

Several definitions exist to befit what has come to be accepted as data warehousing in the 21^st century and these include “A Data warehouse is as organized system of enterprise data derived from multiple data sources designed primarily for decision making in the organization” (Bertman, 2005, p. 12).

This definition brings out the idea of a myriad of sources of data, which is especially relevant because today, most organizations have a multiple of data sources.

Moreover, it is essential in the customization of data warehousing to ensure that the data-warehousing infrastructure being set up including ETL tools (Extraction, Transformation, Transportation and Loading solutions) are compatible with all the data sources. Additionally, the definition touches on the issue of decision making as a primary focus when establishing a data-warehousing project.

A second definition is slightly brief, viz. “…a data warehouse is a structured repository of historic data” (Kimball, Ross, Thornthwaite, Mundy, & Becker, 2008, p. 32)

The author of this definition adds that it is “…developed in an evolutionary process by integrating data from non-integrated legacy systems” (Kimball, Ross, Thornthwaite, Mundy, & Becker, 2008, p. 32).

This definition is attractive for its introduction of the term “integrated”, because the main idea behind data warehousing is that the information that was previously archived in a jumble is reorganized to make sense in the form of tables and even graphs depending on the presentation format preferred by the end user.

At this point, it is appropriate to introduce Inmon’s definition. As the father of data ware housing, his definition has attached a legendary thrill to data warehouse builders and other experts in the field and thus it has even been used in a devolved capacity to divide data warehousing into branches.

He states, “A data warehouse is a subject-oriented, integrated, time variant, and anon volatile collection of data used in strategic decision making” (Inmon, 2003, p. 34). It is important to note the usage of several definitive words that have since achieved the status of “mandatory” features of a data warehouse including subject oriented, non-volatile, time variant, and integrated.

Another definition reads, “A data warehouse is an electronic storage of an organization’s historical data for the purpose of analysis and interpreting” (Prine, 1998, p. 54).

The interesting concept introduced by this final definition is the term “historical data”, which is a very important feature of data warehouses as shall be seen in the ensuing discourse. Additionally, the tasks of analysis and interpretation mentioned by this definition are very crucial features in the business of data ware housing.

The next section provides a run through the definitions of other important terms outlined within this paper.

Definitions

OLAP: – Online Analytical Processing refers to the procedure through which multidimensional analysis occurs.

OLTP: – this term refers to a transaction system that collects business data and it is optimized for INSERT and UPDATE operations. It is highly normalized because the emphasis is on updating the system since transactions take precedence here and so the currency of the information is crucial for the relevance of the data.

Data Mart: – this term underscores a data structure designed for access. It is designed with the aim of enhancing end user access to information files stored in subject-order. For instance, in an organization there are numerous departments including IT, HR, Management, Finance, and Research among others.

However, an organization may set up data marts on top of the hardware platform for each department, so that after data warehousing, there exists the traditional centralized data storage envisioned by the creators, but in addition to this, a next section in the architecture provides for data marts (Hackney, 2007, p. 45). These elements would in effect separate the information into the relevant sub-sections based on the subject matter.

ER Model: – this model refers to an entry relationship model. In other words, a data modeling methodology whose aim is to normalize data by reducing redundancy.

Dimensional Model: – this model qualifies the data. The main goal is to improve data retrieval mechanism. It is ideal for data ware housing that is operated based on queries. A typical example would be keying in 1kg as a search term and how convoluted the results that one is likely to get would be.

On the contrary, if one keys in: “1kg of soya (product) bought by Becker (customer) on 23^rd November 2012 (date),” in effect, one has just introduced three dimensions- product, customer, and date.

These are mutually independent and non-overlapping classifications of data (Imhoff, Galemmo, & Geige, 2003, p.101). A fact underlines something that can be measured or quantified conventionally, but not always, in numerical values that can be aggregated.

Star schema: – this term refers to a technique used in data warehousing models in which one centralized fact table is used as the reference for all the dimension tables so that the keys (primary keys) from the entirety of dimension tables can flow directly into the fact table (as foreign keys of course) in which the measures are stored. The entity relationship represented diagrammatically resembles a star, hence the name.

Different Types of Data Warehousing Architectures

There are three main types of data warehousing architectures and these include:

Data Warehouse Architecture (basic)
Data Warehouse Architecture (with a Staging Area)
Data Warehouse Architecture (with a staging area and a data mart)

Data ware house architecture basic

This structure comprises metadata, raw data, and summary data. Meta data and raw data are a classical feature of all operational systems, but the summary data makes the architecture to be a unique data warehouse material.

Summaries pre-compile long operations in advance, for instance, they can grant an answer to a query on August sales (Imhoff & White, 2011, p. 25). In oracle, a summary is also known as a materialized view and in term of granul-ity, it may be atomic, which is transaction oriented, lightly summarized, or highly summarized.

Data Warehouse Architecture (with a Staging Area)

This architectural type is relevant when there is a need to clean and process operational data before it is stored in the warehouse. This task can be done either programmatically, that is, with a program or using a staging area. A staging area simply refers to that “region of the architecture that simplifies building summaries and general warehouse management” (Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2003, p. 67).

Data Warehouse Architecture (with a staging area and a data mart)

This architecture type is ideal for the customization of a data warehouse for different groups within an organization. It adds “data marts to the staging area, where data marts are systems that are designed for a particular line of business” (Hackney, 2007, p.18). A good example is a case where a firm needs to separate inventories from sales and or purchases.

At this point, it is important to introduce the concept of Business Intelligence for a better understanding of the working of database warehouses. Business intelligence covers information that is available for strategic decision making by businesses. In this setting, the data warehouse is simply the backbone or the infrastructural component (Prine, 1998, p. 39).

Business intelligence includes the insight that is obtained upon the execution of a data mining analysis and other unstructured data, and this aspect explains the significance of content management systems because in an unstructured context, they organize the information logically for better analysis.

When choosing a business intelligence tool, one needs to address the following considerations that advice the choice, viz. increasing the costs, increasing the function ability, increasing the complexity of business intelligence, and decreasing the number of end users (Eliott, 2012). Interestingly, the most popular business intelligence tool is Microsoft Excel.

This assertion holds due to several reasons including the fact that Ms Excel is cheap to acquire, and it is conveniently simple to use.

In addition, the user does not have to worry whether the other user can decipher the information or figure out how the reports are to be interpreted (because the presentation is simple to interpret), and finally, Excel has all the functionalities that are necessary for the display of data (Barwick, 2012).

Other tools include a reporting tool, which can be either custom built or commercial and it is used for the running, creation, and scheduling of operations or reports (Kimball, Ross, Thornthwaite, Mundy, & Becker, 2008, p. 67).

Another tool is the OLAP tool, which is a favorite amongst advanced users because it features a multidimensional perspective of findings, and finally there is the Data mining tool that is for specialized users, hence the limitation to less than five users in an entire enterprise.

Overall structure

The primary features of a data warehouse are better relayed in a graphical format, but this section hopes to provide a comprehensive textual explanation of the same. At the beginning end, there exists data sources, which are archived in different formats, but they are largely unorganized and very general.

The idea is to get them to the other end where in an idyllic scenario they are available to end users in data marts and the users are capable of deriving this information in the form of CDs, DVDs or flash drives.

In a bid to get to that end, the data has to pass through data acquisition, which refers to retrieval of information from the data sources; that is, “a set of processes and programs that extract data for the data warehouse and operational data store from the operational systems” (Imhoff, Galemmo, & Geige, 2003, p. 17).

At this stage, features touching on cleansing, integrating, and transformation of data stand out. Next, the data, through data delivery, is moved to the open marts and ready for harvesting.

Advantages of data warehousing

This process makes the data more accessible in terms of accuracy so that end users do not fumble through scores of unsorted data in order to get a response to the queries that they are seeking to answer. Consequently, it makes the process of accessing that information cheaper and more efficient.

It reduces the costs of acquiring this data because the accessibility means that users do not need to spend additional resources on fruitless tasks; in addition, these resources can be expended elsewhere. Another advantage is that it increases the competitive advantage of the enterprise that integrates it into its infrastructure.

The data in a data warehouse can be used in multiple scenarios including in the production of reports for log term analyses, in producing reports meant to aggregate enterprise data, and finally for producing reports that are multidimensional; for instance, a query can be lodged on the profits accrued by month, product, and branch.

The information stored in a warehouse provides a basis for strategic decision-making, it is available for access, and it is consistent. Additionally, it assists in introducing an organization to the continuous changes in information within the enterprise. Finally, it helps protect the data from abusers.

Disadvantages of data warehousing

Data warehousing is a very costly investment, which is bound to dig into the capital pool of the enterprise that is using it. Additionally, it takes a lot of time to get the project underway and finally see it to completion and this aspect could be anywhere between two to six months. The time becomes relevant because the data-warehousing infrastructure being installed may just end up obsolete by the time it is getting into production.

The very volatile nature of business is vulnerable to this new risk because in contemporary times, even the formerly static fields like finance are susceptible to multiple changes within such a period in order to increase sales. In such a scenario, at the onset of installation, the data warehousing technique may be relevant, but at the end of the project, it may have become obsolete.

It is also very worrying that colleges and other institutions are churning out new experts in data warehousing every other day and the effect that this has on the industry is horrifying because these new brains are eager to apply what they have learnt ins school, yet have not practiced and they apparently lack quality experience.

Ultimately, they install data warehouses that are slow or ineffective because of sticking to ideals that may not be practical in real life scenarios.

Moreover, another disadvantage is the fact that due to the efficiency of the results of data warehousing, organizational users may be tempted to use the data warehouse inappropriately.

This scenario occurs when the data warehouse is used to replace the operational systems or reports that are normally churned out by operational systems, or in analyzing the current operational results. It is noteworthy that these two systems are not supposed to be used interchangeably; on the contrary, they should be used complimentarily.

OLTP and Data Warehousing Environments

Before getting to the contrasts, it is important to create a background that is relevant to this discourse. With that in mind, a data warehouse “is a relational database, which is designed for queries and analyses rather than for transaction processing” (Imhoff, Galemmo, & Geige, 2003, p.111).

Consequently, it is comprised of historical data as well as data from other sources or in other word, which in most cases it falls in the category of unstructured data. The surrounding environment features the following components:

ETL solution

This component comprises the extraction, transportation loading, and transformation stages that are required for unstructured data to be cleaned and transformed into an integrated block of information.

Online Analytical Processing Engine (OLAP)

This component underscores the reporting and analyzing system that processes business data. It is deliberately de-normalized in order to ensure fast data retrieval. As a result, instead of the update and insert features that are commonplace for OLTP, this system features SELECT operations that are ideal for queries (Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2003, p. 54).

A good example would be in a department store scenario where at the Point-of-Sale, which is at the cashier’s stand where he or she looks at the price list that he or she has and deducts money from customers’ credit cards; therefore, this aspect amounts to a transaction and so OLAP is not in play (Hackney, 2007, p. 39).

However, if the store manager were to require a list of out-of-stock products, he would turn to the OLAP operation to retrieve that data.

After landscaped the environs of a data warehouse to this end, it is important to look into the founding father’s perspective, as it shall form the basis of the contrast between OLTP and Data Warehousing Environments. As per William Inmon’s definition of warehouses mentioned above, four distinguishing features come to mind:

Subject oriented. During operation, where operation refers to data analysis, it is possible for the data warehouse to be programmed to act based on a particular subject, for example, sale of Ferraris. In this line of thought, it is thus possible to arrive at the best customer for Ferraris in June 2012. This aspect is known as subject orientation.

Integrated. This feature is in reference to an organization and so it is safe to say that it is an organizational feature. At this point, it is apparent that in an organizational context, there exist various sources of data.

The cumulative effect of this aspect is that the bulk of the data will be disparate and inconsistent and thus the job of ensuring that this data goes through consolidation and alignment into a sensible platform belongs to the data warehouse (Bertman, 2005, p. 41).

In the course of executing this task, various challenges are expected to emerge. These challenges should meet resolution and if the data warehouse is capable of getting to such a state where they are resolved, it qualifies as an integrated data warehouse.

Time variant. The idea behind data warehousing is to carry out an analysis that spans a given period and the width of its scope may be infinite. This aspect explains why data warehouses contain historical data ranging back years or decades.

This element is very different from Online Transaction Processing (OLTP) systems, which store historical data in archives to give room for current data. On the contrary, data warehousing analysts need a large data bundles in order to glean change over time, which underscore the concept of time variance.

Non volatile. This feature is in reference to the stability or performance of data once it has been loaded into the data warehouse. The data warehouse should have the ability to maintain the information in the state that it was entered initially. There should not be any deletions or other alteration or else the whole information would be jumbled and inaccurate to use in the analysis of business intelligence.

Contrast between OLTP and Data Warehousing Environments

Workload

Data warehouses accommodate ad hoc queries, which is to say that the queries they deal with are random and unexpected. The ideal system should have the capacity to perform well in a wide array of possible questions in various categories. On the other hand, OLTP systems rely on the pre definition of key concepts. It follows that applications should be specifically tuned or designed for preset applications.

Data modifications

Data warehouses feature a regular update of the system through the ETL process (offering extraction, transportation, transformation, and loading solutions). The same is set to run nightly or weekly depending on organizational preferences. In a bid to accomplish this goal, the enterprise employs bulk-data-modification-techniques. However, the end users do not individually update the data warehouse.

On the contrary, in OLTP systems, “the end users are responsible for system updates and they do this by way of routinely issuing individual modification statements to the database warehouse; consequently, the database is always up to date” (Reddy, Rao, Srinivasu, & Rikkula, 2010, p.2869).

Schema design

Data warehouses “use fully or partially de-normalized schemas such as the star schema for optimal query performance” (Reddy, Rao, Srinivasu, & Rikkula, 2010, p.2870). On the other hand, OLTP systems use normalized schemas for optimum updates with insert and delete functionalities and data consistency because they are transactional and the accuracy of current information is very critical.

Typical operations

For data warehouses, the typical operation is querying. They need the capacity to scan thousands or even millions of rows simultaneously to come up with the required search result load. A good example of such a demanding query is one that is in search: for instance, finding the total sales for all the cashiers for the last month.

On the other hand, OLTP systems have a lighter burden to contend with in terms of the requirements of bulk. A transactional operation scans only a handful of records at a go. For instance, retrieve the current price for this customer’s order.

Historical data

Due to the nature and the intended use of data warehouses, it is relevant for them to store up to decades of information in a region that is easily accessible when queries are executed. Such a structure is ideal for historical analyses. On contrary, OLTP systems are just the opposite.

They store up data for at most a few weeks or months and only retain historical data as is relevant for the current transaction. Moreover, this additional historical data is stored up in archives and a special retrieval process is necessary when it becomes relevant or necessary.

Hardware and I/O Considerations in Data Warehouses

Scalability

It is important to ensure that the data warehouse grows as the data storage grows. In a bid to warrant this element, it would be wise to choose the RDBMS and hardware platforms that are adequately structured to handle large volumes of data with the most efficacies (Kimball, Reeves, Ross, & Thornthwaite, 1998, p. 90).

However, this move may be a difficult task to embark on in advance when it is still not apparent what amount of data shall be stored in the data warehouse in its maturity. This realization explains why it is also advisable to approximate the amount and use it as a basis in setting up the data warehouse.

Parallel Processing Support

It is necessary to refrain from using one CPU as the main processor and instead use multiple CPUs each performing a related part of the task separately but simultaneously (South, 2012, p. 67).

RDBMS – Hardware combination

This move becomes relevant because of the physical location of the RDBMs as it is strategically placed on top of the hardware platform and this aspect may bring issues with bugs and bugs fixing (Kimball & Ross, 2002, p. 26).

Ebay database warehouse (structure)

Oliver Ratzesberger and his team in eBay are responsible for two of the world’s larges t data warehouses. The Greenplum data warehouse that is fully equipped with a data mart is comprised of 6.5 petabytes of user data, which translates to more than 17 trillion records, and “each day, an additional 150 billion new records are added and this amounts to 100 days of event data (Dignan, 2010, Para.12).

The ultimate goal is to reach 90-180 days of event data. The working speed of these metrics is an impressive 200 MB/node/sec of I/O. This rate further improves due to a minimized number of concurrent end users.

The second data warehouse is “a teradata warehouse with two (2) petabytes of user data, which is fed by tens of thousands of production databases” (Miller, & Monash, 2009, Para.6).

Its speed is 140 GB/sec of I/O, or 2 GB/node/sec. By aiming at resource partitions, eBay metrics rely on the workload management software to deliver on numerous Service –Level Agreements (SLA) simultaneously.

Conclusion

This paper has addressed the topic of data warehousing exhaustively. It has touched on the system’s definitions, characteristics, advantages and disadvantages, contrasts with OLTP and even hardware considerations. Finally, it has concluded by looking into eBay’s data warehousing, which is the idyllic system that most organizations throughout the globe envy and would be wise to learn from.

References

Barwick, H. (2012). Security, Business Iintelligence ‘critical’ for Australian CIOs in 2013: Telstyle. Web.

Bertman, J. (2005). Dispelling Myths and Creating Legends: Database Intelligence Groups. Web.

Dignan, L. (2010). eBay’s Teradata implementation headed to 20 petabytes. Web.

Eliott, T. (2012). Rethinking Business Intelligence: 3 Big New Old Ideas. Web.

Hackney, D. (2007). Picking a Data Mart Tool. Web.

Imhoff, C., Galemmo, N., & Geiger, J. (2003). Mastering Data Warehouse Design : Relational and Dimensional Technique. Indianapolis, IN: Oxford University Press.

Imhoff, C., & White, C. (2011). Self-Service Business Intelligence Empowering Users to Generate Insights. Web.

Inmon, W. (2005). Building the Data Warehouse. Indianapolis, IN: Wiley.

Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (2003). Fundamentals of Data Warehousing (2^nd edn.). New York, NY: Springer.

Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). Data Warehouse Lifecycle Toolkit: Expert methods for Designing, Developing, and Deploying Data Warehouses. Indianapolis, IN: Wiley.

Kimball, R., & Ross, M. (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2^nd edn.). Indianapolis, IN: Wiley.

Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). Data Warehouse Toolkit: Practical Techniques for Building Data warehouse and Business Intelligence Systems (2^nd edn.). Indianapolis, IN: Wiley.

Miller, R., & Monash, C. (2009). eBay’s two enormous data warehouses. Web.

Prine, G. (1998). Coherent Data Warehouse Initiative. London, UK: Unisys Presentations.

Reddy, S., Rao, M., Srinivasu, R., & Rikkula, S. (2010). Data Warehousing, Data Mining, OLAP and OLTP Technologies are Essential Elements to Support Decision-Making Process in Industries. International Journal of Computer Science and Engineering, 2(9), 2865-73.

South, G. (2012). Small business: Savings lead to a Stellar business. New Zealand Herald , 67.