Data management is an important part of GIS operation. However, there is a major difference between information in general and GIS data organization in particular. Fazal offers an overview of data and its use in GIS in a chapter of his book GIS basics.
Although information and data are usually referred to as synonyms, in GIS perspective, there is a huge difference between the two. Despite the fact that in GIS, these two notions are admittedly related, they are still technically not the same.
In the specified chapter of his book, Fazal explains that data is the source material, whereas information is the result of data analysis. Data can exist as a linguistic, symbolic, or mathematical expression, and even a signal. Information, however, exists only in the form of a message delivered to a recipient from a sender.
At some point, Fazal states in a straightforward fashion that “data are facts” (Fazal, 2008, p. 100). In other words, the concept of “data” embraces all elements known as facts, which can be stored in a database, such as images, programs, rules, etc. Hence the function of an information system is defined – the latter is supposed to convert data into information through conversion, organization, structuring and modeling.
Apart from the difference between data and information, there is a huge gap between geographic data and data in general. Geographic data concerns solely providing characteristics for features and resources of the Erath. A geographical reference is another important feature of geographic data – as a rule, latitude and longitude are available to the recipient of the geographic information instantly.
In addition, geographical data is supposed to be used for solving geography related problems. Containing both a descriptive (non-spatial) and a graphical (spatial) element, geographic data provides a three-dimensional view of the current state of the Earth, as well as major geographical issues and dilemmas.
Among the issues that GIS data can help solve, a reasonable use of exhaustible resources and the means to restore renewable ones are the priority. When dealing with the use of GIS data, one must keep in mind that it has three dimensions: a temporal, a thematic and a spatial one. Temporal data concern time related information, thematic data concerns a particular problem, and spatial data allow locating the affected areas.
The difference between the spatial and non-spatial data is also to be kept in mind. Spatial data characterizes a particular area or object within this area by denoting its “location, shape, size, and orientation” (Fazal, 2008, p. 100). Non-spatial data concerns the information that is unrelated to geometric considerations.
It is typical to refer to the elements of a non-spatial, or descriptive, data as a “data item” (Fazal, 2008, p. 86). The organization of data items is rather specific and can be viewed from four major perspectives. The first one is a Data Perspective Information Organization (DPIO); the second one is called a Relationship Perspective of Information Organization (RPIO) and describes the logical links between objects.
The third one is the Operating System Perspective of Information Organization (OSPIO) and describes the link between directories. The fourth and the last one, the Application Architecture Perspective of Information Organization (AAPIO) describes the link between the client and the server.
In particular, the DP links descriptive and graphical elements. The two have different requirements concerning data storage and organization, and DPIO helps create the environment for preserving each. Descriptive data, or a data item, is considered the basic unit of information organization. Forming a record, a set of data items is collected into a data file with a unique file name (text file or ASCII file).
A numeric data file is called a binary file. These files may form a one-dimensional (vector) or two-dimensional (matrix) array. If the data are arranged in columns and rows, the file is called “a table” (Fazal, 2008, p. 88). If the data is structured in a complex system with branches, the file is called “a tree” (Fazal, 2008, p. 88), with “leaves” and “nodes.”
As a rule, “leaves” are larger in value than “heaps;” otherwise, the binary tree is called “a heap.” The heap sort algorithm sorts data from columns to heaps. Since recently, a database approach is associated with computing. The RPIO method classifies the data with the help of the scale of measurement.
The latter is split into four grades, which are nominal, ordinal, interval, and ratio. A nominal level is usually textual, and an ordinal one is numerical. An interval level is a list of numerical data linked to an arbitrary datum, and a ratio level is a list of numerical data linked to an absolute datum. Thus, in the RPIO system, the categories are rather broad. It is usually hard to include spatial relationships into the data set. However, it is possible with sufficient storage space.
The above-mentioned means of information organization might seem rather clear and relatively easy to utilize. However, without knowing the difference between a data file and a database, one might find the descriptions of the four types of information management quite confusing.
Therefore, the line between the two concepts must be drawn. Fazal stresses that there are three basic differences between a data file and a database. A data file traditionally includes a collection of the same or similar data records and description, while a database includes interrelated records with possibly different data and descriptions.
More importantly, a database may include several data files, though it does not necessarily have more than one. The processing procedure is another important distinction between a database and a data file. Fazal shows that data file processing is usually related to programming and computing, whereas database processing is always related to a database management system. Finally, a data file is used for spontaneous information check, acquisition, or analysis, while a database is traditionally used on a regular basis.
As a relatively more complicated concept than a data file, a database deserves a closer consideration. Fazal explains that database models are the means to organize databases and defines four key types of these models: relational, network, hierarchical and object-oriented ones.
Relational data are represented in a manner resembling a table; network data are introduced as a set of records organized according to their types; hierarchical data are listed in accordance with a parent-child principle and based on one-to-many relations; object oriented data are classified in accordance with unique characteristics of the objects included into the data set.
The spatial, or graphical, data is also often referred to as the graphical data. Graphical data are split into basic graphical elements. Traditionally, three major graphic elements are distinguished; these are a point, a line, or an arc, and a polygon, or an area.
From a dimensional perspective, a point is equal to zero (0), a line is equal to one (1) and an area is equal to two (2). The three elements in question depend on each other: areas consist of lines, lines consist of points, and points, in their turn, are defined by coordinates (latitude and longitude).
The use of the above-mentioned elements in GIS to represent geographic characteristics of a particular region is usually referred to as a vector method (a vector data model). A vector method requires that the spatial component of a geographical database should be used in order to characterize a particular area. Because of its focus on the features of a physical object, the given type of data representation is defined as a method based on the “object view of the real world” (Fazal, 2008, p. 92).
The positioning of a specific database element as an object represented the way in which it is viewed in the real world is not the only method adopted in the GIS system. Before defining the levels of data abstraction, one should set the concepts of data model and database model should be defined. A data model includes all raster and vector methods of reality representation, whereas a database model is a software implementation of a data model (Fazal, 2008, mp. 93).
The highest level of data abstraction is denoted as a data structure. Several levels of a data structure (DS) are traditionally distinguished. A descriptive DS shows how non-spatial data was designed and used. A descriptive data structure exists either in the form of a relational DS, or in the form of an object-orientational data structure. A graphical DS may exist as a vector or a raster DS.
The latter is represented by such subcategories as “irregular tessellation (e.g., triangulated irregular network (TIN)), hierarchical tessellation (e.g., quad tree) and scan-line” (Fazal, 2008, p. 94). A vector DS can be implemented as a spaghetti DS, a hierarchical DS, etc. The third type, a georelational data, is supposed to represent geographical information.
The existing spatial data (point, line (arc) and polygon (area)) is characterized by two types of spatial relationships (proximal and topological). The OSPIO is linked directly to computing. Unlike in DPIO or RPIO, in OSPIO, directories are the primary method of information organization. Directories are also known as folders, which help organize data in a hierarchy – there is a root (topmost) directory in a computer, a sub-directory (the one below the topmost directory), and a parent directory (the one above a sub-directory).
A directory structure was created for bookkeeping purposes and has a unique name assigned to it once a directory is created. The concept of workspace is used along with the notion of directory and is defined as a directory which the files of a specific project belong to. AAPIO, in contrast to DPIO or RPIO and OSPIO, is based on client–server relations.
Like OSPIO, AAPIO is usually associated with computer databases and hinges on the data management processes in a particular computer or a telecommunication network. A client is a process that requires services, while the server is the one that facilitates them. The client–server architecture works in many ways, yet file servers, database servers, transaction servers, web servers and groupware servers are the most popular ones.
With file servers, information is requested from a particular file; with database servers, a structured query language (SQL) request is sent from the client to the server; with transaction servers, a server transaction must be carried out to acquire information; with web servers, Hypertext Transfer Protocol (HTTP) is used as the primary means of communication between the client and the server. Finally, groupware servers are based on the interface, which facilitates communication and information sharing between users (clients).
A pure client–server setting provides an immediate access to remote data sources and a subsequent transfer of the retrieved documents to the client. The data structure demands that information should be sorted according to the frequency of its usage. Data that is supposed to be utilized for specific purposes and data that is used on a regular basis should be stored in different databases with different security levels.
The choice of a database depends on the type of data. To be more exact, it is necessary to determine whether the data in question is spatial or not. Fazal claims that spatial data are “multi-dimensional and auto-correlated” (Fazal, 2008, p. 100), while non-spatial data are “one-dimensional and independent” (Fazal, 2008, p. 100). For spatial data, order is crucial; therefore, the traditional type of databases cannot be used to locate geographical data. With the invention of new types of databases, new means of coordinating them appeared, though.
The current DBMS (database management systems) provide two solutions for data arrangement in GIS: 1) providing the user the ability to access all data through DBMS (“total DBMS solution”); 2) leaving a direct access to some of the data that is inaccessible otherwise (“mixed solution”).
A repository is viewed as an alternative solution for the problem of data storage. The functions of a repository are restricted by adding data, retrieving it, and deleting it from the system. In some cases, the data added previously can be changed. However, in most cases, changing data in a repository is prohibited for security reasons.
Speaking of DMBSs, a user can navigate them by using commands in different languages, such as a data definition language (DDL), a data manipulation language (DML), a query, and a query language. DMBSs have three levels of abstraction, unlike data itself; these levels are known as a physical level (database implementation), a conceptual level (expression of the model of the real world), and a view (a user group portion of a database (Fazal, 2008, p. 103)).
In order to structure data appropriately and define the relationship between its elements, data modeling is used. Traditionally, conceptual, logical and physical data modeling is distinguished. In the course of data modeling, a mathematical formalism known as a data model emerges.
A data model is composed of a notation for data description and a range of operations for data manipulation. Process modeling is also mentioned among the data modeling types. The given concept, however, is process-oriented and cannot technically be associated with the three data-oriented processes above.
Data storage is a relatively hard task that needs to be approached with due responsibility. Knowing the basics of GIS information management, one can use it to the full potential. Therefore, Fazal’s Spatial data structure and models is a crucial piece of information that helps understand the GIS principles better.
Reference
Fazal, S. (2008). Spatial data structure and models. In S. Fazal (Ed.), GIS basics. New Delhi, IN: New Age Publications.