Information Retrieval Methods Report

Exclusively available on Available only on IvyPanda® • No AI

Table of Contents

Introduction
Different types of IR systems
Requirements for an IR system
Reference List

Introduction

The people attic trust is a complex storage and retrieval project featuring widely varying forms of media spanning several decades’ worth of information embedded in different forms of technological media.

Some of them feature single modes of data storage while some contain many forms of media. Some of it is text, while a lot of it falls under the multimedia category. The challenge of organising it into retrievable formats and subsequently availing the information to a wide audience through an Information Retrieval mechanism is daunting.

Fortunately, several tools already exist to tackle this kind of challenge. The primary goal of the project is to document the existence of the media, to describe them adequately and to make their future retrieval possible. This report focuses on the retrieval issues of the project. It presents the range of options available for organising the retrieval system and, evaluates them, and finally recommends the most appropriate configuration for the system.

Different types of IR systems

Components of an Information Retrieval (IR) system

An Information Retrieval system has four basic components. They are a database, a search mechanism, a language, and an interface to provide interaction between the user and the system. According to Chu (2005, p.15), databases “comprise information represented and organised in a certain manner”.

In other words, a database is an organised storage system that allows for the searching of items in it using preset criteria. The search mechanism is the system that allows for the searching of the database for the retrieval of the information stored in it. The degrees of complexity of query methods applied vary depending on the technical capacity of the user accessing the database. The third component of an Information Retrieval system is language, which can be either the “natural language or a controlled vocabulary” (Chu, 2005, p.16).

Chu (2005, p.16) notes that, “information relies on language, spoken or written, when being processed, transferred, or communicated”. The final component of an Information Retrieval system is the user interface. This is the point of interaction between the user and the system. Its user friendliness will determine in many cases the propensity of users to apply it. More than anything else, it determines the usefulness and eventual success of an Information Retrieval system.

Categorisation of items in attic

Items in the attic are of various forms, which fall into four categories. There are text-based items, image based media, streamed media applications and multimedia applications. Text-based items use words as the basic mode of information storage. Text based media in this collection include poems, manuscripts for performance art, and newspaper clippings. Image based applications rely on picture elements to store information.

Each picture element, called a pixel, has a different identity describing its color and its intensity, which when collectively combined with other pixels, describes a given image. Image based applications in the collection include, photographic materials held in CD-ROMs and hard drives, and 35mm film negatives.

Others are paintings and old maps. These will require digitisation if they will be retrieved from a computerized Information Retrieval system. Streamed media applications are those that have a time component, necessary for the correct interpretation of the data. Distorting the timeline distorts the information in them. Streamed media applications available in the collection include the audio recordings like the music and sound clips in .wav and .mp3 format.

Speech and music on audio cassettes and vinyl records also exist in the collection. These forms will require digitisation if they are to be made available to a wide audience. Finally, multimedia applications use a combination of media to present information. In the collection, multimedia applications include video in digital format, and on tape, and the reels of film.

Text Based Retrieval Systems

A text-based retrieval system will aid the retrieval of the text-based media in the collection. Some of the media rely on analogue technologies, which complicate storage in the media storage available to the public, who are intended beneficiaries of the project. The text-based materials found in the collection will require digitisation.

The key advantage of text-based retrieval is that the technology is mature, and hence enjoys a great degree of format standardisation. It presents fewer compatibility problems between different types of software. Where this problem arises, numerous options for conversion exist to allow retrieval in a desired format. Its weakness lays in its use of letters and words as the basic data storage and retrieval unit.

So far, many of the retrieval methods available for text retrieval do not take into account the semantic elements of a query. They rely on word match, and hence most search systems may not return relevant content based on their meaning, but will return content that closely matches the phrase used as search query. Advanced systems allow for contextual search, which employ thesauri to identify words with closely related meanings, thereby improving the semantic elements of a search.

*Figure 1: Multimedia information retrieval system architecture.*

Multimedia Retrieval Systems

On the other hand, multimedia retrieval systems use different means of identifying information that match a search query. A multimedia Information Retrieval system will comfortably handle search queries for image-based applications and for streamed media applications. Multimedia search queries employ elements usable for each of the image-based applications and streamed media applications.

Multimedia Information Retrieval is still relatively young. It has many compatibility problems owing to the different formats used for presentation of media type of the same nature. For instance in the collection, there are .wav and .mp3 files, which are all audio formats. The reason for this is that there is greater functionality derived from each type of new format. The newer formats regularly lack backward compatibility.

They main constraints that drive the use of different formats include maximisation of storage space, or preservation of media quality. However, the design of many media players for streamed media applications and for image-based applications takes into account these constraints. They regularly include capacity to handle different media types and a format inter-conversion facility. The crux remains having the latest version of a media player, which will be able to present the latest file formats.

Requirements for an IR system

Comparison of Requirements for Text Based IR Systems and Multimedia IR Systems

Retrieval systems require a means to identify the information source, which a search mechanism can latch on to in order to identify the media from a database. This is about as far as the similarity between the two types of retrieval systems go. Text based Information Retrieval system rely on matching the text in the files to the search query in the database to identify a document, while multimedia Information Retrieval systems rely on a range of elements to identify relevant media carrying the required information. This includes text elements such as an assigned name for the media in the database. It is possible to search for a film from a database using the film name, on condition that the name is on the file carrying the film. Other locators for multimedia files include duration of media and file format of the media. These are useful in narrowing down a search query.

Main Solutions Available to Designers of IR System

The availability of searchable information from the attic trust is dependent on the digitisation of all records currently in the collection, and to some degree, the standardisation of formats to ease retrieval. There will be need to either type or by using scanning software, to digitise the text-based items in the collection.

Typing will allow for a greater degree of freedom in the presentation of the information because it will make formatting possible to achieve the best possible output for users. It will however lead to a loss of authenticity since the items are antiques, and their appeal remains in appearing in their original format. For users seeking information for semantic purposes, reformatted presentation will meet their needs best because of better presentation.

Those who are seeking the information for sentimental purposes will best appreciate the original presentation. To retain the original look, a digitised image of the text will provide the best option. The quickest way to achieve this is through scanning without text recognition. This will actually transform the material into an ‘image’ presenting text, and not pictures. The implication is that text retrieval methods will not apply.

Preservation of the rest of the information will also require digitisation. The most crucial factor is the format to use in the process. Conversion technologies from tape to digital data exist for both audio and video tapes. The storage of physical artifacts like the sculptures for mass presentation will require the taking digital pictures for storage in the database.

Another option is the presentation of three-dimensional representations through animation of the photos, or making short films of the objects, which adds the possibility of adding sound clips. Animation allows for greater user interaction while filming allows for the addition of details through voice, enriching the experience. Again, the format to apply depends on the nature of user. For the arts lover, an animated clip over which he can exercise control over the image to get desired views will be suitable.

For the inquisitive semantic user, a video clip with a sound clip giving background information on the artifact will be ideal. As Jalal (2001, p.6) observes, “Speech can introduce, give summary, stimulate, and tell”. Audio data presents the fewest presentation challenges since the auditory experience does not vary much between users. Provided the data presentation takes on a widely accessible format, there should be no major technical challenges.

Different Methods of Representation

There are two key types of Information Retrieval systems. Belkin (n.d.) identifies them as, “retrospective or ad-hoc” and, the second type, “Information filtering or routing”. Retrospective systems fulfill one time information needs which taper off after meeting the need. These include information from e-books, news articles, online magazines, or information websites.

Information filtering includes those accessed regularly because they have high utility levels. These include websites with changing information such as weather patterns, stock prices and maps services. There are some key issues to consider when setting up the database based on the methods available for representation.

The issue of what language to use across the database is critical. Two ways of approaching language exist. One of them is to use natural language of the users, which forms the basis for the search queries, while the second approach is to use a controlled vocabulary. If the trust adopts natural language for the Information Retrieval system, then users will have an easier time interacting with the database since they do not have to learn the controlled vocabulary of the database.

They will however be faced with ambiguity and irrelevance problems. If the trust adopts a controlled vocabulary, users will first have to learn the language after which they will have better results for their search queries. Tedd et al. (2005 p.39) stresses that, “it is necessary for users to have the requisite skills to obtain relevant information quickly and effectively”.

There will be need to use indexing across the database. This involves assigning words or specific phrases to each item in the database. The trust may use descriptors of free indexing depending on whether the language adopted is the natural language of the users, or a controlled vocabulary.

Categorisation will involve developing categories for all the items in the collection. Chu (2005) proposes that useful categories must be “exhaustive” (p.29) and “mutually exclusive” (p. 29). This means that all items in the collection must have an assigned category and that no two categories should have an area of overlap.

Techniques for summarisation improve the query function for text-based applications. It involves providing a user with brief information relating to a body of text. The techniques include the use of abstracts, summaries, or extracts. Abstracts provide the readers with a broad view of the text and can act as a substitute for the text. It only lacks in detail.

A summary assumes that the reader will read the whole document so it excludes portions such as background, the methodology, and purpose. An extract on the other hand is an actual piece of the document, cut out to provide a snapshot of an actual portion of the document. Each of these methods has their advantages and their challenges and applies in different circumstances.

Querying refers to the interrogation of a database using a language. Nordbotten (2008) says, “Query language will always provide specification of the selection criteria for the desired information for the remaining processes” in the information retrieval process. The key aspect for designing a query system is to determine the degree of semantic querying necessary for ideal user experience.

Challenges include the management of synonyms, which might require processing of the query, hence reducing speed of the system, and increasing the design and management costs. A simplified query system that matches input to metadata and similar phrases provides large volumes of output giving the user a more difficult time in sifting through the data, which may compromise user experience. Use of metadata, can potentially improve search results as it expands the possible ways of accessing a document.

Implications of Using IR systems

The most appropriate system for the project will include two sections. One of them is preservation of the physical artifacts that contain the information that requires preservation in a museum. The second section is the development of a digital library or digital museum that will enable users from different parts of the world interested in the trust’s activities to interact with the materials.

Arms (2001, p.4) points out that, “a digital library brings the information to the user’s desk, either at work or at home”. The most appropriate Information Retrieval system will be one that uses natural language, since the trust targets a worldwide audience, as opposed to a limited vocabulary system. Keywords in the process assist in refining queries. The trust should also prefer to use methods of storage that will present the artifacts in their natural condition since this is the main appeal in viewing artifacts.

Later on, the trust may consider storage methods applicable to semantic users who are seeking meaning, especially for educational purposes out of the information. In particular, the trust needs to digitise its entire collection. This involves conversion of audio files to multiple digital formats.mp3 format will be useful if the objective is to conserve storage space. It is also widely playable on most media players.

Accessing a Digital library. — *Figure 2: Accessing a Digital library (Techweb, n.d).*

Discretion will be required for text-based media. Some of them will require preservation in digital format by scanning with text recognition, to allow formatting. This will apply to manuscripts and poems. Others may be stored as images though scanning without text recognition. These include the newspaper clippings and poems.

Digital photographs of physical artifacts such as sculptures will aid the development of animated collections. This is easier to handle compared to multimedia items. The multimedia items in the collection will require widely varying file formats for effective retrieval. The option of developing a unique media player for the trust requires consideration. This will solve the compatibility problems in the interim because it will use a single format and will potentially reduce administration costs.

Reference List

Arms, W. Y., 2000. Digital Libraries. USA: MIT Press.

Belkin, J. N., n.d. User Modeling in Information Retrieval. New Jersey: Rutgers University. Web.

Chu, H., 2003. Information Retrieval and Presentation in the Digital Age. NewJersey: Information Today, Inc.

Jalal S.K., 2001. “Multimedia Database: content and structure”. Workshop on Multimedia and Internet Technologies. Documentation Research and Training Centre, Bangalore. Web.

Kang, K., 1999. Development of a Multimedia Information Retrieval Architecture with Integrated Image Information Retrieval Technique, digital image, Multimedia Technical lab, Korea Telecom. Web.

Nordbotten, J. C. 2008. Multimedia Information Retrieval Systems. Web.

Techweb. n.d. Accessing a digital library. Digital image, R.V. College of Engineering. Web.

Tedd L.A., Large A., & Large J.A., 2005. Digital Libraries: principles and practice in a global environment. München: K.G.Saur Verlag GmbH.