[Part 2 - Computing Concepts] Something for the Public: A web accessible National Sites and Monuments Record // sweeting.org

You are here: Home > Mark Sweeting > Something for the Public > Computing Concepts

PART 2 - COMPUTING CONCEPTS

DATABASES

Worboys (1997:345) defines a database as:

"A unified computer-based collection of data, shared by authorised users, with the capability for controlled definition access, retrieval, manipulation and presentation of data within it."

To take this description a little further, a database can hold any sort of data, from numbers to characters and from images to sounds (e.g. the TerraServer). This data is useless unless it is given some sort of meaning - be that simply some sort of logical ordering, or by assigning relationships between some bits of data and other bits.

To give the data meaning, a database management system (DBMS) must be employed. The DBMS is the core of the database, and provides the tools to manage the data (Tare, 1989:124), for example the input of data, the verification, storage, retrieval and combination of data (Burrough and McDonnell, 1998:300).

There are many techniques used by the DBMS to accomplish these tasks, but these depend on the design or structure of the database. Database structures or models can be broken down into approximately five categories: flat file, hierarchical, network, relational and object orientated. It is not important to understand all of these, but I shall just consider those two that are commonly used by archaeologists - flat files and relational databases.

Database Models

Flat File

The flat file database is the simplest of all, and the easiest to visualise. Consider a piece of paper divided up into rows and columns - for example an attendance register - see Table 2.

NAME	WEEK 1	WEEK 2	WEEK 3	WEEK 4	WEEK 5	WEEK 6	WEEK 7
Ahmed	Yes	Yes	Yes	Yes	Yes	No	Yes
Bret-Young	No	Yes	Yes	Yes	No	No	Yes
Brian	Yes	Yes	No	Yes	Yes	Yes	No
Davis	Yes	Yes	Yes	No	Yes	Yes	Yes

Table 2: A representation of a flat file database.

In "relational database terms" (to be discussed below) each row is called a tupule, or a "record", and each column heading is an attribute or field heading. Together, this row/column structure makes up a "table". A primary key uniquely identifies each row. For Table 2, this is just the student"s surname. For larger files where there may be students with the same surname though, the primary key is often some sort of code, for example a student registration number, or an SMR record number.

Querying the file by the primary key is very simple for humans to do – we can quickly skim through until we find the record we are interested in. Likewise, a computer is able to read through the rows, until it reaches the desired record. However, to speed this process up, many flat file databases employ a technique known as a binary search. With a binary search, the file is split into two equal halves. The halfway record is compared with the search string, and depending on weather it is equal, precedes it, or comes after it, one half of the file is discarded. The remaining half is again slit in two, and the same comparison carried out. This process is continued until the record is found.

Burrough and McDonnell (1998:42) have illustrated the efficiency of binary searching compared to sequential searching with some maths. They showed that for a database of a thousand records, and a required time of 1 second to read each record, it would take on average 1.5 hours to find a record using a sequential search, but only 14 seconds with a binary search routine!

Queries that do not look for a primary key take much longer to perform, as the field data is not indexed in any form.

Relational Databases

Relational databases store data in sets of tables made up of rows and columns (i.e. several flat files). These tables store information on specific themes, and the tables are related to each other through a common column - known as the primary key.

For an example, consider a simple map consisting of two polygons (Figure 8).

The example map is made up of two polygons, I and II. To verify this you can look at the "Map table" (in part (a)). To find out information regarding either of these polygons, the "polygons table" would have to be consulted. The polygon table tells us about all the lines that make up each polygon, so for more information regarding those, it is the "lines table" that would be referred to.

One of the benefits of using a relational database over a flat file is that they save space. Considering the map in Figure 8, it is possible to see that the line "c" is part of both polygons. The polygons could subsequently be broken down into two three-line polylines or two four-node polylines (polylines "a" and "b" in table set (b)), and use line "c" as a two-node line. This process, known as normalisation (Burrough and McDonnell, 1998:47) helps to reduce the volume of raw data in spatial data sets.

To query relational databases, a specialised query language called Structured Query Language (SQL) is used. There is no need to go into the language itself here in full, but it is a fairly simple language to learn and use. For instance, the statement:

SELECT * FROM CUSTOMERS WHERE LOCATION = 'SURREY'

would produce a list of all the records in the "customers" table, where the "location" field equals "Surrey". All SQL-compliant databases usually provide several methods for querying the data, including an SQL prompt. Microsoft Access for instance provides an SQL prompt and a "query builder" interface, as well as various "Wizards" to help you construct queries.

Database Types

There are three main types of database, and each has it"s own benefits. Choosing the correct type of database when designing a database-backed system is essential if it is to be successful.

Standalone

A standalone database has all its files stored in the local file system, and the application designed to use the data also resides locally. These applications are meant for single-user access, so only one person can access the data at a time (Doherty and Manning, 1998:536).

File-Share

File share databases are the most basic of networked databases. Each user has a copy of the database application running locally on his or her machine. The database files are all held on a single machine, typically a network file server. The type of network is unimportant, because the file is accessed just like any other file. Whilst in use by someone, the data file is locked to prevent others using it. This means each user must wait their turn to use the file (Doherty and Manning, 1998:536-537).

Client-Server

The Client-Server database or two-tier system is a database that allows multiple users or concurrent sessions. The processing can be distributed amongst both the client and the server, the degree of distribution being dependent on network conditions and traffic. With a simple client-server database, the server only processes one request at a time, and locks the tables during this. While dealing with a request, it can create a queue of other requests received during the processing, for dealing with later (Doherty and Manning, 1998:537).

At the higher end of client-sever databases, there are "multitier" systems. Multitier systems sometimes employ "middlewear" to carry out a proportion of the processing. The middlewear is usually multithreaded allowing the concurrent processing of multiple requests on multiple databases. This process takes much of the "thought" away from the server program that handles requests to the machine, and consequently speeds up the processing (Doherty and Manning, 1998:537).

GIS

Worboys sums up what a GIS is in the typical manner:

"A Geographic Information System (GIS) is a computer-based information system that enables capture, modelling, manipulation, retrieval, analysis and presentation of geographically referenced data."

(Worboys, 1997:1)

Its design is very similar to that of a database, which is after-all its underlying component. However, unlike a database, a GIS is able to represent its data on the screen as maps or other spatially referenced images and objects.

GIS software usually specialises in handling one of two types of spatial data: raster and vector. Raster data is structured in a grid manner, like a 2 dimensional array (see Figure 9 (a)). Vector data on the other hand is stored as a series of co-ordinate pairs, joined by straight lines (Figure 9 (b)).

Computers handle raster data very well, as all the modern programming languages can handle arrays (Worboys, 1997:16). This makes processing this data fairly simple, and quite complex operations can be performed on it, though co-ordinate transformations can be time-consuming (Burrough and McDonnell, 1998:70).

However, raster data can take up very large volumes of storage space. Reducing this volume by making the grid larger results in loss of data through reduced spatial resolution (Burrough and McDonnell, 1998:70).

Vector data on the other hand is quite compact to store. Co-ordinate transformations are efficient, and accuracy is always maintained, no matter at what scale it is viewed.

Nevertheless, it is not always an ideal format, as mathematical analysis of the data can be very complex, requiring considerable computing power and time. Graphical display can also be quite time consuming, as this process effectively "rasterises" the data (Burrough and McDonnell, 1998:70).

There are many benefits of using GIS in archaeology. For instance, we could begin by drawing a map illustrating the distribution of Neolithic finds in a defined geographical area. Having got our "map", we may add a "layers" showing for example relief, water courses, soil types and so on. Aside from the desktop mapping function of GIS however, it may also be used an analytical tool. We may have a hunch that all our Neolithic finds would be close to water - lets say 200 meters. By drawing a "buffer" around the water courses, we could test to see statistically how likely that is by seeing what proportion of finds occur within our buffer. You could go a step further and suggest that all the find spots are within 200 meters of water, and on the southern slope of any mountains or hills. Again, this can be tested using the buffer first of all, then working out the slope or gradient and direction of all find spots.

By studying the data in such a way, the archaeologist may come up with a model describing where Neolithic finds are typically found. Using this information, he could then go out and carry out fieldwork in areas that are statistically more likely to be of interest to him.

THE WWW

The World Wide Web grew out of a project initiated by, the American "Defence Advanced Research Projects Agency" (DAPRA) in 1969. The objective was to develop communication protocols, which would allow networked computers to communicate transparently across multiple, linked packet networks (Serf, 1998).

The network developed, called ARPANET, was so successful that it was used for daily data communications between the attached organisations. In 1975 it changed from being experimental to operational, and TCP/IP protocols (Transmission Control Protocol / Internet Protocol) were formalised to enable all machines connected to the network to talk to each other (Hunt, 1994:2).

The widespread development of LANs, PCs and workstations in the 1980s allowed the Internet to flourish, but also introduced a few problems. TCP/IP relies on every machine having a unique IP address - and this could have been a great hindrance to the popularity of the Internet. Just imagine what it would be like having to reefer to every computer on the Internet by their IP address - 128.66.12.1 or 192.178.16.66 for example!

A "Name Server" established at the University of Wisconsin in 1983, and the introduction of the "Domain Name System" (DNS) in 1984 (Zakon, 1998) managed to defeat this problem. DNS allows the Internet addresses we are familiar with such as "www.bham.ac.uk" to be resolved into their IP addresses by a DNS server, allowing us to refer to all machines by a host and domain name.

In 1990, ARPANET ceased to exist, but because of the networks that had grown up around it, the users barely noticed. In 1993, Mosaic, the first HTML capable browser, was released. This allowed easy "web surfing" through the use of "point and click" navigation, and was the foundation of the web interface everyone is familiar with today.

AVAILABLE 'GIS ON-LINE' SOFTWARE

Having long established itself in the daily life of academia, it wasn"t long before the web started to provide access to GIS applications running on remote servers. The first true web-based GIS were typically based on GRASS (Geographic Resource Analysis Support System). The U.S. Army Construction Engineering Research Laboratory (USA-CERL) developed GRASS in 1982, and maintained and developed it until very recently. Today, Baylor University, USA now maintains and develops it through a dedicated research group (Byars, Neteler, Clamons and Cherry, 1998).

GRASS is a UNIX based GIS. It is a raster based GIS, with image processing and graphics production capabilities, though it does also have limited vector capabilities. Because web servers traditionally ran on a UNIX based OS, it was quite simple to interface GRASS with the WWW using a simple web interface based on standard HTML form elements. A Perl script (discussed later) could glue the programmes together and also route any raster image output from GRASS through a GIF or JPEG conversion utility and then back out to the client.

Today, you can still find Web based GISs using GRASS as the backend (for example REGIS GRASSLinks, at http://www.regis.berkeley.edu/grasslinks/). However, with the wide ranging choice of server operating systems, and the abundantly available Desktop Mapping/GIS applications, there has been a flourish in "Internet Map Server" software (to coin a title from ESRI).

For example, there is MapExtreme for MapInfo, MapGuide from AutoDesk, and ArcView Internet Map Server from ESRI. Two of these products - MapExtreme and ArcView IMS both produce raster images for download (see Figure 10). This means all the processing has to be carried out on the server. This is fine if there is not too much traffic, or requests are often identical allowing cached versions to be served, but for busy sites or users who may want to continually "tweak" their maps, MapGuide from AutoDesk is perhaps the most suitable. MapGuide is a vector based GIS web browser plug-in. Before using the service, the client must first download the plug-in with is about 2 megabytes in size. Having done that, they connect to the server using a standard HTTP URL, and browse data using a typical GIS interface. Most of the basic GIS functions are available through custom menus and toolbars, and layer control, colour scheme, and scale can all be altered at the client end. This can provide a remarkable saving in server workload by transferring it to the client.