Predictive Modelling and the Existing
Archaeological Inventory in British Columbia

Table of contents

Predictive Modelling

Predictive modelling for archaeological site locations has been a topic of concern to managers for over a decade (e.g., Darsie, Keyser and Hackenberger 1985). The subject is extremely complex and will only be dealt with briefly here. Originally, predictive models generally used univariate statistics to correlate site locations with multiple environmental variables. These models were used to help select areas for survey when total survey was impossible or impractical (Hollenbeck 1985). This form of analysis has been used, at least by a few archaeologists, since the early 1970s (Plog and Hill 1971, referenced by Judge et al 1975:94).

Geographical Information Systems (GIS)

The advent of geographic information systems (GIS) has revolutionized the ability to analyze spatial relationships and produce predictive models. The ability to transform a map layer to a new data type (for instance, converting contours into slope and aspect algorithms) and to combine different data sets into new map covers allows for much more sophisticated mathematical models to be generated. A typical example might begin with dividing the study area into small (often about 25 m squares) cells and applying a univariate statistical test to determine if there are significant correlations between archaeological site locations and environmental variables. It would then move to the very sophisticated logistic regression analysis on variables selected to give the best fit with archaeological sites. Each cell is evaluated for its overall probability of containing a site given its position on the regression line correlating the presence of a site to the presence or distance to an environmental variable. A probability surface is produced. A "cutoff point", usually about 50%, is chosed as a decision boundary to determine if a cell contains a site. Finally, a map is produced predicting site presence or absence (e.g, Warren 1 99o).

It is perhaps fair to say that the products of GIS predictive modelling have not lived up to their promise. Some attempts at predictive models have achieved a predictive accuracy only slightly better than chance. For instance, Warren (1990) provides an example of correctly predicting 67% of the sites occurring on 61 % of the land base. This result is not impressive when it is remembered that the sites occur over only a small part of the land and that any model can be 100% accurate by predicting site potential in 100% of the land area. In a better case, Carmichael (1990:221) creates a model that correctly identifies 72% of the sites using approximately 45% of the landbase in a large area in Colorado. The first successful application of GIS in archaeological predictive modelling using logistic regression achieved a correct classification of 76% of the sites using 40% of the land base (Kvamme 1984 referenced in Carmichael 1990). It should be noted that many of these models are handicapped by poor, 1:250,000 scale maps for their base environmental data and local features critical to site location may not be mapped. Even so, as Altschul (1990:227) comments in a stimulating article:

A model is judged to be successful if it correctly predicts where sites will and will not be located 80 to 90% of the time. While accuracy is important, it is not necessarily a very useful measure. Predictive models in CRM have generally been based on the observation that in most regions sites tend to cluster around certain environmental features. This observation is not new; indeed, it has been part of most field archaeologists' baggage for over a century. Sophisticated models that capitalize on this fact may be accurate, but by and large do not tell us anything we did not already know.

While acknowledging Alschulis comments, the situation in B.C. is that we do not know where the sites are even likely to be in many parts of the province, simply because of inadequate survey coverage. It is worth while to compare the GIS accuracy with predictive models/potential maps produced by other means. As previously discussed, the 1:250,000 scale map of the Cariboo Forest Region predicted almost I00% of the known sites would occur in about 25% of the land base (13ussey and Alexander 1992). This map is based on a model which correlates mapped but ethnically significant environmental zones with ethnographic patterns of activity. It would appear to be much more accurate than the GIS models. Another model (also untested by later ground-truth survey) was based on some simple correlations with distance to permanent water (derived from probabilistic survey), terrain classes, quality of fish habitat, and soils/forest cover (derived from Canadian Land Inventory 1:50,000 maps). This model predicted 75% of the estimated 900-1200 sites in the Dean River valley would occur in only 6% of the land area (Eldridge and Eldridge 1979). It is worthwhile to note that nearly all the Dean River study area has been assigned as having "potential" in the Bussey and Alexander 1:250,000 map, illustrating the difference in accuracy introduced by scale. Although the site distribution patterns may be easier to predict in the Chilcotin region than in the Plains or Southwest, the difference in accuracy is marked. It would be instructive to produce a predictive model, using similar techniques as those commonly used in the United States, on the Dean River data.

Part of the difference between the GIS and simple predictive models may be due to the use of a small cell sizes in logistic regression. One of the strengths of logistic regression is the non-assumption of a normal distribution for variables and the potential to fit polynomial interaction terms into the model (Warren 1990:212) but models fully using this power have yet to be developed, especially at fine resolution. Archaeological sites tend to have strong negative binomial distributions - that is, they tend to be rare events that cluster rather than be randomly or evenly distributed (Mitchell and Eldridge 1981: 18). For predictive purposes, it means that a single large (500 m) cell in a high potential area will have a good chance of containing several sites, while adjacent but peripheral cells have none. The smallest cell size possible is normally chosen for logistic regression in order to maximize the accuracy and resolution of the encoded information (Warren 1992:96). It is considered undesireable for any cell to have more than one parameter for any variable. However, if a small cell size (e.g., 25 x 25 m) is used, a large number of cells in the high potential zone will still be empty, because of the small size of the individual sites. This may result in an apparent reduction in site density. For instance if a hypothetical 5 quadrats contain 10 small sites in an area encompassing 100 quadrats, then 5%o of the 500 m cells contain sites. Using 25 m cells, perhaps 40 cells could contain the same 10 sites but now, instead of 100 cells, there are 40,000 cells, and the proportion containing sites is only 0.1 %0. The sites appear to have become much more rare and difficult to precisely predict. In order to include most sites, a very high number of "false positive" errors must be accepted, and the predictive model appears to be less powerful. More experimentation with real data would be necessary to test the relative advantages of each approach as no study of these factors could be found in the recent literature.

Utility of Existing Data for Predictive Modelling

Many of the inventory studies examined in this study have data useful for predictive modelling. This is particularly true of the probabilistic surveys, but also holds for many of the systematic intensive surveys. All the studies that made detailed, systematic observations are particularly useful for predictive modelling. Many of the large-area probabilistic surveys, such as Dean River, Clinton Ashcroft, or Eagle (Choelquoit) Lake have data which could be used with little or no additional work in sophisticated modelling approaches. The limitations to predictive modelling in British Columbia will come, not from a lack of good archaeological data, but from a lack of mapped environmental data at the appropriate scale.

Many of the British Columbia probabilistic surveys produced simple models of environmental correlates. They often systematically field recorded environ-mental data such as vegetation cover and topographic features (e.g., Eldridge and Eldridge 1979; Magne 1984). Some reports analyzed this data in relation to site locations, to varying standards of formal analysis (e.g., Beirne and Pokotylo 1979, Pokotylo 1979, REF 7). The Magne (1984) study used cluster and principal components analysis to group quadrats scored for a large number of environmental variables. The cluster analysis grouped all the quadrats containing archaeological sites on the basis of environmental characteristics alone. This cluster analysis effectively produced a predictive model for the study area. However, the environmental data coded in the field has not been mapped for the rest of the study area or adjacent areas, precluding "as is" use of the model.

One of the strengths of the logistic regression technique mentioned earlier is that a probabilistic sample of an area is not strictly necessary. What is needed are enough cells that are known to have sites and enough that are known to not have sites, covering a cross-section of the area to be mapped. Obviously, data that are skewed by biased survey methods will produce a skewed model, but good judgemental surveys can produce reasonably representative samples.

Coastline intensive survey results can be used to produce predictive models which can be extrapolated to unsurveyed or reconnaissance-level surveyed coastlines. The Gulf of Georgia data is a prime candidate for this treatment, not only because the quality of survey is relatively good, but because up to 50 environmental map layers are already digitized and available in the Environmental Emergencies Services Branch's OSRIS system (oil spill response information system) .

One problem with using existing data for predictive modelling is that sites are entered in CHIN as point data. If a relatively small cell size is used and the scale of mapping is large, it would be necessary to map sites as polygons. Indeed, it would be advantageous for current management techniques to map larger sites as polygons at 1:50,000 scale. Sites less than 100 m in maximum diameter can be plotted as point data because a manager is likely to flag any project as having potential conflict if it comes within 200 m or so of a known site. A CHIN search for large sites finds 3670 over 99 m in maximum dimension, and only 1292 over 200 m. It would represent only a modest effort to map these sites as polygons on a GIS system or on paper hardcopy maps.

Predicting numbers of Sites in B.C.

The compilation of data during this study has produced density values which allow predictions of the number of sites in the Province (Tables 1 and 2). The coastline of B.C. is about 30,000 km long (Chan 1993, personal communication). Assuming that half of the coastline is basically unsuitable for archaeological sites with an average of 0.4 sites per linear kilometre, then this predicts 6,000 sites. If the other half has higher potential, with an average of 0.8 sites per kilometre, this predicts 12,000 sites, for a total of 18,000 prehistoric sites along the B.C. coastal strip. Many more must be located inland of the coast, although most are probably at or near the modern shore. probably considerably less. A search in CHIN for all prehistoric sites less than 20 m in elevation and more than 123 degrees longitude (to exclude large numbers of recorded sites in the lower Fraser River) produced a total of 5,070 prehistoric sites, of which nearly 1,500 are in the southern Strait of Georgia. It would appear, therefore, that 1/4 of the sites along the coastline of B.C. could already be recorded. Predictive modelling would be a valuable tool in aiding management of the unsurveyed, or non-systematically surveyed, areas.

In the interior, site numbers are more difficult to predict. The probabilistic surveys cumulative predictions of sites within their study areas total 6,336. There are 16,000 prehistoric sites recorded in B.C., so the interior has about 9,000 recorded archaeological sites, not many more than the number predicted just for the probabilistically surveyed areas. For just the Cariboo Forest District, there are 2,155 recorded sites, including historic sites (Bussey and Alexander 1992) but the probabilistic study areas alone are estimated to have 2,783 prehistoric sites. Comparing the total area of red shading on the British Columbia map (Figure 1) to the total area of the interior, it is likely that the study areas represent much less than 1/10 of the total area, so there must be at least ten times the 6,336 sites in the predictive study areas, and perhaps many more. The number of prehistoric sites in British Columbia as a whole is almost certainly more than 100,000. The state of Arkansas, a much smaller area than British Columbia, has almost 20,000 recorded archaeological sites (Farley et al 1990:144).


Previous PageTop Of PageNext Page