|
What is knowledge discovery?
GIS and knowledge discovery (KD), also known as data mining (DM), are
considered by many not only as technologies but also as sciences or
even "arts." KD helps in detecting patterns and extracting significant,
previously unknown information from databases.For many years,
statisticians manually mined databases looking for statistically
significant patterns.This operation can be (now) performed less or
more automatically.
KD overlaps with predictive analytics since it is also a business
intelligence tool for predicting future trends and behaviors, allowing
businesses to make proactive knowledge driven decisions.This
predictive information can be easily overlooked or underestimated even
by experts.Although the broad meaning of knowledge discovery refers
more to traditional statistical methods, its narrow definition
emphasizes such issues as automated methods, artificial intelligence,
or computer learning techniques.
As technologies, both GIS and KD emerged about 15 years ago (Ed note:
as mainstream applications) and their origins were stimulated by
progresses in computer technology, such as employing computer graphics
and dealing with massive databases.KD usually deals with a large
number of attributes, whereas GIS deals with a large number of GIS
features (records).As a science, KD is a part of applied mathematics
or statistics.Also, practitioners of both GIS and KD tend to be
interdisciplinary; they developed own specific methods and specialized
tools, and have attempted to construct their own methodologies.Also,
KD and GIS can be considered "arts" because they require some level of
technical proficiency and competence in the application domain area.
Emerging GIS and knowledge discovery
As a computer technology, GIS is characterized by the heavy use of
algorithms representing computational geometry (such as the polygon
intersection algorithm) and topological operations.GIS also deals with
relatively large and complex objects such as polygons with high fractal
dimensions, polygons with attached topological information, networks
with their attributes (for example, addresses), and large index tables.
GIS utilizes spatial data structures and corresponding algorithms for
storing and indexing spatial data.GIS is a synergetic technology
because it represents much more than just the sum of its components.
This synergy can be even more obvious because GIS, being itself a very
powerful technology, benefits from integration with other technologies
such as KD, customer relationship management (CRM), or enterprise
resource planning (ERP).
Both GIS and KD technologies emerged partially as a result of the
abundance of data and inefficiency of traditional technology to process
information.For both, the progress in computer technology was
critical, including advancement in data structures, database
management, computer graphics and artificial intelligence.Another key
factor was the interdisciplinary nature of GIS and KD.In early stages,
the main contributors to GIS were geographers, computer scientists,
foresters, land surveyors and military personnel.In KD, the main
contributors were statisticians, computer scientists, marketers,
quality controllers and medical specialists.
The early developmental stage of these two technologies (1970s) was
focused on data collection with retrospective and static data delivery.
Enabling technologies were mainframe computers or digitizing tables.In
GIS, an example of a typical question was "What is the forest stand
type in a given polygon?" whereas in knowledge discovery, a typical
question could be "What was the total revenue in the last three years?"
The next stage in developing these technologies (1980s) focused on data
access.GIS could answer questions like: "Where is the most suitable
moose habitat?" providing retrospective and dynamic data delivery at a
feature level.The enabling technological issues were vector topology,
raster data structure and database management systems.The major
applications were found in geology, environmental sciences and in the
government.In KD, a question like: "What were unit sales in the
Maritimes last April?" could be answered using the retrospective and
dynamic data delivery at a record level and such enabling technologies
as relational database management systems, structured query language
(SQL), or open database connectivity.
In the 1990s, the focus in GIS was data modeling and analysis.Such
questions as "What are the changes in the forest cover in a given
area?" could be answered using the retrospective and dynamic data
delivery at multiple levels.The enabling technological issues were
vector/raster integration, GPS, SQL, interoperability and portable
computers.The major users were corporations, municipalities and
educational institutions.KD was focused on data warehousing and
decision support.Questions like "What were unit sales in the Maritimes
last April? Drill down to Halifax, Nova Scotia," could be answered
using the retrospective and dynamic data delivery at multiple levels.
The enabling technologies were online analytical processing (OLAP),
data warehouses and portable computers.
Today, GIS is focused on the deployment of geographical information by
answering such questions as: "How to get to the closest restaurant?"
Data delivery is proactive and prospective, enabling technological
issues include location-based services, Internet mapping, and
geodatabases.The major users come from communication, business or the
general public.KD represents data mining with such typical questions
like: "What is likely to happen to Halifax unit sales next month and
why?" Data delivery is also prospective and proactive.The enabling
technologies include distributive algorithms and databases,
multiprocessor computers and massive databases.Further progress in
knowledge discovery may result from developing query languages for
spatial knowledge discovery, mining under uncertainty, and using
parallel knowledge discovery (Koperski, 1997).
Both GIS and KD deal with massive databases.Can they be used for
handling the problem of getting too much information? As discussed by
Kantardzic (2003), 61% of managers believe that information overload is
present in their own workplace; 80% of them believe the situation is
getting worse; over 50% of managers ignore data in current
decision-making processes because of information overload; 84% of
managers do not use this information immediately but store it for
future use and 60% believe that the cost of gathering information
outweighs its value.
The list of applications in GIS and KD is very extensive, and it is
impossible to find "the most typical" one.Therefore, both technologies
can be considered domain-free; they can be applied practically in any
domain.GIS and KD are scale-free technologies, since they can also be
applied at many different scales.There are examples of using GIS for
mapping a human eye and for analyzing changes at a global or even
cosmic scale.Similarly, KD is used at a micro-scale level (for
diagnosing a single patient) and at a macro-scale level (for
international analyses).
Mining geographical information
The most generic components of GIS are:
1. Data input
2. Data manipulation
3. Analysis and modeling
4. Data output.
There are similarities between these components and the KD phases
identified within its cross-industry standard process for data mining
(CRISP) methodology.CRISP is a general KD protocol developed in late
1990s and is similar to a product life cycle methodology developed in
software engineering and implemented in managing GIS projects.The
CRISP protocol consists of six phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment.
According to the CRISP protocol, the business understanding phase is
composed of the following issues: determining business objectives,
defining background and business objectives, identifying business
success criteria and access situation, making inventory of resources
and requirements, analyzing assumptions and constraints, risk and
contingencies, costs and benefits, determining knowledge discovery
goals and success criteria, producing project plan, and assessing tools
and techniques.
Regardless of these similarities, the nature of KD and GIS leads to
some substantial differences between these technologies.KD operates in
multidimensional abstract space, whereas GIS acts mainly in
geographical space.Hypotheses in KD are generated by machine learning,
while in GIS hypotheses are constructed by users.Results of analysis
in KD often go beyond the content of a database.In GIS, there are
difficulties in mapping multivariate dependencies.
Integrating GIS and KD
GIS and KD are both synergetic, powerful, dynamic and rapidly
developing technologies.There are numerous areas where GIS and KD have
already overlapped - the process of integrating GIS and KD has been
initiated.However, further integration can significantly benefit both
technologies.
Benefits to GIS of Integrating with KD
GIS can benefit from being
integrated with KD by using more efficient data manipulation tools,
specialized exploratory data analysis (EDA) tools, powerful new
modeling tools and better visualization tools.
Data manipulation tools represent the
primary area within KD
These
tools are also important, but not critical in GIS, since the
manipulation of non-spatial attributes can always be performed outside
GIS.Experts agree that data cleansing is one of the most time and cost
consuming operations within GIS projects.It would be very beneficial
if GIS could incorporate more sophisticated data manipulation tools for
such common operations as detecting and replacing missing data,
improving attribute accuracy, handling inconsistency in databases,
intelligent data reclassification, merging attributes and appending
records, and filtering data.EDA tools were introduced to GIS about
10-15 years ago directly from KD.Since then, EDA has been used as the
very first step in any spatial analysis completed with GIS.KD provides
more specialized EDA tools for such operations as outlier analysis,
testing normality, analyzing distribution with boxplots and Q-Q plots.
KD also offers numerous powerful modeling tools that are not yet
available in GIS, such as decision trees and decision rules,
association rules, artificial neural networks and genetic algorithms.
Some KD tools are already partially implemented in some GIS packages,
including fuzzy logic or clustering.
Visualization tools play a critical role in mapping spatial attributes
and enabling the art of cartography in GIS.These tools play a very
important role in KD, primarily focusing on charting and graphing with
statistical methods.
Benefits to KD in Integrating with GIS
KD can benefit from being
integrated with GIS at various stages of its own CRISP methodology,
particularly in data preparation, analysis, evaluation and deployment.
Data preparation represents a critical component in both KD and GIS.
Geographically referenced attributes are very common within databases
being analyzed using KD.However, when using KD technology alone, many
typical operations on spatial attributes cannot be performed at all.
GIS can provide tools for such operations as spatial referencing,
geocoding or building topological relationships among objects.GIS can
also be very useful in expanding the number of attributes available for
further analysis by deriving the new ones.New attributes can be
derived based on geographical (metric) information or based on
topological information.Newly derived geographical (metric) attributes
include: length of lines, areas of polygons, distance to a closest
object, directions, or density of features per area unit.Derived
topological attributes include the connectivity of nodes, adjacency of
polygons, information resulting from such topological operations as
inside, within, intersects, contains, covers and others.
Modeling and analysis are the most powerful components in both KD and
GIS, and the technologies are complementary in their approach to
modeling.GIS provides more specialized spatial analysis tools, whereas
KD provides more statistical analysis tools.KD lacks numerous
geographical analytical tools from the domain of GIS, including spatial
statistics tools (e.g., the spatial multiple linear regression),
spatial analysis tools (e.g., the spatial autocorrelation),
geostatistical tools (e.g., kriging or trend surface analysis), network
analysis tools (e.g., the optimal path or minimal tour), surface
analysis tools (e.g., the visibility analysis), numerous
location-allocation modeling tools (e.g., allocating demand to a given
center), and regionalization tools (e.g., spatial clustering).
Evaluation is a required step in the KD protocol, whereas in GIS an
evaluation is a recommended step rather than a strictly enforced
standard.However, GIS itself offers invaluable evaluation tools for
mapping residuals (the difference between actual and predicted values)
or analyzing the spatial autocorrelation of residuals.
Finally, in regard to the deployment phase, GIS provides mapping tools
that are non-existent in standard KD.These tools, used for mapping
results, can enhance the deployment phase in KD.
Enhancing geographical analysis with
KD modeling tools
There are three basic groups of standard modeling tools provided by
knowledge discovery: predictive, rule-based and classification tools.
Predictive tools usually include neural networks, multiple linear
regression, logistic regression, and C5.0 rule-based (for categorical
target variables and categorical or numerical predictors) methods.The
rule-based tools consists of the same C5.0 algorithm, classification
and regression trees), association rules, Apriori (for categorical
target variables and predictors), and generalized rule induction (GRI)
algorithms.Finally, the classification tools include such algorithms
as K-Means clustering, Kohonen network and two-step clustering.The
purposes and results of these modeling tools, as well as their
usefulness for geographical analysis, will be discussed below.
The purpose of neural networks modeling is to predict a numeric or
categorical target variable.The output includes predicted values,
residuals (actual minus predicted values), and corresponding rules.
With GIS the actual target variable, its predicted values, residuals
(Figure 1), and rules can be mapped and interpreted.
The purpose of rule induction modeling using the C5.0 algorithm is to
predict a categorical target variable.The importance of predictors,
predicted values and residuals constitute the output.The maps of
actual target and predicted target variables and residuals can be
created and analyzed within GIS.
Multiple linear regression is used for predicting a numerical target
variable using numerical predictors.The output from the regression
includes the selected set of predictors, predicted target variable and
residuals.The maps of the actual target variable, the predicted target
variable and residuals cannot be produced and analyzed within the
standard KD alone - use of GIS technology can be beneficial.The
difference between this tool and logistic regression is that the latter
can predict a categorical target variable using categorical and
numerical predictors.The output and possible maps for logistic
regression are similar to those from multiple linear regression (Figure
2).
Figure 1 - Predicting GDP per
capita with neural network: absolute
residuals

Click
here for larger image
Figure 2 - Predicting GDP per
capita with logistic regression (actual vs.
predicted values)

Click
here for larger image
The purpose of generating rules within KD is to better understand the
analyzed data by finding patterns and rules governing them.The basic
algorithms are C5.0 (for categorical target variables and categorical
or numerical predictors), Apriori (for categorical target variables and
predictors) and GRI (for categorical target variables and categorical
or numerical predictors).The output consists of rules for groups of
records, including their frequency and accuracy.The geographical
distribution of rules can be mapped and analyzed with GIS.
The purpose of clustering is to group records into clusters using some
of the available algorithms such as Kohonen networks, K-Means, or
two-step clustering.The output includes the cluster memberships,
cluster description, and for the K-Means algorithm, the distance to
cluster centroids.At least two types of maps can be created to show
the geographical distribution of clustering: maps of clusters (Figure 3
showing cluster memberships) and maps of the most typical features for
each cluster.
Figure 3 - Clusters of
countries (K-means algorithm)

Click
here for larger image
Factor analysis and principal component analysis are used to reduce the
number of variables by replacing individual variables by factors or
components.This method produces a list of extracted factors or
components, values of correlations between variables and factors or
components, and factor/component scores.In GIS, the analysis of maps
showing the geographical distribution of factor/component scores can
provide new and very valuable information that is not available within
KD alone.
Classification tree modeling is another standard modeling tool used in
KD for picking individual predictors one at a time and classifying them
in order to optimize (minimize or maximize) a predicted value of a
target variable.This tool utilizes one of many possible algorithms,
including the classification and regression tree, Chi-square automatic
interaction detector (CHAID), exhaustive CHAID, or QUEST (quick
unbiased Efficient statistical tree).Modeling with the classification
tree method provides the list of top predictors, and groups of similar
records following the same classification rule.In GIS, the spatial
distribution of rules can be analyzed and mapped (Figures 4 and 5).
Figure 4 - Classification tree

Click
here for larger image
Finally, OLAP cubes represent another standard analytical tool in KD.
OLAP cubes are used for querying, browsing and summarizing tabular
information in a very efficient, interactive and dynamic way.The basic
operations with OLAP cubes include slicing, dicing, rolling-up and
drilling down, and pivoting.The issue of integrating OLAP cubes with
GIS was discussed in my article titled "Creating
and Manipulating
Multidimensional Tables with Locational Data Using OLAP Cubes".
Figure 5 Map corresponding to Figure 4
classification tree

Click
here for larger image
Spatial knowledge discovery resources
Numerous efforts have been made to integrate GIS and KD.Significant
attempts in developing spatial KD have taken place in such American
universities as the University of Utah, Southern Illinois University
and Boston University.Other research centers where similar research
has been conducted include Simon Fraser University (Canada), the
University of Leeds (England), the University of Munich (Germany), the
University of Bari (Italy) and the Russian Academy of Sciences.Spatial
KD software packages were also developed, including GeoMiner
or Spin!
GeoMiner is a prototype of a spatial KD system,
based on a spatial database server.Spin! (short for Spatial Mining for
Data of Public Interest) represents a Web-based integration of KD and
GIS for such applications in public health, environmental protection,
seismology or marketing.This European product includes live
Oracle-based queries and data visualization.
Final remarks
Today, GIS and KD are still used as separate technologies.If someone
is using both, and both software packages are driven by the same
operating system, data can be passed on easily (but still indirectly)
between them.The idea of interoperability, developed in GIS in recent
years, should be extended beyond GIS technology in order to establish
the link with other business intelligence technologies such as KD, CRM
or ERP.Right now, the most typical sequence of operations encountered
while using GIS and the KD tools, is a mixture of both, as shown below.
1. Data preparation including data cleansing (KD)
2. Deriving new geographical attributes (GIS)
3. Spatial analysis (GIS)
4. Modeling (KD)
5. Validation (KD)
6. Mapping initial results and spatial validation
(GIS)
7. Charting and interpreting results (KD)
8. Mapping final results (GIS)
Further integration of GIS and KD should focus on using spatial
object-oriented and spatiotemporal databases, creating multidimensional
spatial rules, integrating artificial intelligence and GIS, and spatial
clustering.As an emerging discipline, spatial KD should also include
visualization with multivariate thematic maps, mining remote sensing
data, and maintaining the consistency and quality in spatial databases
(topological and geometric errors).
Selected bibliography
1. CRISP-DM 1.0, 1999.SPSS.
2. Dramowicz K., 2002.Adding Geography to Data
Mining.Data Mining Summit, Reston, VA.
3. Dramowicz K.2005.Creating and Manipulating
Multidimensional Tables with Locational Data Using OLAP Cubes.
Directions
Magazine, January 15, 2005.
http://www.directionsmag.com/article.php?article_id=733
4. Dramowicz K., 2005.Geographic Dimension in Data
Mining.ESRI Business GeoInfo Summit, April 18-19, Chicago, Illinois.
5. Dunhan M.H., 2003.Data Mining: Introduction and
Advanced Topics.Prentice Hall.
6. Eklund P.W.et al., 1998.Data Mining and Soil
Salinity Analysis.International Journal of Geographical Information
Science, 12.pp.
247-268.
7. Ester M., et al., 1998.Spatial Data Mining:
Database Primitives, Algorithms and Efficient DBMS Support.Data Mining
and
Knowledge Discovery, 4, 2/3, pp.193-216.
8. Gahegan M., 2000.On the Application of Inductive
Machine Learning Tools to Geographical Analysis.Geographical Analysis,
2, pp.
113-139.
9. Kantardzic M., 2003.Data Mining: Concepts,
Models, Methods, and Algorithms.Wiley.
10.Koperski K.et al., 1997.Spatial Data Mining:
Progress and Challenge.
11.Koperski K., J.Han, 1995.Discovery of Spatial
Association Rules in Geographic Information Databases.[In:] Egenhofer
M., J.
Ferring (eds.) Advances in Spatial
Databases.Springler-Verlag,
pp.47-66.
12.Miller H.J.and J.Han (eds.), 2001.Geographic
Data Mining and Knowledge Discovery.Taylor and Francis.
13.Oppenshaw S., 1999.Geographic Data Mining: Key
Design Issues.4th International Conference on GeoComputation.
|