Statistical Analysis Maps

R.W. Oldford S. C. Peters
University of Waterloo Apple Computer


Abstract
This paper presents software designed to aid the interactive management
of
 a statistical analysis.  A graphical interface is proposed which allows

the analyst to keep track of the analysis and manage it as it is being

carried out.  The implementation is in an experimental statistical system,

but the design principles apply more generally.  Interactive command 
history lists and object-oriented programming suggest how new statistical
 environments can evolve from command line interfaces to graphical ones.
  The design is based on modelling a statistical analysis as a collection

of acyclic directed networks.  Nodes are `statistical analysis objects'
and
 arcs linking them show how one step in the analysis has led to another.
  The network modelling the current analysis is continually displayed
and 
the analysis can be carried out by interacting with elements of the 
display.  The analysis management tools and key attributes of this software
 model are discussed.
 

KeyWords: Statistical computing system, Integrated programming 
environments, History mechanisms, Graphical interface,  Object-oriented

programming


1.0  Introduction

The interactive nature of modern statistical systems allows statisticians
 to undertake ever more involved analyses.  The data can be more fully
 explored; more models can be considered; alternative approaches can be
 undertaken.  While this power encourages a more thorough analysis, it

also produces an enormous amount of information.  Unorganized, this 
information can be confusing.  The problem is especially acute when the

analysis must be understood, and perhaps continued, by someone other than

its original author.

A data structure containing the history of the analysis is essential to

its management.  But what form should this data structure take?  What

aspects of the analysis should be recorded, and how?  Are simple history
 lists of commands enough?

In this paper, we present an analysis management system that addresses

these questions.  It is built on a history data structure that graphically
 displays the analysis as it progresses.  The display is a collection
of
 acyclic directed networks which we call an analysis map.  The analyst
can
 interact with it through a command language or more directly through
a 
pointing device like a mouse.  The design is directed towards encouraging
 the latter style of analysis (Oldford and Peters, 1988, describe an 
example session).

The proposed history structure evolves quite naturally from simpler 
history lists that capture the commands of the analysis system.  The 
interactive statistical system called S (Becker and Chambers, 1984), the
 LISP programing language, and object-oriented programming are used to

illustrate this evolution.

Section 2 provides some necessary background information.  The evolution
of
 history lists is briefly outlined and the information they contain 
described.  Traditionally, much of this information is based on the time

order of the command events.  However, to use only the time order of events
 would be to miss important inferential information relating the elements

of a statistical investigation. In particular, distinct and temporally

independent analyses often shed light on a statistical investigation only
 when considered together.  By recording only the time order of the 
analyses the logical connections of separate analyses is obscured.

To track the inferential connections between these events more faithfully,
 each is represented as a data structure which can contain pointers to

other data structures.  Object-oriented programming provides a simple
means
 to introduce and to manage such logical connections and is therefore
also 
briefly outlined in Section 2.  The connections that are of interest 
between statistical data structures are described in Section 3.

A software structure called an AnalysisMap is introduced and described
in 
some detail in Section 4.  The AnalysisMap can be regarded as  a highly

interactive history mechanism which displays the statistical objects of
an
 analysis and the logical connections between them as a directed graph.
 
It is intended to be an integral part of the analysis environment,  which
the analyst uses directly to carry out and manage the analysis.  
In Section 5, those elements of the AnalysisMap that can be applied to
applications other than statistical analysis are identified and a more
abstract software structure introduced that will interactively display
any collection of connected objects.  Then new kinds of displays, which
show networks defined by other important connections between statistical
objects, are easily created as specializations of the more general software
structure (using the inheritance facility available in object-oriented
programming).  Some of these specializations are also described in Section
5.
In the final section, we close remarking on some of the immediate advantages
of this graphical representation of a statistical data analysis and on
its potential application in future analysis systems. 
2.0 Background
2.0 Background
In this section we provide some background history on the use of history
mechanisms in computing and introduce object-oriented programming.  Two
principal conclusions are drawn.  First, interactive statistcal systems
like S (Becker and Chambers 1984) already make a great deal of use of
simple linear history lists and rather simple data structures.  And second,
by adopting an object-oriented approach to the design of an interactive
statistical system, a much richer analysis history can be recorded and,
moreover, more easily conducted and managed.  The reader familiar with
these areas might skip this section. 
2.1 History lists
2.1 History lists
We begin by discussing the simplest history structure: a procedure which
automatically records every command the analyst issues in the session.
 A well-known example is the S diary function (Becker and Chambers 1984).
 When invoked, diary causes all S commands which follow it in the session
to be recorded in a system file.  Figure 1 gives an example of a short
session's diary.

1.BodyWts <- C(6654.0, 2547.0, ..., 0.005)
2.BrainWts <- C(5712.0, 4603.0, ..., 0.14))
3.plot(BodyWts, BrainWts)
4.MyReg <- reg(BodyWts, BrainWts))
5.Coefs <- MyReg$coef
6.abline(Coefs)
7.plot(BodyWts, MyReg$resid)
8.LnBodyWts <- log(BodyWts)
9.LnBrainWts <- log(BrainWts)
10.plot(LnBodyWts, LnBrainWts)
Figure 1. An S session diary 
Here, two data vectors are created and assigned names in steps 1 and 2.
 Step 3 produces a scatterplot. Steps 4 to 6 fit a straight line to the
data, save the coefficients, add the fitted line to the previous plot,
and plot the residuals versus the independent variable.  Steps 8 to 10
transform the data and plot the results.
The S diary records the commands in the order they were issued and nothing
more.  It does not, for example, record the output of the commands.  While
the history produced by the diary command can be consulted, edited at
a later date, and rerun, the diary itself is not interactive.  That is,
it is unavailable to the analyst from within the analysis session it is
recording.  (Only temporary escapes from the analysis system, using !
in S, permit examination of the diary file during the analysis session.)
Becker and Chambers (1986) recently implemented a new version of the S
diary called an audit file.  In addition to recording the commands, the
creation date is recorded for every data item when it is either assigned
or read.  In Figure 1, BodyWts' creation date ``27-Jan-88 11:01:03'' in
step 1 would be stored in the audit file as part of the information at
step 1.  Similarly, in step 3 the creation date is taken from the data
structure BodyWts and recorded on the audit file.  These two dates should
be the same.  In this way, the integrity of the analysis can be monitored.
Moreover, as the audit file grows, it is continually processed by an audit
program to produce an audit structure that can then be queried interactively
by the user.  One such request ask for all commands that used BodyWts
- yielding a script of steps 1,3,4, and 8.  Queries that can be addressed
by the audit structure involve when and/or where data structures (BodyWts,
MyReg, ...) were created and referenced.  In a multiprocess (and preferably
multi-window) environment, the analysis session and the audit program
could be running simultaneously as two separate processes.  As the analysis
continued, the audit file would be updated and the audit program would
process the addition.  The analyst could then switch from the analysis
session to the audit process whenever she wanted to query the analysis
history.
This audit approach is more active than the diary approach.  With the
right computing environment, it could be more interactive still.  Parts
of the analysis could be taken from the audit, edited, and run in the
analysis session, effectively making the audit process an active tool
in the ongoing analysis.
More interactive history mechanisms have been proposed recently in the
statistical literature (e.g. Thisted 1986), but have long been
in use in the computer science community.  The earliest would appear
to be the Programmer's Assistant (Teitelman, 1972).  (A more familiar,
but less powerful, interactive history list is that available in the Unix
operating system (e.g. using the c-shell, csh, in BSD Unix).)  Written
in Interlisp (any other Lisp dialect would serve equally well), the programmer's
assistant provides interactive assistance to the programmer by allowing
immediate interactive access to the session history from within the session.
Like S, Interlisp is interactive - commands are interpreted and executed
as soon as they are complete.  Figure 2 shows the script of an Interlisp
session which performs the same analysis as S does in Figure 1.  Each
command is a list of tokens within parentheses - the first element is
the function, or command, name and the remaining elements are its arguments.
 Each list is called a form and, as Figure 2 illustrates, forms can be
nested with innermost forms evaluated first.  Assignment is achieved by
the SETQ function, so that statement 1 assigns the result of the C function
to BodyWts.   The @ function in step 5 plays the same role as $ does in
S - it selects a named component (coef) from a data structure (MyReg).

1.(SETQ BodyWts (C 6654.0 2547.0 ... 0.005))
2.(SETQ BrainWts (C 5712.0 4603.0 ... 0.14))
3.(SETQ MyPlot (PLOT BodyWts BrainWts))
4.(SETQ MyReg (REG BodyWts BrainWts))
5.(SETQ Coefs (@ MyReg coef))
6.(ADDLINE MyPlot Coefs)
7.(SETQ Plot2 (PLOT BodyWts (@ MyReg resid)))
8.(SETQ LnBodyWts (LOG BodyWts))
9.(SETQ LnBrainWts (LOG BrainWts))
10.(SETQ Plot3 (PLOT LnBodyWts LnBrainWts))
Figure 2. Script of LISP commands 
All LISP functions return a value.  Even the PLOT function of line 3 returns
a value that is then assigned to MyPlot.  The value returned might be
a number, an array, a tree structure, or even another function.  In LISP,
there is essentially no restriction.
The programmer's assistant records, on a history list, each command entered
and its value.  The programmer's assistant will also respond to its own
set of commands which allow user interaction with the history list.  A
programmer's assistant command is invoked like any other, by typing it
directly to the system executive.  Effectively, the programmer's assistant
acts as an intermediary, albeit usually invisible, between the user and
the LISP executive that evaluates LISP expressions.  The programmer's
assistant deals with its own commands directly and calls the LISP executive
only as necessary.
Some simple examples of programmer's assistant commands are he following.
 REDO 5 will cause statement 5 to be executed again (becoming statement
11 in the history).  UNDO 5 will undo the effect of statement 5 (detaching
the value assigned to Coefs).  Typing `USE LnBodyWts FOR BodyWts AND LnBrainWts
FOR BrainWts IN 4 TO 7' will repeat statements 4 to 7  (becoming new statements
11 to 15) after substituting LnBodyWts for BodyWts and LnBrainWts for
BrainWts.  And, FIX 5 will display a copy of statement 5 to be edited
and evaluated.
The set of programmer's assistant commands is large and extendable.  Moreover,
the programmer's assistant can itself be accessed from other programs.
 For a complete description of its power, see Teitelman (1972, 1977),
and Xerox (1985).
The programmer's assistant is the least passive of the history mechanisms
described here.  It achieves its high level of interaction by being an
integral part of the analysis environment.  As its name suggests, its
aim is to aid the programming task.  While this implies that it also aids
the statistical data analyst, it also necessarily restricts its scope
as an analyst's assistant.  A programmer's assistant must not make use
of information that may be peculiar to statistical data analysis.
We now turn to the statistical data structures that are produced in the
course of an analysis.  The next subsection shows how richer data structures
called objects can be easily introduced into a statistical analysis software
environment.  Relationships between these objects will be shown to provide
analysis information that can be used to assist the analyst in managing,
and hence conducting, an analysis.
2.2 Data structures and object-oriented programming
2.2 Data structures and object-oriented programming
The analysis history is more than a history of the commands used and their
arguments.  It is also a history of the data structures that were created
and explored.
In the sample S session (Figure 1), a number of different data structures
were generated.  The initial vector structures BodyWts and BrainWts were
created and from these the more complex data structure MyReg was formed.
 MyReg is a hierarchical data structure that contains several named component
data structures, including coef and resid, which can be extracted using
the $ function (see Becker and Chambers, 1984).
Similarly, every form in the LISP session (Figure 2) returns some structure
as its value.  The C function returns a vector structure.  The REG function
returns a regression structure with components coef and resid (and possibly
others).  Statement 3 shows that PLOT returns a plot structure.  (Similar
structures exist in S but cannot be assigned to tokens like MyPlot in
Figure 2.)  The ADDLINE function in statement 6 takes the plot MyPlot
as one of its arguments.  This is very convenient in environments where
many different plots can be displayed and interacted with at once (e.g.
see Stuetzle 1987).  The analyst may want to refer to previous plots at
a later time in the analysis. 
In both S and LISP, each function is only applicable to certain kinds
of data structures.  The log function can be applied to BodyWts but not
to MyReg.  Other functions are more generic, applying to more than one
type of data structure.  For example, abline in S will accept two numbers,
or a vector of two numbers, or any hierarchical structure that contains
a coef component (e.g. MyReg).  Because it is useful to have different
kinds of data structures for different kinds of statistical results (vectors,
regression results, plots, etc.) and because many functions are designed
to operate on certain structures and not others, it would be convenient
if these data structures and the functions designed to operate on them
were more closely associated.
A style of programming called object-oriented programming is designed
to provide this convenience.  Functions can be directly attached to a
data structure type.  Functions and data are bound together in a single
structure called an object.
An object can be thought of as a hierarchical structure with named components
called its Instance Variables, or IVs.  Unlike an S hierarchical structure,
however, there are also methods associated with each object.  These methods
are the functions commonly applied to that object.  A PLOT object, for
example, would have an AddLine method which, when invoked, would take
the necesary slope and intercept information from the user (analyst or
program).  A method is invoked on a given object according to the following
syntax:

(<- Object Method Argument1 Argument2 ... ArgumentN)
Figure 3. Message passing syntax 
This syntax is sometimes called ``message passing''.  The idea is that
each method on Object can be invoked by passing Object the name of the
method (the message) together with arguments for that method.  The arrow
symbol, <-, is read as  ``send'' so that the whole statement can be read
as ``send Object the message Method with arguments Argument1 to ArgumentN''.
 The flavour here is that the data structure (Object) is the active agent.
 It receives the message Method and invokes the corresponding method function.
Since many objects will share basic structure, any common structure is
recorded in special objects called classes.  Individual objects are then
taken as instances of a class.  For example, BodyWts, BrainWts, LnBodyWts
and LnBrainWts would all be instances of a class called Vector.
Each class specifies the instance variables, their default values, and
the methods that every instance of that class must have.  Classes (e.g.
vector) are the templates used to construct other objects (BodyWts, BrainWts,
etc.).  New instances of a class are created by sending the class object
the New message.  Once created, these instances will persist in the virtual
memory of the machine.  Many different instances of the same class can
be created.  These will have the same methods and instance variables,
but the values of their instance variables may be different.  (See Stefik
and Bobrow, 1985, for a more general treatment of object-oriented programming.)
For example, the general structure of a regression result might be specified
by a class called REG.  In REG it is specified that all instances will
have IVs coef and resid.  Further, if every instance of REG must be able
to respond to the message PrintTStats, then the function which achieves
this is defined as a method in the class REG.
Using an object-oriented approach, our analysis session would proceed
as in Figure 4.

1.(<- (C 6654.0 2547.0 ... 0.005) SetName BodyWts)
2.(<- (C 5712.0 4603.0 ... 0.14) SetName BrainWts)
3.(<- (<- PLOT New BodyWts BrainWts) SetName MyPlot)
4.(<- (<- REG New BodyWts BrainWts) SetName MyReg)
5.(<- (@ MyReg coef) SetName Coefs)
6.(<- MyPlot AddLine Coefs)
7.(<- PLOT New BodyWts (@ MyReg resid))
8.(<- (<- BodyWts LOG) SetName LnBodyWts)
9.(<- (<- BrainWts LOG) SetName LnBrainWts)
10.(<- (<- PLOT New LnBodyWts LnBrainWts) SetName Plot3)
Figure 4. Message passing history 
On the surface, this session would seem to be more complicated than the
original S session.  However each line here creates and manipulates objects
which, as later sections  will show, permit a highly interactive graphical
presentation of the analysis session to be relatively easily built (Section
4).
In detail, the above session can be described as follows.  The C function
returns an object that is an instance of the class FloatVector (a vector
of floating point numbers).  This instance is sent the SetName message
(a message understood by all instances) with the argument BodyWts.  Thereafter
the instance can be referred to by its name, BodyWts.  Statement 3 creates
an instance of PLOT, with BodyWts and BrainWts as the values of the x
and y coordinates, and then names the object MyPlot.  Later, in statement
6, MyPlot is sent the message AddLine with the argument Coefs.  The method
AddLine, stored on the class PLOT, draws a line on MyPlot with slope and
intercept taken from Coefs (as in the S example, different kinds of arguments
could be passed to the AddLine method).  Note that (@ MyReg coef) could
have been used in place of Coefs in statement 6 - either way the same
unique object is accessed.  Unlike the hierarchical data structures in
S, two different objects can share a great deal of structure. 
The approach here emphasizes the software structures that are manipulated
and created at each step, not the command lines.  The structures are the
active agents.  Each one is unique, persistent in memory, and can be shared
by many structures.  Moreover, there are natural connections between them.
 In the next section, these connections are explored and formalized to
become the basis for a highly interactive graphical interface to an analysis.
3.0 Connected statistical objects
3.0 Connected statistical objects
An object-oriented approach emphasizes the history of a statistical analysis
session as the history of statistical objects that are created and manipulated
throughout the analysis.  The time sequence of commands is no longer paramount.
 Rather, different relationships between individual objects can be considered.
 Our objective is to assist, or at least to cooperate with, the analyst
in managing and conducting the analysis.  To meet this objective the logical
flow of the analysis must be recorded.  The question is how can these
statistical objects be related, one to another, so as to best capture
this information?
One important relationship is a consequence of an object's definition.
 Every object is directly connected to any other object that is the value
of one of its instance variables.  This means that MyReg is connected
to the vector object that is the value of its coef IV.  Since this vector
object is unique, naming it Coefs in statement 5 does not alter its relationship
with MyReg.  Moreover, we may distinguish those IVs (like X and Y of a
REG object) whose values (BodyWts and BrainWts) are required to create
the object (MyReg), from those whose values are attached after the object
has been created (e.g. tthe value of coef in MyReg).  We call the first
kind of IV a RequiredIV.  Should we wish to reproduce an analysis, the
RequiredIVs would identify the necessary inputs.
Causal relationships also exist between objects which are not required
components of one or the other and hence are not captured by RequiredIVs.
 For example, consider statements 8 and 9 of Figure 4.  LnBodyWts is created
as a direct result of the LOG message being sent to BodyWts.  The relationship
is causal and should be captured in the history.
 These relations can be recorded in a generic way by attaching two new
instance variables to all statistical objects - one called CausalLinks
to record on the antecedent object the identity of the consequent object,
and one called BackCausalLinks to record the identity of the antecedent
object on the consequent object.  The value of an object's CausalLinks
IV is a list of those statistical objects which have been created as a
direct consequence of a message received by that object. The CausalLinks
IV of BodyWts, for example, is a list containing the single element LnBodyWts.
 If, at some later time in the analysis, BodyWts is sent the SQRT (square-root)
message yielding SqrtBodyWts, say, then BodyWts'  CausalLinks IV will
be a list of two items - LnBodyWts and SqrtBodyWts.  BackCausalLinks are
entirely analogous.  This makes it possible to begin at any statistical
object and, from it, trace the causal sequence of events forward, or backward.
The establishment and updating of these links is made automatic by a simple
modification to the send function, <-.  We chose to define a new function
<-~, read ``send and link'', to be used in place of <- everywhere in Figure
4.  With <-~, whenever the result of the method is another statistical
object, a link is established from the object receiving the message to
the object returned.  (Note.  Some messages, like SetName, do not `return'
any object and consequently make no links.)
Consider now the information remaining in a temporal record of events
after recording the component relationships and causal links as above.
 Some of it is of no value at all - BodyWts and BrainWts are temporarally
independent events in the sense that the logic of the analysis is independent
of whether BodyWts or BrainWts was created first.  Worse, some important
information is not at all apparent from a temporal record.  In our sample
analysis the analyst may have decided to take logarithms of the data,
in statements 8 and 9, on the basis of what was seen in the scatterplot
of statement 3.  Such information, while not deducible from the sequence
of events, could be easily recorded by linking the three objects.  However,
CausalLinks seem inappropriate - the logged data were not produced as
a direct consequence of any action taken on the scatterplot.  These weaker
relationships are captured on an IV called AnalysisLinks that is attached
to every statistical object.
Analogous to causal links, each statistical object has both an AnalysisLinks
IV and a BackAnalysisLinks IV.  These links represent the logical flow
of the analysis, as perceived by the analyst.  The idea is that an analysis
link should be established from one object to another, if the analyst
feels that the analysis was directed from consideration of the first object
to consideration of the second.  This is almost always the case when the
second object was created as a result of a message passed to the first.
  As a convenience, then, AnalysisLinks are also constructed automatically
whenever CausalLinks are constructed, again by modifying the <-~ function.
 The difference between the two types of links is that the analyst can
make and break AnalysisLinks at will (each statistical object responds
appropriately to the messages MakeAnalysisLink and BreakAnalysisLink).
 Thus, in the example session AnalysisLinks would automatically be established
from BodyWts and BrainWts to LnBodyWts and LnBrainWts, respectively (as
would CausalLinks).  However, to indicate the logical flow of the analysis
from the scatterplot, MyPlot, to LnBodyWts and LnBrainWts, it would be
necessary to send MyPlot the MakeAnalysisLink message with LnBodyWts and
LnBrainWts as arguments.
We propose, then, that three distinct connections between statistical
objects be recorded on every object.  First, there are the RequiredIVs,
to distinguish those components of an object whose values are required
at creation time.  Second, there are the causal links (forward and backward)
to indicate which object caused another to be created.  And third, analysis
links are meant to reflect the logic of the analysis - as determined by
the analyst (with some help from the system).  The first two, together
with an object's creation date (also an IV), provide valuable auditing
information.   They are particularly important should we decide to reproduce
an analysis on different data.  The third, however, is of principal value
to the analyst - they are the sole responsibility of the analyst and should
be made and broken with care.  Consequently, the tools developed in the
next section are primarily designed to aid the management of analysis
links.
4.0 AnalysisMaps
4.0 AnalysisMaps
The previous section showed that a rich analysis history can be modelled
as a set of multiply connected statistical objects.  This has been accomplished
by moving the relevant information from a separate history mechanism to
the data structures themselves.  Under this model, analysis management
tools can be thought of as intermediaries that facilitate the exchange
of information between the analyst and the objects created.  One such
intermediary is a software structure which we call an AnalysisMap.
An AnalysisMap allows the analyst to view, and interact with, any set
of statistical objects and the AnalysisLinks between them.  AnalysisLinks
are used here because they are directly controlled by the analyst and
hence play a primary role in managing the analysis.  Intermediaries focussing
on other connections are discussed in section 5.
Following the AnalysisLinks as arcs, the statistical objects are the nodes
of a possibly disconnected directed graph.  The whole analysis is a collection
of such digraphs so that managing the analysis amounts to manipulating
and rearranging these digraphs to best show the analyst's logic.  These
networks are displayed in an AnalysisMap as in Figure 5 below.

Figure 5.  An AnalysisMap of the sample session

Every node is easily identified with an object from the session of Figure
4.  The node labelled BodyWts FloatVector (62), for instance, corresponds
to the object named BodyWts in that session.  The default labelling displays
the class name of that object preceded by its unique name (e.g. BodyWts)
if it has one.  In general, the node labels are designed to contain as
much information about the identity of underlying object as seems reasonable.
 In particular, the node label of each PLOT is a miniature reproduction
of the plot that was drawn, augmented by the object's name and class.
Each object is responsible for the production of its node label.  This
not only makes it easy to have a label tailored to each object, but it
also simplifies the design of AnalysisMap.  For example, rather than maintain
a record of the label generating functions for every object class in the
definition of AnalysisMap, each object can be queried directly for its
label.  The arcs connecting the nodes are similarly determined - the AnalysisMap
asks each object for its AnalysisLinks.  This style of delegating responsibility
is adopted wherever possible and results in a simple design for AnalysisMap
and similar software structures (see Section 5).
Note that the AnalysisMap does not show all the statistical objects that
were created in the analysis of Figure 4.  The residuals that were accessed
in line 7, for example, do not appear.  Since the residual vector is an
instance variable of MyReg it is directly available from the node MyReg.
 Unless explicitly added to the AnalysisMap (e.g. Coefs), instance variables
do not appear so that the amount of detail displayed is minimized.
The uniform way of accessing such detail is through the Zoom message which
is understood by all statistical objects.  For example, to display the
residual vector, and other IVs of MyReg, the Zoom message is sent to MyReg.
 This causes MyReg and its IV connections to be displayed as in  Figure
6.
Figure 6. The MicroscopicView of MyReg
Because it is the responsibility of the object to respond to the Zoom
message, its response can be tailored.  The default response for all classes
is to produce and display a software structure called a MicroscopicView
as in Figure 6.  Other objects will produce a more meaningful display
- a FloatVector will display the numbers it contains in an array, a PLOT
its original plot.  These responses are simply implemented by specializing
the Zoom method of the appropriate class of object (e.g. FloatVector and
PLOT).  If no specialization is done, the default Zoom method will be
inherited and will produce a MicroscopicView as before.  Consequently,
the AnalysisMap can be unaware that different responses exist - it simply
send the appropriate object the Zoom message. 
4.1 Interactive graphical analysis management
4.1 Interactive graphical analysis management
As described so far, an AnalysisMap is still a relatively passive tool
for analysis management.  It could be implemented as a static display
on traditional hardware.  However, by using a modern workstation's multiple-windowed
bit-mapped display and a ``mouse'' pointing device, the AnalysisMap can
be made a highly interactive tool for analysis management.
First, take the AnalysisMap to be a ``mouse-sensitive'' window, meaning
simply that something can happen if the mouse button is pressed while
the mouse is within the window of the AnalysisMap.  For the sake of the
discussion, further assume that the mouse has three different active states
which correspond, say, to which of three different buttons on the mouse
is being pressed.  We'll distinguish these buttons as left, middle, and
right buttons.
In our implementation, a menu of some kind pops up at the mouse's location
when a button is down.  If the mouse is located over an object in the
map, then the menu will offer some actions that can be taken on that object.
 If the mouse is on the title bar of the AnalysisMap, then actions are
offered that are applicable the displayed network as a whole.  The actions,
and hence contents of the menus, depend upon which of the three mouse
buttons is depressed.
4.1.1 Exploring the nodes of the graph
4.1.1 Exploring the nodes of the graph
Pointing at an object in the AnalysisMap and clicking the left button
causes the following menu to pop up.
Figure 7. The object's left-button menu
The mouse is used to select a menu item by placing it over the item and
releasing the mouse button.  In the figure, `Zoom' is shown selected.
 Once an item is selected, the AnalysisMap sends a message corresponding
to that item to the object - selecting Zoom causes the AnalysisMap to
send the object the Zoom message.
No matter which object is selected, if the left-button is used the same
menu (Figure 7) is produced.  This has two implications.  First, the menu
contents can be stored as part of the AnalysisMap structure.  And second,
every statistical object must respond to these messages.  The contents
of this menu are therefore chosen to be rather generic actions.
The most generic actions are those which simply exchange information between
the analyst and the object.  For example, `Zoom'  sends the Zoom message
to get more detail, `Name this item' sends the SetName message to put
information on the object, and `Edit notes' lets the analyst access and
directly record notes on the selected statistical object (on a Notes IV
using the EditNotes message).  The last one is an important tool for recording
an analysis. 
4.1.2 Managing the network
4.1.2 Managing the network
A similar simple exhange exchange of information is possible on the whole
AnalysisMap by pressing the left button with the mouse on the AnalysisMap's
title bar.  A menu is produced whose items, when selected, will either
produce information on how the AnalysisMap works, or, accept information
(e.g. name for the AnalysisMap or notes) to be recorded on the AnalysisMap.

Tools for managing the relationships between objects in the analysis are
found by pressing the middle mouse button while over the title bar.  This
produces a set of nested menus, organized by the type of actions they
allow.  The first level of these menus is shown in Figure 8.  Arrows indicate
that another menu, containing more specialized actions, will appear if
the mouse is moved to the right off the highlighted item.
Figure 8. AnalysisMap title bar middle-button menu.
These actions are best understood by regarding the AnalysisMap as a window,
or View, on a collection of linked statistical objects.  Then, it is expected
that this view could be widened, adding more statistical objects to the
AnalysisMap, or narrowed, removing objects from the map.  These actions
are the first two menu items.
Many links were made to arrive at the network in Figure 5.  Some were
also broken (e.g. the analysis links from BodyWts to LnBodyWts).  The
result was a reasonably clear display of the flow of the analyst's logic
(from top to bottom in Figure 5).  The fourth menu item in Figure 8 provides
access to those functions which allow the analyst to make and break analysis
links between the displayed objects.
Moving the mouse to the right along this item produces the menu system
of Figure 9.
Figure 9.  Analysis link operations
Selecting `Make a link', the analyst is prompted for two statistical objects
to be linked.  These can be indicated by pointing to them with the mouse.
 This greatly facilitates the analysis management.  If the analyst was
required to type in the message passing expression to achieve the linking,
it is not likely that much organization would go on - it would slow down
the analysis too much.  The remaining items allow the AnalysisLinks to
be broken between two objects and an object called a memo to be inserted
between any two nodes.  Memos are objects that contain only notes.
4.1.3 Multiple analyses
4.1.3 Multiple analyses
An AnalysisMap is a view showing which objects are the current focus of
the analysis.  If so, then why not have many AnalysisMaps?  Each could
be identified with a part of the analysis that formed a separate focal
point within the overall analysis.  These would be separate subanalyses
within the larger analysis.  In turn, these subanalyses might themselves
contain yet finer subanalyses.  In our implementation, different AnalysisMaps
are different instances of the AnalysisMap class.  Each instance has an
IV called the DisplayList, whose value is a list of the statistical objects
it contains and displays.  AnalysisMaps appearing on this list indicate
subanalyses.
A statistical object might appear in the view of more than one AnalysisMap.
 For example, suppose many independent analyses are undertaken on the
same data.  When considering any one of these analyses, it would be useful
to have this data appear in the corresponding AnalysisMap.  This is simply
achieved by having a pointer to the data object (e.g. its name) appear
on the DisplayList of those AnalysisMap objects which contain the data
object.  The data are not copied into two different AnalysisMaps.  If
an object appears in more than one AnalysisMap, then it is equally accessible
from each AnalysisMap (Zooming etc.).  In a multi-window environment,
many different AnalysisMaps can be displayed simultaneously, allowing
the analyst to switch concentration from one to another as necessary.
Some analyses are best described as subanalyses - they address a subproblem
of the larger analysis.  In the larger problem context, the details are
only of interest aggregated as the result of the subanalysis.  To accomodate
such aggregation graphically, AnalysisMaps can also contain other AnalysisMaps.
Moving the mouse across `Narrow the View' produces the following menu
system.

Figure 10. Creating a subanalysis
Selecting `Create a View' produces a small icon representing an AnalysisMap
inside the current AnalysisMap.  The analyst is then prompted for the
statistical objects which are to be included in the new, nested, AnalysisMap.
 The part of the network corresponding to the selected objects is collapsed
and replaced by a single node represented by the AnalysisMap icon.  Figure
11 shows the result for our example analysis after two subanalyses were
created.
Figure 11.  Analyses within analyses
These nodes behave like any other in an AnalysisMap. The left button menu
is as before, allowing the subanalyses to be named, to have notes added
to it, and so on.  For example, the subanalyses in Figure 11 have been
given names, LOG DATA and PLAIN REG, which appear in each icon's title
bar.
From the top level AnalysisMap of Figure 11, the logic of the analysis
is straightforward and contains few details.  The analyst began with two
data vectors, plotted them, and then considered two different subanalyses
independently.  One was a straight forward regression of some kind, as
indicated by the name PLAIN REG, and the other was an analysis involving
the logged data.
The detail can be returned by exploding the nested AnalysisMap.  This
is done by selecting the menu item shown in the figure below.
Figure 12.  Operations on nested AnalysisMaps
As Figure 12 shows, the inverse operation should also be available.
Alternatively, the analyst can zoom in on the subanalysis by selecting
`Zoom' from the left-button menu (Figure 7).  Zooming in on the PLAIN
REG subanalysis opens up a window, as in Figure 13.
Figure 13.  The subanalysis PLAIN REG
This too is an AnalysisMap object.  It has the same behaviours as the
larger AnalysisMap that contains it.  The analyst carry out the subanalysis
in it with none of the details appearing in the larger AnalysisMap.
It is as if the entire collection of connected statistical objects are
displayed as a Venn diagram having the set boundaries defined by the AnalysisMaps.
 Many AnalysisMaps can contain the same objects, and AnalysisMaps can
be nested within one another to an arbitrary depth.  The only difference
is that the contents of a nested AnalysisMap are not visible inside the
larger AnalysisMap.
This Venn diagram organization has implications for the behaviour of AnalysisMaps.
 Removing an object from one AnalysisMap (by narrowing its view) implies
that the object be added to the view of all AnalysisMaps directly enveloping
the first AnalysisMap.  Like the set boundaries in a Venn diagram, AnalysisMaps
are separate from the statistical objects they view.  As a consequence,
there is no reason for AnalysisMaps to have formal AnalysisLinks.  Instead,
arcs to and from nested AnalysisMaps are drawn if, and only if, a statistical
object in the larger AnalysisMap has an analysis link to at least one
object inside the nested AnalysisMap.  Finally, care must be taken in
the definition of the methods of the AnalysisMap class to prevent an instance
from containing itself through some chain of nested AnalysisMaps.
4.2 Conducting the analysis
4.2 Conducting the analysis
The AnalysisMap makes it easy to organize the analysis as it progresses.
 It is, however, somewhat inconvenient to switch between typing in the
commands and managing the analysis.  It would be preferable if the AnalysisMap
assisted in managing the analysis, not only after the analysis, but also
while the analysis was being conducted.  Minimally, the AnalysisMap should
be automatically updated whenever a command is executed.  But which AnalysisMap?

To uniquely identify the AnalysisMap, the LISP executive could be modified
to have a specified AnalysisMap updated with those objects created by
the commands.  This is a common approach taken with history lists (see
Section 2.1).  A simple command could redirect the output to different
AnalysisMaps as appropriate.
An alternative approach is to have a selected AnalysisMap invoke a LISP
executive as needed.  In our implementation, a menu item from the title
bar of any AnalysisMap allows this possibility.  Figure 14 shows the menu
item to be selected.
Figure 14.  Invoking LISP from an AnalysisMap
Selecting this item causes a small window to be attached to the top of
the AnalysisMap where lisp commands may be typed directly.  The object
returned by the command is added to the view of the AnalysisMap.  This
approach has the distinct advantage that the most interesting part of
the analysis is immediately before the analyst when the command is entered.
 This encourages the analyst to a more active involvement in the analysis
management.
Statistical objects can also be created and added to the AnalysisMap by
selecting `Add a new kind of Analysis node' from the same menu.  In the
AnalysisMap's LISP window, the analyst is then prompted for the name of
the class of object to be created.  Once created, the object prompts the
analyst for values to be assigned to its required variables (its RequiredIVs).
 Finally, the object is added to the view of the AnalysisMap.  
Thus a new statistical object is created with a minimum of typing on the
part of the analyst.  This is a simple consequence of using objects. 
It also results in a more interactive AnalysisMap.  By making the AnalysisMap
even more interactive, and relying heavily on the fact that objects are
being manipulated, the typing required to perform actions on existing
objects can also be substantially reduced.
Recall that every command that operates on a statistical data structure
(vectors, plots, regression structures, etc.) will operate on some structures
and not on others.  In Section 2.2, these commands were implemented either
as methods of the objects on which they operate (like LOG on FloatVectors),
or as separate objects themselves (e.g. REG).  For the latter, the above
procedure to add a new object is sufficient.  For the former, the methods
are easily accessible from that object.  In particular, an AnalysisMap
can make them available by simply querying the selected object.
In our implementation, selecting an object in the view of an AnalysisMap
with the middle button depressed will pop up a series of menus that outline
the methods available for that object in an ordered fashion.  The topmost
level has three categories, as shown in Figure 15.
Figure 15. Top level middle-button menu for any selected object.
The first of these leads to the methods that are designed specifically
for that class of object.  Within this category, the methods are further
organized to simplify location of the desired method.  For example, the
coefficient estimates and t-statistics from MyReg would be printed nicely
by selecting the menu item as shown in Figure 16.  
Figure 16 with suitable title here.
This would cause the AnalysisMap to send MyReg a message called PrintEstimatesAndTStats.
 Similarly, the fitted line could be plotted, various residual plots produced,
and so on.  If an object is returned as the value from sending the message,
then it is automatically added to the view of the current AnalysisMap.
The second top level category leads
to those methods which are less frequently
used, where frequency of use is determined by us, the designers.
 These methods also have a second level of organization.
The third category, `Extraction,' has nothing to do with methods.  Instead,
it provides a simple means to extract the values of instance variables.
 Figure 17 shows the menu system for MyReg.
Figure 17.  Extracting IVs of MyReg
As can be seen, the IV names are presented as menu items.  Selecting one
will return the value of that IV for MyReg and add it to the AnalysisMap.
 An analysis link will also be formed from MyReg to the extracted value.
  In our example session, this was how the vector Coef came to appear
in the AnalysisMap.
Thus there are four possible ways to continue an analysis.  First, new
analysis objects can be created and added to an AnalysisMap by selecting
`Add a new Analysis node' in the title bar of the current AnalysisMap.
 Second, once a few objects exist, the analysis can continue by directly
interacting with these objects in the display.  If the methods are well
designed, this can make the analysis progress at a rapid pace.  For example,
in Figure 13 residual plots could be produced by pointing directly at
the MyReg node and selecting the appropriate method.  Third, for commands
that involve multiple operations, the LISP executive can be invoked above
the AnalysisMap and the command typed in.  Fourth, whenever it was desired
to do many commands before adding the result to the AnalysisMap, the LISP
executive can be used directly (in a multi-window environment, one window
can be dedicated for theLISP executive).
4.3 Responsibilities of the AnalysisMap
The net result of these four possibilities, together with the analysis
management tools discussed earlier, is that the analysis itself becomes
something tangible.  It grows in many directions and at many levels of
detail.  Oldford and Peters (1988) describe a sample analysis in some
detail.  Because the analyst nearly always interacts directly with the
AnalysisMap to carry out the analysis, and because the analysis can be
easily and speedily carried out, the analyst is more likely to make the
effort necessary to properly manage the analysis.
4.3 Responsibilities of the AnalysisMap
The AnalysisMap sounds like a very complex object indeed.  However, it
is greatly simplified by the fact that each statistical data structure
is an object which can be made to share much of the responsibility.
If an object is selected with the left-button depressed, then the AnalysisMap
produces the menu of Figure 7.  When an item is selected, the AnalysisMap
sends the object the corresponding message (Zoom, EditNotes, SetName,
etc.) and its responsibility ends. 
If the middle-button is depressed, then the AnalysisMap asks the selected
object for the menu to produce.  Once an item is selected, a message is
sent to the object.  The AnalysisMap observes the value returned by the
object in response to the message and, if it is another object, the AnalysisMap
adds it to its view.  Links are automatically established by having the
AnalysisMap send all messages using the <-~ function.
Even to compute the displayed network, much is handed off to the individual
objects.  The AnalysisMap has a DisplayList of objects in its view.  Each
of these is asked for a label to describe itself (by sending the object
the DescriptiveLabel message).  To determine where the arcs should be
drawn each object is also asked for the value of its AnalysisLinks IV.
 An arc must be drawn between an object and every object in its AnalysisLinks
which is also on the DisplayList of the AnalysisMap.
The AnalysisMap is responsible for arranging the nodes and arcs in a relatively
pleasing way, for monitoring where the mouse is and what position its
buttons are in, and, of course, for responding to all menus produced when
its title bar is selected.
5.0 General network views
5.0 General network views
AnalysisMaps are quite general.  Although they view statistical objects,
they are not restricted to objects of statistical interest.  An AnalysisMap
views and provides an interactive interface to objects that are connected
by AnalysisLinks and which will respond to the left-button messages Zoom,
SetName, and so on.  The statistical and numerical operations are made
available by querying the individual objects.
Consider the non-statistical content of an AnalysisMap.  The links between
objects are directional and can be determined from the objects themselves.
 As a general network view, the AnalysisMap responds to mouse selection
in its title bar by producing a left-button or a middle-button menu as
appropriate.  Among the middle-button menu items there is the ability
to narrow the view, widen the view, and so on.  Similarly, a menu of items
is produced if one of the objects being viewed is selected with the left
mouse button.  Every object is expected to be able to respond to the messages
of those menu items.  Similarly, each object is expected to produce a
descriptive node label and a menu of items when it is selected with the
middle mouse button.
These responsibilities can be abstracted to define a general network view
that can be implemented as a class.  This class, which we call View, is
a general purpose tool for inspecting and altering a network of linked
objects.  An AnalysisMap is just a particular kind of View.  The class
AnalysisMap is said to be a specialization of the class View.
While all behaviours, like widening the view, creating a sub-view, and
so on, are defined for View, the links to be followed between objects
are not.  The links are specified only to the extent that forward and
backward links will exist.  What kind of link is a forward (or backward)
link will depend upon the kind of network being viewed.  The forward links
of an AnalysisMap are to be found on the AnalysisLinks instance variable
of each object and the backward links on the object's BackAnalysisLinks
IV.
This relationship between View and AnalysisMap is implemented in object-oriented
programming by declaring View to be a parent, or super, class of AnalysisMap.
 All behaviours and instance variables of View are automatically behaviours
and instance variables of AnalysisMap.  The class AnalysisMap is said
to inherit these from its parent class, View.  (This idea of inheritance
is used extensively in the definition of previous statistical objects
as well so that methods like MakeAnalysisLinks and Zoom can be centrally
located and shared by all statistical objects.)  As a child of View, AnalysisMap
also has methods and IVs that are special to itself  (e.g. an AnalysisMap
determines its forward links by accessing the AnalysisLinks of each object).
A class can have many different children (and many different parents),
each one specialized in different ways.  In particular, View is specialized
to allow other connections between statistical objects to be viewed (as
discussed in Section 3).  The children of View are shown in Figure 18.

Figure 18. Inheritance from View
Like AnalysisMap, each child is a View specialized to follow certain links.
 A CausalMap traces the CausalLinks as the forward links which connect
objects.  A DataFlowMap goes to each object's RequiredIVs to trace the
backward links between objects - forward links are then defined in terms
of the backward links.  A MicroscopicView was seen in Figure 6 of Section
4.  MicroscopicViews follow an object's IVs as forward links.  Unlike
an AnalysisMap, none of these three Views allow the analyst to make or
break links between objects in their view.  Clearly, critical information
could be lost if, for example, CausalLinks were broken.
A MicroscopicView also differs in that it cannot contain other MicroscopicViews.
 Nor can it contain more than one object and its IV values.  A MicroscopicView
is meant to be the bottommost view of any given object.  Of course the
value of any IV can also be zoomed in on (using the left-button menu)
to get a MicroscopicView of it independent of the original object.
The ToolBox is a View quite different from the others.  It views the available
classes of statistical objects and, consequently, is a network organization
of the statistical data structures that are available to the analyst (e.g.
commands like REG or PLOT).  A ToolBox can contain other ToolBoxes and
links between the objects it views can be manipulated.  In this way, the
analyst can personalize the organization  (e.g. grouping tools (classes)
into categories of analysis).  This is a great advantage over a simple
alphabetic listing of possible commands (as in S).
The many different kinds of Views discussed here afford the analyst the
means to interactively investigate a variety of statistically meaningful
connections between different statistical analysis objects.  Should other
connections become interesting to record and investigate in the future,
a tool to monitor and perhaps manage them will be easily produced by specializing
View. 
6.0 Discussion
6.0 Discussion
A statistical analysis is a rich structure having more connections between
intermediate results and commands than the time order of events alone
would indicate.  Many of these connections will depend upon the actual
problem being analysed and are therefore best left to the analyst to make
(i.e. AnalysisLinks).  This does not, however, preclude having software
assist the analyst in making and managing these links.  The design of
the statistical objects discussed above and the Views which display their
interconnections provide the analyst with a new graphical way to carry
out a statistical analysis.
Moreover, the network model presented here does not require that statistical
computations be organized at any higher level than present interactive
statistical systems.  It simply requires a shift to the richer structures
of object-oriented programming.  Data structures like vectors and arrays
are easily given object-oriented representations.  Statistical commands
either become objects themselves (e.g. S's reg and plot become the object
classes REG and PLOT), or they become methods of the data structures on
which they operate (e.g. LOG).  The former is likely the best choice for
commands which produce a complex data structure, while the latter would
probably be chosen for operators which either returned no data structure
or returned one that is the same as its operand.  (There may be considerable
value in implementing some commands both ways.  For example, the object
REG could have a method called PlotResidualsVsFit which, when invoked,
would instantiate a PLOT object with appropriate arguments.  The plotting
is implemented both as an individual class (PLOT) and as a method of another
class (REG).  The key result is that, when the REG object is selected
in an AnalysisMap with the middle-button, a menu will offer the possibility
of plotting the residuals in this way.  The analyst has a method available
where she is likely to use it.)
Given this kind of behaviour, one is encouraged to consider different
and possibly higher level organizations of statistical computations and
data.  In Oldford and Peters (1986, 1988), we suggest how the nodes in
an AnalysisMap might also be taken to represent steps in a statistical
analysis.  Instead of the REG object representing the results from fitting
a regression, it could represent the decision to model a vector of responses
as a function of other vectors.  It would respond to the message DoLeastSquaresFit
by producing a new object called a LeastSquaresFit which would contain
the results of the fit.  (Of course, this new object would be linked via
analysis and causal links to the REG object and would appear in the AnalysisMap.)
 REG could have other messages like DoRobustFit and PlotData, representing
different choices at that step in the analysis.  These would make life
easier on the experienced analyst and, if coupled with good on-line documentation,
would provide a less experienced analyst a modicum of guidance in unfamiliar
territory (see Oldford and Peters, 1988, for further discussion).
Finally, by using a tangible model of a statistical analysis, patterns
should become more apparent.  These may have strategic import, either
to record and study, perhaps to repeat, or to avoid altogether.  Before
they can be affected, they must be recognized.
7.0  References
7.0  References
Becker, R.A. and J.M. Chambers. (1986). ``Auditing of Data Analyses'',
Siam Journal for Scientific and Statistical Computing,  (to appear).
McDonald, J.A. and J. Pedersen (1986). ``Computing Environments for Data
Analysis Part 3: Programming Environments'', Siam Journal on Scientific
and Statistical Computing,  (to appear).
Oldford, R.W. and S.C. Peters (1986). ``Data Analysis Networks in DINDE'',
Proc. of the ASA: Stat. Comp. Section, pp. 19 - 24. 
Oldford, R.W. and S.C. Peters (1988). ``DINDE:  Towards more sophisticated
software environments for Statistics'',  Siam Journal on Scientific and
Statistical Computing, 9, pp. 191 - 211.
Stefik, M. and D.G. Bobrow (1985). ``Object-Oriented Programming:  Themes
and Variations'', The AI Magazine, 5, pp. 40-62.
Stuetzle, W. (1987). ``Plot Windows'', JASA, 82, pp. 466-475.
Teitelman, W. (1972). ``Automated programmering - The programmer's assistant'',
AFIPS Conference Proceedings, 41,  pp. 917-921.
Teitelman, W. (1977). ``A Display Oriented Programmer's Assistant'', Proceedings
of the Fifth Joint Conference on Artificial Intelligence, Cambridge Massachusetts,
 pp. 917-921.
Xerox (1985). Interlisp-D Reference Manual Volume II: Environment, Xerox
Corporation.