Paolo Giorgini, Stefano Rizzi, Maddalena Garzetti — A Goal-Oriented Approach to Requirement Analysis in Data Warehouses

Abstract

Several surveys indicate that a significant percentage of data warehouses fail to meet business objectives or are outright failures. One of the reasons for this is that requirement analysis is typically overlooked in real projects. In this paper we propose GRAnD, a goal-oriented approach to requirement analysis for data warehouses based on the Tropos methodology. Two different perspectives are integrated for requirement analysis: organizational modeling, centered on stakeholders, and decisional modeling, focused on decision makers. Our approach can be employed within both a demand-driven and a mixed supply/demand-driven design framework.

Introduction

Data warehouse (DW) projects are inherently risky. In confirmation of this, several surveys indicate that a significant percentage of data warehouse projects fail to meet business objectives or are outright failures. The reasons for a failure are often related to how the organization reacts to the project or to the low quality of data, but in several cases they may be traced back to how the design process was conducted. In particular, a common design-related reason for failure is that requirement analysis is typically overlooked in DW projects, mainly since [1]:

Warehousing projects are long-term ones, and it is very difficult to anticipate future requirements, so only a few requirements can be stated from the beginning.
Information requirements for DW applications are difficult to specify since decision processes are flexibly structured, poorly shared across large organizations, jealously guarded by managers, and unstable in time to keep pace with evolving business processes.
Requirements for decision making often refer to information that does not exist in the required form, and must be derived from data sources by integrating, transforming, and cleaning them.

Most methodologies for DW design claim there must be a phase dedicated to analyzing the business requirements (e.g., [2, 3, 4, 5]), still there is no consensus on what relevance and temporal priority should be assigned to it. Indeed, the approaches to DW design are usually classified in two categories [1]:

Supply-driven (also called data-driven) approaches design the DW starting from a detailed analysis of the data sources [6, 7, 8]. User requirements impact on design by allowing the designer to select which chunks of data are relevant for the decision making process and by determining their structuring according to the multidimensional model.
Demand-driven (or requirement-driven) approaches start from determining the information requirements of business users [9, 10]. The problem of mapping these requirements onto the available data sources is faced only a posteriori, by designing proper ETL routines.

While supply-driven approaches somehow simplify the design of ETL, since each data in the DW corresponds to one or more attributes of the sources, they give user requirements a secondary role in determining the information contents for analysis, and give the designer little support in identifying facts, dimensions, and measures. Conversely, demand-driven approaches bring requirements to the foreground, thus ensuring that the DW will be tightly tailored to the users’ needs, but require a larger effort when designing ETL.

Supply-driven approaches are feasible when all of the following are true: (1) detailed knowledge of data sources is available a priori or easily achievable; (2) the source schemata exhibit a good degree of normalization; (3) the complexity of source schemata is not too high. In practice, when the chosen architecture for the DW relies on a reconciled level (or operational data store) these requirements are largely satisfied: in fact, normalization and detailed knowledge are guaranteed by the source integration process. The same holds, thanks to a careful source recognition activity, in the frequent case when the source is a single relational database, well-designed and not very large. In a supply-driven approach, conceptual design is heavily rooted on source schemata and can be largely automated (e.g. see [7]). Our on-the-field experience shows that requirement analysis can then be carried out informally, based on simple requirement glossaries (such as in [11]) rather than on formal diagrams. On the other hand, such an informal approach is unsuitable for a demand-driven framework, that asks for a more structured and comprehensive technique.

In this paper we propose GRAnD (Goal-oriented Requirement Analysis for Data warehouses), a goal-oriented technique for requirement analysis in DWs based on the Tropos methodology [12]. GRAnD can be employed:

within a demand-driven framework, that is the only alternative whenever a deep analysis of data sources is unfeasible, or data sources reside on legacy systems whose inspection and normalization is not recommendable. In this case, conceptual design will be directly based on requirements.
within a mixed supply/demand-driven framework. In this case, requirement analysis and source inspection are carried out in parallel; conceptual design is still carried out in a semi-automated way, like in the supply-driven framework, but leaning on user requirements to reduce its complexity. The mixed framework is recommendable when source schemata are well-known but their size and complexity are substantial. In fact, the cost for a more careful and formal analysis of requirement is balanced by the quickening of conceptual design.

GRAnD adopts two different perspectives for requirement analysis: organizational modeling, centered on stakeholders, and decisional modeling, focused on decision makers. Decisional modeling is directly related to the information needs of decision makers; with reference to the terminology introduced in [1], it achieves to be analysis. On the other hand, organizational modeling is aimed at as is analysis; it has a primary role in enabling identification of facts and in supporting the supply-driven component of the approach. The diagrams produced, that relate enterprise goals to facts, dimensions, and measures, are then used during conceptual design: within a demand-driven design framework, the requirements are translated into a conceptual schema to be mapped on data sources a posteriori; within a mixed framework, while the data sources are still explored to shape hierarchies, user requirements play a fundamental role in restricting the area of interest for analysis and in determining facts, dimensions, and measures.

The paper is structured as follows. In Section 2 we summarize the most relevant literature related to requirement analysis in DW design. Section 3 illustrates the technique we propose for requirement analysis by discussing organizational and decisional modeling. Section 4 shows how our technique results in conceptual design within both demand- driven and mixed design frameworks. Section 5 introduces the prototype supporting our approach. Finally, Section 6 draws the conclusions.

Related literature

In the field of DWdesign, it is necessary to distinguish between supply- and demand-driven approaches. The prototypical supply-driven approach dates back to 1992, when Inmon claimed that the development of DWs is data-driven, as opposed to the requirement-driven development of operational systems [13]. Other supply-driven approaches were proposed in [6], [7], and [8], where conceptual design of the DW is rooted in the schema of operational sources and is carried out starting, respectively, from the identification of measures, from the selection of facts, and from a classification of the operational entities. Also the comprehensive design method described in [4] leans on a conceptual model; a mixed approach to conceptual design is recommended, but no further details are given.

In demand-driven approaches, collecting user requirements is given more relevance. The approach described in [14] shares some similarities with ours, since it is based on the i-framework. However, requirements are directly used to build a conceptual model in a fully demand-driven perspective, without any feedback from the data sources. Besides, organizational modeling is not supported. In [1], a wish-list for DW design methodologies is proposed, and a multi-stage technique for requirement analysis is outlined. Here, two different phases are interlaced: as is analysis, aimed at analyzing and describing the actual information supply, and to be analysis, aimed at analyzing the information demand and matching it with the supply. In [9], a goal-oriented approach based on the goal-decision-information model is proposed. Though this approach shares some similarities with ours, it mainly focuses on requirement analysis and does not show how to move from requirements to design. A process-oriented approach is presented in [10], where three different perspectives at increasing levels of detail, each associated to a specific requirement template, are used. Though the authors recommend to iteratively and incrementally gather requirements with use cases, a few details are given and no examples are provided, so a comparison is very hard. A comprehensive framework for requirement management, based on the Unified Process, is proposed in [5]; though there is a strong emphasis on documenting the project through document templates, nothing is said about how to model requirements.

In [15], a goal-oriented method to support the identification and design of DWs is presented. This approach can be regarded, like ours, as mixed demand/supply driven. The main difference is that organizational modeling is not supported and that requirement analysis starts from the goals of decision makers. Goals are analyzed separately using abstractions sheets, and general considerations about how they relate to the organization activities are given in natural language. Conversely, in our approach an explicit goal model of the organization is given and the analysis of decision makers’ goals is directly related to such a model. Moreover, in our goal analysis, goals are decomposed in subgoals and specific relationships between goals are specified. Another important difference with our approach, is that we support early requirements analysis [16, 17] that allows for modeling and analyzing processes that involve multiple participants (both humans and software systems) and the intentions that these processes are supposed to fulfill. By so doing, one can relate the functional and non-functional requirements of the system-to-be to relevant stakeholders and their intentions.

An interesting case-based comparison of supply- and demand-driven approaches can be found in [18]. Remarkably, it is concluded that data-oriented and goal-oriented techniques are complementary, and may be used in parallel to achieve optimal design.

Finally, it is worth to mention that a few CASE tools for DW design have been implemented, either from software vendors or as research protoypes. In ADAPT [19] and in GOLD [20] the conceptual schema for the DW is directly drawn by the designer, thus a demand-driven approach is implied — though no active support for requirement analysis is given. Conversely, in WAND [21] the conceptual schema is semi-automatically derived from the source schemata, thus implementing de facto a supply-driven approach.

Requirement Analysis with GRAnD

Tropos [12, 22] is an agent-oriented software development methodology, based on the i-conceptual framework [17], where the concepts of agent, goal, and related mentalistic no- tions are used to support all software development phases, from early requirement analysis to implementation. Tropos differs from other goal-oriented methodologies since it moves the notions of agent and goal to the early stages of software development. During early requirement analysis, the requirements engineer identifies the domain stakeholders and models them as social actors, who depend on one another for goals to be fulfilled, tasks to be performed, and resources to be furnished. Through these dependencies, one can answer why questions, besides what and how, regarding system functionality. Answers to why questions ultimately link system functionality to stakeholder needs, preferences, and objectives.

The Tropos methodology has been successfully applied in different domains. In the following we summarize the part of the Tropos notation that can be used in the DW context:

Actors. An actor represents an enterprise stakeholder. More precisely, it can model a physical or software agent (e.g., Mr. Brown), a role, meant as an abstract characterization of the behavior one or more agents take in a specific context (e.g., sale analyst), or a position, i.e. a set of roles generally played by a single agent (e.g., marketing manager). Graphically, actors are represented by circles.
Dependencies. A dependency represents an “agreement” between two actors, one depending on the other to respect the agreement. The agreement can be a goal to be fulfilled, a task to be performed, or a resource to be delivered. In our context, the main interest is on goals, that are represented as ovals.
Actor diagram. It is a graph of actors related by dependencies, used to model how actors depend on each other.
Rationale diagram. It is used to represent the logical foundations that rule the relationships between actors. It appears as a balloon within which goals of a specific actor are analyzed and dependencies with other actors are established. Goals are decomposed into subgoals, with either AND (all subgoals must be achieved) or OR (any of the subgoals must be achieved) semantics, possibly specifying the positive/negative contributions of subgoals to goals. The intuitive meaning of a positive (negative) contribution is that the satisfaction of a goal encourages (discourages) the satisfaction of another goal. Notations + and ++ (– and – –) specify the different strength of the contribution

When analyzing user requirements for DWs, two perspectives should be taken into account. Firstly, it is important to model and analyze the organizational setting in which the DW will operate (organizational modeling); this includes designing the actor diagram as well as the rationale diagrams for each stakeholder. Secondly, in order to capture the functional and non-functional requirements of the DW, we need to design rationale diagrams for the decision makers, who are the main actors in the decisional process (decisional modeling).

In the following subsections these two perspectives are described in detail, together with the analysis phases they encompass, with reference to real case study, the BI-BANK project, developed at the University of Trento in collaboration with DeltaDator S.p.a.. BI- BANK is a project for developing a Banking Business Intelligence System able to support the decisional process with a set of basic banking analyses. For simplicity, in this paper we only focus on analysis of banking transaction.

As concerns notation, using Tropos in the DW context requires some new concepts to be introduced:

Facts. In organizational modeling, a fact models a set of events that happen when a goal is achieved. In decisional modeling, a fact is more properly meant as a possible focus of analysis related to an analysis goal. Graphically, facts are represented as rectangles connected to a goal.
Attributes. They are fields whose value is provided when a fact is recorded to fulfill a goal. They are denoted as small diamonds connected to goals.
Dimensions. A dimension is a fact property that describes a possible coordinate of analysis, i.e. a possible perspective for looking at the fact to fulfill an analysis goal. Dimension are represented as small circles connected to goals.
Measures. A measure is a numerical property of a fact that describes a quantitative aspect that is relevant for decision making. Graphically, measures are represented as small squares connected to goals.

The graphical notation is summarized in Table 1.

Table 1: Notation for actor and rationale diagrams

3.1 Organization Modeling

Organizational modeling consists of three different phases: (i) goal analysis, in which actor and rationale diagrams are produced; (ii) fact analysis, in which rationale diagrams are extended with facts; and (iii) attribute analysis, in which rationale diagrams are further extended with attributes. Each phase is a different iterative process taking in input the diagrams produced by the previous one.

3.1.1 Goal Analysis

The first step for goal analysis is to represent the relevant stakeholders for the organization and their dependencies by means of an actor diagram, in which actors can represent agents, roles, or positions within the organization. This starts with a very high-level analysis aimed at expressing the responsibilities and relationships involving the different components (e.g., departments) of an organization. Then, the analysis proceeds in more detail by decomposing the high-level actors in sub-actors (e.g., the divisions of a department) and ending up with the identification of the agent(s) who is responsible for each single activity. The analysis is conducted by interviewing the stakeholders and producing a documentation organized in a number of templates. Three different type of templates are provided:

main actor, where the main actors of the analyzed environment are identified together with their strategic objectives and goals. A table of the form (Actor,Objectives) is used to collect the information.
sub-actor, where each main actor is decomposed into a number of sub-actors (including roles, positions, and agents). For each main actor, a table of the form (Sub-actor,Type,Goals) is used to collect the information.
dependencies, where all the dependencies among actors are identified and analyzed. A table of the form (Depender, Dependee, Goal) is used to collect the information.

The second step consists in analyzing goals of each actor in more detail to produce a rationale diagram for each actor. Goals are AND-decomposed and contribution links between goals are discovered. See for instance [23, 24] for details on how goal analysis can be carried out. Goal analysis ends when all the relevant goals of each actor have been analyzed and all the dependencies between actors are established. As for the first step, also here the information are collected and organized using a predefined template. For each actor, we mainly use a table of the form (Goal, Sub-goal, InContrib, OutContrib), where InContrib is the list of goals that contribute to the satisfaction of the goal and OutContrib is the list of goals that receive a contribution from the satisfaction of the goal.

Note that analyzing the dependencies is important since it allows new goals to be discovered for each actor, and in general it explains the reasons behind some of the goals thus leading to a comprehensive view and model of the organization. The analysis, of course, is limited to some areas of the organization, thus excluding some aspects that are not considered to be relevant for the design of the DW. It is responsibility of the analyst to decide what has to be included and what not; however, we emphasize that it is necessary to include in the model the external actors (e.g., clients of the bank) who do not belong to the organization but still play a crucial role in its activities.

Figure 1: An actor diagram for the BI-BANK case study

3.1 Organization Modeling

Example 1Figure 1 shows a partial actor diagram for the BI-BANK case study. The Client depends on the Bank for achieving the goal use Bank service, and on the ATM for the goal use ATM. Moreover, the Bank depends on the ATM actor for the goal supply ATM. Figure 2 presents a part of the rationale diagram for the Bank actor focusing on the goal of managing transactions. The goal manage a/c transactions is decomposed into manage debit transactions and manage credit transactions, and in turn manage debit transactions is decomposed into manage permanent payments and manage occasional payments. New dependencies may be discovered at this point, for example the Client depends on the Bank to manage transaction. 2

3.1.2 Fact Analysis

The objective of fact analysis is to identify all the relevant facts for the organization. The analyst navigates the rationale diagram of each actor and extends it by associating goals with facts that model the set of events to be recorded when goals are achieved. Here the information are collected using two different type of templates. The first one is a simple table that describes each single fact (Fact, Description), and the second a table in which each goal is associated with a number of facts (Goal, Facts). Fact analysis is carried out according to a top-down approach, in fact the analyst starts by identifying relevant facts from the top goals of each actor and then moves down towards the leaf goals. Usually, facts associated to leaf goals are not considered.

The structure of the rationale diagrams induces some relevant relationships between facts. In particular, when a goal is decomposed in subgoals, all the facts related to the subgoals become subfacts of the fact associated to the goal. As we will see in Section 4.1.3, these induced relationships are useful for defining the final conceptual schema for the DW.

Figure 2: Rationale diagram for the Bank actor from the organizational perspective

3.1 Organization Modeling

3.1.3 Attribute Analysis

Attribute analysis is aimed at identifying all the attributes that are given a value when facts are recorded. Starting from the extended rationale diagrams produced in the previous phase, the analyst explores all the subgraphs to associate goals with the attributes they use. Note that, in this phase, the attributes are identified without specifying their possible role as dimensions or measures; from the organizational point of view, attributes are simply data associated to goals. To collect information for attribute analysis, the analyst uses a table of the form (Attribute, Goal, Fact).

Example 2 Figure 3 shows an extended rationale diagram for the Bank actor, still focusing on goal manage transactions. First, the fact transaction is associated to the main goal manage transactions, the fact debit transaction to the goal manage debit transactions, and so on. Then, by analyzing the subgraph of the goal manage ATM withdrawals that fact ATM withdrawal is associated with, we introduce attributes currency, date, card number, amount, etc.

Figure 3: Extended rationale diagram for the Bank from the organizational perspective

Decisional Modeling

After organizational modeling, the methodology proposes a second type of analysis focused on the goals of decision makers, i.e., the actors that play the most relevant role in the decisional process. Here the analysis is substantially different from the organizational analysis since the main objective is not to model the organization as it is, but rather to model how the DW can support the decisional process of the organization. Basically, we focus on the requirements of the DW from the perspective of the decision makers. The organizational model is extremely important in this phase since it actively supports the analyst during the identification of the facts to be associated with the decision makers’ goals.

Firstly, all the decision makers are identified; then, for each of them, four steps are carried out: (i) goal analysis, that produces rationale diagrams; (ii) fact analysis, that extends them with facts; (iii) dimension analysis, that further extends them with dimensions; and (iv) measure analysis, that further extends them with measures.

3.2.1 Goal Analysis

As for organizational modeling, goal analysis starts by analyzing the actor diagram for the decision makers. Decision makers are identified and initial dependencies between them are established. The goals associated to each decision maker are then decomposed and analyzed in detail, to produce a set of rationale diagrams. Goals may be completely different from those analyzed during organizational modeling, indeed they are part of the decision process and might be not included in the operative process of the organization. The same templates used for organizational modeling are used here to collect and organize information about goal analysis.

Example 3 Figure 4 shows a rationale diagram for decision maker Financial Promoter, focusing on the goal of analyzing transactions. The goal analyze transactions is OR- decomposed into analyze debit transactions and analyze ATM withdrawals, which in turn are further decomposed. So, for instance, the goal analyze debit transactions is OR-decomposed into analyze total amount and analyze number of transactions.

Figure 4: Rationale diagram for the Financial Promoter decision maker from the decisional perspective

3.2.2 Fact Analysis

Like for organizational modeling, rationale diagrams are extended by identifying facts and associating them to the goals of decision makers. Facts are possible objects of analysis, and correspond to business events that dynamically happen within the enterprise. Facts are normally imported from the extended rationale diagrams produced during organizational modeling. Indeed, very often the goals of decision makers are related to the information produced in the operational process, so the facts associated to the organization activities are fundamental for fulfilling the decision makers’ goals. In some cases, the analyst can also introduce some new facts by directly analyzing the decision maker rationale diagrams.

Dimension Analysis

In this phase, each fact is related to the dimensions that decision makers consider necessary in order to satisfy their decisional goals. Dimensions are identified by analyzing the leaf goals of the rationale diagram of the decision makers and the relevant facts associated to the upper level goals. Here, analysis requires a strong interaction with the decision makers in order to capture the possible perspective of analysis. To support analysis we propose a template based on two different tables, the first capturing the relationship between goals, facts, and dimensions (Goal, Fact, Dimensions), the second describing dimensions (Dimension, Description).

Measure Analysis

Finally, the analyst associates a set of measures to each fact previously identified. Even here, analysis is carried out by analyzing the leaf goals and the facts associated to the upper-level goals, and requires a strong interaction with the decision makers.

Example 4 In Figure 5 the analyst associates fact transaction, identified during organizational modeling (see Figure 3), to the goal analyze transactions. Then, dimensions are connected to the goals associated to the fact; for instance, dimensions account number and month are associated to goal analyze total amount. Finally, two measures are identified for goal analyze total amount: total amount and average amount.

Figure 5: Extended rationale diagram for the Financial Promoter decision maker from the decisional perspective

From Requirement Analysis to Conceptual Design

The organizational model produced by requirement analysis represents the main data on which the enterprise operation is based, thus it comprises the most relevant attributes that are part of the source database. On the other hand, the decisional model describes the decision makers’ needs, thus summarizing the role played, in glossary-based requirement analysis, by the glossaries of facts, dimensions, and measures and by the preliminary workload. In this section we explain how these diagram are used for conceptual design within, respectively, a mixed and a demand-driven framework.

4.1 Mixed Design Framework

The mixed framework joins the facilities of supply-driven approaches with the guarantees of demand-driven ones. In fact, the requirements derived during organizational and decisional modeling are matched with the schema of the source database to generate the conceptual schema for the DW. Three phases are involved: (i) requirement mapping, where facts, dimensions, and measures identified during decisional modeling are mapped onto entities in the source schema; (ii) hierarchy construction, where a basic conceptual schema is generated by navigating the source schema; and (iii) refinement, where the basic conceptual schema is edited to fully meet the user’ expectations.

4.1.1 Requirement Mapping

During this phase the facts, dimensions, and measures included in the extended rationale diagrams produced by decisional modeling are mapped, where possible, onto the source schema. More precisely:

1. Assuming that sources are modeled by a relational schema, the facts of decisional modeling are mapped onto relations.
2. As to dimensions and measures, mapping is achieved by using the attributes represented during organizational modeling as a bridge. In fact, each such attribute is mapped to a physical attribute in the source schema on the one hand, and to a dimension or a measure in the decisional model on the other.
3. As to the attributes in the organizational model that were not mapped to dimensions or measures in the decisional model, they are mapped nonetheless on the source schema; as we will see in Subsection 4.1.2, this may be useful to provide the designer with an additional choice of dimensions and measures for the fact.

Example 5 In our case study, as shown in Figure 6, fact ATM withdrawal is mapped to some WITHDRAWALS table in the source schema. Then, attribute card number associated to goal record withdrawal card number in the organizational model is mapped to dimension card number associated to the analysis goals analyze withdrawals amount and analyze number of withdrawals in the decisional model; the same attribute card number might for instance correspond, on the source schema, to an attribute cardNumber within the WITHDRAWALS table. Similarly, attribute amount of goal record withdrawal amount corresponds to measure total amount of the analysis goal analyze withdrawals amount and to an attribute amount on the WITHDRAWALS table. An example of an attribute in the organizational model that cannot be mapped to the decisional model but should nevertheless be mapped to the source schema is destination a/c, associated to goal record transfer dest. a/c.

Figure 6: Mapping from the decisional model (left) to the source shcema (right) passing through the organizational model (center)

Interestingly, if the names in the extended rationale diagrams are chosen by the analyst consistently with those in the source schema, this phase can be partially automated. In particular, if the source schema was actually obtained by normalizing and integrating different sources — which very often is the case, especially when complex cleaning and transformation procedures are necessary to improve data quality — its name space is largely under the designer’s control. Otherwise a Thesaurus must be built, as suggested in [15], in order to establish the mappings between the organizational model and the source schemata.

4.1.2 Hierarchy Construction

This phase implements the supply-driven part of GRAnD. For each fact identified during decisional modeling and successfully mapped onto a relation F in the source schema, the many-to-one associations expressed in the source schema by foreign keys are iteratively navigated, starting from F, to build the attribute hierarchies and create a basic conceptual schema, e.g. in the form of a fact schema. Fact schemata are the conceptual artifacts provided by the Dimensional Fact Model, proposed in [7] as a support for conceptual design of DWs. Note that any other conceptual model for multidimensional databases could be equivalently adopted.

This phase can be largely automated, as discussed in [7] where an algorithm is proposed that create attribute hierarchies from the source schema. The underlying idea is that, given a relation, a functional dependency — i.e., a many-to-one association — holds between its primary key and each of its attributes, including foreign keys. Since a foreign key maps on the primary key of another relation, that in turn functionally determines each of its attributes, a tree of attributes related by functional dependencies — in the terminology of multidimensional modeling, a hierarchy — can easily be built by recursively navigating the source schema. In the following, considering that functional dependencies are transitive, we will say that a many-to-one path exists from a relation R to any attribute a in the schema if the primary key of R (transitively) functionally determines a.

Remarkably, while in the supply-driven approach described in [7] navigation is “blind”, meaning that all the attributes connected to the fact F by a many-to-one path are reached and included in hierarchies, here navigation is actively biased by the user requirements. In fact:

1. Every dimension d associated to a goal related to F and successfully mapped from the decisional model to the source (see Section 4.1.1) is included in the conceptual schema, and the full hierarchy rooted in d is generated by navigation. If the path from F to d is not many-to-one but many-to-many, d is modeled as a multiple dimension [25]; this means that multiple values of d can be related to a single instance of the fact F.
2. Every measure m associated to a goal related to F and successfully mapped from the decisional model to the source is included in the conceptual schema, provided that a many-to-one path exists from F to m, and no hierarchy is generated for it.
3. For each attribute a in the organizational model that could be mapped onto the source schema but not onto the decisional model, the designer has different choices:

3.1 Decide that a is not interesting for analysis and ignore it.
3.2 Decide that a may be interesting for analysis of some fact F, in which case it should be included in the conceptual schema for F. The designer also has to choose the primary role of a:
3.2.1 The role may be that of a dimension, in which case the full hierarchy rooted in a is generated by navigation and a is labeled as “supplied” in the conceptual schema. If the path from F to a is not many-to-one but many-to-many, a is modeled as a multiple dimension.
3.2.2 The role may be that of a hierarchy attribute. In this case, if a has not already been reached by navigation of the source schema starting from some dimension (see point 1.), the designer selects one of the attributes in the path from F to a as a new dimension d, and labels d as “supplied”.
3.2.3 The role may be that of a measure, provided that a many-to-one path exists from F to a. Even in this case, a is labeled as “supplied” in the conceptual schema.

4. Among the dimensions and measures associated to a goal related to F on the decisional model, those for which no mapping on the source schema could be found are nevertheless included in the conceptual schema for F and labeled as “demanded”.
5. All the attributes in the source schema that were not mapped and were not reached by navigation of many-to-one associations starting from a dimension are not included in the conceptual schema for F.

Remarkably, the basic conceptual schemata generated here may be considerably simpler and smaller than those generated in [7]. Besides, while in [7] the designer is asked for identifying facts, dimensions, and measures directly on the source schema, here such identification is driven by the diagrams developed during requirement analysis. We also note that the names used for measures in decisional diagrams can give the designer precious suggestions regarding which aggregation operators to use: for instance, from Figure 5 we may presume that measure amount is to be aggregated through both the sum and the average operators.

Example 6 The preliminary fact schemata obtained for facts ATM WITHDRAWAL and DEBIT TRANSACTION in the bank example are reported in Figure 7. Consistently with the Dimensional Fact Model, the fact is represented as a box containing the measures; dimensions are circles connected to the fact; hierarchies are represented as trees rooted in dimensions. For fact ATM WITHDRAWAL, dimensions currency date, card number, and a/c are derived from the decisional model, as well as measures number of withdrawals and amount. The hierarchy rooted in a/c has been built by navigating the source schema. Dimension internal/external has been added since, though the decisional model does not represent it explicitly, the two separate subgoals analyze internal ATM withdrawals and analyze external ATM withdrawals suggest that the user will ask for distinguishing between these two types of withdrawal. Finally, measure withdrawal cost is labeled as “demanded” since it appears as a measure associated to goal analyze external withdrawals but not as an available attribute on the organizational model. As to fact DEBIT TRANSACTION, we only remark that dimension destination a/c is labeled as “supplied” since it is present in the organizational model but has not been indicated by decision makers as a dimension in the decisional model.

Figure 7: Preliminary fact schema for facts ATM WITHDRAWALS and DEBIT TRANSACTION in a mixed framework

4.1.3 Refinement

This phase is aimed at rearranging the conceptual schemata in order to better meet the user’s needs. The basic operations that can be carried out to this purpose are:

1. Operations on facts
1.1 Fact merge. Merging facts is beneficial when the user is often interested in analyzing them together since they can be seen as particular cases of a more general fact. Two or more facts G1 . . .Gn can be merged into a single fact if they have a common supergoal F in the decisional model. The resulting fact takes the name of F and has the union of the measures, dimensions, and hierarchies of the Gi’s. A dimension that exists only within a subset of the Gi’s is marked as optional in F, meaning that it characterizes only a subset of the events modeled by fact F.
1.2 Fact split. A fact F may be split into n facts G1 . . .Gn when most of its dimensions and measures are related to specific subsets of events. Each resulting fact Gi includes only the dimensions and measures that better characterize it; each dimension is accompanied by the corresponding hierarchy.
2. Operations on hierarchies
2.1 Attribute prune. An attribute a may be dropped from a hierarchy while preserving the subtree rooted in a. This is useful when neither a nor its descendant attributes are relevant for analysis.
2.2 Attribute graft. An attribute a may be dropped from a hierarchy while preserving the subtree rooted in a; the children attributes of a become children of the parent attribute of a. This is useful when a is not interesting for analysis but its descendants are.
2.3 Association adding. A many-to-one association between attributes a and b may be added to a hierarchy, which results in changing to a the parent of attribute b in the tree representing the hierarchy [7]. Adding an association may be necessary in order to model some functional dependency that was not correctly captured in the source schema.
2.4 Optionality. Optional attributes, i.e., attributes that have a value only for a subset of instances of the hierarchy, should be detected and properly marked on the conceptual schema.
3. Operations on measures
3.1 Measure merge. Two or more measures that only differ for their aggregation operator may be unified, provided that the conceptual model adopted supports the representation of different aggregation operators for the same measure.

Note that, thanks to the labeling of dimensions carried out during hierarchy construction, the decision makers and the designer are enabled to distinguish, on conceptual schemata, what is needed and available (unlabeled dimensions/measures), what is needed but unavailable (dimensions/measures labeled as demanded), what is available but not needed (dimensions/measures labeled as supplied). While the second category may drive the designers in enriching the source database or in considering new data sources, the second may stimulate the decision makers to undertake new directions of analysis.

Example 7 We assume that (1) users are not interested in the client granularity, so attribute client is grafted; (2) the client address is not interesting, so address is pruned; (3) measure withdrawal cost can be computed during ETL; (4) dimension destination a/c is considered by users to be relevant for analysis. The two facts ATM WITHDRAWALS and DEBIT TRANSACTION are merged into fact TRANSACTION, that is the fact associated, in the decisional model, to the common supergoal of subgoals analyze debit transactions and analyze ATM withdrawals. In merging the facts, measures number of withdrawals and number of debit transactions converge in measure number of transactions. Measures total amount and average amount are merged into measure amount, to be aggregated by both the sum and the average operators. Dimension internal/external is incorporated into dimension type (meaning that, among the values for type, there also will be internal ATM withdrawal and external ATM withdrawal). Dimension card number is marked as optional since a card number is defined only for some types of transactions; the same holds for destination a/c. The final fact schema obtained is shown in Figure 8.

Figure 8: fact schema for fact TRANSACTION after refinement in a mixed framework

4.2 Demand-Driven Design Framework

Within a demand-driven framework, in the absence of a priori knowledge of the source schema, the building of hierarchies cannot be automated; the main assurance of a satisfactory result is the skill and experience of the designer, and her ability to fruitfully interact with the domain experts to capture the existing dependencies between attributes.

The starting point is a set of preliminary conceptual schemata obtained by associating each fact from the decisional model with the corresponding dimensions and measures. In the bank case, from the rationale diagram in Figure 5 we immediately derive the preliminary fact schemata in Figure 9.

Figure 9: Preliminary fact schemata for facts ATM WITHDRAWAL and DEBIT TRANSACTION in a demand-driven framework

The next step is related to the building of hierarchies: differently from the case of the mixed framework, where hierarchies are determined by navigating the source schema, here they must be defined manually by strictly interacting with business users. This is undoubtedly the most difficult phase of the demand-driven approach; we suggest to proceed for each fact F as follows:

1. Detect functional dependencies between dimensions of F and represent them in the form of hierarchies.
2. If F is also represented on the organizational model, select all the attributes associated to the goals related to F on the organizational model. For each attribute a that is not already present in the conceptual schema and is considered to be interesting for analysis of F, understand its primary role:
2.1 The role may be that of a dimension, in which case a is included in the conceptual schema and labeled as “supplied”.
2.2 The role may be that of a hierarchy attribute. In this case, the designer selects one of the existing dimensions, d, and properly connects a to the hierarchy rooted in d.
2.3 The role may be that of a measure, in which case a is included in the conceptual schema as a measure and labeled as “supplied”. 3. Repeat step 2. for all the other attributes in the organizational model.
4. Enrich hierarchies with other attributes possibly indicated by the user, labeled as “demanded”.

The issues to be faced afterwords are the same described for the refinement phase in the mixed framework.

Example 8 In fact ATM WITHDRAWAL, among all the attributes associated to goals manage ATM external withdrawals and manage ATM internal withdrawals on the organizational model, date is inserted as a supplied dimension. In fact DEBIT TRANSACTION, attributes currency date and card number are inserted as supplied dimensions. Then, during refinement, the functional dependency between date and month is detected, and the two facts are merged as seen in the mixed framework to produce the fact schema in Figure 10.

Figure 10: Fact schema for fact TRANSACTION after refinement in a demand-driven framework

This approach is, in our view, more diffcult to pursue than the previous one. Nevertheless, it is the only alternative when a detailed analysis of data sources cannot be made (for instance when the DW is fed from an ERP system, whose logical schema is huge and hardly understandable), or when the sources come from legacy systems whose complexity discourages recognition and normalization.

5 DW-Tool

GRAnD is supported by a prototypical CASE tool named DW-Tool 1, which has been developed in Java and supports the analyst and the designer in the following activities:

Collection of requirements. During the interviews with the stakeholders, the analyst organizes the information acquired by filling in the templates introduced in Section 3.
Dictionary management. The content of the templates is used by DW-Tool to generate a general dictionary from which the analyst can select and see the description of each single entity in the models. Figure 11 shows an example where the description of a fact is provided.
Organizational modeling. The content of the templates is also used to generate a basic organizational model, including actor and rationale diagrams, that the analyst can further rearrange to better meet the user’s expectation (see Figure 12).
Decisional modeling. After the basic rationale diagrams have been generated from the templates, the analyst can extend them with facts, dimensions, and measures, typically by importing entities from the diagrams produced during organizational modeling (Figure 13).
Conceptual design. If the designer decides to follow a mixed approach, the source schema is acquired via JDBC from the operational DBMS. As to requirement mapping, DW-Tool proposes a basic choice of mappings, derived where possible by matching names in the diagrams. Hierarchy construction is fully automated, and the basic refinement operations are supported. On the other hand, if a demand-driven approach is undertaken, the attributes related to the fact in the organizational model are imported, so that the designer can decide which role to give them within the conceptual model of the DW.

Figure 11: Accessing the dectionary in DW-Tool

Figure 12: Organization modeling in DW-Tool

Figure 13: Importing facts for decisional modeling

6 Conclusion

In this paper we have proposed a goal-oriented methodology for requirement analysis in DWs, which can be used within both a demand-driven and a mixed supply/demand-driven design framework. We believe that the adoption of our methodology can help the designer to reduce the risk of project failure by ensuring that early requirements are properly taken into account and formalized — which ensures a “good” design — and, at the same time, that the resulting DW schemata are tightly rooted to the operational database — which makes the design of ETL simpler.

The methodology was applied to the BI-BANK case study, a project developed in collaboration with a company based in Trento. The experience with the company was extremely useful for refining and validating our approach. We received a positive feedback about the methodology and, in particular, about the importance of deriving the requirements directly from the analysis of the stakeholders and decision makers goals. The case study also supported us in investigating the scalability of our approach. With regard to this, we verified that associating an actor diagram with several rational diagrams, one for each actor, has a crucial role in dealing with complex application domains. In fact, detailed requirement analysis is carried out on rationale diagrams, whose complexity depends on how articulated the decisional tasks of a single actor are (which is not directly related to how large the application domain is). On the other hand, the actor diagram — on which the size of the application domain, in terms of number of stakeholders, directly impacts — is drawn at a high level of abstraction, thus its complexity never becomes overwhelming.

References

1. R. Winter, B. Strauch, A method for demand-driven information requirements analysis in data warehousing projects, in: Proceedings HICSS, Hawaii, 2003, pp. 1359–1365.
2. M. Golfarelli, S. Rizzi, A methodological framework for data warehouse design, in: Proceedings DOLAP, 1998, pp. 3–9.
3. R. Kimball, L. Reeves, M. Ross, W. Thornthwaite, The Data Warehouse Lifecycle Toolkit, John Wiley & Sons, 1998.
4. S. Lujan-Mora, J. Trujillo, A comprehensive method for data warehouse design, in: Proceedings DMDW, Berlin, Germany, 2003.
5. F. R. S. Paim, J. B. Castro, DWARF: An approach for requirements definition and management of data warehouse systems, in: Proceedings International Conference on Requirements Engineering, Monterey Bay, CA, 2003.
6. B. Husemann, J. Lechtenborger, G. Vossen, Conceptual data warehouse design, in: Proceedings DMDW, Stockholm, Sweden, 2000, pp. 3–9.
7. M. Golfarelli, D. Maio, S. Rizzi, The dimensional fact model: A conceptual model for data warehouses, International Journal of Cooperative Information Systems 7 (2–3) (1998) 215–247.
8. D. Moody, M. Kortink, From enterprise models to dimensional models: A methodology for data warehouse and data mart design, in: Proceedings DMDW, Stockholm, Sweden, 2000.
9. N. Prakash, A. Gosain, Requirements driven data warehouse development, in: CAiSE Short Paper Proceedings, 2003.
10. R. Bruckner, B. List, J. Schiefer, Developing requirements for data warehouse systems with use cases, in: Proceedings Americas Conference on Information Systems, 2001, pp. 329–335.
11. J. Lechtenborger, Data warehouse schema design, Tech. Rep. 79, DISDBIS Akademische Verlagsgesellschaft Aka GmbH, (2001).
12. P. Bresciani, P. Giorgini, F. Giunchiglia, J. Mylopoulos, A. Perini, Tropos: An agent-oriented software development methodology, Journal of Autonomous Agents and Multi-Agent Systems 8 (3) (2004) 203–236.
13. W. H. Inmon, Building the Data Warehouse, QED Press/John Wiley, 1992.
14. J.-N. Mazon, J. Trujillo, M. Serrano, M. Piattini, Designing data warehouses: From business requirement analysis to multidimensional modeling, in: Proceedings International Workshop on Requirements Engineering for Business Need and IT Alignment, Paris, France, 2005.
15. A. Bonifati, F. Cattaneo, S. Ceri, A. Fuggetta, S. Paraboschi, Designing data marts for data warehouses, ACM Transactions on Software Engineering Methodologies 10 (4) (2001) 452–483.
16. A. Dardenne, A. van Lamsweerde, S. Fickas, Goal-directed requirements acquisition, Science of Computer Programming 20 (1–2) (1993) 3–50.
17. E. Yu, Modelling strategic relationships for process reengineering, Ph.D. thesis, University of Toronto, Department of Computer Science (1995).
18. B. List, R. Bruckner, K. Machaczek, J. Schiefer, A comparison of data warehouse development methodologies: Case study of the process warehouse, in: Proceedings DEXA, 2002, pp. 203–215.
19. D. Bulos, Designing OLAP with ADAPT, Tech. rep., Atos Origin (1999).
20. S. Lujan-Mora, J. Trujillo, I. Y. Song, The Gold Model Case Tool: An environment for designing OLAP applications, in: Proceedings ICEIS, 2002, pp. 699–707.
21. M. Golfarelli, S. Rizzi, E. Saltarelli, WAND: A CASE tool for workload-based design of a data mart, in: Proceedings SEBD, Portoferraio, Italy, 2002, pp. 422–426.
22. J. Castro, M. Kolp, J. Mylopoulos, Towards requirements-driven information systems engineering: The Tropos project, Information Systems 27 (6) (2002) 365–389.
23. P. Giorgini, E. Nicchiarelli, J. Mylopoulos, R.Sebastiani, Formal reasoning techniques for goal models, Journal of Data Semantics 1 (2003) 1–20.
24. R. Sebastiani, P. Giorgini, J. Mylopoulos, Simple and minimum-cost satisfiability for goal models, in: Proc. CAiSE’04, 2004, pp. 20–33.
25. S. Rizzi, Conceptual modeling solutions for the data warehouse, in: R. Wrembel, C. Koncilia (Eds.), Data Warehouses and OLAP: Concepts, Architectures and Solutions, Idea Group, 2006, To appear.

GRAnD: A Goal-Oriented Approach to Requirement Analysis in Data Warehouses

Abstract

Introduction

Related literature

Requirement Analysis with GRAnD

3.1 Organization Modeling

3.1.1 Goal Analysis

3.1 Organization Modeling

3.1.2 Fact Analysis

3.1 Organization Modeling

3.1.3 Attribute Analysis

Decisional Modeling

3.2.1 Goal Analysis

3.2.2 Fact Analysis

Dimension Analysis

Measure Analysis

From Requirement Analysis to Conceptual Design

4.1 Mixed Design Framework

4.1.1 Requirement Mapping

4.1.2 Hierarchy Construction

4.1.3 Refinement

4.2 Demand-Driven Design Framework

5 DW-Tool

6 Conclusion

References