# Data Mining Soltions

Question 1: Assume a base cuboid of 10 dimensions contains only three base cells: (1) (a1, b2, c3, d4; ..., d9, d10), (2) (a1, c2, b3, d4, ..., d9, d10), and (3) (b1, c2, b3, d4, ..., d9, d10), where a_i != b_i, b_i != c_i, etc. The measure of the cube is count. 1, How many nonempty cuboids will a full data cube contain? Answer: 210 = 1024 2, How many nonempty aggregate (i.e., non-base) cells will a full cube contain? Answer: There will be 3 ∗ 210 − 6 ∗ 27 − 3 = 2301 nonempty aggregate cells in the full cube. The number of cells overlapping twice is 27 while the number of cells overlapping once is 4 ∗ 27 . So the final calculation is 3 ∗ 210 − 2 ∗ 27 − 1 ∗ 4 ∗ 27 − 3, which yields the result. 3, How many nonempty aggregate cells will an iceberg cube contain if the condition of the 4, iceberg cube is "count >= 2"? Answer: There are in total 5 ∗ 27 = 640 nonempty aggregate cells in the iceberg cube. To calculate the result: fix the first three dimensions as (***), (a1**), (*c1*), (**b3) or (*c1b3), and vary the rest seven ones. 4, How many closed cells are in the full cube? Answer: There’re 6 closed cells in the full cube: 3 base cells; (a1, *, *, d4, …, d10); (*, c2, b3, d4, …, d10) : count 2; (*, *, *, d4, .., d10): count 3. Question 2: (Half open questions, make sure your algorithm and assumptions are correct, no need to be very specific) Suppose a base cuboid has the following tuples: A B C D Count Sales a1 b1 c1 d1 1 a1 b2 c2 d1 1 a1 b3 c1 d2 1 a2 b4 c1 d2 1 a2 b3 c2 d3 1 6 4 2 10 12

1, Show the representative steps to demonstrate how a complete data cube (with Count and SUM(Sales) as measures) is computed by the multiway array aggregation algorithm; Answer (from fang2): Suppose dimensions A, B, C, D are organized into 2, 4, 2, 3 partitions respectively. So in total there are 2*4*2*3 = 48 chunks. The cardinality of dimensions A, B, C, D is 2, 4, 2, 3 respectively, i.e. A and C have the smallest size, followed by D, and lastly B has the largest sieze). From the base cuboid given, we can compute 3D-, 2D-, 1D- and apex cuboids as in the diagram. The chunk scan order is always first along the smallest dimension, then along 2nd smallest dimension, then along 3rd smallest dimension, and so on. For example, when computing

3D-cuboids from the base cuboid, we first scan chunks along the A dimension, then C, D and B, in this ascending order of the size of the dimension. In other words, we aggregates first towards CDB, so only 1 chunk of CDB needs to be held in memory at any one time; then aggregates towards ADB, so only 1 row of ADB needs to be held in memory at any one time, so on and so forth. For computation of 2D-, 1D- and apex cuboids, a similar approach is adopted, where the chunks are scanned first along the smallest dimensions. During computation of a cuboid, both measures (count and sales) are aggregated.

2, Do the same using the BUC algorithm; and Answer (from duan9): First we order the dimensions in descending order by cardinality: BDAC. Then we have the aggregation order in the tree form:

At the beginning of the recursion, we aggregate all the dimensions to get the apex cuboid using the two measures: count and sum of sales. Then we start partitioning the table according to the sequence BDAC as follows:

Through this recursive aggregation and partition process we get the following cuboids: apex, B, BD, BDA, BDAC. Then we traverse back (as part of the recursion) and get BDC, and traverse back further we get BA, BAC and so on. 3, Do the same using the Star-Cubing algorithm. Answer (from duan9): First we order the dimensions as we did in BUC: BDAC. Based on the order, we have the following computation ordering:

Then we construct a Star-Tree for the base table. Since we are actually computing the full cube, there is no star on the star tree. Similarly, there will be no compressed table (or you can do your own assumption and build your own compressed table).

Then we start...

Please join StudyMode to read the full document