The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | Apr 2017 |
| Last Sign in: | 103 Weeks Ago, 3 Days Ago |
| Questions Answered: | 4870 |
| Tutorials Posted: | 4863 |
MBA IT, Mater in Science and Technology
Devry
Jul-1996 - Jul-2000
Professor
Devry University
Mar-2010 - Oct-2016
Question:
1. [14] Data preprocessing.
(a) [8] Present the value range for each of the following
measures:
(1) Jaccard coecient
Answer:
(2) covariance
Answer:
(3) F-measure
Answer:
(4) Kulczynski measure
Answer:
Â
(b) [6] Give three example d i s t a n c e m e a s u r e s f o r each of
the following two kinds:
(1) the distance between two objects
Answer:
Â
(2) the distance between two clusters
Answer:
Â
2. [16] Data Warehousing, OLAP and Data Cube Computation
(a) [8] The standard deviation of n observations x1 , x2 , . . . , xn
is dened as
Â
Where x¯ is the average (i.e., mean) value of x1 , . . . , xn .
i. [3] What kind of measure d o e s standard deviation belong to:
distributive, algebraic, or holistic? Justify your answer.
Answer:
Â
ii. [5] Outline an ecient algorithm that computes an iceberg cube
with standard deviation
as the measure, whe re the iceberg condition is n
≥ 100 and σ ≥ 2.
Answer:
Â
(b) [8] It is desirable t o construct an AlbumCube to facilitate
multidimensional search through digital photo collections, s u c h as by date,
photographer, location, theme, content, color, etc.
i. [2] What should be the dimensions and measures for such a data
cube?
Answer
Â
ii. [3] What analytical functions can you provide?
Answer:
Â
iii. [3] What are the major challenges on implementing AlbumCube, and
how would you propose to handle them?
Answer:
3. [19] Frequent pattern and association m in in g
(a) [6] Since items have dierent expected frequencies of sales, it
is desirable to use group-based minimum support thresholds
set up by users. For example, one may set up a small min
support for the group of cameras but a rather large one for
the group of bread. Outline an FP growth-like algorithm that
derive the set of frequent items eciently in a transaction
database.
Answer: Suppose each item is associated with a group ID.
(b) [7] Suppose a BestBuy analyst is interested in only the
frequent patterns (i.e., itemsets) from the sales transactions that
satisfy certain constraints. For the following cases, state the
characteristics (i.e., categories) of every constraint in each
case and how to mine such patterns most eciently.
i. The prot range for the items in each pattern must be
within
$50.
Answer:
Â
ii. The sum of the price of all the items with prot over $5 in
each pattern is at least $100.
Answer:
Â
iii. The average prot for those items priced over $50 in each pattern
must be less than $10.
Answer:
Â
(c) [6] Frequent p a t t e r n mining often generates m a n y somewhat
" similar" patterns that carry little new information. Give one such
example. Then outline one method that may generate l e s s number (i.e.,
compressed) b u t interesting patterns.
Answer:
4. [27] Classification and Prediction
(a) [6] Given a training s e t of 10 million tuples with 10
attributes each taking 8 bytes space. One attribute is a class
label with two distinct values, w h e r e a s for other attributes each
has 50 distinct va l u e s . Assume your machine cannot hold all the
dataset in the main memory. Outline an ecient method that
constructs Naive Bayes classier eciently, and answer the
following questions explicitly:
Â
(i) how many scans of the database does your algorithm take?
Answer:
(ii) what is the maximum m e m o r y space your algorithm w i l l use
in your induction?
Answer:
Â
(b) [6] Give each situation that o n e of the following measures i s
most appropriate for measuring the quality of classication:
(1) sensitivity
Answer:
0
(2) specicity
Answer:
(3) ROC curve
Answer:
(c) [5] People say that if each classier is better than
random guess, ensemble of multiple such classiers will lead to
a nontrivial increase of classication accuracy. Do you agree
with this statement? Give reasoning on it.
Answer:
Â
(d) [5] What are the similarities and dierences between semisupervised classication and active l e a r n i n g ?
Answer:
Â
(e) [5] If one would like to work out a model to classify U . of Michigan
webpages based on the model you have learned from the UIUC web- site.
Is it easy to do it by transfer learni ng ? How would you suggest the
person to proceed?
Answer:
Â
5. [24] Clustering
(a) [6] Use one sentence t o distinguish
methods:
(1) k-means vs. KNN
Answer:
Â
(2) STING v s . CLIQUE
Answer:
Â
each o f the following pairs o f
Â
(3) BIRCH v s . CHAMELEON.
Answer:
(b) [6] Outline the best clustering m e t h o d for the following
tasks (and briey reason on why you make such a design):
(i) nding oil spills along a coast line
Answer:
Â
(ii) clustering e m p l o y e e s i n a company b a s e d o n their
salaries a n d years of working experience
Answer:
Â
(c) [6] why subspace clustering i s a good choice for highdimensional data? Outline one ecient and eective subspace
clustering method that can cluster a very hi g h dimensional ( e.g.,
thousands of dimensions) data set.
Answer:
Â
(d) [6] Cross-validation can be useful in both classication and
clustering.
What are the dierences in these two cases?