Data Mining(Nursey data set)

 


 

1.      Selecting an appropriate data set

 

Description about the data set


Nursery data set was taken from UCI Machine Learning Repository. Nursery database was derived from a hierarchical decision model originally developed to rand applications for nursery schools and it was used during several years in 1980’s when there was excessive enrollment to these schools in Ljubljana, Slovenia and the rejected applications frequently needed an objective explanation. Here the final decision depended on 3 subproblems that are occupation of parents and child’s nursery, family structure and financial standing and social and health picture of the family. This Nursery dataset is a multivariate dataset with categorical data type values(except 1 attribute). And it consists with 8 attributes and 12960 instances. Below shows the table with attributes information, (UCI Machine Learning Repository: Nursery Data Set, 2021)

 

                                    Table 01 : Attribute information of the dataset

Attribute

Description

Values

Parents

Parent’s occupation

usual,pretentious,great_pret

Has_nurs

Child’s nursery

proper,less_proper,improper,critical,very_crit

Form

Form of the family

complete,completed,incomplete,foster

Children

Number of children

1,2,3, more

Housing

Housing condition

convenient,less_conv,critical

Finance

Financial standing of the family

convenient,inconv

Social

Social condition

non_prob,slightly_prob,problematic

Health

Health condition

recommended,priority,not_recom

          

Below show the Class distribution table(number of instances per class) and by class describe evaluation of application for nursery school,

 

                                           Table 02 : Class distribution

Class

Number of instances(N)

N[%]

not_recom

4320

33.333

recommend

2

0.015

very_recom

328

2.531

priority

4266

32.917

spec_prior

4044

31.204

 

Reasons for selecting Nursery dataset


·         There are 12960 instances(normally it means , large number of instances) in this dataset. So it can be generated better rules

·         Originally association rule mining designed to work with categorical data and because of Nursery dataset’s attribute characteristics is categorical, it’s easy to implement this dataset

·         There are no any missing values in this dataset

 

2.    Preparing and preprocessing the data

If there are missing/garbage values in the selected dataset, there may be some issues with performing the data mining tasks. But , in this selected Nursery dataset hasn’t any missing/garbage values. So there may not be issues with performing data mining tasks (like, performing algorithms).Association rule mining rule can only be performed on categorical data. There is 1 string data type attribute(children attribute) in this Nursery dataset. So it must be converted into nominal(categorical).Below shows procedure to perform that task,

 

 


                 Figure 01 : Before converting children attribute to nominal(current data type – string)

Ø  Filter that use : weka.filters.unsupervised.attribute.StringToNominal

Ø  Command that use : StringToNominal-R 4

Ø  Object Editor settings :

·         Change “attributeRange” box value to “4” as below Figure 02 and click OK. Then click “Apply” in the Filter panel  

                                                             Figure 02 : Object Editor StringToNominal


 



                     Figure 03 : After converting children attribute to nominal(new data type – Nominal)

 

Algorithm use for discovery association rules

Ø              Algorithm use for association rule mining : Apriori Algorithm

 

Apriori Algorithm metric type as Confidence

§  Command that use : Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

§  Object Editor settings : Set the parameters as below figure 04

 

             

 

             

                                                                                        

Figure 04 : Metric type as “Confidence” in                                       ObjectEditor                                                                                                                                                                           

3.      Finding rules and determining the interesting rules

 

          Summery (Apriori Algorithm metric type as confidence)


§  Minimum support: 0.1 (1296 instances)

§  Minimum metric <confidence>: 0.9

§  Number of cycles performed: 18

§  Generated sets of large itemsets:

§  Size of set of large itemsets L(1): 30

§  Size of set of large itemsets L(2): 137

§  Size of set of large itemsets L(3): 12

 

Number of best rules found : 10 (Here it can be generated maximum 24 rules but it set to generate only first 10 best rules)

          * Below rules , “class” represents the evaluation of application for nursery school

1.      1. class=not_recom 4320 ==> health=not_recom 4320    <conf:(1)> lift:(3) lev:(0.22) [2880]                 conv:(2880)

§  This means if evaluation not recommend then health is not recommended.

§  The confidence of this rule is 1 that means 100% instances are there where this rule found to be true

§  The lift value is 3 means that relationship between L.H.S. and R.H.S. of the above equation is more significant (that means L.H.S. of the equation makes high impact on R.H.S. of the equation)

§  The leverage and conviction values are 0.22 and 2880 respectively

2.      2. health=not_recom 4320 ==> class=not_recom 4320    <conf:(1)> lift:(3) lev:(0.22) [2880]                 conv:(2880)

§  If health is not recommended then evaluation not recommended

3.      3. finance=convenient class=not_recom 2160 ==> health=not_recom 2160    <conf:(1)> lift:(3)             lev:(0.11) [1440] conv:(1440)

§  If finance is convenient and evaluation is not recommended then the health is not recommended

 

4.      4. finance=convenient health=not_recom 2160 ==> class=not_recom 2160    <conf:(1)> lift:(3)             lev:(0.11) [1440] conv:(1440)

§  if finance is convenient and health is not recommended then the evaluation is not recommended

§  Here confidence,lift,leverage and conviction values are 1,3,0.11 and 1440 respectively

5.      5. finance=inconv class=not_recom 2160 ==> health=not_recom 2160    <conf:(1)> lift:(3) lev:(0.11)     [1440] conv:(1440)

§  If finance is inconvenient and evaluation is not recommended then the health is not recommended

6.      6. finance=inconv health=not_recom 2160 ==> class=not_recom 2160    <conf:(1)> lift:(3) lev:(0.11) [1440] conv:(1440)

§  If finance is inconvenient and health is not recommended then the evaluation is not recommended

7.      7. parents=usual class=not_recom 1440 ==> health=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

§  If the parents’ occupation usual and evaluation is not recommended then the health is not recommended

8.      8. parents=usual health=not_recom 1440 ==> class=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

§  If parents’ occupation usual and the health is not recommended then the evaluation is not recommended

9.      9. parents=pretentious class=not_recom 1440 ==> health=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

§  If parents’ occupation pretentious and the evaluation is not recommended then the health is not recommended

10 10. parents=pretentious health=not_recom 1440 ==> class=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

§  If parents’ occupation pretentious and health is not recommended then the evaluation is not recommended

       

Interesting rules which are useful to client

1.      health=not_recom 4320 ==> class=not_recom 4320    <conf:(1)> lift:(3) lev:(0.22) [2880] conv:(2880)

Reason : The confidence of this rule is 100%. That means that 100% of the instances are there where this rule found to be true. It means, there is a almost true probability(because 100%) for the  not recommendation for nursery evaluation because of the not recommended health condition. And the lift value is 3 means health condition has more impact to not recommended the nursery evaluation.

2.      finance=convenient health=not_recom 2160 ==> class=not_recom 2160    <conf:(1)> lift:(3) lev:(0.11) [1440] conv:(1440)

Reason : This rule also has a confidence of 100%. It means, there is a almost true probability (because 100%) for the  not recommendation for nursery evaluation because of the not recommended health condition and even if the finance is convenient. And the lift value is 3 means health condition and finance standing of the family have more impact to not recommended the nursery evaluation.

 

3.      finance=inconv health=not_recom 2160 ==> class=not_recom 2160    <conf:(1)> lift:(3) lev:(0.11) [1440] conv:(1440)

Reason : This rule also has a confidence of 100%. It means, there is a almost true probability (because 100%) for the  not recommendation for nursery evaluation because of the not recommended health condition and finance is inconvenient. And the lift value is 3 means health condition and finance standing of the family have more impact to not recommended the nursery evaluation.

 

4.      parents=usual health=not_recom 1440 ==> class=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

Reason : This rule also has a confidence of 100%. It means, there is a almost true probability (because 100%) for the  not recommendation for nursery evaluation because of the not recommended health condition and even if the parent’ occupation is usual. And the lift value is 3 means health condition and parents’ occupation have more impact to not recommended the nursery evaluation.

 

5.      parents=pretentious health=not_recom 1440 ==> class=not_recom 1440    <conf:(1)> lift:(3) lev:(0.07) [960] conv:(960)

Reason : This rule also has a confidence of 100%. It means, there is a almost true probability (because 100%) for the  not recommendation for nursery evaluation because of the not recommended health condition and the parents’ occupation is pretentious. And the lift value is 3 means health condition and parents’ occupation have more impact to not recommended the nursery evaluation.

 

    Recommendations

 

          With the use of important rules that were declared above, following recommendation can be given to students that hope to enroll for nursery school.

*      # Above all the important rules given that it must be a good health condition in order to enter the nursery school

*      # It’s better if the parents’ occupation is usual and not pretentious or great pretentious

*      # It’s better if the financial standing of the family is convenient

*      # Also it can be seen that child nursery , Form of the family ,Number of children, Housing condition and Social condition are not effected with those generated rules. So the important thing is to consider above first 3 points mainly for students that enroll to the nursery school

          


References

 

§  Tutorialspoint.com. 2021. Weka - Association - Tutorialspoint. [online] Available at: <https://www.tutorialspoint.com/weka/weka_association.htm> [Accessed 4 May 2021]

§  Archive.ics.uci.edu. 2021. UCI Machine Learning Repository: Nursery Data Set. [online] Available at: <https://archive.ics.uci.edu/ml/datasets/nursery> [Accessed 4 May 2021]

§  2021. [online] Available at: <https://www.researchgate.net/publication/336551606_Discovering_Rules_for_Nursery_Students_using_Apriori_Algorithm> [Accessed 5 May 2021]

§  Weka.sourceforge.io. 2021. Apriori (weka-dev 3.9.5 API). [online] Available at: <https://weka.sourceforge.io/doc.dev/weka/associations/Apriori.html> [Accessed 8 May 2021]

 

Comments

Popular posts from this blog

How to create custom page borders in WORD

Behavior of Average Queing Delay vs Traffic Intensity