This page shows the part of my research project for the Machine Learning for Data Analysis course by Wesleyan University on Coursera (Week 1 Assignment: Running a Classification Tree). The research topic and data set are described here.
With the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) 2001–2002 wave 1 data (n = 43,093), decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable, Alcohol Use Disorder (AlcDisorder). Various possible separations (categorical) or cut points (quantitative) were tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.
The following explanatory variables were included as possible contributors to a classification tree model evaluating Alcohol Abuse/Dependence (i.e., Alcohol Use Disorder): age, gender, race/ethnicity, household income, marital status, education level, occupation and other job-related conditions, relationship problems, financial hardship, the region and type of housing, age at first drinking alcohol, drinking pattern and frequency, smoking, drug uses, gambling, mental health problems, family history of any addiction or mental health issues, etc. [*The complete list is shown below]
The S2AQ10 variable (HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS) was the first variable to separate the sample into two subgroups. Adults who drank enough to feel intoxicated more than once in last 12 month (S2AQ10=1~10) were more likely to have alcohol use disorder compared to Adults who answered Never (S2AQ10=11): 29.2% vs. 1.8%, respectively.
Of the adults who drank enough to feel intoxicated more than once in last 12 month, a further subdivision was made with the S2AQ12B variable (HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS). Adults who drank after midnight 3 or more times in the last year (S2AQ12B=1-9) were more likely to have alcohol use disorder than adults who consumed less frequently (S2AQ12B=10-11): 47.1% vs 18.2%, respectively.
The total model classified 93.6% of the sample correctly, with the sensitivity (True Positive Rate) = 32.8% and the specificity (True Negative Rate) = 98.7%. That is, this model predicts adults not having alcohol use disorder very accurately, 99% correct, but may predict adults with Alcohol Abuse/Dependence less accurately, only 33% correct.
* The full list of the 59 explanatory variables used in the decision tree model is as follows: SEX="GENDER"; AGE="AGE"; AgeAlcUse="AGE RANGE WHEN STARTED DRINKING"; CENDIV="CENSUS DIVISION"; BUILDTYP="TYPE OF BUILDING FOR HOUSEHOLD"; S1Q1D5="WHITE CHECKED IN MULTIRACE CODE"; S1Q1C="HISPANIC OR LATINO" ; S1Q1D1="AMERICAN INDIAN OR ALASKA NATIVE"; S1Q1D2="ASIAN"; S1Q1D3="BLACK OR AFRICAN AMERICAN"; S1Q1D4="NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER"; S1Q2A="LIVED WITH AT LEAST 1 BIOLOGICAL PARENT BEFORE AGE 18"; S1Q2B="BIOLOGICAL FATHER EVER LIVE IN HOUSEHOLD BEFORE RESPONDENT WAS 18"; SPOUSE="SPOUSE OF RESPONDENT IN HOUSEHOLD"; S1Q4A="AGE AT FIRST MARRIAGE"; S1Q4B="HOW FIRST MARRIAGE ENDED"; S1Q6A="HIGHEST GRADE OR YEAR OF SCHOOL COMPLETED"; S1Q7A1="WORKING FULL TIME (35+ HOURS A WEEK)"; S1Q7A2="WORKING PART TIME (<35 HOURS A WEEK)"; S1Q7A3="EMPLOYED BUT NOT AT WORK BECAUSE OF TEMPORARY ILLNESS OR INJURY"; S1Q9A="BUSINESS OR INDUSTRY"; S1Q9B="OCCUPATION"; S1Q9C="TYPE OF EMPLOYER"; S1Q12B="TOTAL HOUSEHOLD INCOME IN LAST 12 MONTHS"; S1Q14A="PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS"; S1Q16="SELF-PERCEIVED CURRENT HEALTH"; S1Q232="ANY FAMILY MEMBERS OR CLOSE FRIENDS HAD SERIOUS ILLNESS OR INJURY IN LAST 12 MONTHS"; S1Q233="MOVED/ANYONE NEW CAME TO LIVE WITH YOU IN LAST 12 MONTHS"; S1Q234="FIRED OR LAID OFF FROM JOB IN LAST 12 MONTHS"; S1Q236="HAD TROUBLE WITH BOSS OR COWORKER IN LAST 12 MONTHS"; S1Q237="CHANGED JOBS, JOB RESPONSIBILITIES OR WORK HOURS IN LAST 12 MONTHS"; S1Q238="GOT SEPARATED OR DIVORCED OR BROKE OFF STEADY RELATIONSHIP IN LAST 12 MONTHS"; S1Q239="HAD PROBLEMS WITH NEIGHBOR, FRIEND OR RELATIVE IN LAST 12 MONTHS"; S1Q2310="EXPERIENCED MAJOR FINANCIAL CRISIS OR BANKRUPTCY IN LAST 12 MONTHS"; S1Q2312="YOU OR FAMILY MEMBER BEEN VICTIM OF CRIME IN LAST 12 MONTHS"; S2DQ1="BLOOD/NATURAL FATHER EVER AN ALCOHOLIC OR PROBLEM DRINKER"; S2DQ2="BLOOD/NATURAL MOTHER EVER AN ALCOHOLIC OR PROBLEM DRINKER"; S2DQ3C2="ANY FULL BROTHERS EVER ALCOHOLICS OR PROBLEM DRINKERS"; S2DQ4C2="ANY FULL SISTERS EVER ALCOHOLICS OR PROBLEM DRINKERS"; SMOKER="TOBACCO USE STATUS"; S3AQ3C1="USUAL QUANTITY WHEN SMOKED CIGARETTES"; TAB12MDX="NICOTINE DEPENDENCE IN THE LAST 12 MONTHS"; TABP12MDX="NICOTINE DEPENDENCE PRIOR TO THE LAST 12 MONTHS"; S3BQ1A6="EVER USED COCAINE OR CRACK"; DGSTATUS="DRUG USE STATUS"; DGENAXDXSNI12="GENERALIZED ANXIETY IN LAST 12 MONTHS"; DGENAXDXSNIP12="GENERALIZED ANXIETY PRIOR TO THE LAST 12 MONTHS"; GAMB12DX="PATHOLOGICAL GAMBLING IN LAST 12 MONTHS"; GAMBP12DX="PATHOLOGICAL GAMBLING PRIOR TO THE LAST 12 MONTHS"; ANTISOCDX2="ANTISOCIAL PERSONALITY DISORDER (WITH CONDUCT DISORDER)"; AVOIDPDX2="AVOIDANT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)"; DEPPDDX2="DEPENDENT PERSONALITY DISORDER (LIFETIME DIAGNOSIS)"; S2AQ4B="HOW OFTEN DRANK COOLERS IN LAST 12 MONTHS"; S2AQ5B="HOW OFTEN DRANK BEER IN LAST 12 MONTHS"; S2AQ6B="HOW OFTEN DRANK WINE IN LAST 12 MONTHS"; S2AQ7B="HOW OFTEN DRANK LIQUOR IN LAST 12 MONTHS"; S2AQ10="HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS"; S2AQ12A="HOW OFTEN DRANK BEFORE 3 PM IN LAST 12 MONTHS"; S2AQ12B="HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS"; S2AQ12C="HOW OFTEN DRANK AT HOME ALONE IN LAST 12 MONTHS"; S2AQ12D="HOW OFTEN DRANK IN PUBLIC PLACES IN LAST 12 MONTHS"