Part A (30 points):
Note: You should use Rapidminer to complete Part A. Remember that the ID column is used specifically for identification and should not be included as a predictor variable in your machine learning prediction model.
Assume you are a manager at a local outpatient clinic.
You would like to offer patients at risk for obesity and hypercholesterolemia early interventions to help improve their health.
Use the data in Cholesterol.xlsx file and use machine learning to identify four groups of similar patients, so you can offer tailored interventions to each group. In the gender column, 0 denotes Female, whereas a 1 denotes Male.
1a. What is the average cholesterol for the group with the highest overall cholesterol? (5 points)
1b. Does this group (identified above) have more males or females? (5 points)
2. You have many patients who have still not signed up to use the patient portal, which allows patients to manage several health care related tasks online, such as medication refills, scheduling appointments and messaging their physician.
To increase patient portal adoption, you would like to send a targeted mail campaign to the group of current non-users who are more likely to sign up for the patient portal.
Pick an appropriate machine learning program and use the Training.xlsx file to train your model and predict portal adoption for patients listed in Scoring.xlsx
2a. What machine learning algorithm would be appropriate: linear regression or logistic regression? (5 points)
2b. What is the predicted patient portal adoption status for Patient ID 993? If targeted in the marketing campaign, is she likely to become an adopter or stay non-adopter? (5 points)
3. You are interested in learning what factors play a role in patient adoption of patient portals, so you can plan your marketing strategy accordingly.
Create a decision tree using the data in Training.xlsx. What is the most important factor determining adoption? (5 points)
Very briefly, what is your observation regarding this variable, as determined by the decision tree? (5 points)
Part B (70 points)
The challenge is to analyze a selected health dataset of your choice, and determine what can be learned from the data. Using any publicly available datasets (you may use any resource included those listed below) retrievable online, use visual analytics techniques in Tableau to identify trends and patterns in your data.
You must summarize and discuss your findings in a paper which should follow the outline below (you may add additional sections to your paper, as needed). Recommended length is 1500-2000 words:
Introduction/Background: Identify the problem, its history and discuss why it is important and worth investigating. Identify and briefly describe the stakeholders.
Method: Indicate the source of the data, any assumptions you made and describe what your data contains. You should also specify the time, population or any other applicable descriptors of your data. Describe what analysis was performed and why.
Results: Summarize your findings. You may include figures/graphs as well as numerical descriptive data (tables) to discuss your results. Describe your findings and their significance, including any interesting trends or patterns you could identify.
Conclusion: Synthesize your results and discuss the implications of your findings for stakeholders and future research. How can your findings help and influence decision making for identified stakeholders?