I am lucky to be deeply involved in the WiDS Datathon Committee this year. The committee worked hard to choose a data modeling challenge that would be accessible to both new and seasoned data scientists. The data science field is very vibrant. New packages focusing on all manner of data preparation and statistical analysis emerge every year. The datathon is a great opportunity for me to learn new things and keep abreast of developments in the field.
This year’s WiDS Datathon challenge focuses on predicting a type of diabetes, Diabetes Mellitus, based on data collected during the first 24 hours of admission to an intensive care unit. This is a condition in which the body has difficulties maintaining blood glucose level. Such a condition can have secondary impacts on the heart, kidney and other organs. In the absence of complete medical records, it is critical for ER and critical care doctors to be aware of chronic co-morbidities such as diabetes so that a patient can be appropriately treated. This information isn’t always available, as patients can be admitted to a hospital unresponsive and without a complete medical history, or maybe from another health system whose records are not readily accessible.
On the other hand, the consequences of mis-diagnosis are very real. Treatment of patients with diabetes differs from those without diabetes in significant ways. For example, wide fluctuations in blood sugar are common and are better tolerated by diabetic patients. It is clear that information about chronic diseases like diabetes is crucial to managing critical illness.
We will try to model this very real medical problem using the best machine learning tools we have at our disposal. We also encourage participants to collaborate with friends and colleagues in the medical domain to share knowledge and expertise that could be relevant to the modeling challenge.
I will help you with the basic steps of data set preparation and model fitting.
1. Data Preparation:
Most data scientists might agree that the dataset preparation step is the most tedious and time-consuming step of the model building. “Dplyr” is a bread-and-butter package for data manipulation. I was pleasantly surprised to learn about new packages, “janitor” and fastDummies” that also help with the dataset preparation. This is particularly important for a dataset such as the GOSSIS data, which reflects real-world hospital data.
Comparing columns in large data frames can be tough. I found this function from the package “janitor” useful in comparing very large Training and Test data frames: compare_df_cols(Training, Test).
Another trick that really helped me was writing out the structure of the dataframes. For example:
Trainstr<- capture.output(str(Train, list.len=ncol(Train)))
Creating dummies from the package fastDummies was very useful.
Traindummies<- dummy_cols(Train_fctrs, remove_first_dummy=TRUE)
In case you decide to fit Xgboost,you will need to create a DMatrix, for which there are specific things to keep in mind. There are many ways in which the Matrix construction can fail. It is very important to have the Training and Test matrices exactly lined up.
(Hint: You might need to rename a few columns to line them up—such is the nature of real-world data— it will be obvious which ones to rename.)
In addition, there are very simple fixes that help with the construction. For example, lining up the column names of the Training and Test matrices is a simple trick that prevents DMatrix construction failure. For example: Test<- Test[names(Train)]
All of these hints and techniques are relevant to data preparation for models such as Tensorflow, SVM etc. as well.
Here is a sample of Data Preparation:
2. Fit Models:
If you are new to model development, the range of algorithmic approaches to choose from might be understandably overwhelming. There is an ever-expanding ocean of statistical approaches and modeling techniques. A great starting point is the classic book that has accompanied many people on their machine learning and data science journey.
Elements of Statistical Learning:
https://web.stanford.edu/~hastie/ElemStatLearn/
A question one might like to address early on is the type of patterns embedded in the data that could result in appropriate classification. For example, can the data points be separated easily with a line in the middle? Are there penalization approaches that could impact the quality of fit?
Here are some resources that explore these concepts:
https://statweb.stanford.edu/~owen/courses/305a/Rudyregularization.pdf
http://heather.cs.ucdavis.edu/draftregclass.pdf
http://web.engr.oregonstate.edu/~xfern/classes/cs434-18/Regularization-5.pdf
https://www.cs.cmu.edu/~mgormley/courses/10701-f16/slides/lecture4.pdf
https://www.ics.uci.edu/~xhx/courses/CS273P/04-linear-regression-273p.pdf
https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2004.pdf
Could there be nonlinear patterns in the data that if represented appropriately, would assist with the separation of the classes? If you wish to explore nonlinear separation, there are several approaches.
Here are some pointers:
http://cs229.stanford.edu/summer2020/cs229-notes3.pdf
http://matt.colorado.edu/teaching/categories/jsw7.pdf
https://www.ics.uci.edu/~welling/teaching/KernelsICS273B/svmintro.pdf (introductory)
https://arxiv.org/pdf/math/0701907.pdf (more mathematical )
https://cs224d.stanford.edu/lectures/CS224d-Lecture7.pdf
https://www.cs.utah.edu/~zhe/teach/pdf/Tensorflow_tutorial.pdf
As you build different models, you might consider trees and ensembles. These algorithms give you significant control of the fitting process in terms of parameters you can experiment with.
https://arxiv.org/pdf/1603.02754&hl=th. (Xgboost)
https://xgboost.readthedocs.io/en/latest/
If this is your first time doing hyper parameter tuning, try an experimental approach towards model fitting. Study your fits as you tune the parameters. For example, are stumps (shorter trees) better or worse than deeper trees- for this dataset? How does eta affect the performance? Soon, you might develop an intuition for your fits- you might see steep learners that quickly fizz out or slow and steady risers that make micro improvements in performance over several thousand steps.
Here is a sample of an Xgmodel parameter set:
3. Consider Research
This year, the WiDS Datathon is placing significant emphasis on research. I think this a terrific opportunity to think mathematically/ statistically and plunge into active areas of research. The areas of pure and applied math/stat research are simply vast. Are there statistical, algebraic, geometric, number theoretic, complex analytic, topological areas that interest you? Perhaps there are applied areas- applications to healthcare, for instance, that are of interest? Maybe you always wanted to design a new visualization technique. Well, now is the time to bring out all of those ideas and put forth a research proposal!!
Above all, enjoy the data challenge. If you view it as an exercise to continue learning and growing, you will find that it opens a door to an immense and very thrilling field.
Ready to try it yourself? Learn more and participate in the challenge.
Sharada Kalanidhi is Co-Chair of the WiDS Datathon Committee and a Data Scientist/ Quantitative Strategist and Inventor with over 20 years of industry experience.