diabetestalk.net

Pima Indian Diabetes R

Pima Data Science

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots. The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it. So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot. For one numeric and other factor bar plots seem like a good option. And for two numeric variables we have out faithful scatter plot to the rescue. In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code. I strongly suggest you view the code below, which has inferences and a well documented structure. DATA-> (A file named as diabetes.csv is the one) R Code -> (A fair warning to execute the EDA code in R you will first need to execute the and ) Python Code-> (Its a Jupyter Notebook) The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world. This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions. Our approach to this data set will be to perform the following Exploratory data analysis while deriving inferences from it Using techniques like PCA and checking cor relationship between data Running various models and making inferences from the predictions We will do all of this in R , and in Python. Lichman, M. (2013). UCI Machine Learning Repository [Irvine, CA: University of California, School of Information and Computer Sc Let us first begin to underst Continue reading >>

Pima Indians Diabetes Database

In this post we try to analyse a dataset that was acquired by theNational Institute of Diabetes and Digestive and Kidney Diseases. This data set consists of records of 768 women of ages at least 21 years who might or might not have diabetes. This data set was acquired in the year 1990. The observations here belong to 768 women of the Pima Indian tribe of Arizona. These people live along the Gella and Salt rivers in Arizona. The data set consists of variables such as blood pressure, glucose levels, insulin levels, number of pregnancies, skin thickness, body mass index and outcome(positive/negative). In the above plot, the factor level 1 denotes the onset of diabetes. We see that the median ages for these two categories differ. The median age at which diabetes occur for this data set is much higher. This could be attributed to the lack of physical movement as we get older. The diet can also play a huge role here. There seems to be no apparent pattern here with respect to the skin thickness and the blood pressure of individuals here. Lets take a better look at this plot. There are individuals who have skin thickness of 0 ! This is not possible. Data collection errors would have occurred that have not been rectified. What would happen if we removed these points? We see a very weak correlation that conveys a weak relationship between the skin thickness and blood pressure. In the above plot, we have removed outliers from both the columns of observations to better understand what we might find out. The relationship isnt linear here. The above plot is trying to tell us that the relationship might be linear. The above points have been coloured in such a way that we can demarcate the positive results from the negative ones. The points that are denoted by triangles are positive. Continue reading >>

Diabetes Data Analysis In R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set . Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set. The data is from the UCI archive . It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables. dd.list <- list(0)for(i in 1:70) { fileName <- sprintf("data-%02d", i) dd <- read.csv(fileName,header=FALSE,sep='\t') datetime.vec <- paste(dd$V1, dd$V2) dd\$datetime <- as.POSIXct(strptime(datet Continue reading >>

Validatingmodelsinr/pima-indians-diabetes.names.txt At Master Winvector/validatingmodelsinr Github

1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA. Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details. 7. For Each Attribute: (all numeric-valued) 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 6. Body mass index (weight in kg/(height in m)^2) 9. Class Distribution: (class value 1 is interpreted as "tested positive for Continue reading >>