diabetestalk.net

# Pima Indian Diabetes R

## Accuracy Improvement For Diabetes Disease Classification: A Case On A Public Medical Dataset - Sciencedirect

Volume 9, Issue 3 , September 2017, Pages 345-357 Accuracy Improvement for Diabetes Disease Classification: A Case on a Public Medical Dataset Author links open overlay panel MehrbakhshNilashia Open Access funded by Fuzzy Information and Engineering Branch of The Operations Research Society in China As a chronic disease, diabetes mellitus has emerged as a worldwide epidemic. Providing diagnostic aid for diabetes disease by using a set of data that contains only medical information obtained without advanced medical equipment, can help numbers of people who want to discover the disease or the risk of disease at an early stage. This can possibly make a huge positive impact on a lot of peoples lives. The aim of this study is to classify diabetes disease by developing an intelligence system using machine learning techniques. Our method is developed through clustering, noise removal and classification approaches. Accordingly, we use SOM, PCA and NN for clustering, noise removal and classification tasks, respectively. Experimental results on Pima Indian Diabetes dataset show that proposed method remarkably improves the accuracy of prediction in relation to methods developed in the previous studies. The hybrid intelligent system can assist medical practitioners in the healthcare practice as a decision support system. Continue reading >>

## R: Diabetes In Pima Indian Women

A population of women who were at least 21 years old, of Pima Indian heritageand living near Phoenix, Arizona, was tested for diabetesaccording to World Health Organization criteria. The datawere collected by the US National Institute of Diabetes and Digestive andKidney Diseases. We used the 532 complete records after dropping the(mainly missing) data on serum insulin. These data frames contains the following columns: plasma glucose concentration in an oral glucose tolerance test. body mass index (weight in kg/(height in m)\^2). Yes or No, for diabetic according to WHO criteria. The training set Pima.tr contains a randomly selected set of 200subjects, and Pima.te contains the remaining 332 subjects.Pima.tr2 contains Pima.tr plus 100 subjects withmissing values in the explanatory variables. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C.and Johannes, R. S. (1988)Using the ADAP learning algorithm to forecast the onset ofdiabetes mellitus.In Proceedings of the Symposium on Computer Applications inMedical Care (Washington, 1988), ed. R. A. Greenes,pp. 261265. Los Alamitos, CA: IEEE Computer Society Press. Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press. Continue reading >>

## R: Diabetes In Pima Indian Women

A population of women who were at least 21 years old, of Pima Indian heritageand living near Phoenix, Arizona, was tested for diabetesaccording to World Health Organization criteria. The datawere collected by the US National Institute of Diabetes and Digestive andKidney Diseases. We used the 532 complete records after dropping the(mainly missing) data on serum insulin. These data frames contains the following columns: plasma glucose concentration in an oral glucose tolerance test. body mass index (weight in kg/(height in m)\^2). Yes or No, for diabetic according to WHO criteria. The training set Pima.tr contains a randomly selected set of 200subjects, and Pima.te contains the remaining 332 subjects.Pima.tr2 contains Pima.tr plus 100 subjects withmissing values in the explanatory variables. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C.and Johannes, R. S. (1988)Using the ADAP learning algorithm to forecast the onset ofdiabetes mellitus.In Proceedings of the Symposium on Computer Applications inMedical Care (Washington, 1988), ed. R. A. Greenes,pp. 261265. Los Alamitos, CA: IEEE Computer Society Press. Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press. Continue reading >>

## Pima Indians Diabetes Database

In this post we try to analyse a dataset that was acquired by theNational Institute of Diabetes and Digestive and Kidney Diseases. This data set consists of records of 768 women of ages at least 21 years who might or might not have diabetes. This data set was acquired in the year 1990. The observations here belong to 768 women of the Pima Indian tribe of Arizona. These people live along the Gella and Salt rivers in Arizona. The data set consists of variables such as blood pressure, glucose levels, insulin levels, number of pregnancies, skin thickness, body mass index and outcome(positive/negative). In the above plot, the factor level 1 denotes the onset of diabetes. We see that the median ages for these two categories differ. The median age at which diabetes occur for this data set is much higher. This could be attributed to the lack of physical movement as we get older. The diet can also play a huge role here. There seems to be no apparent pattern here with respect to the skin thickness and the blood pressure of individuals here. Lets take a better look at this plot. There are individuals who have skin thickness of 0 ! This is not possible. Data collection errors would have occurred that have not been rectified. What would happen if we removed these points? We see a very weak correlation that conveys a weak relationship between the skin thickness and blood pressure. In the above plot, we have removed outliers from both the columns of observations to better understand what we might find out. The relationship isnt linear here. The above plot is trying to tell us that the relationship might be linear. The above points have been coloured in such a way that we can demarcate the positive results from the negative ones. The points that are denoted by triangles are positive. Continue reading >>

## Using A Neural Network To Predict Diabetes In Pima Indians

Using a neural network to predict diabetes in Pima indians Created an 95% accurate neural network to predict the onset of diabetes in Pima indians. Pretty cool! #theano. Needed to navigate to c:/users/Alex Ko/.keras/keras.json and change tensorflow to theano#Create first network with Kerasimport kerasfrom keras.models import Sequentialfrom keras.layers import Denseimport numpyimport pandas as pdimport sklearnfrom sklearn.preprocessing import StandardScaler# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load pima indians datasetdataset = numpy.loadtxt('pima-indians-diabetes.csv', delimiter=",")#dataset = pd.read_csv('pima-indians-diabetes.csv')data=pd.DataFrame(dataset) #data is panda but dataset is something elseprint(data.head())# split into input (X ie dependent variables) and output (Y ie independent variables) variablesX = dataset[:,0:8] #0-8 columns are dependent variables - remember 8th column is not includedY = dataset[:,8] #8 column is independent variable# = StandardScaler()X = scaler.fit_transform(X)# create modelmodel = Sequential()# model.add(Dense(1000, input_dim=8, init='uniform', activation='relu')) # 1000 neurons# model.add(Dense(100, init='uniform', activation='tanh')) # 100 neurons with tanh activation functionmodel.add(Dense(500, init='uniform', activation='relu')) # 500 neurons# 95.41% accuracy with 500 neurons# 86.99% accuracy with 100 neurons# 85.2% accuracy with 50 neurons# 81.38% accuracy with 10 neuronsmodel.add(Dense(1, init='uniform', activation='sigmoid')) # 1 output neuron# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X, Y, nb_epoch=150, batch_size=10, verbose=2) # 150 epoch, 10 batch size, verbose = 2# evaluate the modelscores = model.evaluate(X Continue reading >>

## Pima Data Science

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots. The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it. So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot. For one numeric and other factor bar plots seem like a good option. And for two numeric variables we have out faithful scatter plot to the rescue. In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code. I strongly suggest you view the code below, which has inferences and a well documented structure. DATA-> (A file named as diabetes.csv is the one) R Code -> (A fair warning to execute the EDA code in R you will first need to execute the and ) Python Code-> (Its a Jupyter Notebook) The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world. This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions. Our approach to this data set will be to perform the following Exploratory data analysis while deriving inferences from it Using techniques like PCA and checking cor relationship between data Running various models and making inferences from the predictions We will do all of this in R , and in Python. Lichman, M. (2013). UCI Machine Learning Repository [Irvine, CA: University of California, School of Information and Computer Sc Let us first begin to underst Continue reading >>

## Pima Indians Diabetes Database

Diabetes is a terrible disease, and predicting its onset in advance would be extremely valuable. This is the goal of the Pima Indians Diabetes Database. A population at high risk, the Pima Indians, was monitored and a variety of medical data recorded for 768 women, including blood glucose levels, insulin, the body mass index, and others. This is available on Kaggle as the Pima Indians Diabetes Database, originally from this publication . Lets have a look: This is the body mass index (BMI) as a function of age. The BMI is the ratio of weight to height squared. Some people have a BMI of 0: the weight was not measured and the missing record was encoded as a 0. This is common in other variables too: in the Pima Indians Diabetes Database zeroes represent missing values, except for the variable Pregnancies. From now on I will not display such values. In this plot a green point represents a person who was later found to have diabetes, and a yellow point is a person later found healthy. We see that BMI and age dont correlate much and that green dots occur in higher proportion at higher BMIs (for heavier people). Obesity is a risk factor for diabetes, though the connection doesnt seem by any means one-to-one, at least in this plot. What about other variables, such as insulin and blood glucose, as measured two hours after the administration of a carbohydrate solution? This is a standard test run by the scientists who compiled the Pima Indians Diabetes Database, and blood glucose above 200 would mean that the subject is already diabetic, by definition, at the time of the test. So clearly we have no data-points with a Glucosevariable over 200. It seems also that people with higherGlucose and Insulinare more prone to becoming diabetic later on. I added trend lines for the two group Continue reading >>

## Diabetes Data Analysis In R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set . Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set. The data is from the UCI archive . It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables. dd.list <- list(0)for(i in 1:70) { fileName <- sprintf("data-%02d", i) dd <- read.csv(fileName,header=FALSE,sep='\t') datetime.vec <- paste(dd$V1, dd$V2) dd\$datetime <- as.POSIXct(strptime(datet Continue reading >>

## Analysis Of Diabetes Data Set Of Pima Indians Using Neural Network And Nn Ensemble

Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble Data Science Professional | Hadoop & Cloud Solutions Expert Data set can be downloaded from UCI Machine Learning Repository. This data set contains of female patients (PIMA Indians) with at least 21 years of age. It has 768 instances and the following 8 attributes (All numeric-valued): 1. Number of times pregnant2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)9. Class variable (0 or 1) This data set contains the diagnostic data to investigate whether the patient shows signs of diabetes according to World Health Organization criteria such as the 2-hour post-load plasma glucose. The graph below (obtained from Weka) shows the histograms of all the attributes. The above histograms provide the following insights: Class 0 with 500 instances represents patients who tested negative and class 1 with 268 instances represents the patients tested positive. Data set is small and seems to be biased with almost 65 percent patients testing negative. This could act as a limitation in the study. Attributes 2-Hour serum insulin, Diabetes Pedigree function, Age and Number of times pregnant are highly skewed to the right. While Plasma glucose concentration, Diastolic Blood pressure and Body Mass Index appear to be normally distributed. Removal of the outliers: As seen in histogram below there are 49 outliers (red bar) which have been removed as part of data pre-processing. Reviewing scatter plots below of all attributes did not show with relationships amongst the attributes, however, there Continue reading >>

## Validatingmodelsinr/pima-indians-diabetes.names.txt At Master Winvector/validatingmodelsinr Github

1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA. Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details. 7. For Each Attribute: (all numeric-valued) 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 6. Body mass index (weight in kg/(height in m)^2) 9. Class Distribution: (class value 1 is interpreted as "tested positive for Continue reading >>