diabetestalk.net

# Pima Indian Diabetes R

## Using A Neural Network To Predict Diabetes In Pima Indians

Using a neural network to predict diabetes in Pima indians Created an 95% accurate neural network to predict the onset of diabetes in Pima indians. Pretty cool! #theano. Needed to navigate to c:/users/Alex Ko/.keras/keras.json and change tensorflow to theano#Create first network with Kerasimport kerasfrom keras.models import Sequentialfrom keras.layers import Denseimport numpyimport pandas as pdimport sklearnfrom sklearn.preprocessing import StandardScaler# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load pima indians datasetdataset = numpy.loadtxt('pima-indians-diabetes.csv', delimiter=",")#dataset = pd.read_csv('pima-indians-diabetes.csv')data=pd.DataFrame(dataset) #data is panda but dataset is something elseprint(data.head())# split into input (X ie dependent variables) and output (Y ie independent variables) variablesX = dataset[:,0:8] #0-8 columns are dependent variables - remember 8th column is not includedY = dataset[:,8] #8 column is independent variable# = StandardScaler()X = scaler.fit_transform(X)# create modelmodel = Sequential()# model.add(Dense(1000, input_dim=8, init='uniform', activation='relu')) # 1000 neurons# model.add(Dense(100, init='uniform', activation='tanh')) # 100 neurons with tanh activation functionmodel.add(Dense(500, init='uniform', activation='relu')) # 500 neurons# 95.41% accuracy with 500 neurons# 86.99% accuracy with 100 neurons# 85.2% accuracy with 50 neurons# 81.38% accuracy with 10 neuronsmodel.add(Dense(1, init='uniform', activation='sigmoid')) # 1 output neuron# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X, Y, nb_epoch=150, batch_size=10, verbose=2) # 150 epoch, 10 batch size, verbose = 2# evaluate the modelscores = model.evaluate(X Continue reading >>

## Validatingmodelsinr/pima-indians-diabetes.names.txt At Master Winvector/validatingmodelsinr Github

1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA. Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details. 7. For Each Attribute: (all numeric-valued) 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 6. Body mass index (weight in kg/(height in m)^2) 9. Class Distribution: (class value 1 is interpreted as "tested positive for Continue reading >>

## Accuracy Improvement For Diabetes Disease Classification: A Case On A Public Medical Dataset - Sciencedirect

Volume 9, Issue 3 , September 2017, Pages 345-357 Accuracy Improvement for Diabetes Disease Classification: A Case on a Public Medical Dataset Author links open overlay panel MehrbakhshNilashia Open Access funded by Fuzzy Information and Engineering Branch of The Operations Research Society in China As a chronic disease, diabetes mellitus has emerged as a worldwide epidemic. Providing diagnostic aid for diabetes disease by using a set of data that contains only medical information obtained without advanced medical equipment, can help numbers of people who want to discover the disease or the risk of disease at an early stage. This can possibly make a huge positive impact on a lot of peoples lives. The aim of this study is to classify diabetes disease by developing an intelligence system using machine learning techniques. Our method is developed through clustering, noise removal and classification approaches. Accordingly, we use SOM, PCA and NN for clustering, noise removal and classification tasks, respectively. Experimental results on Pima Indian Diabetes dataset show that proposed method remarkably improves the accuracy of prediction in relation to methods developed in the previous studies. The hybrid intelligent system can assist medical practitioners in the healthcare practice as a decision support system. Continue reading >>

## Stat 590 Initial Data Analysis Using R

This is a critical step that should always be performed. You should understand the background of a dataset and what each variables in the dataset represent. calculate some descriptive statistics, such as means, standard deviation, maximum and minimum, correlation, and whatever else is appropriate. draw graphical summaries, such as histograms, box plots, density plots, scatter plots, and many more. check whether the data are distributed according to prior expectations and whether some assumptions in the models that will be conducted in further data analyses are violated. Here is a data set from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases on 768 adult female Pima Indians living near Phoenix. We start by reading the data into R. > pima <- read.table("pima.txt", header=T)# read the data into R Take a close look at the minimum and maximum values of each variable. What have you found? It is weird that blood pressure equals zero (also check variables glucose, triceps, insulin, bmi). Let's check their sorted values to find out how many 0's in the variable blood. > sort(pima$blood) # sort the values of this variable from small to large [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 [37] 30 30 38 40 44 44 44 44 46 46 48 48 48 48 48 50 50 50 [739] 94 94 94 94 94 94 95 96 96 96 96 98 98 98 100 100 100 102 [757] 104 104 106 106 106 108 108 110 110 110 114 122 It seems likely that the zero has been used as a missing value code. In a real investigation, one would likely be able to question what really happened and if missing, whether there exists a systematic missing mechanism. R use "NA" as the missing value code. Let's set all zero values of the variables to NA. > pima$blood[pima\$blood == 0] <- NA # se Continue reading >>

## Leave The Pima Indians Alone!

(This article was first published on Xi'an's Og R , and kindly contributed to R-bloggers) our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive (p.1) Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had threatened to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey ), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler. As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger . Where we Continue reading >>

## Analysis Of Diabetes Data Set Of Pima Indians Using Neural Network And Nn Ensemble

Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble Data Science Professional | Hadoop & Cloud Solutions Expert Data set can be downloaded from UCI Machine Learning Repository. This data set contains of female patients (PIMA Indians) with at least 21 years of age. It has 768 instances and the following 8 attributes (All numeric-valued): 1. Number of times pregnant2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)9. Class variable (0 or 1) This data set contains the diagnostic data to investigate whether the patient shows signs of diabetes according to World Health Organization criteria such as the 2-hour post-load plasma glucose. The graph below (obtained from Weka) shows the histograms of all the attributes. The above histograms provide the following insights: Class 0 with 500 instances represents patients who tested negative and class 1 with 268 instances represents the patients tested positive. Data set is small and seems to be biased with almost 65 percent patients testing negative. This could act as a limitation in the study. Attributes 2-Hour serum insulin, Diabetes Pedigree function, Age and Number of times pregnant are highly skewed to the right. While Plasma glucose concentration, Diastolic Blood pressure and Body Mass Index appear to be normally distributed. Removal of the outliers: As seen in histogram below there are 49 outliers (red bar) which have been removed as part of data pre-processing. Reviewing scatter plots below of all attributes did not show with relationships amongst the attributes, however, there Continue reading >>

## Pima Data Science

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots. The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it. So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot. For one numeric and other factor bar plots seem like a good option. And for two numeric variables we have out faithful scatter plot to the rescue. In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code. I strongly suggest you view the code below, which has inferences and a well documented structure. DATA-> (A file named as diabetes.csv is the one) R Code -> (A fair warning to execute the EDA code in R you will first need to execute the and ) Python Code-> (Its a Jupyter Notebook) The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world. This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions. Our approach to this data set will be to perform the following Exploratory data analysis while deriving inferences from it Using techniques like PCA and checking cor relationship between data Running various models and making inferences from the predictions We will do all of this in R , and in Python. Lichman, M. (2013). UCI Machine Learning Repository [Irvine, CA: University of California, School of Information and Computer Sc Let us first begin to underst Continue reading >>

## Pimaindiansdiabetes: Pima Indians Diabetes Database In Mlbench: Machine Learning Benchmark Problems

and were converted to R format by Friedrich Leisch. Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998).UCI Repository of machine learning databases[Irvine, CA:University of California, Department of Information and ComputerScience. Brian D. Ripley (1996), Pattern Recognition and Neural Networks,Cambridge University Press, Cambridge. Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995),Soft Classification a.k.a. Risk Estimation via Penalized LogLikelihood and Smoothing Spline Analysis of Variance, in D. H.Wolpert (1995), The Mathematics of Generalization, 331-359,Addison-Wesley, Reading, MA. data ( PimaIndiansDiabetes ) summary ( PimaIndiansDiabetes ) data ( PimaIndiansDiabetes2 ) summary ( PimaIndiansDiabetes2 ) pregnant glucose pressure triceps Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 insulin mass pedigree age diabetes Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 neg:500 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 pos:268 Median : 30.5 Median :32.00 Median :0.3725 Median :29.00 Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00 pregnant glucose pressure triceps Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00 Median : 3.000 Median :117.0 Median : 72.00 Median :29.00 Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00 Max. :17.000 Max. :199.0 Continue reading >>

## Machine Learning Datasets In R (10 Datasets You Can Use Right Now)

Machine Learning Datasets in R (10 datasets you can use right now) You need standard datasets to practice machine learning. In this short post you will discover how you can load standard classification and regression datasets in R. This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R. It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniquesand improve your skill with the platform. Practice On Small Well-Understood Datasets There are hundreds of standard test datasets that you can use to practice and get better at machine learning. Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small. This last point is critical when practicing machine learning because: Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post: You can load the standard datasets into R as CSV files. There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN). Which libraries should you use and what datasets are good to start with. Need more Help with R for Machine Learning? Take my free 14-day email course and discover how to use R on your project (with sample code). Click to sign-up and also get a free PDF Ebook version of the course. In this section you will discover the libraries that you can use to get access to standard machine learning datasets. You will also discover specific classification and regression that you c Continue reading >>

## R: Diabetes In Pima Indian Women

A population of women who were at least 21 years old, of Pima Indian heritageand living near Phoenix, Arizona, was tested for diabetesaccording to World Health Organization criteria. The datawere collected by the US National Institute of Diabetes and Digestive andKidney Diseases. We used the 532 complete records after dropping the(mainly missing) data on serum insulin. These data frames contains the following columns: plasma glucose concentration in an oral glucose tolerance test. body mass index (weight in kg/(height in m)\^2). Yes or No, for diabetic according to WHO criteria. The training set Pima.tr contains a randomly selected set of 200subjects, and Pima.te contains the remaining 332 subjects.Pima.tr2 contains Pima.tr plus 100 subjects withmissing values in the explanatory variables. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C.and Johannes, R. S. (1988)Using the ADAP learning algorithm to forecast the onset ofdiabetes mellitus.In Proceedings of the Symposium on Computer Applications inMedical Care (Washington, 1988), ed. R. A. Greenes,pp. 261265. Los Alamitos, CA: IEEE Computer Society Press. Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press. Continue reading >>