diabetestalk.net

Pima Indian Diabetes R

Using A Neural Network To Predict Diabetes In Pima Indians

Using A Neural Network To Predict Diabetes In Pima Indians

Using a neural network to predict diabetes in Pima indians Created an 95% accurate neural network to predict the onset of diabetes in Pima indians. Pretty cool! #theano. Needed to navigate to c:/users/Alex Ko/.keras/keras.json and change tensorflow to theano#Create first network with Kerasimport kerasfrom keras.models import Sequentialfrom keras.layers import Denseimport numpyimport pandas as pdimport sklearnfrom sklearn.preprocessing import StandardScaler# fix random seed for reproducibilityseed = 7numpy.random.seed(seed)# load pima indians datasetdataset = numpy.loadtxt('pima-indians-diabetes.csv', delimiter=",")#dataset = pd.read_csv('pima-indians-diabetes.csv')data=pd.DataFrame(dataset) #data is panda but dataset is something elseprint(data.head())# split into input (X ie dependent variables) and output (Y ie independent variables) variablesX = dataset[:,0:8] #0-8 columns are dependent variables - remember 8th column is not includedY = dataset[:,8] #8 column is independent variable# = StandardScaler()X = scaler.fit_transform(X)# create modelmodel = Sequential()# model.add(Dense(1000, input_dim=8, init='uniform', activation='relu')) # 1000 neurons# model.add(Dense(100, init='uniform', activation='tanh')) # 100 neurons with tanh activation functionmodel.add(Dense(500, init='uniform', activation='relu')) # 500 neurons# 95.41% accuracy with 500 neurons# 86.99% accuracy with 100 neurons# 85.2% accuracy with 50 neurons# 81.38% accuracy with 10 neuronsmodel.add(Dense(1, init='uniform', activation='sigmoid')) # 1 output neuron# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X, Y, nb_epoch=150, batch_size=10, verbose=2) # 150 epoch, 10 batch size, verbose = 2# evaluate the modelscores = model.evaluate(X Continue reading >>

Validatingmodelsinr/pima-indians-diabetes.names.txt At Master Winvector/validatingmodelsinr Github

Validatingmodelsinr/pima-indians-diabetes.names.txt At Master Winvector/validatingmodelsinr Github

1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA. Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details. 7. For Each Attribute: (all numeric-valued) 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 6. Body mass index (weight in kg/(height in m)^2) 9. Class Distribution: (class value 1 is interpreted as "tested positive for Continue reading >>

Accuracy Improvement For Diabetes Disease Classification: A Case On A Public Medical Dataset - Sciencedirect

Accuracy Improvement For Diabetes Disease Classification: A Case On A Public Medical Dataset - Sciencedirect

Volume 9, Issue 3 , September 2017, Pages 345-357 Accuracy Improvement for Diabetes Disease Classification: A Case on a Public Medical Dataset Author links open overlay panel MehrbakhshNilashia Open Access funded by Fuzzy Information and Engineering Branch of The Operations Research Society in China As a chronic disease, diabetes mellitus has emerged as a worldwide epidemic. Providing diagnostic aid for diabetes disease by using a set of data that contains only medical information obtained without advanced medical equipment, can help numbers of people who want to discover the disease or the risk of disease at an early stage. This can possibly make a huge positive impact on a lot of peoples lives. The aim of this study is to classify diabetes disease by developing an intelligence system using machine learning techniques. Our method is developed through clustering, noise removal and classification approaches. Accordingly, we use SOM, PCA and NN for clustering, noise removal and classification tasks, respectively. Experimental results on Pima Indian Diabetes dataset show that proposed method remarkably improves the accuracy of prediction in relation to methods developed in the previous studies. The hybrid intelligent system can assist medical practitioners in the healthcare practice as a decision support system. Continue reading >>

Diabetes Data Analysis In R

Diabetes Data Analysis In R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set . Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set. The data is from the UCI archive . It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables. dd.list <- list(0)for(i in 1:70) { fileName <- sprintf("data-%02d", i) dd <- read.csv(fileName,header=FALSE,sep='\t') datetime.vec <- paste(dd$V1, dd$V2) dd$datetime <- as.POSIXct(strptime(datet Continue reading >>

Pima Indians Diabetes Database

Pima Indians Diabetes Database

Diabetes is a terrible disease, and predicting its onset in advance would be extremely valuable. This is the goal of the Pima Indians Diabetes Database. A population at high risk, the Pima Indians, was monitored and a variety of medical data recorded for 768 women, including blood glucose levels, insulin, the body mass index, and others. This is available on Kaggle as the Pima Indians Diabetes Database, originally from this publication . Lets have a look: This is the body mass index (BMI) as a function of age. The BMI is the ratio of weight to height squared. Some people have a BMI of 0: the weight was not measured and the missing record was encoded as a 0. This is common in other variables too: in the Pima Indians Diabetes Database zeroes represent missing values, except for the variable Pregnancies. From now on I will not display such values. In this plot a green point represents a person who was later found to have diabetes, and a yellow point is a person later found healthy. We see that BMI and age dont correlate much and that green dots occur in higher proportion at higher BMIs (for heavier people). Obesity is a risk factor for diabetes, though the connection doesnt seem by any means one-to-one, at least in this plot. What about other variables, such as insulin and blood glucose, as measured two hours after the administration of a carbohydrate solution? This is a standard test run by the scientists who compiled the Pima Indians Diabetes Database, and blood glucose above 200 would mean that the subject is already diabetic, by definition, at the time of the test. So clearly we have no data-points with a Glucosevariable over 200. It seems also that people with higherGlucose and Insulinare more prone to becoming diabetic later on. I added trend lines for the two group Continue reading >>

Pima Indians Diabetes Database

Pima Indians Diabetes Database

In this post we try to analyse a dataset that was acquired by theNational Institute of Diabetes and Digestive and Kidney Diseases. This data set consists of records of 768 women of ages at least 21 years who might or might not have diabetes. This data set was acquired in the year 1990. The observations here belong to 768 women of the Pima Indian tribe of Arizona. These people live along the Gella and Salt rivers in Arizona. The data set consists of variables such as blood pressure, glucose levels, insulin levels, number of pregnancies, skin thickness, body mass index and outcome(positive/negative). In the above plot, the factor level 1 denotes the onset of diabetes. We see that the median ages for these two categories differ. The median age at which diabetes occur for this data set is much higher. This could be attributed to the lack of physical movement as we get older. The diet can also play a huge role here. There seems to be no apparent pattern here with respect to the skin thickness and the blood pressure of individuals here. Lets take a better look at this plot. There are individuals who have skin thickness of 0 ! This is not possible. Data collection errors would have occurred that have not been rectified. What would happen if we removed these points? We see a very weak correlation that conveys a weak relationship between the skin thickness and blood pressure. In the above plot, we have removed outliers from both the columns of observations to better understand what we might find out. The relationship isnt linear here. The above plot is trying to tell us that the relationship might be linear. The above points have been coloured in such a way that we can demarcate the positive results from the negative ones. The points that are denoted by triangles are positive. Continue reading >>

Variants In Acadio Are Associated With Type 2 Diabetes, Insulin Resistance And Lipid Oxidation In Pima Indians

Variants In Acadio Are Associated With Type 2 Diabetes, Insulin Resistance And Lipid Oxidation In Pima Indians

Variants in ACADIO are associated with type 2 diabetes, insulin resistance and lipid oxidation in Pima Indians Phoenix Epidemiology and Clinical Research Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, 445 N. 5th Street, Suite 210, Phoenix, AZ 85004, USA The publisher's final edited version of this article is available at Diabetologia See other articles in PMC that cite the published article. A prior genome-wide association study in Pima Indians identified a variant within the ACAD10 gene that is associated with early-onset type 2 diabetes. Acyl-coenzyme A dehydrogenase 10 (ACAD10) catalyses mitochondrial fatty acid beta-oxidation, which plays a pivotal role in developing insulin resistance and type 2 diabetes. Therefore, ACAD10 was analysed as a positional and biological candidate for type 2 diabetes. Twenty-three SNPs were genotyped in 1,500 Pima Indians to determine the linkage disequilibrium pattern across ACAD10. Association with type 2 diabetes was determined by genotyping four tag single nucleotide polymorphisms (SNPs) in a population-based sample of 3,501 full-heritage Pima Indians; two associated SNPs were further genotyped in a second population-based sample of 3,723 American Indians. Associations with quantitative traits were assessed in 415 non-diabetic full heritage Pima individuals who had been metabolically phenotyped. SNPs rs601663 and rs659964 were associated with type 2 diabetes in the full-heritage Pima Indian sample (p=0.04 and 0.0006, respectively), and rs659964 was further associated with type 2 diabetes in the second American Indian sample (p=0.04). Combination of these two samples provided the strongest evidence for association (p=0.009 and 0.00007, for rs601663 and rs659964, respectively) Continue reading >>

R-exercises Data Science For Doctors Part 1 : Data Display

R-exercises Data Science For Doctors Part 1 : Data Display

data <- read.table(url, fileEncoding="UTF-8", sep=",") names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class') Answers to the exercises are available here . If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Create a frequency table of the class variable. data$class.fac <- factor(data[['class']],levels=c(0,1), labels= c("Negative","Positive")) Create a pie chart of the class.fac variable. Create a strip chart for the mass against class.fac. Create a density plot for the preg variable. Create a histogram for the preg variable. Create a boxplot for the age against class.fac. Create a normal QQ plot and a line which passes through the first and third quartiles. Create a scatter plot for the variables age against the mass variable . Create scatter plots for every variable of the data set against every variable of the data set on a single window. hint: it is quite simple, dont overthink about it. Continue reading >>

Stat 590 Initial Data Analysis Using R

Stat 590 Initial Data Analysis Using R

This is a critical step that should always be performed. You should understand the background of a dataset and what each variables in the dataset represent. calculate some descriptive statistics, such as means, standard deviation, maximum and minimum, correlation, and whatever else is appropriate. draw graphical summaries, such as histograms, box plots, density plots, scatter plots, and many more. check whether the data are distributed according to prior expectations and whether some assumptions in the models that will be conducted in further data analyses are violated. Here is a data set from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases on 768 adult female Pima Indians living near Phoenix. We start by reading the data into R. > pima <- read.table("pima.txt", header=T)# read the data into R Take a close look at the minimum and maximum values of each variable. What have you found? It is weird that blood pressure equals zero (also check variables glucose, triceps, insulin, bmi). Let's check their sorted values to find out how many 0's in the variable blood. > sort(pima$blood) # sort the values of this variable from small to large [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 [37] 30 30 38 40 44 44 44 44 46 46 48 48 48 48 48 50 50 50 [739] 94 94 94 94 94 94 95 96 96 96 96 98 98 98 100 100 100 102 [757] 104 104 106 106 106 108 108 110 110 110 114 122 It seems likely that the zero has been used as a missing value code. In a real investigation, one would likely be able to question what really happened and if missing, whether there exists a systematic missing mechanism. R use "NA" as the missing value code. Let's set all zero values of the variables to NA. > pima$blood[pima$blood == 0] <- NA # se Continue reading >>

Leave The Pima Indians Alone!

Leave The Pima Indians Alone!

(This article was first published on Xi'an's Og R , and kindly contributed to R-bloggers) our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive (p.1) Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had threatened to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey ), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler. As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger . Where we Continue reading >>

Analysis Of Diabetes Data Set Of Pima Indians Using Neural Network And Nn Ensemble

Analysis Of Diabetes Data Set Of Pima Indians Using Neural Network And Nn Ensemble

Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble Data Science Professional | Hadoop & Cloud Solutions Expert Data set can be downloaded from UCI Machine Learning Repository. This data set contains of female patients (PIMA Indians) with at least 21 years of age. It has 768 instances and the following 8 attributes (All numeric-valued): 1. Number of times pregnant2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)9. Class variable (0 or 1) This data set contains the diagnostic data to investigate whether the patient shows signs of diabetes according to World Health Organization criteria such as the 2-hour post-load plasma glucose. The graph below (obtained from Weka) shows the histograms of all the attributes. The above histograms provide the following insights: Class 0 with 500 instances represents patients who tested negative and class 1 with 268 instances represents the patients tested positive. Data set is small and seems to be biased with almost 65 percent patients testing negative. This could act as a limitation in the study. Attributes 2-Hour serum insulin, Diabetes Pedigree function, Age and Number of times pregnant are highly skewed to the right. While Plasma glucose concentration, Diastolic Blood pressure and Body Mass Index appear to be normally distributed. Removal of the outliers: As seen in histogram below there are 49 outliers (red bar) which have been removed as part of data pre-processing. Reviewing scatter plots below of all attributes did not show with relationships amongst the attributes, however, there Continue reading >>

Pima Data Science

Pima Data Science

We have a classification problem. Our data set has in total 8 independent variables, out of which one is a factor and 7 our continuous. This means we should have at-least 8 plots. The target variable Outcome should be plotted against each independent variable if we want to derive any inferences and leave no stones unturned for it. So if we need to plot 2 factor variables, we should preferably use a stacked bar chart or mosaic plot. For one numeric and other factor bar plots seem like a good option. And for two numeric variables we have out faithful scatter plot to the rescue. In this blog I post I will not be stressing much on words but more on code and inferences made which is well explained and documented in my code. I strongly suggest you view the code below, which has inferences and a well documented structure. DATA-> (A file named as diabetes.csv is the one) R Code -> (A fair warning to execute the EDA code in R you will first need to execute the and ) Python Code-> (Its a Jupyter Notebook) The Idea behind using this data set from the UCI repository is not just running models, but deriving inferences that match to the real world. This makes predictions we make all the more sensible and strong especially when we have understood the data set and have derived correct inferences from it which match our predictions. Our approach to this data set will be to perform the following Exploratory data analysis while deriving inferences from it Using techniques like PCA and checking cor relationship between data Running various models and making inferences from the predictions We will do all of this in R , and in Python. Lichman, M. (2013). UCI Machine Learning Repository [Irvine, CA: University of California, School of Information and Computer Sc Let us first begin to underst Continue reading >>

Pimaindiansdiabetes: Pima Indians Diabetes Database In Mlbench: Machine Learning Benchmark Problems

Pimaindiansdiabetes: Pima Indians Diabetes Database In Mlbench: Machine Learning Benchmark Problems

and were converted to R format by Friedrich Leisch. Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998).UCI Repository of machine learning databases[Irvine, CA:University of California, Department of Information and ComputerScience. Brian D. Ripley (1996), Pattern Recognition and Neural Networks,Cambridge University Press, Cambridge. Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995),Soft Classification a.k.a. Risk Estimation via Penalized LogLikelihood and Smoothing Spline Analysis of Variance, in D. H.Wolpert (1995), The Mathematics of Generalization, 331-359,Addison-Wesley, Reading, MA. data ( PimaIndiansDiabetes ) summary ( PimaIndiansDiabetes ) data ( PimaIndiansDiabetes2 ) summary ( PimaIndiansDiabetes2 ) pregnant glucose pressure triceps Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 insulin mass pedigree age diabetes Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 neg:500 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 pos:268 Median : 30.5 Median :32.00 Median :0.3725 Median :29.00 Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00 pregnant glucose pressure triceps Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00 Median : 3.000 Median :117.0 Median : 72.00 Median :29.00 Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00 Max. :17.000 Max. :199.0 Continue reading >>

Machine Learning Datasets In R (10 Datasets You Can Use Right Now)

Machine Learning Datasets In R (10 Datasets You Can Use Right Now)

Machine Learning Datasets in R (10 datasets you can use right now) You need standard datasets to practice machine learning. In this short post you will discover how you can load standard classification and regression datasets in R. This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R. It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniquesand improve your skill with the platform. Practice On Small Well-Understood Datasets There are hundreds of standard test datasets that you can use to practice and get better at machine learning. Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small. This last point is critical when practicing machine learning because: Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post: You can load the standard datasets into R as CSV files. There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN). Which libraries should you use and what datasets are good to start with. Need more Help with R for Machine Learning? Take my free 14-day email course and discover how to use R on your project (with sample code). Click to sign-up and also get a free PDF Ebook version of the course. In this section you will discover the libraries that you can use to get access to standard machine learning datasets. You will also discover specific classification and regression that you c Continue reading >>

R: Diabetes In Pima Indian Women

R: Diabetes In Pima Indian Women

A population of women who were at least 21 years old, of Pima Indian heritageand living near Phoenix, Arizona, was tested for diabetesaccording to World Health Organization criteria. The datawere collected by the US National Institute of Diabetes and Digestive andKidney Diseases. We used the 532 complete records after dropping the(mainly missing) data on serum insulin. These data frames contains the following columns: plasma glucose concentration in an oral glucose tolerance test. body mass index (weight in kg/(height in m)\^2). Yes or No, for diabetic according to WHO criteria. The training set Pima.tr contains a randomly selected set of 200subjects, and Pima.te contains the remaining 332 subjects.Pima.tr2 contains Pima.tr plus 100 subjects withmissing values in the explanatory variables. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C.and Johannes, R. S. (1988)Using the ADAP learning algorithm to forecast the onset ofdiabetes mellitus.In Proceedings of the Symposium on Computer Applications inMedical Care (Washington, 1988), ed. R. A. Greenes,pp. 261265. Los Alamitos, CA: IEEE Computer Society Press. Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press. Continue reading >>

More in diabetes