Billy Joes Hardware Databases

Introduction

Connolly and Begg (2015) defined Database as a collection of related data and a database management system (DBMS) to be the software that manages and controls access to the database. A database is also defined as a self-describing collection of integrated records.

This project is built around the principle of creating a Database for Billy Joes Hardware Suppliers which Supply a broad range of hardware and gardening equipment for trade and public. Connolly and Begg (2015) stated that there are advantages in implementing a database management systems, it attempts to eliminates the redundancy by integrating the files so that multiple copies of the same data are not stored; data consistency it reduces the risk of inconsistencies occurring; more information from the same amount of data with the integration of the operational data; sharing data, improved data integrity; improved security and enforcement of standards.

The SQL allows the table to be created and produced with CREATE and SELECT statement, to contain duplicate rows, it imposes an ordering on the columns, and it allows the user to order the rows of a result table.

An SQL statement consists of reserved words and user defined words. It must be spelled exactly as required and the user defined are made up by the user but should follow the SQL requirement. The SQL statement the letter can be typed in upper or lower case.

This project the Database named mina was created under DDL statements which were used to create the six tables, consists of Products Table, Order Details Table, Employees Table, Customers Table, Orders Table and Suppliers Table. The Author’s used the DML statements to populate the tables with values and queries.

The author’s had used Data Manipulation to execute SELECT statement – to query data in the database; INSERT statement – to insert data into a table; UPDATE statement – to update data in a table; and DELETE statement or DROP statement – to delete data from a table.

The Tables are made up of primary keys and foreign keys. The Order Details has an Order ID int foreign key references Orders (Order ID) and Product ID int foreign key references Product (Product ID);

The Product table is formed of Supplier ID foreign key references Supplier (Supplier ID); and the Orders table is composed of Customer ID int foreign key references Customer (Customer ID); Employee ID in foreign key references Employee (Employee ID).

This database includes Create Views in Product list; Customer List WHERE Country Ireland; Current Order List; Employee List the criteria WHERE Clause  were used to retrieve all Customers who live in Ireland, employee joining date, order date, product average price , product date of expiry and supplier name was stated.

Also queries in Customer table INNER JOIN with Orders table on Customer ID, First Name, Last Name, Order ID and Paid field; LEFT JOIN from Customer and Orders includes not only rows that have the first name, last name, but also those rows of the first (left) table Order ID, total amount, that are unmatched with rows from the second right table. The columns from the second table are filled with NULLs; RIGHT JOIN includes not only rows that have same first name, last name, from Customer table but also rows of the second right Orders table, Order Id and Paid fields that matched or unmatched with rows from the first left table.

Connolly and Begg (2015) suggested that queries involve the SELECT statement with the purpose in retrieving and displaying data from one or more database tables. It is the most frequently used SQL command.

Furthermore the author’s had created Stores Procedures on Customer table with the purpose of getting the customer details; created Procedure Add Supplier table; and ALTER Procedure in Customer table uspGetAddress.

The database includes 3 aggregate functions: a) SELECT COUNT– returns the number of values in a specified column from Customer table; b) SELECT MAX– returns the largest value in a specified column; c) SELECT SUM– returns the sum of the values in a specified column.

In addition the author’s had created more functions in Product Table SELECT * FROM Customer C, Orders O WHERE O.Customer ID = C.Customer ID;

The author’s Create Function Customer Purchase with Customer table and INNER JOIN from Orders table on Customer ID WHERE first name, last name, and order date were all executed.

Also the author’s had create two user define functions udf that ADD three values it executed successfully; but the second function did not executed successfully user defined function order date & delivery date error msg “ Msg 137, Level 15, State 2, Procedure udfDeliveryTime, Line 5 Must declare the scalar variable “@TimeDifference”.

This project provides two triggers for inserting and updating values. The author’s had created two users for the database in the names of Damien and Mina;

The author’s had summarize and provide information into Database Diagram which describe the relationship or mapping of the project.

  1. Orders Table links to Employee Table have many to one relationship; one employee can have many orders; and one big order can be handle by many employees.
  2. Orders Table connected to Customer table have again many to one relationship;
  3. Orders Detailed Table links to Orders Table there is also many to one relationship;
  4. Supplier Table connected to Product Table is one to many relationship;
  5. Order Details Table links to Product Table have one to many relationship.

Database Relationship Diagram

Database Diagram

 

CONCLUSION

Overall the conclusion of the author’s contribution in creating a new database for Mr Billy Joes Hardware Suppliers it is beneficial for the company in eliminating the manual paper-based system, it will reduce the input error, increase more efficiency and transparency.

 

 

 

 

 

REFERENCES

Connolly Thomas and Begg Carolyn, 2015. Database Systems A Pratical Approach to Design, Implementation, and Management. 6th ed. Pearson Education Limited Edinburgh Gate England.

Kettle, Damien (2015) Notes SQL Statement Creating Table, Views/Queries, Stored Procedures, Functions, (B8IT101) Dublin, Dublin Business School, on 28 September 2015.

 

 

Acknowledgements

 

 

Firstly, I thank God for all the strength and wisdom given to me. Also I would like to thanks my lecturer, Mr. Darren Redmond, for great benefit of his encyclopaedic knowledge and his unfailing helpfulness, and patience.

 

Finally thanks to all my colleagues for sharing such happy and bad times throughout the year and I would like to take this opportunity to wish you all the very best for the future.

AIRBNB- Worldwide Accommodations

Abstract

This Project is based on analyses the Airbnb Worldwide Accommodations data set. Airbnb is a trusted community marketplace for people to list, discover, and book online, from a mobile phone or tablet unique accommodations around the world. These Dataset named“Train_users_2” have been obtained from Kaggle.Com website. The study encompasses numerous methodologies such as cleaning the data, removes the unknown, visualise the data by performing names, head and plot design.

The author had performed the two way Anova model extends the ideas of the one-way Anova model and adds an interaction term to examine how various combinations of the three factors affect the response.

The critical analysis of this project is comprised of two main parts, the P-value significance, correlation and interaction among age, gender and country of destination.

 

Introduction

 

This project analysis the Airbnb dataset that consists of 213,451 observations and 16 variables. Customers on Airbnb could book a place listings in 35,000 cities cross 190 countries. Radianto (2015) stated that Airbnb main objective is to try to protect the users and the company from various types of risks and to reduce the number of fraudulent actions. Airbnb focus on creating an environment friendly and trust amongst users and knowing that their safety is the primary concern.

The purpose of this study is to explain the relationship and interaction between the dependent variable in this case is age and independent variables are gender and country of destination as explanatory variables.

The author has selected to perform the two way Anova in Airbnb dataset due to the countries destination and data are discrete continuous.

 

Data Analysis Method

 This section sets out the analytical framework that provides answers to the aims and objectives laid out in this project. The author had read the dataset in R software, and

Cleaned the data for NDF values in country destination, by performing the followed commands:

  1. data2 <- subset(data, country_destination != ‘NDF’) output 88908 observations;

The author had removed the empty data and significant number of missing values in age column and established some assumptions with a valid range of ages as > 18 <100

  1. data2$age[data2$age<18]<-NA

data2$age[data2$age>80]<-NA

data2 <- na.omit(data2)

After performing the above commands the data set have reduced into 67059 observations;

The removing of the unknown in gender column

  1. data2 <-subset(data2, gender !=’-unknown-‘)

data2 <- na.omit(data2)

data2[rowSums(is.na(data2)) != ncol(data2),]

[ reached getOption(“max.print”) — omitted 54920 rows ]; and reduced to 55545 observations;

The removing of the unknowns from first browser

  1. data2 <-subset(data2, first_browser !=’-unknown-‘)

data2 <- na.omit(data2)

Output 51420 Observations

The author had performed the summary method of the dataset after cleaning and named data2, this gives a set of useful overview summary statistics about the data namely the minimum, maximum, mean and medium. This study focus more in Ages, Gender and Country Destination.

OUTPUT (see below the summary)

AGE

The Min age is 18, max 80 and the mean is 36;

GENDER

Female 27537, Male 23738 and other 145;

COUNTRY DESTINATION

From this output it is noticeable that most users are situated in US 36243; FR 2864; IT 1544; GB 1366; ES 1327 and other 2433

summary(data2)

id       date_account_created

000wc9mlv3:   1   20/05/2014: 145

001357912w:   1   24/06/2014: 142

001xf4efvm:   1   13/05/2014: 140

001y3jr7xc:   1   28/05/2014: 135

002qnbzfs5:   1   11/06/2014: 130

003vss7z6h:   1   16/06/2014: 130

(Other)   :51414   (Other)   :50598

timestamp_first_active date_first_booking       gender

Min.   :2.009e+13     05/06/2014: 132   -unknown-:   0

1st Qu.:2.012e+13     14/05/2014: 132   FEMALE   :27537

Median :2.013e+13     11/06/2014: 130   MALE     :23738

Mean   :2.013e+13     22/05/2014: 130   OTHER   : 145

3rd Qu.:2.014e+13     25/06/2014: 126

Max.   :2.014e+13     29/04/2014: 124

(Other)   :50646

age         signup_method   signup_flow

Min.   :18.00   basic   :30592   Min.   : 0.000

1st Qu.:28.00   facebook:20824   1st Qu.: 0.000

Median :33.00   google :   4   Median : 0.000

Mean   :36.08                   Mean   : 1.254

3rd Qu.:41.00                   3rd Qu.: 0.000

Max.   :80.00                   Max.   :25.000

language         affiliate_channel affiliate_provider

en     :50040   direct       :34104   direct   :34044

zh     : 303   sem-brand   : 6783   google   :12508

fr     : 253   sem-non-brand: 4124   other     : 2345

es     : 173   seo         : 2481   craigslist: 865

de     : 159   other       : 2178   facebook : 565

ko     : 118   api         : 1187   bing     : 468

(Other): 374   (Other)     : 563   (Other)   : 625

first_affiliate_tracked   signup_app

untracked   :26433     Android: 120

linked       :12892     iOS   : 1376

omg         :10184     Moweb : 979

tracked-other: 1545     Web   :48945

product     : 306

marketing   :   53

(Other)     :   7

first_device_type       first_browser

Mac Desktop   :27804   Chrome       :19898

Windows Desktop:18235   Safari       :12495

iPad           : 3089   Firefox     : 9779

iPhone         : 1436   Mobile Safari: 4386

Desktop (Other): 338   IE           : 4239

Android Phone : 251   Chrome Mobile: 232

(Other)       : 267   (Other)     : 391

country_destination

US     :36243

other : 5643

FR     : 2864

IT     : 1544

GB     : 1366

ES     : 1327     (Other): 2433

The author had performed analysis of variance two way (ANOVA) with the purpose of determining the interaction to examine the combinations of the Age, Gender and Country destination.

The sums of squares equation for the two-way ANOVA model is:

SS-total = SS-A + SS-B + SS-AB + SS-within.

Here SS-total is the total variability in the

y-values; SS-A is the sums of squares due to factor A (representing the variability in the y-values explained by factor A.); and similarly for SS-B and factor B.

SS-AB is the sums of squares due to the interaction of factors A and B, and SS-within (aka SS-error ) is the amount of variability left unexplained, and deemed error.

The interaction effect is the heart of the two-way ANOVA model.

Knowing that the two factors may act together in a different way than they would separately is important and must be taken into account.

The ANOVA performed on three factors Country destination as a dependent variable and age + gender as independent variables (see the output below):

anova1 <- aov(as.numeric(country_destination) ~ age + gender, data = data2)

> summary(anova1)

Df Sum Sq Mean Sq F value   Pr(>F)

age             1   329   328.9 42.074 8.87e-11 ***

gender         2     64   32.0   4.093   0.0167 *

Residuals   51416 401919     7.8

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The P-value of Ages is 8.87 with three stars is significant comparing to gender 0.0167 *

The author had performed Anova2 to the 16 variables all came significant except gender, affiliate provider and first affiliate tracked could be removed due to having one * insignificant effect.

summary(anova2)

Df Sum Sq Mean Sq F value   Pr(>F)

age                         1   329     328.9 42.329 7.78e-11

gender                     2     64     32.0   4.118 0.01629

signup_method               2     16       8.0   1.027 0.35814

signup_flow                 1   190     190.1 24.468 7.58e-07

language                   22   523     23.8   3.058 1.78e-06

affiliate_channel           7     1270   181.4 23.342   < 2e-16

affiliate_provider         15     289   19.2   2.476   0.00121

first_affiliate_tracked     6     105   17.5   2.252   0.03563

signup_app                 3   312     104.1 13.399 9.76e-09

first_device_type           8   222     27.7   3.570 0.00038

first_browser             33   254       7.7   0.991 0.48254

Residuals               51319 398739     7.8

age                     ***

gender                 *

signup_method

signup_flow             ***

language               ***

affiliate_channel       ***

affiliate_provider     **

first_affiliate_tracked *

signup_app             ***

first_device_type       ***

first_browser

Residuals

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Iteractions

Interaction is when two factors meet, or interact with each other, on the response in a way that’s different from how each factor affects the response separately.

There are interaction factors among age, gender and country destination. (Please see the graph attached).

InteractionPlot

The interaction plot the first argument defines the horizontal axis this is age, while the second defines the groups plotted gender and the third argument identifies the numerical values country destination to be plotted.

Results Analysis

The P-value associated with the t-statistic there is a large differences and this shows that there is strong evidence that the mean vectors of the three variables age, gender and country destination are not the same.

The tests of Hypotheses stated that if the interaction is not statistically significant p-value > 0.05 then we conclude the main effects if present are independent of one another.

This indicate that the Anova model explains above 70% the variability of the response data around its mean 328.9 in age and 32 the mean off gender; Residual 7.8 there is some bias in model.

There are twelve countries destination, most of 70% of the user’s book in the US. The author assumes that age data shows some differentiation by country destination, users who book trips to Spain and Portugal tend to be younger while those that book trips to Great Britain tend to be older.

The data set contains a lot of categorical variables and few outliers were found. The plot shows that there appears to be small spikes in booking rights around the summer holidays.

PLOTS

AirbnbPlot02 Boxplotdata2 DataBoxPlot01 PlotDesign02

The degree of Skewness of the distribution it shows that the data set is not a normal distribution. The author could performed a non-parametric tests

Conclusion

 Overall the data analytics of this project, the largest part of the data was taken up by the cleaning and manipulation of the data.

The cleaning of the data meant that three quarters of the data was not usable for the modelling process such as NDF, Unknown and missing values. This took quite a considerable amount of time to process and it is suggested that Airbnb should improve their data collection process, as mentioned previously.

Also another issue with the project was the skewed distribution of the countries. The unbalanced nature of the data, with 70% falling within the category “US”, meant that it was very difficult for the remaining categories to be correctly predicted.

I think we could improve the model if the data was normal distribution. By seeing the residual plot there is some bias in model.

Further tests are advisable, the author could perform non-parametric tests.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S.Wadsworth & Brooks/Cole.

https://www.kaggle.com/kevinwu06/airbnb-recruiting-new-user-bookings/airbnb-exploratory-analysis/notebook

http://www.upa.pdx.edu/IOA/newsom/da1/ho_levels.pdf

https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_introduction_to_R.pdf

http://www.csse.monash.edu.au/~smarkham/resources/anova.htm

 

 

Prudential Life Insurance

Introduction

This report contains the findings of a critical analysis of Prudential Life Insurance, one of the largest of life insurance in the USA Data solutions in light of the requirements of finding a dataset and create a K-Means Clustering analysis and K-Nearest Neighbour predictions of roughly 200 observations. The data were collected in the website www.kaggle.com. Which comprises with 249 0bservations and 7 variables.

K-means Clustering Background

Rob Kabacoff (2013) stated that K-means Clustering is the most common partitioning method analysis in R uses an algorithm by Hartigan and Wang (1979) that partitions the observations into K groups such as the sum of squares of the observations to their assigned cluster is a minimum.

Where k is the cluster,xij is the value of the jth variable for the ith observation, and xkj-bar is the mean of the jth variable for the kth cluster.

K-means clustering can handle larger datasets than hierarchical cluster approaches. The observations are not committed to a cluster. They are moved when doing so improves the overall solution. However, the use of means implies that all variables must be continuous and the approach can be severely affected by outliers.

The format of the K-means function in R is kmeans(xcenters) where x is a numeric dataset (matrix or data frame) and centers is the number of clusters to extract. The function returns the cluster memberships, centroids, sums of squares (within, between, total), and cluster sizes.

  • kmeans_model <- kmeans(x=X, centers=m)

In other to perform the requested task the he author had to download the below packages.

install.packages(‘aod’)

install.packages(‘ggplot2’)

install.packages(‘Rcpp’)

install.packages(‘rattle’)

install.packages(‘NbClust’)

library(NbClust)

library(aod)

library(ggplot2)

library(Rcpp)

Afterwards the author had performed the following tasks:

  • Set up the Directory and Read the Insurance_life in R

 

setwd(“C:/Users/guilhermina/Desktop”)

> Insurance_Life <- read.csv( “Insurance_Life.csv”, header=TRUE)

2- The Head functions shows all the 7 variables column names

> head(Insurance_Life)

age   sex   bmi children smoker   region   charges

1 19 female 27.900       0   yes southwest 16884.924

2 18   male 33.770       1     no southeast 1725.552

3 28   male 33.000       3     no southeast 4449.462

4 33   male 22.705       0     no northwest 21984.471

5 32   male 28.880       0     no northwest 3866.855

  1. 31 female 25.740       0     no southeast 3756.622

3-Summarize the data that provides a statistic information, although it does not show meaningful information for character values.

summary(Insurance_Life)

age           sex           bmi           children

Min.   :18.0   female:130   Min.   :15.96   Min.   :0.000

1st Qu.:26.0   male :119   1st Qu.:26.60   1st Qu.:0.000

Median :37.0               Median :30.11   Median :1.000

Mean   :38.3               Mean   :30.56   Mean   :1.048

3rd Qu.:52.0               3rd Qu.:34.80   3rd Qu.:2.000

Max.   :64.0               Max.   :49.06   Max.   :5.000

smoker         region     charges

no :196   northeast:53   Min.   : 1137

yes: 53   northwest:65   1st Qu.: 4449

southeast:70   Median : 8517

southwest:61   Mean   :12855

3rd Qu.:16885

Max.   :51195

4- Standardize dataset that contains numerical and categorical variables

df <- Insurance_Life

5- Standardize numerical variables – columns 1, 3, 4, 7

df[1] <- scale(df[1])

df[3] <- scale(df[3])

df[4] <- scale(df[4])

df[7] <- scale(df[7])

6- Standardize categorical variables – convert to numeric by column names

df <- transform(df, sex = as.numeric(sex))

df <- transform(df, smoker = as.numeric(smoker))

df <- transform(df, region = as.numeric(region))

7- The new data frame that has been standardized as numerical field

Df

head(df)

age sex       bmi   children smoker region

1 -1.3395782   1 -0.4705909 -0.87312526     2     4

2 -1.4089965   2 0.5677994 -0.04014369     1     3

3 -0.7148134   2 0.4315881 1.62581944     1     3

4 -0.3677219   2 -1.3895751 -0.87312526     1     2

5 -0.4371402   2 -0.2972310 -0.87312526     1     2

6 -0.5065585   1 -0.8526902 -0.87312526     1     3

charges

1 0.3376554

2 -0.9325188

3 -0.7042877

4 0.7649364

5 -0.7531032

The author had performed wssplot(df) function and NbClust to check the clusters

Since the variables vary in range, they are standardized prior to clustering Next, the number of clusters is determined using the wwsplot() and NbClust()functions . (Please see the plot enclosed) this indicates that there is a distinct drop in within group’s sum of squares when moving from 1 to 3 clusters.

Plots

ClustersPlot KCluster02 KClusterPlot01 NumberofClustersPlot03 Numbers of ClustersPlot03

8-According to the majority rule, the best number of clusters is 2

The set.seed() function to guarantee that the results are reproducible.

 set.seed(1234567) The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended. > fit.km <- kmeans(df, 3, nstart=25)> fit.km$size[1] 103 41 105   

9-Gives the center point of all variables > fit.km$centers         age     sex       bmi     children   smoker1 -0.10554983 1.466019 -0.2686783 0.024553907 1.0970872 0.38064135 1.585366 0.3773754 -0.080776937 1.8292683 -0.04509203 1.447619 0.1162045 0.007455257 1.085714   region   charges1 1.563107 -0.35297972 2.707317 1.94113933 3.476190 -0.4117124

10- Show the mean of each criteria

aggregate(df[-1], by=list(cluster=fit.km$cluster), mean) cluster     sex       bmi     children   smoker   region1       1 1.466019 -0.2686783 0.024553907 1.097087 1.5631072       2 1.585366 0.3773754 -0.080776937 1.829268 2.7073173       3 1.447619 0.1162045 0.007455257 1.085714 3.476190     charges1 -0.35297972 1.94113933 -0.4117124

 K-Nearest Neighbor Background

Archana and Elangovan (2014) stated that K-Nearest Neighbor (K-NN) identifies the category of unknown data point on the basis of its nearest neighbor whose class is already known. This rule is widely used in pattern recognition text categorization ranking models object recognition and event recognition applications. M. Cover and P. E. Hart purpose k-Nearest Neighbour (k-NN) in which nearest neighbor is calculated on the basis of value of k that specifies how many nearest neighbors are to be considered to define class of a sample data point. It makes use of the more than one nearest neighbour to determine the class in which the given data point belongs to and hence it is called as K-NN. The k-NN algorithm is the simplest of all machine learning algorithms.

Other authors declared that one of the simplest and best-known non-parametric methods is K-Nearest  Neighbors regression. (Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani-2001).

In addition Data Camp Scientist confirmed that the KNN or K-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance learning, where new data are classified based on stored labelled instances.

  • knn_model <- knn(train=X_train, test=X_test, cl=as.factor(labels), k=K)install.packages(‘class’)#To normalize dataset num <- x – min(x) return (num/denom)}#Normal distribution> summary(df_norm) Min.   :0.0000   Min.   :0.0000   Min.   :0.0000 Median :0.4130   Median :0.0000   Median :0.4276 3rd Qu.:0.7391   3rd Qu.:1.0000   3rd Qu.:0.5692    children         smoker           region     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.3333 Mean   :0.2096   Mean   :0.2129   Mean   :0.5194 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000The
  • summary information illustrated the effect of normalization on the data set.> ind <- sample(2, nrow(df), replace=TRUE, prob=c(0.67, 0.33))> df.test <- df[ind==2, 1:4] df.training <- df[ind==1, 1:4]The training set to be 2/3 of the original data set: that is why it assign “1” with a probability of 0.67 and the “2″s with a probability of 0.33 to the 150 sample rows.

 

  • #add label create new column
  • df.test <- df[ind==2, 1:4]
  • #create new independent variable
  • This means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again
  • > df.training <- df[ind==1, 1:4]
  • set.seed(1234567)
  •  
  • 3rd Qu.:0.4000   3rd Qu.:0.0000   3rd Qu.:0.6667
  • Median :0.2000   Median :0.0000   Median :0.6667
  • Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
  • Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
  • Mean   :0.4412   Mean   :0.4779   Mean   :0.4411
  • 1st Qu.:0.1739   1st Qu.:0.0000   1st Qu.:0.3215
  •      age             sex             bmi
  • df_norm <- as.data.frame(lapply(df[1:6], normalize))
  •  
  • denom <- max(x) – min(x)
  • normalize <- function(x) {
  • library(class)
  • The author had installed the below packages to allow to perform the KNN model
df.trainLabels <- df[ind==1, 6]

df.testLabels <- df[ind==2, 7]

The actual KNN MODEL

#to predict df prediction

df_pred <- knn(train = df.training, test = df.test, cl = df.trainLabels, k=3)

 The author had performed to predict df prediction, view and str to see the data composition of numeric or non-numeric data about its structure.

View(df_pred)

> str(df_pred)

Factor w/ 4 levels “1”,”2″,”3″,”4″: 4 4 3 3 3 3 3 3 3 4 …

# prediction of the 4 value/evaluate the model

> df_pred

[1] 4 4 3 3 3 3 3 3 3 4 3 3 4 2 3 4 2 3 3 3 4 2 4 2 2 1 4 2 4

[30] 3 4 3 2 4 3 2 1 2 1 3 1 3 2 4 1 4 1 3 2 3 2 2 3 2 3 3 2 4

[59] 3 3 3 2 4 4 3 1 2 2 2 1 2 2 4 3 2

Levels: 1 2 3 4

> df.testLabels

The model makes reasonably accurate predictions, the charges labels shows 75 rows including some negatives value that have been charged.

# indication of model performance for deep analysis

install.packages(“gmodels”)

library(gmodels)

 #Shows what is wrong on the observations and the relationship among the variables

CrossTable(x = df.testLabels, y = df_pred, prop.chisq=FALSE)

> CrossTable(x = df.testLabels, y = df_pred, prop.chisq=FALSE)

|————————-|Total Observations in Table: 75

View(df.test)

 Conclusions

Overall the Cluster analysis is a broad topic and R has some of the most comprehensive facilities for applying this methodology currently available. K-means clustering requires that the number of clusters to extract be specified in advance. For this study the number specified for clustering was 2 based on wssplot(df) function and NbClust function performed.

Finally the Kmeans() function and the centroids provided by the function are based on standardized data. The aggregate function is used along with the cluster to determine the variables means for each cluster in the original metric.

K-Nearest Neighbor (K-NN) computation complexity and memory requirements remain the main concern always. To overcome memory limitation and size of data set is reduced. For this, the repeated patterns, which do not add extra information, are eliminated from training samples .To further improve the data points which do not affect the result are also eliminated from training data set. The K-NN training data set can be structured using various techniques to improve over memory limitation.

In addition the Prudential Life Insurance helps people of all ages and backgrounds grow and protect their wealth through a variety of products and services, including life insurance, annuities, retirement-related services, mutual funds and investment management.

 

 

 

 

 

 

 

References

 

Archana S. and Dr. Elangovan K. et al, International Journal of Computer Science and Mobile Applications, Survey of Classification Techniques in Data Mining Vol.2 Issue. 2, (2014)

http://www.ijcsma.com

James G., Witten D., Hastie T. and Tibshirani R., An Introduction to statistical Learning (ISL eBook) Springer New York Published (2001) http://wwwbcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/

http://www.statmethods.net/about/author.html

www.kaggle.com

https://www.datacamp.com/community/tutorials/machine-learning-in-r

 

 

 

 

Time Series

Time Series

The author will use the R statistical software to carry out some simple analyses that are common in analysing time series data. For this study the author will be using time series data sets that have been kindly made available by Rob Hyndman in his Time Series Data Library at http://robjhyndman.com/TSDL/

Firstly to do the analysis for the time series data, that will be read into R, and to plot the time series. This data can be read into R using the scan() function, which assumes that your data for successive time points is in a simple text file with one column.

Creating a Time series

The ts() function will convert a numeric vector into an R time series object. The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).

The dataset contains data on the age of death of successive kings of England, starting with William the Conqueror (original source: Hipel and Mcleod, 1994).

kings <- scan(“http://robjhyndman.com/tsdldata/misc/kings.dat”,skip=3)

Read 42 items

> kings

[1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53

[20] 56 16 43 69 59 48 59 86 55 68 51 33 49 67 77 81 67 71 81

[39] 68 70 77 56

After reading the time series data into R, the next step is to store the data in a time series object in R, so that you can use R’s many functions for analysing time series data. To store the data in a time series object, we use the ts() function in R. For example, to store the data in the variable ‘kings’ as a time series object in R.

kingstimeseries <- ts(kings)> kingstimeseriesTime Series:Start = 1 End = 42 Frequency = 1 [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53[20] 56 16 43 69 59 48 59 86 55 68 51 33 49 67 77 81 67 71 81[39] 68 70 77 56

An example is a data set of the number of births per month in New York city, from January 1946 to December 1959 (originally collected by Newton).

> births <- scan(“http://robjhyndman.com/tsdldata/data/nybirths.dat”)Read 168 items> birthstimeseries <- ts(births, frequency=12, start=c(1946,1))> birthstimeseries

The file below contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, for January 1987-December 1993 (original data from Wheelwright and Hyndman, 1998). We can read the data into R by typing:

souvenir <- scan(“http://robjhyndman.com/tsdldata/data/fancy.dat”)

souvenirtimeseries <- ts(souvenir, frequency=12, start=c(1987,1))

souvenirtimeseries

Once you have read a time series into R, the next step is usually to make a plot of the time series data, which you can do with the plot.ts() function in R.

#number of birth per month in New York

#shows the life time of each king

plot.ts(kingstimeseries)

plot.ts(birthstimeseries)

#souvenirshop in Queensland

plot.ts(souvenirtimeseries)

 Rplot02 Rplot03 Time seriesPlot01 Timeplot3 TimeSeriesPlot01

Seasonal Decomposition

A time series with additive trend, seasonal, and irregular components can be decomposed using the stl() function. Note that a series with multiplicative effects can often by transformed into series with additive effects through a log transformation.

 # get the estimated values of the seasonal component

birthstimeseriescomponents$seasonal

plot(birthstimeseriescomponents)

birthstimeseriescomponents <- decompose(birthstimeseries)

 

 

 

 

 

References

http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html

http://www.statmethods.net/advstats/timeseries.html

Anscombe’s Quartet

INTRODUCTION

 

Francis Anscombe (1973) is the author of Anscombe’s Quartet. The statistician has demonstrated the importance of graphing data before analysing it and the effect of outliers on statistical properties.  Anscombe’s Quartet contains four distint datasets each properties are identical: the mean of X1, X2, X3 and X4 is 12 and the mean of Y1, Y2, Y3 and Y4 is 12.5. They have identical variances, correlations and regression lines to at least two decimal places. Anscombe’s Quartet shows how simple statistical measures can fail to show an accurate four datasets graphics pictures of statistics summary. When plotted the graphs are different for each dataset consists of eleven (x, y) points.

Mean

The average mean of X in each case is 12 and the mean of Y in each case is 12.5. The sample variance of X and Y are 11 observations.

colMeans( assignment[,1:8] )

X1  Y1 X2 Y2 X3 Y3 X4 Y4
12.0 12.5 12.0 12.5 12.0 12.5 12.0 12.5

 

Correlation

X1- 0.8164205; X2- 0.8162365; X3- 0.8162867; X4- 0.8165214

The plots shows outliers on graphs, the correlation among the group are significant, and there is a positive correlation coefficient relationship, that the value of one variable increases, the value of the other variable. The correlation comprises of 0.816

The graphs shows that, even though we are taught that correlation measures a linear association between two variables, we can have high correlations, even when the relationship is nonlinear.

I believe Anscombe’s main point was to prove that statistics can be misleading (a fact greatly abused by today’s media) but also that outliers can have a strong impact on statistical properties.

Anscombe’s Quartet illustrated correlation and non-linearity. The most common measure of correlation is Pearson product-moment correlation coefficient. Its calculation only derives the linear dependence between two variables. If a non-linear relationship exists between these variables, it will go undetected by the Pearson correlation.

A perfect functional relationship between X and Y The correlation should be 1, yet the Pearson coefficient is irrelevant at 0.816.

Consider the following more extreme case (of the type y=(x-a)^2 + b), where a 100% relationship translates in a zero linear/Pearson correlation:

 REGRESSION The regression is to build a model that estimates or predicts one quantitative variable (Y) by using at least one other quantitative variable(X). Ascombe’s Quartet uses the simple linear regression one X variable to estimate the Y variable. The variance of X is 5.5 and 5.4 

VAR: X1, Y1= 5.501

VAR: X2, Y2= 5.5

VAR: X3, Y3= 5.497

VAR: X4, Y4= 5.49

Data Analysis

Anscombe’s Quartet is often used to highlight the importance of graphical exploration of the data for analysis:

Anscombe’s data analysis is an approach for analysing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses. Data Analysis, the term used for the set of ideas about hypothesis testing, p-values, confidence intervals etc. which formed the key tools in the arsenal of practising statisticians at the time.

The objectives of Anscombe’s Quartet for data analysis are to suggest hypotheses about the causes of observed phenomena; assess assumptions on which statistical inference will be based; support the selection of appropriate statistical tools and techniques and provide a basis for further data collection through surveys or experiments.

Anscombe’s Quartet shows four datasets that are identical statistical properties, but they have different structures. The graph 1 dataset contain two variables X1 and Y1 appears like many datasets that have well-fitting linear models relationship and correlated. The second graph contain variables X2 and Y2 does not have a linear correlation is not normally distributed; the third graph include two variables X3 and Y3 have linear correlation but the linear regression is thrown off by an outlier. It is easy to fit a correct linear model, if only the outlier were spotted and removed before doing so. The fourth graph dataset IV contains X4 and Y4 variables does not have any kind of linear model, but the single outlier is enough to produce a high correlation coefficient makes .

When it moves the regression line away from the mainstream, the graphical representations represents the actual data that needs to be shown, also the distances from the regression line the residuals, and other statistics test that help to judge how well the model fits.

To alter the regression line it will lower the correlation coefficient from 1 to 0.816.

 SUMMARY

X1             Y1             X2             Y2

Min.   : 7.0   Min.   : 9.26   Min.   : 7.0   Min.   : 8.10

1st Qu.: 9.5   1st Qu.:11.31   1st Qu.: 9.5   1st Qu.:11.70

Median :12.0   Median :12.58   Median :12.0   Median :13.14

Mean   :12.0   Mean   :12.50   Mean   :12.0   Mean   :12.50

3rd Qu.:14.5   3rd Qu.:13.57   3rd Qu.:14.5   3rd Qu.:13.95

Max.   :17.0   Max.   :15.84   Max.   :17.0   Max.   :14.26

X3             Y3             X4           Y4

Min.   : 7.0   Min.   :10.39   Min.   :11   Min.   :10.25

1st Qu.: 9.5   1st Qu.:11.25   1st Qu.:11   1st Qu.:11.17

Median :12.0   Median :12.11   Median :11   Median :12.04

Mean   :12.0   Mean   :12.50   Mean   :12   Mean   :12.50

3rd Qu.:14.5   3rd Qu.:12.98   3rd Qu.:11   3rd Qu.:13.19

Max.   :17.0   Max.   :17.74   Max.   :22   Max.   :17.50

The model fit the P-value is significant 0.69 is greater than 0.5 we accept the Null Hypothesis

 

VISUALIZATION

Visualization provides a unique view into data that can make it much easier to understand the structure than numerical methods. Visualization is not as precise as statistics, but provides the context necessary to make better and to be more careful when fitting models.

The benefit of visualisation is the insights into the quality of their collected data that truly leads to success. Data visualization provides 3 keys insights into data such as the completeness of data; data valid and data is well organized.

The   plots and BoxPLot 

Rplot01Rplot02

Rplot03Rplot04

PlotX1PlotX2

PlotX3PlotX4

BoxPlot01Plot02

Excel Dataset

X1

Y1

X2

Y2

X3

Y3

X4

Y4

10

8.04

12

14.02

12

9.65

10

7.43

8

6.95

6

7.85

9

7

8

6.7

13

7.58

15

9.6

11

10

14

6.84

9

8.81

8

10.93

10

11.89

8

9

11

8.33

14

7.35

12

6.75

12

7.54

14

9.96

18

9.34

6

9.5

14

6.09

6

7.24

5

6.2

4

3.78

6

5.3

4

4.26

3

3.05

3

4.23

4

8.1

12

10.84

16

14.8

12

9.13

12

6.4

7

4.82

9

5.85

5

3.87

7

5.4

5

5.68

7

4.65

4

5.67

4

4.75

 

 

 

CONCLUSION

The Anscombe’s Quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets. Looking at the datasets immediately reveals a lot of about the structure and makes the analyst be aware of the patterns of the datasets.

Overall, Anscombe’s Quartet is a powerful example of the insights that can be gleaned from graphical visualization of quantitative data.

 

Reference

Francis J. Anscombe, Graphs in Statistical Analysis. The American Statistician, vol. 27, no. 1, pp. 17–21, 1973

http://www.sebastianmarshall.com/anscombes-quartet

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

https://www.quora.com/What-is-the-significance-of-Anscombes-quartet

vhttps://eagereyes.org/criticism/anscombes-quartet