Digging R

Introduction to R

R is a programming language and software framework for statistical analysis and graphics. Available for use under the GNU General Public License R software and installation instructions can be obtained via the Comprehensive R Archive and Network

The following R code illustrates a typical analytical situation in which a dataset is imported, the contents of the dataset are examined, and some modeling building tasks are executed.

In [8]:
# import a CSV file of the total annual sales for each customer 
sales <- read.csv("sampledatasets/ Superstoresale.csv")
In [9]:
str(sales)
'data.frame':  8399 obs. of  21 variables:
 $ Row.ID              : int  1 49 50 80 85 86 97 98 103 107 ...
 $ Order.ID            : int  3 293 293 483 515 515 613 613 643 678 ...
 $ Order.Date          : Factor w/ 1418 levels "01/01/2009","01/01/2010",..: 598 39 39 448 1289 1289 768 768 1084 1174 ...
 $ Order.Priority      : Factor w/ 5 levels "Critical","High",..: 3 2 2 2 5 5 2 2 2 3 ...
 $ Order.Quantity      : int  6 49 27 30 19 21 12 22 21 44 ...
 $ Sales               : num  262 10123 245 4966 394 ...
 $ Discount            : num  0.04 0.07 0.01 0.08 0.08 0.05 0.03 0.09 0.07 0.07 ...
 $ Ship.Mode           : Factor w/ 3 levels "Delivery Truck",..: 3 1 3 3 3 3 3 3 2 3 ...
 $ Profit              : num  -213.2 457.8 46.7 1199 30.9 ...
 $ Unit.Price          : num  38.94 208.16 8.69 195.99 21.78 ...
 $ Shipping.Cost       : num  35 68.02 2.99 3.99 5.94 ...
 $ Customer.Name       : Factor w/ 795 levels "Aaron Bergman",..: 554 69 69 173 130 130 126 126 552 238 ...
 $ Province            : Factor w/ 13 levels "Alberta","British Columbia",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Region              : Factor w/ 8 levels "Atlantic","Northwest Territories",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Customer.Segment    : Factor w/ 4 levels "Consumer","Corporate",..: 4 1 1 2 1 1 2 2 2 3 ...
 $ Product.Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 2 2 2 3 2 1 2 2 2 2 ...
 $ Product.Sub.Category: Factor w/ 17 levels "Appliances","Binders and Binder Accessories",..: 15 1 2 17 1 9 2 15 15 11 ...
 $ Product.Name        : Factor w/ 1263 levels "#10-4 1/8\" x 9 1/2\" Premium Diagonal Seam Envelopes",..: 405 11 320 888 591 542 152 917 913 1215 ...
 $ Product.Container   : Factor w/ 7 levels "Jumbo Box","Jumbo Drum",..: 3 2 5 5 4 6 5 5 3 5 ...
 $ Product.Base.Margin : num  0.8 0.58 0.39 0.58 0.5 0.37 0.38 NA NA 0.38 ...
 $ Ship.Date           : Factor w/ 1450 levels "01/01/2010","01/01/2011",..: 944 86 133 552 1405 1405 786 834 1156 1199 ...
In [10]:
# examine the imported dataset head(sales)
summary(sales)             
# plot num_of_orders vs. sales 
plot(sales$Order.Quantity,sales$Sales, main="Number of Orders vs. Sales")
     Row.ID        Order.ID          Order.Date         Order.Priority
 Min.   :   1   Min.   :    3   15/09/2011:  20   Critical     :1608  
 1st Qu.:2100   1st Qu.:15012   28/03/2012:  20   High         :1768  
 Median :4200   Median :29857   12/12/2010:  18   Low          :1720  
 Mean   :4200   Mean   :29965   04/08/2010:  17   Medium       :1631  
 3rd Qu.:6300   3rd Qu.:44596   19/11/2011:  17   Not Specified:1672  
 Max.   :8399   Max.   :59973   20/04/2010:  17                       
                                (Other)   :8290                       
 Order.Quantity      Sales             Discount                Ship.Mode   
 Min.   : 1.00   Min.   :    2.24   Min.   :0.00000   Delivery Truck:1146  
 1st Qu.:13.00   1st Qu.:  143.19   1st Qu.:0.02000   Express Air   : 983  
 Median :26.00   Median :  449.42   Median :0.05000   Regular Air   :6270  
 Mean   :25.57   Mean   : 1775.88   Mean   :0.04967                        
 3rd Qu.:38.00   3rd Qu.: 1709.32   3rd Qu.:0.08000                        
 Max.   :50.00   Max.   :89061.05   Max.   :0.25000                        
                                                                           
     Profit            Unit.Price      Shipping.Cost           Customer.Name 
 Min.   :-14140.70   Min.   :   0.99   Min.   :  0.49   Darren Budd   :  41  
 1st Qu.:   -83.31   1st Qu.:   6.48   1st Qu.:  3.30   Ed Braxton    :  38  
 Median :    -1.50   Median :  20.99   Median :  6.07   Brad Thomas   :  35  
 Mean   :   181.18   Mean   :  89.35   Mean   : 12.84   Carlos Soltero:  33  
 3rd Qu.:   162.75   3rd Qu.:  85.99   3rd Qu.: 13.99   Patrick Jones :  30  
 Max.   : 27220.69   Max.   :6783.02   Max.   :164.73   Tony Sayre    :  29  
                                                        (Other)       :8193  
             Province         Region           Customer.Segment
 Ontario         :1826   West    :1991   Consumer      :1649   
 British Columbia:1126   Ontario :1826   Corporate     :3076   
 Saskachewan     : 913   Prarie  :1706   Home Office   :2032   
 Alberta         : 865   Atlantic:1080   Small Business:1642   
 Manitoba        : 793   Quebec  : 781                         
 Quebec          : 781   Yukon   : 542                         
 (Other)         :2095   (Other) : 473                         
        Product.Category                     Product.Sub.Category
 Furniture      :1724    Paper                         :1225     
 Office Supplies:4610    Binders and Binder Accessories: 915     
 Technology     :2065    Telephones and Communication  : 883     
                         Office Furnishings            : 788     
                         Computer Peripherals          : 758     
                         Pens & Art Supplies           : 633     
                         (Other)                       :3197     
                                                            Product.Name 
 Global High-Back Leather Tilter, Burgundy                        :  24  
 Bevis 36 x 72 Conference Tables                                  :  22  
 BoxOffice By Design Rectangular and Half-Moon Meeting Room Tables:  22  
 Fiskars® Softgrip Scissors                                       :  22  
 Master Giant Foot® Doorstop, Safety Yellow                       :  22  
 Wilson Jones Hanging View Binder, White, 1"                      :  21  
 (Other)                                                          :8266  
  Product.Container Product.Base.Margin      Ship.Date   
 Jumbo Box : 532    Min.   :0.3500      21/05/2011:  19  
 Jumbo Drum: 624    1st Qu.:0.3800      09/10/2009:  16  
 Large Box : 406    Median :0.5200      11/04/2009:  16  
 Medium Box: 366    Mean   :0.5125      30/03/2012:  16  
 Small Box :4347    3rd Qu.:0.5900      04/10/2012:  15  
 Small Pack: 956    Max.   :0.8500      09/05/2012:  15  
 Wrap Bag  :1168    NA's   :63          (Other)   :8302
In [11]:
# perform a statistical analysis (fit a linear regression model) 
results <- lm(sales$Sales ~ sales$Order.Quantity) 
# summary of Results
summary(results)
# perform some diagnostics on the fitted model # plot histogram of the residuals 
hist(results$residuals, breaks = 800)
Call:
lm(formula = sales$Sales ~ sales$Order.Quantity)

Residuals:
   Min     1Q Median     3Q    Max 
 -3054  -1618   -795     12  87972 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           379.431     77.438    4.90 9.77e-07 ***
sales$Order.Quantity   54.609      2.635   20.72  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3497 on 8397 degrees of freedom
Multiple R-squared:  0.04866,   Adjusted R-squared:  0.04854 
F-statistic: 429.5 on 1 and 8397 DF,  p-value: < 2.2e-16
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s