Importing Datasets

A offset step in data assay is importing datasets. These can exist in several formats. Fortunately, R has several packages that allow us to easily import data from comma-separated value (CSV), SPSS and Excel files.

You volition find the following datasets on ILIAS:

  1. zufriedenheit.csv
  2. zufriedenheit-semicolon.csv
  3. zufriedenheit.sav
  4. zufriedenheit.xls

This is a generated dataset. The variables are a satisfaction rating, recorded at ii time points for four subjects.

zufriedenheit.csv is a text file, in which the columns are separated by commas (hence the name). zufriedenheit-semicolon.csv is likewise a text file, just with a semi-colon as the delimiter. zufriedenheit.sav and zufriedenheit.xls are SPSS and Excel files.

In your RStudio project folder in the file system of your estimator, create a new directory named 'information' and save the files you lot downloaded.

In RStudio, at that place are two ways of importing files:

  1. Using functions: read_csv(), read_csv2() (for ';' delimiters), read_sav() and read_excel().

  2. Via the GUI: You can access this functionality either through 'File > Import Dataset' or in the Environs pane.

The 2d option is easier to use, and has the advantage that R will generate all code for importing datasets for us, which can be copied for subsequent use.

Comma-separated value (CSV) files

The functions nosotros need for importimg CSV files are available in the readr package. This needs to be loaded:

Alternatively, we tin can simply load all tidyverse packages:

            library(tidyverse)          

We volition beginning import the information files using the GUI. In Surround, click on Import Dataset > From Text (readr) (or 'File > Import Dataset > From Text (readr)'). Y'all volition see a dialogue containing a Code Preview with the following lawmaking:

            library(readr) dataset <- read_csv(Zilch) View(dataset)          

On the left (bottom) you lot will find all options for importing the data. These options are all arguments of the read_csv() function (or the more general read_delim() funtion):

                              args(read_csv)                #> role (file, col_names = True, col_types = NULL, locale = default_locale(),                                #>     na = c("", "NA"), quoted_na = TRUE, quote = "\"", annotate = "",                                #>     trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,                                #>         n_max), progress = show_progress())                                #> Null                          

In the File Browser you can select the zufriedenheit.csv file. Later doing so you will run across Information Preview.

The variables Vpn and messzeitpunkt have been imported as character vectors. These volition demand to be converted to factors. Under Import Options a file name is generated automatically. This is simply the file proper name, minus the .csv suffix.

Endeavour to discover what happens when you play around with the options, east.g. "First Row every bit Names".

Now click on Import. A new data frame (tibble) with the proper noun zufriedenheit volition appear in the Surround pane, and the generated R code is printed to the console.

                              library(readr) zufriedenheit <-                                read_csv("data/zufriedenheit.csv")                #> Parsed with column specification:                #> cols(                #>   Vpn = col_character(),                #>   messzeitpunkt = col_character(),                #>   rating = col_double()                #> )                          

We still have to convert the grouping variables to factors:

              zufriedenheit$Vpn <-                                as.cistron(zufriedenheit$Vpn) zufriedenheit$messzeitpunkt <-                                every bit.cistron(zufriedenheit$messzeitpunkt)            
              zufriedenheit                #> # A tibble: 8 x three                #>   Vpn   messzeitpunkt rating                #>   <fct> <fct>          <dbl>                #> 1 VP_1  t1              64.iii                #> two VP_2  t1              65.1                #> 3 VP_3  t1              54.2                #> iv VP_4  t1              54.5                #> five VP_1  t2              67.2                #> 6 VP_2  t2              75.6                #> # ... with 2 more rows                          

Now practice the same with the semi-colon delimited file zufriedenheit-semicolon.csv. This is something that yous frequently be confronted with if your calculator uses a High german locale.

You should see the following lawmaking in the Code Preview:

            library(readr) zufriedenheit_semicolon <- read_delim("information/zufriedenheit-semicolon.csv",     ";", escape_double = FALSE, trim_ws = True)          

Try importing both files using the R commnands. For the semi-colon file y'all can use the office read_csv2().

If you want to save a data frame as a CSV file, you tin can use the function write_csv():

            write_csv(ten = zufriedenheit_long, path = "data/zufriedenheit.csv")          

SPSS datasets

Nosotros can import the same dataset, simply this fourth dimension from the SPSS file format (zufriedenheit.sav). There is a parcel called haven; this provides the office read_sav(). We can load the package to make the function available:

                              library(oasis) zufriedenheit_spss <-                                read_sav("data/zufriedenheit.sav")            

Just, equally before, we tin can as well use the GUI. In the Environs pane, click on Import Dataset > From SPSS, and choose the file. You volition see a Code Preview:

In contrast to importing CSV files, nosotros have no options when importing from SPSS except for the name of the data frame, which we volition change to zufriedenheit_spss.

Now yous can click on Import. In the Environs pane, the newly created varibale volition appear. Variables imported from SPSS tin can accept additional attributes, the most important of which is the labels aspect:

              zufriedenheit_spss$Vpn                #>                                #> [1] i 2 3 4 1 2 3 4                #> attr(,"format.spss")                #> [ane] "F8.0"                #> attr(,"labels")                #> VP_1 VP_2 VP_3 VP_4                                #>    i    2    iii    4                          

This contains the SPSS value labels. Using this, we can look up what blazon of coding scheme was used for chiselled variables. If the values 0 and ane were used for the sex activity variable, nosotros determine which sexual practice has the value 0.

If we want to convert variables to factors, we can use the function as_factor() from the haven package. This allows us to use either the SPSS values or value labels as levels of the cistron in R. This is achieved by using the argument levels; this tin can take the values "default", "labels", "values" or "both" (you tin inspect help page using ?as_factor). "default" seems to be the most sensible choice - this means that labels are used if bachelor and otherwise the values themselves are used. The other options are "both" (values and value labels are combined), "label" (labels only) and "values" (values merely).

Arguments of the as_factor() function

            levels   How to create the levels of the generated gene:       "default": uses labels where bachelor, otherwise the values. Labels are sorted by value.       "both": similar "default", but pastes together the level and value       "label": use only the labels; unlabelled values go NA       "values: utilize only the values  ordered     If Truthful create an ordered (ordinal) cistron, if Fake (the default) create a regular (nominal) cistron.          

The argument ordered allows the states to create an ordered gene, if the ordering of the factor levels is important.

              zufriedenheit_spss$Vpn <-                                as_factor(zufriedenheit_spss$Vpn,                levels =                "default") zufriedenheit_spss$messzeitpunkt <-                                as_factor(zufriedenheit_spss$messzeitpunkt,                levels =                "default")            

We can as well save a information frame in the SPSS file format using the write_save() function:

            write_sav(information = zufriedenheit_long, path = "information/zufriedenheit.sav")          

Download and import the file called Beispieldatensatz.sav from ILIAS. Which value labels practise the categorical variables take?

Excel files

We tin can too import Excel spreadsheets. Click on Import Dataset > From Excel, and then choose the file you want to import. We volition call the information frame zufriedenheit_xls. Underneath the Name text box there is a drib-down menu entitled Sheet. This allows you to specify which worksheet yous want to import. You should get the following R code in the Code Preview:

                              library(readxl) zufriedenheit_xls <-                                read_excel("data/zufriedenheit.xlsx",                canvass =                "zufriedenheit")            

The role we are using is called read_excel() and is bachelor in the readxl package (not part of the tidyverse).

Categorical variables should over again be converted to factors:

              zufriedenheit_xls$Vpn <-                                equally.factor(zufriedenheit_xls$Vpn) zufriedenheit_xls$messzeitpunkt <-                                as.factor(zufriedenheit_xls$messzeitpunkt)            

RData files

The last option is an RData file. This is a binary file format and has the reward that we tin can combine multiple R objects, including all their attributes, in a single file. This is very useful; when exporting to a text file, such as CSV, all metadata will be lost. A farther advantage is that the file may be compressed in order to save space. All the same, this file is specific to R, and thus may non be the best option when sharing your data with other people.

You lot can save objects in your workspace as .RData (or .Rda) files with the part save():

                              save(zufriedenheit,                file =                "information/zufriedenheit.Rda")            

You lot can also salve several objects in one go:

                              relieve(zufriedenheit, zufriedenheit_spss, zufriedenheit_xls,                file =                "information/zufriedenheit_alle.Rda")            

The file can exist loaded using the function load():

                              load(file =                "information/zufriedenheit_alle.Rda")            

Exercises

Importing datasets one

  1. Download and import the file Therapy.sav from ILIAS, using both the GUI and the function read_sav().

  2. Check the coding of the grouping variables. Which level should the reference category?

  3. Convert to grouping variables to factors.

Importing datasets two

  1. Generate a (simulated) dataset and consign it either as a CSV, xls oder sav file. This dataset should incorporate at least a numeric and a grouping variable. Commutation files with a partner.

  2. You volition receive a file from your partner. Effort to import this, and perform all necessary conversions.