Learn R for Data Wrangling
Chapter 1 Prerequisites
1.1 What to expect
This site will cover basic and intermediate topics in learning the use of R for data wrangling and visualization. This will serve as an introduction to using R for data processing and cleaning, and later on for some analysis and publishing using a typical Labour Force Survey as an example. Ultimately, this site can be handy resource and reference for quickly revisiting some basic R concepts.
The basics of R and RStudio will be reviewed, with some tips and tricks included, before moving on to using the Tidyverse library for manipulating, exploring and visualizing data.
The main tools that will be used are RStudio and the packages in the Tidyverse library, including dplyr, tidyr, ggplot2 and readr.
The general mode of operation will be to introduce R topics in sequence and to provide hand-picked apropriate online resources for those topics. However, some chapters, particularly the introductory ones, will have relatively extensive hand-written material. Other chapters might only provide summaries and supplementary comments to the provided online resources.
1.2 Resources Needed:
- Web Browser
This tutorial assumes that R and RStudio have already been installed, but a link to installation instructions is provided.
1.3 What is R?
Such a simple question, but one worth answering. In short, the creators of the language originally defined it as ‘a programming language for data analysis and graphics’ (Ihaka and Gentleman 1996). However, it has grown to much more than that. R can now be called a powerful integrated environment for statistical and data analysis, data processing, graphics and reporting. R is open-source, meaning that the R codebase is open to users, who can contribute to the development and improvement of the language, mainly through the creation of packages for different purposes and domains. This has led to an expansion of the use of R accross academia and the industry, with applications in official statistics, data science, machine-learning, web-scraping, interactive visualizations and more! R is a powerful tool that continues to improve and expand.
1.4 Is it worth learning R for data wrangling/data science as opposed to other languages like Python?
It is very much worth learning R. On the statistics side of things, R is used so extensively across academia because its tools and packages are very strongly developed and validated against standards for statistical practice and analysis.
“R is probably the most thoroughly validated statistics software on Earth.” – Uwe Ligges, CRAN maintainer (useR!2017).
Therefore, for somebody that works in Official Statistics or academia, R offers some very compelling advantages.
Firstly, R is specialized. R was created for statisitcal analysis and data manipulation. It was built by statisticians for statisticians and is often considered the best language for statistical analysis and modeling. For example, R provides many easy ways to deal with missing values, and it has many well-built packages for survey data analysis. R is also the gold standard for data visualization through graphs, plots, etc.
Secondly, R has perhaps the best data-wrangling software on earth: Tidyverse. Tidyverse is a library containing a set of packages that vastly improve the data processing and data-wrangling capabalities of R. For example, it allows for easy reshaping, filtering, cleaning and querying of large datasets (dplyr and tidyr). It offers tools for quickly reading and writing files from different formats such as text, Excel, SPSS, STATA and more (readr package). Ggplot2 is perhaps the most popular graphing and visualization software there is, that allows plots and graphs to be build layer by layer in a very logical way.
Third, R has many other extended features that will help you in your data analysis workflow. Do you need to write a report with graphs, illustrations and tables? There is RMarkdown for that. Do you need an interactive website for running statistical models and visualizations? R Shiny was created for that purpose.
All of these advantages are unified by what is believed to be one of the best overall best programming IDE* out there: RStudio. RStudio is extremely powerful and flexible development environment, and it integrates all of what makes R so great. At the end of the day, it is all these advantages together that mmakes R such a commpelling sofware to learn.
That beig said, Python is considered to be better at general programming, machine learning and webscraping, and the data-science field on a whole leans more towards Python. However, R is still a major player in data science and it is absolutely worth learning data science with R. In addition, you will still be able to use Python modules and commands in R using the R Package ‘Reticulate’. Reticulate allows you to import and run Python code right in R. You can have the best of both worlds!
In conclusion, if you are a statistician or in academia, R is the way to go; and when you learn programming in R, you can apply these concepts later on in any other language, such as Python.
Now, let’s take a look at starting to use R and RStudio.
*An IDE (Integrated Developoment Environment) is a program that provides many facilities for programmers to write, test and develop programs. It usually provides tools to manage the workflow and extensions to make coding easier. RStudio is considered by many the best IDE there is.