Downloading Data from GitHub Repositories and Basic Data Tidying
Introduction:
This tutorial is aimed at people who are afraid of R. Being completely honest, I’m a little bit afraid of R, but I have discovered that
(a) its not my fault I’m afraid - coding and Bayesian statistics are not taught well (if at all) at pre-honors level in Ecology and Environmental Science
and (b) this stuff is forking difficult! (see what I did there, a little coding pun 😉).
Today we’re going to explore the downloading of GitHub repositories and the basic tidying of a data set (as well as a little data visualisation), but don’t worry this tutorial is aimed at people who who have never used R or GitHub before.
Sound over-whelming? Don’t worry - this stuff is overwhelming…but its not rocket science, we’re going to get there! 🏆💪
Tutorial Aims:
This tutorial will help you along every step of the way, particularly with the steps you didn’t realize were even going to be issues! Within the first five minutes of my first coding tutorial I was already fighting back tears because of total confusion about how to “unzip” folders, where to “unzip” them to and what on earth all the folders are for in the hard drive of the laptop I once called a friend. But fear not, today we shall demystify all of these scary-sounding tasks and get you all set up in R and show you how to import data, give it a quick tidy so that you’re ready to go with any data manipulation or analysis of your liking in the future! And we might even make a pretty plot or two 😜
Learning Objectives:
- Dowloading the files and data necessary for completeing the tutorial
- Importing the data files and tidying up
- Calculate average species abundance
- Make a simple plot of your results
A word of warning before we begin 😬
If you get stuck at any point or get a weird and wonderful error
message 😟, don’t panic! Take a deep breathe, pull up Ecosia (yaay the trees!) and see if anyone else out there in the coding abyss has ran into the same problem (believe me, they will have).
You can also check out the Troubleshooting Coding Club tutorial.
And off we go! 🚀
Today we will be analyzing data of Irish farm animal abundance provided by the FAO database for the last 50 years or so. We’ll trim down the records of multiple different agricultural species to just a few (🐄 🐔 🐎 🦆 🐖) to make our data handling a bit easier.
1. Downloading Repositories from GitHub
So, in order to begin any tidying up and organization, we need the files and data necessary, all accessible via this repositroy. In order to transfer the files onto your PC, click on the large, rectangular button Code and then Download ZIP.
Great, you’ve downloaded the files of the repository. Now, open your File Explorer and see your recently downloaded zipped file in your downloads. Check if your zipped file has automatically downloaded into (one of) your One Drive folders - you don’t want this because it might slow things down later when all of your files are trying to sync with the One Drive cloud as well as your remote repository on GitHub.
Next right click the downloaded zipped file and click Unzip or Extract All based on your operating system
❗ Windows Users ❗ No option for Extract All may appear when you right click the zipped file in your Downloads. But don’t fret 👉
Just make sure to first left click the zipped file so it is selected to be extracted. You should see a pink ribbon appear at the top of the File Explorer window.
Once that appears, right click the zipped file and extract (unzip) it to a location on your hard drive (not a One Drive folder). Click on Browse and scroll down the left panel in the file explorer and select This PC/Windows(C:)/Users/Your PC Username/ Whichever Folder You Want to Save the Files Into. As mentioned earlier, make sure you don’t unzip the folder into any OneDrive foler.
Okay, we’re on our way.
2. Importing Data and Giving it a Tidy
Next, we need to open an R script to begin our coding. You can do this by either pressing Ctrl(Cmd) + Shift + N or by clicking File in the top right corner of the screen and then New File/R Script.
At the top of your new script, write a couple of lines noting the purpose/title of the script, the author and the date
# Basic Data Wrangling and Visualization
# Kate Moloney
# 29/11/20
Set your working directory (the folder for which you want to access the files you need and save any files to related to the R project you’re working on) to the location you have unzipped your downloaded repository to. For more about working directories or if you’re confused about file paths and such check out this Coding Club tutorial.
❗ Windows Users ❗ Don’t forget to change your slashes in your file path from \ to / if you’re copying it from your file explorer - annoying but necessary 🙄
Now we can import our data file from the repository we downloaded. First lets load the packages we’ll need for today. If you don’t have any of these packages already installed, use install.packages()
to install them before loading them.
# Packages ----
library(tidyr)
library(dplyr)
library(ggplot2)
Next, let’s import our data file called FAOSTAT_Irish_Farm_Animals.csv
. This is a comma seperated values Excel file which we are reading into R - almost like scanning using a photocopier.
# Importing and checking data ----
Farm_Animals <- read.csv("FAOSTAT_Irish_Farm_Animals.csv") # importing data file containing count data
head(Farm_Animals) # checking first few rows of data
str(Farm_Animals) # checking if all our variables are the right type of variable
Amazing, so we have our data file loaded and checked out using the head()
and str()
(structure of data). When you run your str()
function, you should see the following appear in your console (the bottom left box in R Studio):
Each column header (i.e. variable) is listed along with its classification as chr
= character variable, int
= integer, num
= numeric variable, etc.
Time to organize! Let’s check how many different animals are in the data set. The length()
function counts how many different types of animal there are in the column named Item when we specify to only count that column using the unique()
function. The list()
function lists the names of the animals so we can clearly see them all and choose which species we want to work with.
length(unique(Farm_Animals$Item)) # checking how many different animals are recorded
list(Farm_Animals$Item) # listing the ten different categories of animal
Below is what should appear when you run the list()
function, listing all the different species in the data. This is a useful function as it lets you make informed descisions about how to filter your data without scrolling through rows and rows of an Excel worksheet.
Now, let’s get rid of some of the variables and columns we don’t want. We’re going to use functions from the dplyr
package to fix up the data to our liking. If you haven’t installed the dplyr
package use install.packages("dplyr"")
to install the package and library(dplyr)
to load it. If you’re confused by any of its functions pr want to learn more about the package, this website is a good resource.
Notice the select()
function can also be used to rename variables by putting the New name = Old name when chaining c()
objects.
Let’s also make sure the count data for abundance is numeric using the as.numeric()
function
# using the dplyr function to select only columns recording species, year, and abundance
Farm_Animals <- dplyr::select(Farm_Animals, c(Animal = Item, Year, Abundance = Value))
Farm_Animals <- as.numeric(Farm_Animals$Abundance) # making the population abundance values numeric
For ease, lets just pick 5 different animals to work with today; cattle, chickens, horses, ducks and pigs. We’re using the filter()
function and the grepl
function to filter our data down to our desired animals. We’re also using pipes in this code chunk for efficiency (I like to think of them as little ant tunnels 🐜) but you don’t need to worry about that for now, if you want to be an eager beaver you can check out more about pipes in this Coding Club tutorial.
The grepl()
function searches for character matches in a vector and in the code below, we use the function to only select our desired animals from the Animal
character variable.
# filtering animals down to desired species using filter function and pipes
Farm_Animals <- Farm_Animals %>%
dplyr::filter(grepl("Cattle", Animal) | # selecting our chosen species
grepl("Chickens", Animal) |
grepl("Horses", Animal) |
grepl("Ducks", Animal) |
grepl("Pigs", Animal))
Nice job! Now we have our tidied data ready to go with just the three variables we require, animal, year and abundance. You’re doing great and we’re nearly there! Even if you feel a bit like this 😖 don’t give up, it will all come together soon 😎
3. Calculating Average Species Abundance
So now, lets see what the average abundance of each animal was between 1961 and 2018. We do this using the very handy mean()
function which does exactly what it says on the tin.
Using square brackets [ ] allows you to specify which columns/rows you want R to select. For example, for the object cattle we are telling R to average the numbers in the abundance column (Farm_Animals$Abundance
) in rows 1 to 58.
cattle <- mean(Farm_Animals$Abundance[1:58]) # calculating mean abundance of each species
chickens <- mean(Farm_Animals$Abundance[59:116])
ducks <- mean(Farm_Animals$Abundance[117:174])
horses <- mean(Farm_Animals$Abundance[175:232])
pigs <- mean(Farm_Animals$Abundance[233:290])
Now that we have an average abundance value for each of our species, its time to chain them together using the trusty c()
function!
# making vector of mean abundance values for each species and chaining them together
Average_Abundance <- c(cattle, chickens, ducks, horses, pigs)
# making character vector - using "" to avoid calling OBJECT
Main_Species <- c("cattle", "chickens", "ducks", "horses", "pigs")
🎺 Notice how we made two vectors, one chaining the objects cattle, chickens, etc, and the other chaining the characters cattle, chickens, etc? This is so we can make a data frame of the two vectors together (one being numeric, the other categorical). Think of characters like characters in movies or books - characters speak and use “quotation marks”. Well, so do characters in R !📣 Now we need to make sure
# making data frame of main species and their abundance
Main_Animals <- data.frame(Main_Species, Average_Abundance)
Main_Animals <- mutate(Main_Animals, Abundance_log = log10(Average_Abundance))
# creating new column using mutate function of logged abundance
Using the data.frame()
function we have created a data frame with our 5 animals of interest along with their average abundance. The mutate()
function creates new columns, which in this case, we use for our logged data (you can log any numeric variable in R by using the function log
or log10
). For this data, logging is a good idea because the values are of such varying magnitudes, viewing the raw values doesn’t tell us much about how the abundance of each animal compares to another.
Try viewing our new dataframe Main_Animals
using the view()
function or by clicking on the data frame as it appears in the Environment 🌍
Okay, we’re almost there! Final push for the last section 🏃
4. Visualizing our Results with a Simple Plot
Although the data visualization option in R are endless, we’re going to keep it simple today, but if you want to dive into more complicated data vis, check out this tutorial from Coding Club .
R has many base plotting functions that illustrate basic, but very clear graphs. Let’s try plot the average abundance of each animal we calculated before in a bar chart.
png("Images/Average_Abundance.png", width = 800, height = 600) # saving plot to desired images folder
(Animals_barchart <- barplot(Main_Animals$Abundance_log,
names.arg= Main_Animals$Main_Species, # assigning character vector as names for each bar
col = c("lightblue", "lightcyan", "lavender", "mistyrose", "cornsilk"),
# adding colours to each animal
main = "Average Farm Animal Abundance" , # adding title to main part of graph
xlab = "Farm Animals", # adding x axis label
ylab = "Logged Average Abundance (individuals)", # addig y axis label
ylim = c(0, 8),# setting y axis limits
cex.names= 1.5, cex.axis=1.5, cex.lab=1.5, cex.main = 1.5)) # altering size of font
dev.off() # closing barplot device
It worked! 👼
The barplot()
function allows a data frame to be plotted, in this case our numeric vector of average species abundance against our character vector of farm animal species.
🎺 Notice how we but brackets around the object Animals_barchart
we created to plot our barplot? This is so the object is created and run at the same time when the code is run for the barplot()
function. If we hadn’t put the brackets around the entire object we would then have to call the object once we first created it.
The names.arg
argument allows the vector of Main_Species
names we created to be plotted below each bar in the chart. The cex
code allows you to adjust the size of the plotting parameters, increasing the font size when greater than 1 and decreasing it when less than 1.
It may not be a fancy graph, but it is an informative one! We can see cattle and pigs have been the highest in numbers for the last half century - which isn’t really surprising considering Ireland has one of the highest meat-consumption rates in western Europe 😑
Let’s try plot one more graph to check out how the populations of farm animals have varied over time. We’re going to use the package ggplot2
to plot this next graph so if you haven’t installed it, do so now! Can you remember how to install and load packages? If you need to, peak at section 2. We’re not going to go into too much detail about the ins and outs of ggplot right now (I don’t want to overload you 🧟), but if you’re keen check out this data visualization tutorial.
# Plotting Abundance over Time for each animal ----
(Animal_line <- ggplot(Farm_Animals, aes(x = Year, y = log10(Abundance), colour = Animal)) +
geom_line()) + # specifying line chart
labs(title = "Farm Animal Abundance over Time", size = 12) + # adding title to graph
ylab("Log Animal Abundance\n") +
xlab("\nYear") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1, hjust = 1), # angling year labels
axis.text.y = element_text(size = 8), # formatting axis text size
axis.title = element_text(size = 10, face = "plain"),
panel.grid = element_blank(), # Removing background grid
plot.margin = unit(c(1,1,1,1), units = , "cm"), # Adding a 1cm margin around plot
legend.text = element_text(size = 7, face = "italic"),# formatting font for the legend text
legend.title = element_blank(), # Removing legend title
legend.position = c(1, 0.99)) # specifying legend location
ggsave(Animal_line, file = "Images/Abundance_vs_time.png", width = 10, height = 10) # saving plot
The ggplot()
function allows us to plot year against (log) abundance while differentiating animal type with color. By using geom_line
we are specifying we want to plot a line chart as opposed to a point chart (scatter plot) or any other chart type. ggsave()
is a useful function which allows you to save your ggplot with code to your desired location with specified formatting size.
🎁Bonus Points🎁 Notice anything about the object Animal_line
we created to plot our line plot? Can you remember why it has a bracket around it?
Nice! Interesting graph showing the log abundance (remember our log()
function from earlier 🤔) of our chosen farm animals between 1961 and 2018.
Looks like something serious happened to Irish ducks in the mid 80’s 😲.
Finishing Line 🏆 🎉
Amazing job! You’ve made it. We
- downloaded from GitHub ✔️
- unzipped ✔️
- tidied up ✔️ and
- produced some plots ✔️
This definitely warrants a cup of tea reward ☕ 🥇
Thanks for sticking with it and remember, even if you feel all of that went in one ear and out the other, this stuff takes some time to make sense and become natural. So, be patient with your computer and yourself and if at first you don’t succeed, try, try again! And ask for help!✋
You can find more Coding Club tutorials here if you want to explore any of the concepts covered or dive in deeper to more coding.
I’ll leave you with this very quaint scene of Irish farm animals 😊
Happy coding and mind yourself ! 💜
For any questions or queries related to this tutorial please contact Kate Moloney at s1831776@ed.ac.uk