[go: up one dir, main page]

Using R for people who don't use R

I wanted to do a short post on how to do something with the Search Console API with R. Backing up a bit, I thought I’d include a short summary of how to get started with R, and as you do, I’m now writing a separate post on how to do that. No, I won’t back up more and explain computers or how to connect your printer.

Setup

For today’s excursion we’ll use the programming language “R”. Why do they call it “R”? What happened to all the languages between “C” and “R”? Moving on. But why R? You could do the same things with Python, or most programming languages, but Python is for hipsters, and R is for real data scientists.

Today, we’ll be a data scientist, or a pirate (arrrrr!), or even a pirate data scientist.

Pirate data scientists are the coolest. Good job on finding this post and joining the club. You can add R programming to your resume/CV already - it’s such a weird language that nobody will ask you to do anything in a job interview. And in real life, you’ll just search around for code samples to copy & paste anyway. Trust me. I got all of this to work without really understanding what I’m doing either.

Anyway, R is available for most computers. There’s a simple IDE called RStudio which I use for R. The open source desktop edition is free. You don’t even need an editor. Install it and let’s move on. You might be prompted to install “R” as well - follow the directions for that. “R” is the programming language, “RStudio” is an easy way of working in “R”.

Using RStudio

Obviously, you should read the 291 page documentation. Someone should (I didn’t, maybe it’s not even 291 pages long). This is just my short-cut. It might not even be a good short cut, as clearly I am no authority on R. I do sometimes EAT like a pirate though.

RStudio has a few quirks that you could get used to. The main window will look like this initially:

RStudio main view

(It turns out, a lot of these graphics get resized when I publish, so they look a bit terrible. Oops. But at least it’s fast.)

Setting a working directory

The only reasonable thing here will be the file explorer at the bottom right. Your first job will be to pick or make a working directory. Navigate to the right folder on your computer there, use the “New Folder” button to create a directory if needed.

Now for the magic: Click “More” and “Set as Working Directory”.

If you don’t do this, all files you read & write will end up somewhere else. It’s super-annoying, computers can be such jerks with details like this. Always remember to set the working directory first, even if it’s currently showing that directory.

Create a new R file

Now for the programming part. In your main menu (on the Mac, in your title bar), select File / New File / R Script.

This will open another quadrant in your RStudio window for your R script. This will be what things in RStudio will mostly look like.

Your first R program

Who am I to say what your first R program should do? I turned to Google, and apparently other sites like to create a list of random numbers and plot their distribution. With that in mind, here’s something you could try. Copy the code below into the top left quadrant.

n <- floor(rnorm(1000, 100, 10))
t <- table(n)
barplot(t)

Now click “Save” (the diskette-icon - who knows what disks are nowadays, wth), give it a file name, like “test”. And now you should have something like this:

For those used to programming languages, you assign values by using “<-”. R is a bit weird in that you can kinda assign values to functions too, but whatever floats your boat, R. Also, you can apply functions to individual numbers, vectors, or arrays all at once.

You don’t really have to understand the code here, but very roughly:

rnorm() creates a list of 1000 randomly distributed numbers averaging around 100 with a deviation of 10 (so mostly numbers 70-130, math is weird too). floor() turns them into integers. Mathematically it’s a set of numbers in a normal distribution with a mean of 100 and a standard deviation of 10. These are now assigned to the variable “n”.

table() counts the individual occurrences of each number and places them into the variable “t”.

barplot() then just shows that as a graph.

Running your R script

Clearly, you just hit the “run” or “play” button, and it’ll go, right? No. Remember, R is for scientists, so you must click the “Source” button instead. Nobody knows how that happened, it just is.

Outcome

If all goes well, your RStudio UI should now look like this:

You can notice a few things here:

  • The console quadrant (bottom left) mentions the “source()” command. You can enter any R command here, and it’ll be processed. This is useful for when you have no idea what you’re doing, and need to try things out.
  • The file quadrant now shows a graph. What the heck, huh? So cool. But also, why.
  • The top right quadrant shows your variables. This is kinda useful for figuring things out.

If you run the script a few times (remember, the “Source” button - we’re data scientists here), it’ll create new sets of random numbers and generate new graphs. Try it out. Clicking stuff is cool, but also, to show how to deal with these graphs we’re going to need them.

When you have multiple graphs, you can switch between graphs (“plots” in data-scientist-eze) using the arrows:

In the same place, you can export these graphs to save them as files, or copy them into your clipboard if you’re writing a report.

(I really like this set of random numbers. Arrrr.)

Using packages

The default R installation doesn’t have all the cool stuff. If you use Stack Overflow regularly to copy and paste code, I mean to learn, you’ll see mentions of other “packages” or “libraries”. Installing these is often pain-free. You need an internet connection though (this is kinda assumed anyway nowadays, it’s not like we’re a pirate on a boat in the ocean, oh wait).

For R, there are always two steps involved: install the package, and then use the library. Why they don’t call it the same thing, I don’t know. Gate-keeping by data scientists, obviously.

Let’s try one out.

Step 1: install the package.

In the console quadrant (bottom left), copy the following and hit enter:

install.packages("ggplot2")

This will now install the ggplot2 library. This library helps to make nice graphics. If you’re curious, there’s a big collection of R graphics that you can use to copy & paste in your code, many of them use ggplot2.

Your console should show something like this now (the exact content will differ):

Step 2: load the library in your code.

Let’s start a new script (menu: File / New File / R Script), and use the following code:

library(ggplot2)

ggplot(mpg, aes(displ, hwy, colour = class)) + 
     geom_point()

The first line (library(…)) loads the ggplot2 library. You only have to install it once, in future scripts you can just load it like that. RStudio tries to figure out which libraries it needs too, and helps you to remember to install them. If you don’t install them, the script won’t work.

The next two lines use ggplot() to create a graph (plot). ggplot() takes the dataset (“mpg”), the items in there you want to graph with aes(), and then adds the type of graphic (“geom_point()”) that you want to do. ggplot() does these weird things with just adding things together with “+” to combine them.

You might wonder where the data used in the graphic comes from - how did we suddenly get “MPG” data and car types? R includes a number of small data sources that you can use for trying things out. It makes it a bit easier to mess with simple graphics before you use your real data. In this case, it’s some older car manufacturer information: the “mpg dataset” as a part of the “mtcars dataset”. If you spot random car-related statistics and graphics in R, now you know why.

Side-note about dyplr

Another weird setup is “dyplr”, which you might encounter with R. I’m not covering it in this post, this is just FYI. It’s basically a way of routing the output from one part into another part of code using “%>%”, with the goal of making it easy to write (and hard for mortals to understand, I guess). In R, it’ll look something like this:

firstthing() %>% secondthing()

“dyplr” is a separate library, so you install it in the same way as previously mentioned, etc. A lot of R-related code snippets offer both dyplr and “normal” R code variations. You can probably get around with not using it at all, but once you’ve seen how it works, it’s not that weir … ok, it is still weird.

That’s mostly it

At this point, you should be ready to do things in RStudio. Remember, R is weird, and the names of things are a bit confusing at first, so use your favorite search engine whenever you get stuck. Regardless, I hope this helps to get you started.

Sidenotes

Random comments …

  • You can also use normal variable assignments “varname = function()” in R. This looks too normal though, so it’s discouraged. Use “varname <- function()” instead.
  • I’ll add more things here when I remember.

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Mastodon and mention me ( @hi@johnmu.com ) there. Thanks!