Lecture 13: SQL Queries

class: center, middle, inverse, title-slide

.title[
# Lecture 13: SQL Queries
]
.author[
### Robin Liu
]
.institute[
### UCSB
]
.date[
### 2022-07-11
]

---

class: inverse, middle, center
# Querying data
---
# Querying data
We will need the following. This assumes Chinook_Sqlite.sqlite is in your working directory.

```r
library(RSQLite)
library(sqldf)
library(DBI)
chinook_db <- dbConnect(SQLite(), "Chinook_Sqlite.sqlite")
```

Recall that `chinook_db` represents a connection to the database file.
---
# Chinook structure
.pull-left[
![](Lec12_files/chinook.png)
]
.pull-right[
Online music store that:

- stores track information, albums, artists, etc
- stores customer purchases of tracks
]
---
# Querying data
Basic SQL query syntax:
`select <fields> from <table> [where <conditions>]`

`select *` returns all of the fields and leaving out `where`
returns all of the rows.

What are the values in all the fields for every track?
The query below returns everything:

```r
dbGetQuery(chinook_db, "select * from track") |> str()
```

```
## 'data.frame':	3503 obs. of  9 variables:
##  $ TrackId     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "For Those About To Rock (We Salute You)" "Balls to the Wall" "Fast As a Shark" "Restless and Wild" ...
##  $ AlbumId     : int  1 2 3 3 3 1 1 1 1 1 ...
##  $ MediaTypeId : int  1 2 2 2 2 1 1 1 1 1 ...
##  $ GenreId     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Composer    : chr  "Angus Young, Malcolm Young, Brian Johnson" NA "F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman" "F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman" ...
##  $ Milliseconds: int  343719 342562 230619 252051 375418 205662 233926 210834 203102 263497 ...
##  $ Bytes       : int  11170334 5510424 3990994 4331779 6290521 6713451 7636561 6852860 6599424 8611245 ...
##  $ UnitPrice   : num  0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 ...
```

---
# Querying data
That was too much data. Here are the first 5 records:

```r
dbGetQuery(chinook_db, "select * from track limit 5")
```

```
##   TrackId                                    Name AlbumId MediaTypeId GenreId
## 1       1 For Those About To Rock (We Salute You)       1           1       1
## 2       2                       Balls to the Wall       2           2       1
## 3       3                         Fast As a Shark       3           2       1
## 4       4                       Restless and Wild       3           2       1
## 5       5                    Princess of the Dawn       3           2       1
##                                                                 Composer
## 1                              Angus Young, Malcolm Young, Brian Johnson
## 2                                                                   <NA>
## 3                    F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman
## 4 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman
## 5                                             Deaffy & R.A. Smith-Diesel
##   Milliseconds    Bytes UnitPrice
## 1       343719 11170334      0.99
## 2       342562  5510424      0.99
## 3       230619  3990994      0.99
## 4       252051  4331779      0.99
## 5       375418  6290521      0.99
```

---
# Querying data
Still too much data. Let's cut down on the number of attributes.
Suppose we only want the TrackId, Name, AlbumId, and Milliseconds.

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from track limit 5")
```

```
##   TrackId                                    Name AlbumId Milliseconds
## 1       1 For Those About To Rock (We Salute You)       1       343719
## 2       2                       Balls to the Wall       2       342562
## 3       3                         Fast As a Shark       3       230619
## 4       4                       Restless and Wild       3       252051
## 5       5                    Princess of the Dawn       3       375418
```

Let's find which are the longest and shortest tracks using the `where` clause.

---
# Querying data
Find all tracks shorter than 30,000ms (= 30s)

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds
                          from Track where Milliseconds < 30000")
```

```
##   TrackId                     Name AlbumId Milliseconds
## 1     168               Now Sports      18         4884
## 2     170              A Statistic      18         6373
## 3     172         The Real Problem      18        11650
## 4     178                    Oprah      18         6635
## 5    2241                    Bossa     184        29048
## 6    2461 É Uma Partida De Futebol     200         1071
## 7    3304             Commercial 1     258         7941
## 8    3310             Commercial 2     258        21211
```

We can retrieve tracks shorter than 30,000ms from the album with AlbumId = 18
using multiple `where` clauses
---
# Querying data
Retrieve all tracks shorter than 30,000ms from the album with AlbumId = 18

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds
                          from Track where Milliseconds < 30000 and AlbumId = 18")
```

```
##   TrackId             Name AlbumId Milliseconds
## 1     168       Now Sports      18         4884
## 2     170      A Statistic      18         6373
## 3     172 The Real Problem      18        11650
## 4     178            Oprah      18         6635
```

`select <fields> from <tables> where <clause 1> and <clause 2> and ...`

What if we used `or`?
---
# Querying data
Retrieve all tracks **either** shorter than 30,000ms **or** with AlbumId = 18

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds
                          from Track where Milliseconds < 30000 or AlbumId = 18")
```
<div id="htmlwidget-3d20f4584c0bb6ff0b90" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-3d20f4584c0bb6ff0b90">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21"],[166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,2241,2461,3304,3310],["Smoked Pork","Body Count's In The House","Now Sports","Body Count","A Statistic","Bowels Of The Devil","The Real Problem","KKK Bitch","D Note","Voodoo","The Winner Loses","There Goes The Neighborhood","Oprah","Evil Dick","Body Count Anthem","Momma's Gotta Die Tonight","Freedom Of Speech","Bossa","É Uma Partida De Futebol","Commercial 1","Commercial 2"],[18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,184,200,258,258],[47333,204251,4884,317936,6373,223216,11650,173008,95738,300721,392254,350171,6635,239020,166426,371539,281234,29048,1071,7941,21211]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>TrackId<\/th>\n      <th>Name<\/th>\n      <th>AlbumId<\/th>\n      <th>Milliseconds<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>
---
# Querying data
We saw that the album with AlbumId = 18 has some short songs.
Retrieve all songs under 30000 milliseconds from albums excluding AlbumId = 18.

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds
                          from Track where Milliseconds < 30000 and not AlbumId = 18")
```

```
##   TrackId                     Name AlbumId Milliseconds
## 1    2241                    Bossa     184        29048
## 2    2461 É Uma Partida De Futebol     200         1071
## 3    3304             Commercial 1     258         7941
## 4    3310             Commercial 2     258        21211
```

---
# Querying data
SQL is designed to mirror human language. Hence
using `and`, `or`, `not` instead of the symbols `&`, `|`, `!`.

However there is a "not equal" operator.
The standard SQL "not" operator is `<>`. Although `!=` works in SQLite, stick with the standard.

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds
                          from Track where Milliseconds < 30000 and AlbumId <> 18")
```

---
# Your turn
Retrieve the TrackId, Name, AlbumId, and Bytes for all tracks
between 300,000,000 and 400,000,000 bytes (approx between 286MB and 381MB)

```
##   TrackId                         Name AlbumId     Bytes
## 1    3199 Casino Night - Season Finale     250 327642458
## 2    3206               Branch Closing     251 358761786
## 3    3207                   The Merger     251 345960631
## 4    3212   Producer's Cut: The Return     251 337219980
## 5    3217              The Negotiation     251 371663719
## 6    3220         Women's Appreciation     251 338778844
## 7    3221                  Beach Games     251 333671149
## 8    3428               Branch Closing     251 360331351
## 9    3429                   The Return     251 343877320
```
<div class="countdown" id="timer_62cc4779" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">03</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---
# Querying data
Find all tracks named "Branch Closing"

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId from track
                          where Name = 'Branch Closing'")
```

```
##   TrackId           Name AlbumId
## 1    3206 Branch Closing     251
## 2    3428 Branch Closing     251
```

Note the single quotes `'` within the double quotes `"`. This is how
strings-within-strings are done in R. You can also switch the roles:

```r
dbGetQuery(chinook_db, 'select TrackId, Name, AlbumId from track
                          where Name = "Branch Closing"')
```

```
##   TrackId           Name AlbumId
## 1    3206 Branch Closing     251
## 2    3428 Branch Closing     251
```

But if you don't nest them properly, it is trouble.

---
# SQL is case-insensitive, except...
SQL commands are case-insensitive:

```r
dbGetQuery(chinook_db, "SeLecT tRaCkiD, nAmE, aLbUmId FrOm TrAcK
                          wHeRe NaMe = 'Branch Closing'")
```

```
##   TrackId           Name AlbumId
## 1    3206 Branch Closing     251
## 2    3428 Branch Closing     251
```
![](Lec13_files/plaid.png)

---
# SQL is case-insensitive, except...
But watch out when doing string-matching

```r
dbGetQuery(chinook_db, "SeLecT tRaCkiD, nAmE, aLbUmId FrOm TrAcK
                          wHeRe NaMe = 'BrAnCh ClOsInG'")
```

```
## [1] TrackId Name    AlbumId
## <0 rows> (or 0-length row.names)
```

You will often see people write sql commands in all caps and match the case
of the column names.

```r
dbGetQuery(chinook_db, "SELECT TrackId, Name, AlbumId FROM Track
                          WHERE Name = 'Branch Closing'")
```

```
##   TrackId           Name AlbumId
## 1    3206 Branch Closing     251
## 2    3428 Branch Closing     251
```

But I find it unnecessary. Lowercase letters are easier to read.

---
# Computations
We can `select` things other than fields.
Below we get two records since our `where` clause yields two records.

```r
dbGetQuery(chinook_db, "select 'pstat', 10 from track where Name = 'Branch Closing'")
```

```
##   'pstat' 10
## 1   pstat 10
## 2   pstat 10
```

SQL can do a surprising number of mathematical operations:

```r
dbGetQuery(chinook_db, "select power(2, 10) from track where Name = 'Branch Closing'")
```

```
##   power(2, 10)
## 1         1024
## 2         1024
```

---
# Computations
Here are the tracks with bytes
between 300,000,000 and 400,000,000 bytes (approx between 286MB and 381MB)

Since it is easier to consider MB, lets convert the units to MB.
There are `$2^{20}$` bytes in a megabyte.

---
# Computations

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) from track
                          where Bytes between 300000000 and 400000000")
```

```
##   TrackId                         Name AlbumId Bytes / power(2, 20)
## 1    3199 Casino Night - Season Finale     250             312.4642
## 2    3206               Branch Closing     251             342.1419
## 3    3207                   The Merger     251             329.9338
## 4    3212   Producer's Cut: The Return     251             321.5980
## 5    3217              The Negotiation     251             354.4461
## 6    3220         Women's Appreciation     251             323.0847
## 7    3221                  Beach Games     251             318.2136
## 8    3428               Branch Closing     251             343.6388
## 9    3429                   The Return     251             327.9470
```

Note the name of the new field. Can we clean it up?
---
# Computations
We can **alias** the field names using `AS`:

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
                          from track where Bytes between 300000000 and 400000000")
```

```
##   TrackId                         Name AlbumId       MB
## 1    3199 Casino Night - Season Finale     250 312.4642
## 2    3206               Branch Closing     251 342.1419
## 3    3207                   The Merger     251 329.9338
## 4    3212   Producer's Cut: The Return     251 321.5980
## 5    3217              The Negotiation     251 354.4461
## 6    3220         Women's Appreciation     251 323.0847
## 7    3221                  Beach Games     251 318.2136
## 8    3428               Branch Closing     251 343.6388
## 9    3429                   The Return     251 327.9470
```

---
# Computations
We can even use the alias in our `where` clause:

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
                          from track where MB between 300 and 400")
```

---
# Computations
Retrieve all tracks from AlbumId = 18 that are longer than 3 minutes.
Rename the Name field as TrackName.

```
##   TrackId                   TrackName AlbumId Minutes
## 1     169                  Body Count      18       5
## 2     175                      Voodoo      18       5
## 3     176            The Winner Loses      18       6
## 4     177 There Goes The Neighborhood      18       5
## 5     181   Momma's Gotta Die Tonight      18       6
## 6     182           Freedom Of Speech      18       4
```

It's okay if your minutes are rounded like mine.

---
# Order by
The `order by` command returns records sorted by a column.

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
                          from track where MB between 300 and 400
                          order by MB")
```

```
##   TrackId                         Name AlbumId       MB
## 1    3199 Casino Night - Season Finale     250 312.4642
## 2    3221                  Beach Games     251 318.2136
## 3    3212   Producer's Cut: The Return     251 321.5980
## 4    3220         Women's Appreciation     251 323.0847
## 5    3429                   The Return     251 327.9470
## 6    3207                   The Merger     251 329.9338
## 7    3206               Branch Closing     251 342.1419
## 8    3428               Branch Closing     251 343.6388
## 9    3217              The Negotiation     251 354.4461
```

The default is ascending.
---
# Order by
We can break ties by specifying a second column.

```r
dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
                          from track where MB between 300 and 400
                          order by AlbumId, TrackID desc")
```

```
##   TrackId                         Name AlbumId       MB
## 1    3199 Casino Night - Season Finale     250 312.4642
## 2    3429                   The Return     251 327.9470
## 3    3428               Branch Closing     251 343.6388
## 4    3221                  Beach Games     251 318.2136
## 5    3220         Women's Appreciation     251 323.0847
## 6    3217              The Negotiation     251 354.4461
## 7    3212   Producer's Cut: The Return     251 321.5980
## 8    3207                   The Merger     251 329.9338
## 9    3206               Branch Closing     251 342.1419
```

`desc` means decreasing. First sort by increasing AlbumId, then
in the case of a tie, sort by decreasing TrackId.

---
# Aggregation
SQL contains aggregation functions that summarize numerical data:
`sum`, `min`, `max`, `count`, ...

Which track in AlbumId = 250 has the largest size in terms of bytes?

```r
dbGetQuery(chinook_db, "select max(bytes) from track where AlbumId = 250")
```

```
##   max(bytes)
## 1  327642458
```

What is the total length of the album with AlbumId = 250?

```r
dbGetQuery(chinook_db, "select sum(milliseconds) / 60000 as minutes from track
                          where AlbumId = 250")
```

```
##   minutes
## 1     477
```

---
# Aggregation
How many customers are represented in Chinook?

```r
dbGetQuery(chinook_db, "select count(*) from customer")
```

```
##   count(*)
## 1       59
```

How many customers in Chinook are from the USA?

```r
dbGetQuery(chinook_db, "select count(*) from customer where country = 'USA'")
```

```
##   count(*)
## 1       13
```

---
# A note about style
It is best to indent SQL queries according to their logical function.

```default
select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
  from track where MB between 300 and 400
  order by AlbumId, TrackID desc
```

Also acceptable:

```default
select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB
  from track
    where MB between 300 and 400
  order by AlbumId, TrackID desc
```

The specific indentation doesn't matter.
Just focus on readability and remember to not exceed 80 chars.

---
# Summary
SQL queries are `SELECT...FROM` statements.
- case insensitive
- logical operations with `where`
- numeric operations
- aliases
- `group by`

## Next time
- More on aggregation.
- Joins