class: center, middle, inverse, title-slide .title[ # Lecture 14: Aggregation and Joins ] .author[ ### Robin Liu ] .institute[ ### UCSB ] .date[ ### 2022-07-14 ] --- class: inverse, middle, center # Aggregation --- # Advanced querying We will need the following. This assumes Chinook_Sqlite.sqlite is in your working directory. ```r library(RSQLite) library(sqldf) library(DBI) chinook_db <- dbConnect(SQLite(), "Chinook_Sqlite.sqlite") ``` Recall that `chinook_db` represents a connection to the database file. --- # Advanced querying - Last time we looked at simple queries on a single table. - Today: more complicated queries over several tables. --- # Aggregation Recall aggregation is done via certain functions (`max`, `min`, `sum`, `avg`, ...) What is the average length of a track in the album with AlbumId = 250? -- ```r dbGetQuery(chinook_db, "select avg(milliseconds) / 60000 as minutes from track where AlbumId = 250") ``` ``` ## minutes ## 1 21.6941 ``` But often we want to aggregate over *groups* of data, e.g. AlbumId. -- Find the average length of tracks for albums with AlbumId between 250 and 255, inclusive. Below is wrong! Why? ```r dbGetQuery(chinook_db, "select avg(milliseconds) / 60000 as minutes from track where AlbumId >= 250 and AlbumId <= 255") ``` ``` ## minutes ## 1 25.19051 ``` --- # Aggregation We need specify groups with `group by`. ```r dbGetQuery(chinook_db, "select albumid, avg(milliseconds) / 60000 as minutes from track group by AlbumId") |> as_tibble() ``` ``` ## # A tibble: 347 x 2 ## AlbumId minutes ## <int> <dbl> ## 1 1 4.00 ## 2 2 5.71 ## 3 3 4.77 ## 4 4 5.11 ## 5 5 4.90 ## 6 6 4.42 ## 7 7 4.51 ## 8 8 3.46 ## 9 9 5.57 ## 10 10 4.68 ## # ... with 337 more rows ``` --- # Aggregation Now our `where` clauses will operate on the groups ```r dbGetQuery(chinook_db, "select albumid, avg(milliseconds) / 60000 as minutes from track group by AlbumId where AlbumId >= 250 and AlbumId <= 255") ``` ``` ## Error: near "where": syntax error ``` -- Uh oh... --- # Aggregation When dealing with groups, we must use `having` instead of `where`. ```r dbGetQuery(chinook_db, "select albumid, avg(milliseconds) / 60000 as minutes from track group by AlbumId having AlbumId >= 250 and AlbumId <= 255") ``` ``` ## AlbumId minutes ## 1 250 21.694095 ## 2 251 25.544730 ## 3 252 5.179567 ## 4 253 48.759572 ## 5 254 41.409450 ## 6 255 3.720932 ``` Just another annoying thing to remember. --- # Aggregation Find the total number of tracks for albums with albumid between 10 and 15 inclusive. ```r dbGetQuery(chinook_db, "select albumid, count(*) from track group by AlbumId having AlbumId >= 10 and AlbumId <= 15") ``` ``` ## AlbumId count(*) ## 1 10 14 ## 2 11 12 ## 3 12 12 ## 4 13 8 ## 5 14 13 ## 6 15 5 ``` --- # Aggregation Add an alias for clarity: ```r dbGetQuery(chinook_db, "select albumid, count(*) as TrackCount from track group by AlbumId having AlbumId >= 10 and AlbumId <= 15") ``` ``` ## AlbumId TrackCount ## 1 10 14 ## 2 11 12 ## 3 12 12 ## 4 13 8 ## 5 14 13 ## 6 15 5 ``` --- # Aggregation Let's take a look at the Genre table: .pull-left[ ```r dbGetQuery(chinook_db, "select * from genre") ``` ``` ## GenreId Name ## 1 1 Rock ## 2 2 Jazz ## 3 3 Metal ## 4 4 Alternative & Punk ## 5 5 Rock And Roll ## 6 6 Blues ## 7 7 Latin ## 8 8 Reggae ## 9 9 Pop ## 10 10 Soundtrack ## 11 11 Bossa Nova ## 12 12 Easy Listening ## 13 13 Heavy Metal ## 14 14 R&B/Soul ## 15 15 Electronica/Dance ## 16 16 World ## 17 17 Hip Hop/Rap ## 18 18 Science Fiction ## 19 19 TV Shows ## 20 20 Sci Fi & Fantasy ## 21 21 Drama ## 22 22 Comedy ## 23 23 Alternative ## 24 24 Classical ## 25 25 Opera ``` ] .pull-right[ ![](Lec13_files/track-genre.png) ] --- # Aggregation Select the GenreId and size of the largest track in MB for each Genre with GenreId between 1 and 10 inclusive: ``` ## GenreId LargestMB ## 1 1 50.058893 ## 2 2 28.054029 ## 3 3 24.763794 ## 4 4 17.299500 ## 5 5 2.495748 ## 6 6 18.273713 ## 7 7 17.254580 ## 8 8 11.526608 ## 9 9 10.075287 ## 10 10 12.138298 ```
05
:
00
--- class: inverse, middle, center # Joins --- # Querying multiple tables - We have been writing progressively more complicated queries. - But the queries have still been on a single table, Track. In a relational DB, the data are split among multiple tables and related by their keys. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId from Track where AlbumId = 250 limit 5") ``` ``` ## TrackId Name AlbumId ## 1 3178 The Dundies 250 ## 2 3179 Sexual Harassment 250 ## 3 3180 Office Olympics 250 ## 4 3181 The Fire 250 ## 5 3182 Halloween 250 ``` What if we wanted to know the name of the album? --- # Querying multiple tables The album name is in the Album table. Consider the two queries: .pull-left[ ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId from Track where AlbumId = 250 limit 5") ``` ``` ## TrackId Name AlbumId ## 1 3178 The Dundies 250 ## 2 3179 Sexual Harassment 250 ## 3 3180 Office Olympics 250 ## 4 3181 The Fire 250 ## 5 3182 Halloween 250 ``` ] .pull-right[ ```r dbGetQuery(chinook_db, "select * from Album where AlbumId = 250") ``` ``` ## AlbumId Title ArtistId ## 1 250 The Office, Season 2 156 ``` ] Remember that: - AlbumId in Album is the primary key - AlbumId in Track is a foreign key to the AlbumId field in Album -- To combine the output into a single relation, we **join on the keys**. --- # Joins ```r dbGetQuery(chinook_db, "select TrackId, Name, Track.AlbumId, Title from Track inner join Album on Track.AlbumId = Album.AlbumId where Track.AlbumId = 250 limit 5") ``` ``` ## TrackId Name AlbumId Title ## 1 3178 The Dundies 250 The Office, Season 2 ## 2 3179 Sexual Harassment 250 The Office, Season 2 ## 3 3180 Office Olympics 250 The Office, Season 2 ## 4 3181 The Fire 250 The Office, Season 2 ## 5 3182 Halloween 250 The Office, Season 2 ``` `<Table 1> inner join <Table 2> on <key1> = <key2>` -- Why specify `Track.AlbumId`? Since both Track and Album tables have an AlbumId field, we need to clarify to SQL which table we mean. --- # Joins In a single query, return the track id, track name, album id, mediatype id, and mediatype name of all tracks with AlbumId = 10. ``` ## TrackId Name AlbumId MediaTypeId Name ## 1 85 Cochise 10 1 MPEG audio file ## 2 86 Show Me How to Live 10 1 MPEG audio file ## 3 87 Gasoline 10 1 MPEG audio file ## 4 88 What You Are 10 1 MPEG audio file ## 5 89 Like a Stone 10 1 MPEG audio file ## 6 90 Set It Off 10 1 MPEG audio file ## 7 91 Shadow on the Sun 10 1 MPEG audio file ## 8 92 I am the Highway 10 1 MPEG audio file ## 9 93 Exploder 10 1 MPEG audio file ## 10 94 Hypnotize 10 1 MPEG audio file ## 11 95 Bring'em Back Alive 10 1 MPEG audio file ## 12 96 Light My Way 10 1 MPEG audio file ## 13 97 Getaway Car 10 1 MPEG audio file ## 14 98 The Last Remaining Light 10 1 MPEG audio file ```
04
:
00
--- # A Three-way Join Note that Album.ArtistId is a foreign key to Artist.ArtistId. Let's add the Artist.Name to our result. -- We will two **two** joins on **three tables** 😳 ![](Lec14_files/track-album-artist.png) --- # A Three-way Join ```r dbGetQuery(chinook_db, "select TrackId, Track.Name, Track.AlbumId, Title, Artist.Name from Track inner join Album on Track.AlbumId = Album.AlbumId inner join Artist on Album.ArtistId = Artist.ArtistId where Track.AlbumId = 250 limit 5") ``` ``` ## TrackId Name AlbumId Title Name ## 1 3178 The Dundies 250 The Office, Season 2 The Office ## 2 3179 Sexual Harassment 250 The Office, Season 2 The Office ## 3 3180 Office Olympics 250 The Office, Season 2 The Office ## 4 3181 The Fire 250 The Office, Season 2 The Office ## 5 3182 Halloween 250 The Office, Season 2 The Office ``` --- # Table Alias We can make the above somewhat less painful by giving the table names an alias. ```r dbGetQuery(chinook_db, "select TrackId, t.Name, t.AlbumId, Title, ar.Name from Track t inner join Album al on t.AlbumId = al.AlbumId inner join Artist ar on al.ArtistId = ar.ArtistId where t.AlbumId = 250 limit 5") ``` ``` ## TrackId Name AlbumId Title Name ## 1 3178 The Dundies 250 The Office, Season 2 The Office ## 2 3179 Sexual Harassment 250 The Office, Season 2 The Office ## 3 3180 Office Olympics 250 The Office, Season 2 The Office ## 4 3181 The Fire 250 The Office, Season 2 The Office ## 5 3182 Halloween 250 The Office, Season 2 The Office ``` `TableName alias` - Track: t - Album: al - Artist: ar --- # Table Alias Finally, the two Name fields are confusing. Let's alias those field names. ```r dbGetQuery(chinook_db, "select TrackId, t.Name as TrackName, t.AlbumId, Title as AlbumTitle, ar.Name as AristName from Track t inner join Album al on t.AlbumId = al.AlbumId inner join Artist ar on al.ArtistId = ar.ArtistId where t.AlbumId = 250 limit 5") ``` ``` ## TrackId TrackName AlbumId AlbumTitle AristName ## 1 3178 The Dundies 250 The Office, Season 2 The Office ## 2 3179 Sexual Harassment 250 The Office, Season 2 The Office ## 3 3180 Office Olympics 250 The Office, Season 2 The Office ## 4 3181 The Fire 250 The Office, Season 2 The Office ## 5 3182 Halloween 250 The Office, Season 2 The Office ``` -- Now **that** is a SQL query! --- # A Three-way Join Produce the following relation with a three-way join. ``` ## TrackId TrackName AlbumId AlbumTitle GenreName ## 1 3178 The Dundies 250 The Office, Season 2 TV Shows ## 2 3179 Sexual Harassment 250 The Office, Season 2 TV Shows ## 3 3180 Office Olympics 250 The Office, Season 2 TV Shows ## 4 3181 The Fire 250 The Office, Season 2 TV Shows ## 5 3182 Halloween 250 The Office, Season 2 TV Shows ```
05
:
00
--- # Aggregation with Join Retrieve the average track size in MB for all tracks in each genre, along with the name of the genre. ``` ## AvgSizeMB GenreId GenreName ## 1 8.590101 1 Rock ## 2 9.048592 2 Jazz ## 3 8.806775 3 Metal ## 4 7.334712 4 Alternative & Punk ## 5 2.024901 5 Rock And Roll ## 6 8.225990 6 Blues ## 7 7.353390 7 Latin ## 8 7.855886 8 Reggae ## 9 4.525822 9 Pop ## 10 7.715961 10 Soundtrack ## 11 6.903703 11 Bossa Nova ## 12 5.875128 12 Easy Listening ## 13 9.035828 13 Heavy Metal ## 14 6.271292 14 R&B/Soul ## 15 10.196616 15 Electronica/Dance ## 16 6.798998 16 World ## 17 6.231992 17 Hip Hop/Rap ## 18 483.588204 18 Science Fiction ## 19 324.498823 19 TV Shows ## 20 508.242060 20 Sci Fi & Fantasy ## 21 483.462302 21 Drama ## 22 302.223650 22 Comedy ## 23 5.610918 23 Alternative ## 24 4.979044 24 Classical ## 25 2.728909 25 Opera ```
05
:
00
--- # More on joins `inner join` can be abbreviated with `join`. We don't need the word `inner`. ```r dbGetQuery(chinook_db, "select TrackId, Name, t.AlbumId, Title from Track t join Album al on t.AlbumId = al.AlbumId where t.AlbumId = 250 limit 3") ``` ``` ## TrackId Name AlbumId Title ## 1 3178 The Dundies 250 The Office, Season 2 ## 2 3179 Sexual Harassment 250 The Office, Season 2 ## 3 3180 Office Olympics 250 The Office, Season 2 ``` --- # More on joins What happens if we join without specifying the key to join on? We get the *Cartesian product* of all records from both tables. This is usually undesirable. ```r dbGetQuery(chinook_db, "select al.title, al.artistid, ar.artistid, ar.name from Album al join Artist ar limit 3") ``` ``` ## Title ArtistId ArtistId Name ## 1 For Those About To Rock We Salute You 1 1 AC/DC ## 2 For Those About To Rock We Salute You 1 2 Accept ## 3 For Those About To Rock We Salute You 1 3 Aerosmith ``` ```r dbGetQuery(chinook_db, "select al.title, al.artistid, ar.artistid, ar.name from Album al join Artist ar") |> nrow() ``` ``` ## [1] 95425 ``` --- # More on joins Some people prefer to do an *implicit join*, selecting the Cartesian product and then using a `where` clause to identify matching keys. They can do this without the `join` keyword by specifying a comma-separated list of tables. The query below has no `join` keyword! ```r dbGetQuery(chinook_db, "select al.title, al.artistid, ar.artistid, ar.name from Album al, Artist ar where al.artistid = ar.artistid limit 3") ``` ``` ## Title ArtistId ArtistId Name ## 1 For Those About To Rock We Salute You 1 1 AC/DC ## 2 Balls to the Wall 2 2 Accept ## 3 Restless and Wild 2 2 Accept ``` -- However, these people are wrong. Always specify the join explicitly. --- # More on joins Often times, the foreign key matches the primary key of the target table. ![](Lec14_files/track-album-artist.png) In this case, SQL can infer the key based on the field name via a `natural join`. The query below specifies no keys, but is joined on the matching field name, ArtistId. ```r dbGetQuery(chinook_db, "select * from album natural join artist limit 3") ``` ``` ## AlbumId Title ArtistId Name ## 1 1 For Those About To Rock We Salute You 1 AC/DC ## 2 2 Balls to the Wall 2 Accept ## 3 3 Restless and Wild 2 Accept ``` --- # More on joins ![](Lec14_files/track-album-artist.png) But why doesn't this work? ```r dbGetQuery(chinook_db, "select TrackId, Name, GenreId from track natural join genre") ``` ``` ## [1] TrackId Name GenreId ## <0 rows> (or 0-length row.names) ``` -- ### Opinion: **Avoid** natural joins. Don't be lazy; specify your keys explicitly. --- # Left join Suppose I am a new customer, but I have yet to order anything. ```r dbExecute(chinook_db, "insert into customer (CustomerId, FirstName, LastName, Email) values (888, 'Robin', 'Liu', 'r_liu@pstat.ucsb.edu')") ``` ``` ## [1] 1 ``` ```r dbGetQuery(chinook_db, "select CustomerId, FirstName, SupportRepId from customer where CustomerId = 888") ``` ``` ## CustomerId FirstName SupportRepId ## 1 888 Robin NA ``` Can we join customers to their invoices? --- # Left join ```r dbGetQuery(chinook_db, "select c.CustomerId, FirstName, i.InvoiceId, i.InvoiceDate, i.Total from Customer c inner join invoice i on c.customerid = i.customerid where c.CustomerId = 888") ``` ``` ## [1] CustomerId FirstName InvoiceId InvoiceDate Total ## <0 rows> (or 0-length row.names) ``` Empty as expected, since there are no invoices assigned to me. -- However, we may want to see this customer anyway. This is done with a **left join**. ```r dbGetQuery(chinook_db, "select c.CustomerId, FirstName, i.InvoiceId, i.InvoiceDate, i.Total from Customer c left join invoice i on c.customerid = i.customerid where c.CustomerId = 888") ``` ``` ## CustomerId FirstName InvoiceId InvoiceDate Total ## 1 888 Robin NA NA NA ``` --- # Left join The *left* table, Customer, contained a key, 888, that is not in the *right* table, Invoice. A left join returns all keys of the left table, and values in the right table with missing keys are filled with NULL (which turn into NA in R). .center[![:scale 65%](Lec14_files/joins.png)] --- # Joins summary - Specify keys explicitly - Avoid natural join - left/right/outer joins are very useful and you should know them. - But we won't discuss much beyond inner join in this class. - Remember by default a `join` is an `inner join`.