class: center, middle, inverse, title-slide .title[ # Lecture 13: SQL Queries ] .author[ ### Robin Liu ] .institute[ ### UCSB ] .date[ ### 2022-07-11 ] --- class: inverse, middle, center # Querying data --- # Querying data We will need the following. This assumes Chinook_Sqlite.sqlite is in your working directory. ```r library(RSQLite) library(sqldf) library(DBI) chinook_db <- dbConnect(SQLite(), "Chinook_Sqlite.sqlite") ``` Recall that `chinook_db` represents a connection to the database file. --- # Chinook structure .pull-left[ ![](Lec12_files/chinook.png) ] .pull-right[ Online music store that: - stores track information, albums, artists, etc - stores customer purchases of tracks ] --- # Querying data Basic SQL query syntax: `select <fields> from <table> [where <conditions>]` `select *` returns all of the fields and leaving out `where` returns all of the rows. What are the values in all the fields for every track? The query below returns everything: ```r dbGetQuery(chinook_db, "select * from track") |> str() ``` ``` ## 'data.frame': 3503 obs. of 9 variables: ## $ TrackId : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Name : chr "For Those About To Rock (We Salute You)" "Balls to the Wall" "Fast As a Shark" "Restless and Wild" ... ## $ AlbumId : int 1 2 3 3 3 1 1 1 1 1 ... ## $ MediaTypeId : int 1 2 2 2 2 1 1 1 1 1 ... ## $ GenreId : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Composer : chr "Angus Young, Malcolm Young, Brian Johnson" NA "F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman" "F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman" ... ## $ Milliseconds: int 343719 342562 230619 252051 375418 205662 233926 210834 203102 263497 ... ## $ Bytes : int 11170334 5510424 3990994 4331779 6290521 6713451 7636561 6852860 6599424 8611245 ... ## $ UnitPrice : num 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 ... ``` --- # Querying data That was too much data. Here are the first 5 records: ```r dbGetQuery(chinook_db, "select * from track limit 5") ``` ``` ## TrackId Name AlbumId MediaTypeId GenreId ## 1 1 For Those About To Rock (We Salute You) 1 1 1 ## 2 2 Balls to the Wall 2 2 1 ## 3 3 Fast As a Shark 3 2 1 ## 4 4 Restless and Wild 3 2 1 ## 5 5 Princess of the Dawn 3 2 1 ## Composer ## 1 Angus Young, Malcolm Young, Brian Johnson ## 2 <NA> ## 3 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman ## 4 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman ## 5 Deaffy & R.A. Smith-Diesel ## Milliseconds Bytes UnitPrice ## 1 343719 11170334 0.99 ## 2 342562 5510424 0.99 ## 3 230619 3990994 0.99 ## 4 252051 4331779 0.99 ## 5 375418 6290521 0.99 ``` --- # Querying data Still too much data. Let's cut down on the number of attributes. Suppose we only want the TrackId, Name, AlbumId, and Milliseconds. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from track limit 5") ``` ``` ## TrackId Name AlbumId Milliseconds ## 1 1 For Those About To Rock (We Salute You) 1 343719 ## 2 2 Balls to the Wall 2 342562 ## 3 3 Fast As a Shark 3 230619 ## 4 4 Restless and Wild 3 252051 ## 5 5 Princess of the Dawn 3 375418 ``` Let's find which are the longest and shortest tracks using the `where` clause. --- # Querying data Find all tracks shorter than 30,000ms (= 30s) ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from Track where Milliseconds < 30000") ``` ``` ## TrackId Name AlbumId Milliseconds ## 1 168 Now Sports 18 4884 ## 2 170 A Statistic 18 6373 ## 3 172 The Real Problem 18 11650 ## 4 178 Oprah 18 6635 ## 5 2241 Bossa 184 29048 ## 6 2461 É Uma Partida De Futebol 200 1071 ## 7 3304 Commercial 1 258 7941 ## 8 3310 Commercial 2 258 21211 ``` -- We can retrieve tracks shorter than 30,000ms from the album with AlbumId = 18 using multiple `where` clauses --- # Querying data Retrieve all tracks shorter than 30,000ms from the album with AlbumId = 18 ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from Track where Milliseconds < 30000 and AlbumId = 18") ``` ``` ## TrackId Name AlbumId Milliseconds ## 1 168 Now Sports 18 4884 ## 2 170 A Statistic 18 6373 ## 3 172 The Real Problem 18 11650 ## 4 178 Oprah 18 6635 ``` `select <fields> from <tables> where <clause 1> and <clause 2> and ...` What if we used `or`? --- # Querying data Retrieve all tracks **either** shorter than 30,000ms **or** with AlbumId = 18 ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from Track where Milliseconds < 30000 or AlbumId = 18") ```
--- # Querying data We saw that the album with AlbumId = 18 has some short songs. Retrieve all songs under 30000 milliseconds from albums excluding AlbumId = 18. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from Track where Milliseconds < 30000 and not AlbumId = 18") ``` ``` ## TrackId Name AlbumId Milliseconds ## 1 2241 Bossa 184 29048 ## 2 2461 É Uma Partida De Futebol 200 1071 ## 3 3304 Commercial 1 258 7941 ## 4 3310 Commercial 2 258 21211 ``` --- # Querying data SQL is designed to mirror human language. Hence using `and`, `or`, `not` instead of the symbols `&`, `|`, `!`. However there is a "not equal" operator. The standard SQL "not" operator is `<>`. Although `!=` works in SQLite, stick with the standard. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Milliseconds from Track where Milliseconds < 30000 and AlbumId <> 18") ``` ``` ## TrackId Name AlbumId Milliseconds ## 1 2241 Bossa 184 29048 ## 2 2461 É Uma Partida De Futebol 200 1071 ## 3 3304 Commercial 1 258 7941 ## 4 3310 Commercial 2 258 21211 ``` --- # Your turn Retrieve the TrackId, Name, AlbumId, and Bytes for all tracks between 300,000,000 and 400,000,000 bytes (approx between 286MB and 381MB) ``` ## TrackId Name AlbumId Bytes ## 1 3199 Casino Night - Season Finale 250 327642458 ## 2 3206 Branch Closing 251 358761786 ## 3 3207 The Merger 251 345960631 ## 4 3212 Producer's Cut: The Return 251 337219980 ## 5 3217 The Negotiation 251 371663719 ## 6 3220 Women's Appreciation 251 338778844 ## 7 3221 Beach Games 251 333671149 ## 8 3428 Branch Closing 251 360331351 ## 9 3429 The Return 251 343877320 ```
03
:
00
--- # Querying data Find all tracks named "Branch Closing" ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId from track where Name = 'Branch Closing'") ``` ``` ## TrackId Name AlbumId ## 1 3206 Branch Closing 251 ## 2 3428 Branch Closing 251 ``` Note the single quotes `'` within the double quotes `"`. This is how strings-within-strings are done in R. You can also switch the roles: ```r dbGetQuery(chinook_db, 'select TrackId, Name, AlbumId from track where Name = "Branch Closing"') ``` ``` ## TrackId Name AlbumId ## 1 3206 Branch Closing 251 ## 2 3428 Branch Closing 251 ``` But if you don't nest them properly, it is trouble. --- # SQL is case-insensitive, except... SQL commands are case-insensitive: ```r dbGetQuery(chinook_db, "SeLecT tRaCkiD, nAmE, aLbUmId FrOm TrAcK wHeRe NaMe = 'Branch Closing'") ``` ``` ## TrackId Name AlbumId ## 1 3206 Branch Closing 251 ## 2 3428 Branch Closing 251 ``` ![](Lec13_files/plaid.png) --- # SQL is case-insensitive, except... But watch out when doing string-matching ```r dbGetQuery(chinook_db, "SeLecT tRaCkiD, nAmE, aLbUmId FrOm TrAcK wHeRe NaMe = 'BrAnCh ClOsInG'") ``` ``` ## [1] TrackId Name AlbumId ## <0 rows> (or 0-length row.names) ``` -- You will often see people write sql commands in all caps and match the case of the column names. ```r dbGetQuery(chinook_db, "SELECT TrackId, Name, AlbumId FROM Track WHERE Name = 'Branch Closing'") ``` ``` ## TrackId Name AlbumId ## 1 3206 Branch Closing 251 ## 2 3428 Branch Closing 251 ``` But I find it unnecessary. Lowercase letters are easier to read. --- # Computations We can `select` things other than fields. Below we get two records since our `where` clause yields two records. ```r dbGetQuery(chinook_db, "select 'pstat', 10 from track where Name = 'Branch Closing'") ``` ``` ## 'pstat' 10 ## 1 pstat 10 ## 2 pstat 10 ``` SQL can do a surprising number of mathematical operations: ```r dbGetQuery(chinook_db, "select power(2, 10) from track where Name = 'Branch Closing'") ``` ``` ## power(2, 10) ## 1 1024 ## 2 1024 ``` --- # Computations Here are the tracks with bytes between 300,000,000 and 400,000,000 bytes (approx between 286MB and 381MB) ``` ## TrackId Name AlbumId Bytes ## 1 3199 Casino Night - Season Finale 250 327642458 ## 2 3206 Branch Closing 251 358761786 ## 3 3207 The Merger 251 345960631 ## 4 3212 Producer's Cut: The Return 251 337219980 ## 5 3217 The Negotiation 251 371663719 ## 6 3220 Women's Appreciation 251 338778844 ## 7 3221 Beach Games 251 333671149 ## 8 3428 Branch Closing 251 360331351 ## 9 3429 The Return 251 343877320 ``` Since it is easier to consider MB, lets convert the units to MB. There are `\(2^{20}\)` bytes in a megabyte. --- # Computations ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) from track where Bytes between 300000000 and 400000000") ``` ``` ## TrackId Name AlbumId Bytes / power(2, 20) ## 1 3199 Casino Night - Season Finale 250 312.4642 ## 2 3206 Branch Closing 251 342.1419 ## 3 3207 The Merger 251 329.9338 ## 4 3212 Producer's Cut: The Return 251 321.5980 ## 5 3217 The Negotiation 251 354.4461 ## 6 3220 Women's Appreciation 251 323.0847 ## 7 3221 Beach Games 251 318.2136 ## 8 3428 Branch Closing 251 343.6388 ## 9 3429 The Return 251 327.9470 ``` Note the name of the new field. Can we clean it up? --- # Computations We can **alias** the field names using `AS`: ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where Bytes between 300000000 and 400000000") ``` ``` ## TrackId Name AlbumId MB ## 1 3199 Casino Night - Season Finale 250 312.4642 ## 2 3206 Branch Closing 251 342.1419 ## 3 3207 The Merger 251 329.9338 ## 4 3212 Producer's Cut: The Return 251 321.5980 ## 5 3217 The Negotiation 251 354.4461 ## 6 3220 Women's Appreciation 251 323.0847 ## 7 3221 Beach Games 251 318.2136 ## 8 3428 Branch Closing 251 343.6388 ## 9 3429 The Return 251 327.9470 ``` --- # Computations We can even use the alias in our `where` clause: ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where MB between 300 and 400") ``` ``` ## TrackId Name AlbumId MB ## 1 3199 Casino Night - Season Finale 250 312.4642 ## 2 3206 Branch Closing 251 342.1419 ## 3 3207 The Merger 251 329.9338 ## 4 3212 Producer's Cut: The Return 251 321.5980 ## 5 3217 The Negotiation 251 354.4461 ## 6 3220 Women's Appreciation 251 323.0847 ## 7 3221 Beach Games 251 318.2136 ## 8 3428 Branch Closing 251 343.6388 ## 9 3429 The Return 251 327.9470 ``` --- # Computations Retrieve all tracks from AlbumId = 18 that are longer than 3 minutes. Rename the Name field as TrackName. ``` ## TrackId TrackName AlbumId Minutes ## 1 169 Body Count 18 5 ## 2 175 Voodoo 18 5 ## 3 176 The Winner Loses 18 6 ## 4 177 There Goes The Neighborhood 18 5 ## 5 181 Momma's Gotta Die Tonight 18 6 ## 6 182 Freedom Of Speech 18 4 ``` It's okay if your minutes are rounded like mine.
04
:
00
--- # Order by The `order by` command returns records sorted by a column. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where MB between 300 and 400 order by MB") ``` ``` ## TrackId Name AlbumId MB ## 1 3199 Casino Night - Season Finale 250 312.4642 ## 2 3221 Beach Games 251 318.2136 ## 3 3212 Producer's Cut: The Return 251 321.5980 ## 4 3220 Women's Appreciation 251 323.0847 ## 5 3429 The Return 251 327.9470 ## 6 3207 The Merger 251 329.9338 ## 7 3206 Branch Closing 251 342.1419 ## 8 3428 Branch Closing 251 343.6388 ## 9 3217 The Negotiation 251 354.4461 ``` The default is ascending. --- # Order by We can break ties by specifying a second column. ```r dbGetQuery(chinook_db, "select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where MB between 300 and 400 order by AlbumId, TrackID desc") ``` ``` ## TrackId Name AlbumId MB ## 1 3199 Casino Night - Season Finale 250 312.4642 ## 2 3429 The Return 251 327.9470 ## 3 3428 Branch Closing 251 343.6388 ## 4 3221 Beach Games 251 318.2136 ## 5 3220 Women's Appreciation 251 323.0847 ## 6 3217 The Negotiation 251 354.4461 ## 7 3212 Producer's Cut: The Return 251 321.5980 ## 8 3207 The Merger 251 329.9338 ## 9 3206 Branch Closing 251 342.1419 ``` `desc` means decreasing. First sort by increasing AlbumId, then in the case of a tie, sort by decreasing TrackId. --- # Aggregation SQL contains aggregation functions that summarize numerical data: `sum`, `min`, `max`, `count`, ... Which track in AlbumId = 250 has the largest size in terms of bytes? ```r dbGetQuery(chinook_db, "select max(bytes) from track where AlbumId = 250") ``` ``` ## max(bytes) ## 1 327642458 ``` What is the total length of the album with AlbumId = 250? ```r dbGetQuery(chinook_db, "select sum(milliseconds) / 60000 as minutes from track where AlbumId = 250") ``` ``` ## minutes ## 1 477 ``` --- # Aggregation How many customers are represented in Chinook? ```r dbGetQuery(chinook_db, "select count(*) from customer") ``` ``` ## count(*) ## 1 59 ``` -- How many customers in Chinook are from the USA? ```r dbGetQuery(chinook_db, "select count(*) from customer where country = 'USA'") ``` ``` ## count(*) ## 1 13 ``` --- # A note about style It is best to indent SQL queries according to their logical function. ```default select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where MB between 300 and 400 order by AlbumId, TrackID desc ``` Also acceptable: ```default select TrackId, Name, AlbumId, Bytes / power(2, 20) AS MB from track where MB between 300 and 400 order by AlbumId, TrackID desc ``` The specific indentation doesn't matter. Just focus on readability and remember to not exceed 80 chars. --- # Summary SQL queries are `SELECT...FROM` statements. - case insensitive - logical operations with `where` - numeric operations - aliases - `group by` -- ## Next time - More on aggregation. - Joins