Introduction to SQL

Presentation for the LBS Technology Club

Qi-Shan Lim / @qmanic

This presentation is live at http://qlim.github.io/sql-intro (src)

Before we get started

What are we covering today?

  • Relational databases - the basics
  • Simple SELECT queries with WHERE, ORDER BY
  • Aggregation - SUM, AVERAGE, COUNT
  • Simple JOINs

What are we not covering?

  • This is an introduction only!
  • This is not about database design
  • This is a general SQL introduction and not about specific database systems
  • Query builders - they are your friend, don't write SQL unless you have to! But it does help to know what's going on under the hood

What is SQL?

Structured Query Language
is a tool for manipuating data in
relational databases


SELECT firstName, lastName, countryCode, city
FROM Customers
WHERE countryCode="US";
                    

Relational databases are everywhere

Some examples of database systems that use SQL

  • Microsoft Access
  • Microsoft SQL Server
  • Oracle RDBMS
  • IBM DB2
  • SAP Sypbase
  • MySQL
  • PostgreSQL
  • Amazon SimpleDB

Although SQL is standardised, differences do exist between systems

We will be using SQLite in this tutorial but will be using a basic subset of SQL that is broadly universal

Why learn SQL?

In business you might want to:

  • Extract a subset of data from a larger dataset - filter it to only include data you're interested in and maybe sort it while you're at it
  • Transform data by performing calculations on it
  • Summarise complex data (e.g. computing sums or averages)
  • Insert new data or delete existing data from a dataset (out of scope for today)

And of course, databases are at the heart of many mobile and web applications

Why learn SQL

Dilbert.com

What is a relational database?

A relational database is a collection of tables,
which have columns and rows

Table: Customers

#|customerId|firstName|lastName |countryCode|city       |tier|joinDate  
-+----------+---------+---------+-----------+-----------+----+----------
1|1         |Andy     |Dwyer    |US         |Pawnee     |3   |2013-01-08
2|2         |Bruce    |Bullerby |GB         |London     |1   |2013-01-21
3|3         |Carrie   |Bradshaw |US         |New York   |2   |2013-03-04
4|4         |D'Angelo |Barksdale|US         |Baltimore  |1   |2013-04-19
...

                    
Table: Sales

# |transactionId|customerId|productId|timestamp       |quantity|salesValue
--+-------------+----------+---------+----------------+--------+----------
1 |1            |7         |5        |2013-06-01 09:44|6       |336       
2 |2            |4         |5        |2013-04-20 04:37|10      |625       
3 |3            |8         |4        |2013-04-12 11:04|16      |1051.2    
...
                    

More about tables

  • A table is a collection of entries - each row is an entry, e.g. a customer
  • The columns of a table define the what information is stored for each entry - e.g. their name, address etc.
  • Each column has a type - it could be a number, text ('string'), date etc. Different database systems support different types
  • A common column found in tables (though not always) is a unique identifier for each row - e.g. 'customerId'. These are often used to link data in different tables (e.g. customerId appears in our transactions table)

Introducing our example database

Our example database has 3 tables:

  • Customers
  • Products
  • Sales

Example: Your first SELECT query

A SELECT query is used to get information out of a database. It's probably the most common type of query you will write


SELECT customerId, firstName, lastName, countryCode 
FROM Customers;
                    

You should get output like this:


#|customerId|firstName|lastName |countryCode
-+----------+---------+---------+-----------
1|1         |Andy     |Dwyer    |US         
2|2         |Bruce    |Bullerby |GB         
3|3         |Carrie   |Bradshaw |US         
4|4         |D'Angelo |Barksdale|US       
...

                    

Deconstructing your first SELECT query

  • The start of every SQL query tells you what sort of query you are running, in this case, it's SELECT
  • Next comes the list of columns that you want to see in the output, separated by commas
  • The FROM clause specifies which table the columns reside in
  • Finally, every SQL query ends with a semicolon
  • Whitespace is ignored. It's good practice if you can to split your query over multiple lines to improve readability

Exercise: Simple SELECT query

Write a query that selects the transactionId and timestamp columns from the Sales table

Filtering using a WHERE clause

Often you will only want to select rows that match a certain criteria. You can add a WHERE clause to your SELECT query that will specify which rows to include

This example selects only customers whose countryCode is 'US':


SELECT *
FROM Customers
WHERE countryCode = 'US';
                    

#|customerId|firstName|lastName |countryCode|city       |tier|joinDate  
-+----------+---------+---------+-----------+-----------+----+----------
1|1         |Andy     |Dwyer    |US         |Pawnee     |3   |2013-01-08
2|3         |Carrie   |Bradshaw |US         |New York   |2   |2013-03-04
3|4         |D'Angelo |Barksdale|US         |Baltimore  |1   |2013-04-19
...

                    

Deconstructing your new query

  • You can use '*' as a shorthand to mean 'select all of the columns in the table' instead of listing them explicitly
  • The WHERE clause comes after the FROM and is followed by a conditional expression. Only rows that meet the criteria will be included in the final output

Conditional expressions

Conditional expressions should be fairly familiar to you if you've used them elsewhere - e.g. with IF() functions in Excel, although there are some differences


WHERE column = 'text'       -- test for equality
WHERE column = 2            -- also works for numbers
WHERE column > 5            -- greater than is '>', less than is '<'
WHERE column <> 8           -- test for inequality
WHERE c1 = 'US' AND c2 < 10 -- you can combine conditions with AND 
WHERE (c1 = 'US' AND        -- use parentheses to clarify order of operations
  c2 < 10) OR c1 = 'GB'     
                    

Exercise: Filtering with WHERE

Write a query to select all columns for transactions that occurred prior to 1-Sep-2013.

Hint: write the date as '2013-09-01'

A quick note about dates in databases

Date handling tends to vary quite a bit between different database systems

Most systems will have specialised functions that make working with dates easier - but these are out of scope for today

Sorting the results of a query with ORDER BY

You may want to order the results of a query in a certain way. You can specify this using an ORDER BY clause in your SELECT


SELECT *
FROM Customers
WHERE countryCode = 'US'
ORDER BY lastName ASC;
                    

#|customerId|firstName|lastName |countryCode|city       |tier|joinDate  
-+----------+---------+---------+-----------+-----------+----+----------
1|4         |D'Angelo |Barksdale|US         |Baltimore  |1   |2013-04-19
2|7         |Gary     |Blauman  |US         |New York   |2   |2013-06-03
3|3         |Carrie   |Bradshaw |US         |New York   |2   |2013-03-04
4|1         |Andy     |Dwyer    |US         |Pawnee     |3   |2013-01-08
5|5         |Eric     |Murphy   |US         |Los Angeles|3   |2013-04-23
6|8         |Henry    |Pollard  |US         |Los Angeles|2   |2013-06-11
...

                    

Understanding ORDER BY

  • ORDER BY comes after any WHERE clause (if there is one) and is followed by the column to sort on, and then either ASC (ascending) or DESC (descending)
  • If you omit the direction, ASC is assumed
  • You can add additional sorting criteria (to break ties), by separating them with a comma:

SELECT *
FROM Customers
WHERE countryCode = 'US'
ORDER BY lastName ASC, firstName ASC;
                    

Exercise: sorting with ORDER BY

Write a query to select all sales occurring in the month of April 2013, sorted so that the transaction with the largest value comes first. If two transactions have the same value, a transaction with a lower customerId should come first

Hint: you will need to combine 2 conditions with AND to select only sales from April 2013

Expressions in SELECT statements

So far we have just been giving the database a list of columns that we want to see in the output of our query, as-is

However, we can also ask the database to perform calculations on the data as well

 
SELECT customerId, firstName || ' ' || lastName AS fullName, 
  tier + 10 AS tierPlus10
FROM Customers ORDER BY tierPlus10 DESC; 
                    

#|customerId|fullName        |tierPlus10
-+----------+----------------+----------
1|1         |Andy Dwyer      |13        
2|5         |Eric Murphy     |13        
3|6         |Francis Urquhart|13        
4|3         |Carrie Bradshaw |12        
...

                    

Explaining expressions

  • Instead of simply specifying the column name, we use an expression to tell the database how to calculate that column in our output
  • We can specify a label for that output column by using 'AS' after the expression with a name
  • Expressions can get complex! As well as basic math, most databases come with lots of in-built functions that can be used
  • Unfortuately, different database systems offer different functions, so you will need to find out what's available for your given system

More expression examples (Access)

Calculating the months since the customer first joined (this works in Access but not SQLite)


SELECT lastName, DateDiff('m', joinDate, Now()) AS monthsActive 
FROM Customers
ORDER BY monthsActive DESC;
                    

lastName     monthsActive 
==========   ============
Dwyer        15
Bullerby     11
Bradshaw     3
Barksdale    2
...

                    

More expression examples (Access)

You can even include 'IF' functions (again, works in Access but not SQLite)


SELECT IIf(tier = 1, '*' + firstName, firstName) as highlightedName, tier
FROM Customers;
                    

highlightedName   tier 
===============   ====
Andy              3
*Bruce            1
Carrie            2
*D'Angelo         1
...

                    

Exercise: expressions

Write a query to retrieve all sales since June 1, 2013, and include the average unit price for each transaction

Aggregation with GROUP BY

A common task is to summarise information from multiple rows in a table into a single row

For example, we may want to know, what is the total value of all sales for each customer?

We can add a GROUP BY clause to our SELECT queries to achieve this

Aggregation with GROUP BY


SELECT customerId, Sum(salesValue) as salesValueSum 
FROM Sales
GROUP BY customerId;
                    

#|customerId|salesValueSum   
-+----------+----------------
1|1         |4865.0          
2|2         |740.6           
3|3         |3421.2   
...

                    

Explaining GROUP BY

  • GROUP BY works a bit like Pivot Tables in Excel
  • In the example, we combined rows with the same customerId together into a single row in our output. We did this by specifying the column name 'customerId' after the GROUP BY clause
  • After 'SELECT' we list the columns we want in the output (as usual), but if we list a column that does not appear after GROUP BY, we also need to tell the database which function to use to combine values from multiple rows. In this case we use the Sum function to tell it to add the values of salesValue together
  • You can specify multiple columns as part of a GROUP BY clause, to get all unique combinations of values in those columns

Exercise: GROUP BY

Write a query to calculate the total sales and average unit sell price for every customer & product combination

Hint: you will need to use an expression that contains 2 Sum functions to calculate the average unit sell price

Other Aggregate Functions

  • Sum()
  • Avg()
  • Count() - returns the number of rows
  • First() and Last() (not in SQLite)
  • Max() and Min()

Exercise: GROUP BY

Write a query that returns the following for each customer: the number of transactions they have made (call it numOfTransactions) and the value of their largest transaction (call it maxSalesValue)

JOINing tables

  • So far, we've only be writing queries that operate on one table at a time
  • However, it's very common to need to combine information from multiple tables in a database
  • For instance, in our sample database we have a Customers table, that holds information on each customer such as their name and location. We also have a Sales table, that describes the sales made to each customer.
  • But what if we need a query that selects all of the sales, but only for customers in a certain country. We don't store the location of each customer in our Sales table, we just store it once in the Customers table. We need to join the Customers table to Sales

JOINing tables


SELECT s.transactionId, c.customerId, c.lastName, c.countryCode, 
  s.timestamp, s.quantity
FROM Sales s
JOIN Customers c ON s.customerId = c.customerId
WHERE c.countryCode = 'US';
...

                    

# |transactionId|customerId|lastName |countryCode|timestamp       |quantity
--+-------------+----------+---------+-----------+----------------+--------
1 |1            |7         |Blauman  |US         |2013-06-01 09:44|6       
2 |2            |4         |Barksdale|US         |2013-04-20 04:37|10      
3 |3            |8         |Pollard  |US         |2013-04-12 11:04|16      
4 |4            |3         |Bradshaw |US         |2013-12-20 19:15|17      
...

                    

Explaining JOIN

  • Like a normal SELECT statement, you specify the columns you want first, however now we also need to tell the database which table each column comes from. We do this by specifying columns in the format: [tableAlias].[columnName]
  • The main table that our query is based on comes after FROM keyword, however we also need to give the table a 'alias' ('s' in the example). We use this alias to refer to the table elsewhere in the query

Explaining JOIN (cont)

  • The JOIN comes next - first we say which table we want to join onto our main table, then after ON we tell the database how to match rows from the new table to the main table
  • We can include a WHERE clause as per usual - only now we need remember to use the table alias in our condition

Exercise: JOIN

Write a query to select transactionId, productId, the product name and quantity for each transaction that is for a product in the Widget category

JOINing multiple tables

We can also join 3 or more tables together by adding more JOIN clauses


SELECT s.transactionId, c.customerId, c.lastName, 
  p.productId, p.name AS productName, s.quantity
FROM Sales s
JOIN Customers c ON s.customerId = c.customerId
JOIN Products p ON s.productId = p.productId;
                    

# |transactionId|customerId|lastName |productId|productName|quantity
--+-------------+----------+---------+---------+-----------+--------
1 |1            |7         |Blauman  |5        |Bling      |6       
2 |2            |4         |Barksdale|5        |Bling      |10      
3 |3            |8         |Pollard  |4        |Moolah     |16      
4 |4            |3         |Bradshaw |6        |Pip        |17      
...
                    

Exercise: multiple joins

Write a query that outputs for each transaction the following: the first name of the customer, the tier of the customer, the product name and category and the quantity

Exercise: aggregating in joins

Modify your previous query so that instead of a line for every transaction, it calculates the total sales in each category for every customer

We've only scratched the surface of JOIN

We've only one covered one specific type of JOIN - called a LEFT INNER JOIN

There are also RIGHT INNER JOINs, LEFT and RIGHT OUTER JOINs, FULL OUTER JOINs, CROSS JOINs and more. Each uses different rules for combining the tables

You can also join a table to itself, or join a table to another multiple times (on different columns)

Some parting tips

  • Databases also have a concept of NULL values - used when there is no data. NULL doesn't always behave how you might expect it to, especially if you try to do math on it
  • Your database table names or columns may contain spaces of other characters in them - you may need to enclose these names in [brackets] or 'quotation marks' depending on the database system
  • If you're working with very large databases it can help to test your queries on a smaller subset of data - to help you find and fix mistakes quicker

Things we haven't been able to cover

  • You can query queries! Most db systems will let you store queries (sometimes called 'views') and let you query those queries just like you would a table
  • SQL is used for a lot more than reading data - it can be used to insert or delete data, create or delete or even alter tables and more
  • DB systems each support a lot of custom functions that can be very useful for doing analysis. Some even let you code your own functions (like using VB in Excel)

Things we haven't been able to cover

  • Designing robust and high-performing databases is a whole field in itself. Understanding some of the basics (e.g. indexes and how to take advantage of them) can help you to write more efficient queries
  • Most databases come with tools to help you analyse the performance of your queries if they're taking a long time to execute
  • Saying it again - query builders/GUIs are your friend. Don't write SQL if you don't have to!

We're out of time!

Here are some other resources you may want to look into:

Thanks for your attention!

 

We hope it's been useful!