Abstract
This chapter covers the basics of SQL and how to implement SQL statements within the R environment. Concepts and examples span SQL clauses, aggregate functions, joins, subqueries, virtual tables, window functions, and common table expressions (CTEs).
You have full access to this open access chapter, Download chapter PDF
Structured Query Language (SQL) is the most common language used to extract and wrangle data contained in a relational database. SQL is an essential skill for anyone working in analytics.
Basics
There are three main clauses in a SQL query: (a) SELECT, (b) FROM, and (c) WHERE. The SELECT and FROM clauses are required, though the optional WHERE clause is frequently needed.
-
SELECT: Specifies the columns to include in the output
-
FROM: Specifies the table(s) in which the relevant data are contained
-
WHERE: Specifies the rows to search
Despite the clauses being ordered as shown above (SELECT then FROM then WHERE), the FROM clause is the first to execute since we first need to identify the relevant table(s) before filtering rows and selecting columns. The SELECT clause is the last to execute.
Additional clauses are available for grouping and sorting data.
-
GROUP BY: Specifies the columns by which data should be grouped when using aggregate functions
-
HAVING: Specifies conditions for filtering rows based on aggregate functions
-
ORDER BY: Specifies how data should be sorted in the output
When implementing aggregate functions in a SELECT clause, such as counting, summing, or averaging a numeric field, all other non-aggregated fields must be included in the GROUP BY clause.
When working with large data sets, it is best to filter records on the database side to avoid reading superfluous records into an analytics tool such as R only to then filter data to the relevant subset. For example, if we are performing an analysis on employees in the Research & Development department, we should ideally filter to this subset on the database side rather than loading data on the entire workforce and then paring down to the relevant records within R. Fewer records can help enhance the performance of R scripts—especially when R is running on a local machine, such as a laptop, rather than on a more powerful server.
Though it is important to execute SQL queries directly on the database to minimize the amount of data read into R, we will use the sqldf library within R to demonstrate the mechanics of a SQL query for easily replicable examples. The sqldf library allows us to write SQL to query data frames via an embedded database engine (SQLite by default). In a practical setting, we would pass a string containing the SQL query, execute it directly against the database, and then store the query’s results to an object within R. While the syntax of SQL may vary by database, the core structure of queries is universal.
First, let us load the data sets:
## [1] 1470 36
Next, we will apply the sqldf() function to our data frame to extract specific rows and columns. In addition to the SELECT, FROM, and WHERE clauses, we will use the LIMIT clause to limit the number of rows that are displayed given the data frame’s size (n = 1, 470). In a practical setting, the LIMIT clause is only used for efficient data profiling and troubleshooting, as we would not want to arbitrarily truncate a data set used for analysis.
A best practice in writing SQL is to capitalize the names of clauses and functions and to use separate lines and indentation to make the SQL statements more readable:
## employee_id ## 1 1002 ## 2 1003 ## 3 1004 ## 4 1005 ## 5 1006 ## 6 1007 ## 7 1008 ## 8 1009 ## 9 1010 ## 10 1011
This query returned a list of employee ids for employees in the Research & Development department.
To optimize query performance, it is important to order conditions in the WHERE clause beginning with the condition that will exclude the largest number of records. Conditions are executed sequentially, and each subsequent condition must evaluate all records that remain following any preceding filtering. Limiting the number of records that must be searched when evaluating each condition will reduce the time it takes the query to return results. For example, if two conditions are needed and one excludes 5000 records while the other excludes 10, the condition that excludes 5000 records should be listed first in the WHERE clause.
Aggregate Functions
Next, let us take a look at average organization tenure by job for those in the Research & Development department:
## job_title AVG(org_tenure) ## 1 Healthcare Representative 8.618321 ## 2 Laboratory Technician 5.019305 ## 3 Manager 13.673077 ## 4 Manufacturing Director 7.600000 ## 5 Research Director 10.937500 ## 6 Research Scientist 5.113014 ## 7 Vice President 9.500000
There are 7 distinct job titles among employees in the Research & Development department, and the average organization tenure for these ranges from 5 to 13.7 years.
Since there could be a small number of employees in certain jobs, in which case average organization tenure may not be as meaningful, we can use the COUNT(*) function to count the number of rows for each group. In this case, COUNT(*) will return the number of employees in each job in the Research & Development department. We can also assign column aliases via AS in the SELECT clause to assign different names to the output fields:
## job_title employee_cnt avg_org_tenure ## 1 Healthcare Representative 131 8.618321 ## 2 Laboratory Technician 259 5.019305 ## 3 Manager 52 13.673077 ## 4 Manufacturing Director 145 7.600000 ## 5 Research Director 80 10.937500 ## 6 Research Scientist 292 5.113014 ## 7 Vice President 2 9.500000
The output shows that there are only 2 Vice Presidents in the Research & Development department, while other job titles are much more prevalent.
Since relatively few employees are Vice Presidents, let us use the HAVING clause to only show average organization tenure for Research & Development department jobs with more than 10 employees. We can also use the ROUND() function to truncate average organization tenure to one significant digit:
## job_title employee_cnt avg_org_tenure ## 1 Healthcare Representative 131 8.6 ## 2 Laboratory Technician 259 5.0 ## 3 Manager 52 13.7 ## 4 Manufacturing Director 145 7.6 ## 5 Research Director 80 10.9 ## 6 Research Scientist 292 5.1
Joins
In a practical setting, the required data are rarely contained within a single table. Therefore, we must query multiple tables and join them together.
Figure 1 illustrates how worker, position, and recruiting schemas may be related. For example, a candidate submits a job application to a posted requisition, which is connected to an open position Finance approved as part of the company’s workforce plan; when the selected candidate is hired, they become a worker with one or many events (hire, promotion, transfer, termination) and activities (learning, performance appraisals, surveys) during their tenure with the company.
Tables are related using a set of keys. Each table needs a Primary Key (PK), which is a unique identifier for each row in the table. A PK may be a single column or multiple columns; a multi-column PK is known as a composite key. It is generally best to leverage non-recyclable system-generated ids for PKs. A Foreign Key (FK) is a column whose values correspond to the values of a PK in another table. Referential integrity is the logical dependency of a FK on a PK, and this is an important concept in the context of relational data structures. Referential integrity constraints protect against orphaned FK values in child tables by deleting PK values from an associated parent table.
Figure 2 shows an Entity Relationship Diagram (ERD) that depicts PK/FK relationships among the Position, Worker, and Requisition tables. Notice that the PK for each table shown in Fig. 1 is listed as a FK in related tables.
With knowledge of the keys required to connect records across tables, there are several methods of joining the tables. Figure 3 illustrates SQL join types using Venn diagrams. Both the join type and keys for related tables need to be specified in the SQL statement. The structure of SQL queries for each method of joining Table A and Table B is represented in the following code blocks:
LEFT INCLUSIVE
LEFT EXCLUSIVE
FULL OUTER INCLUSIVE
FULL OUTER EXCLUSIVE
RIGHT INCLUSIVE
RIGHT EXCLUSIVE
INNER JOIN
To illustrate how SQL joins work, we will leverage three of the data sets used to produce the consolidated employees data set that will be leveraged throughout this book: job, tenure, and demographics. In a people analytics context, employee id is often the PK since this identifier should not be shared by two or more employees—past, present, or future. Email or network id may also be a suitable PK. We will use the employee_id column in each of the three data frames to facilitate joins.
Let us query these data frames to return the average organization tenure and average commute distance for employees in the Research & Development department, grouped by jobs with more than 10 employees. To accomplish this, we will leverage an INNER JOIN, which will return records only for employee ids which are present in all three data frames. For example, if a record exists in demographics and tenure for a particular employee id, but there is no corresponding record in job, that employee id would not be included in the output.
## job_title employee_cnt avg_org_tenure avg_commute_dist ## 1 Healthcare Representative 131 8.6 9.8 ## 2 Laboratory Technician 259 5.0 9.4 ## 3 Manager 52 13.7 7.2 ## 4 Manufacturing Director 145 7.6 9.5 ## 5 Research Director 80 10.9 8.4 ## 6 Research Scientist 292 5.1 9.0
Note that the INNER JOIN in this SQL query was structured such that both tenure and job were joined to demographics via the employee_id column. We could have instead joined job to tenure since we joined tenure to demographics; this would have achieved the same result since all employee ids exist in each of the three data frames.
If it were possible for all employee ids to exist in demographics but not in either tenure or job, we could leverage a LEFT JOIN to ensure all records from demographics are included in the output irrespective of whether they have matches in tenure or job:
## job_title employee_cnt avg_org_tenure avg_commute_dist ## 1 Healthcare Representative 131 8.6 9.8 ## 2 Laboratory Technician 259 5.0 9.4 ## 3 Manager 52 13.7 7.2 ## 4 Manufacturing Director 145 7.6 9.5 ## 5 Research Director 80 10.9 8.4 ## 6 Research Scientist 292 5.1 9.0
In this case, if demographics is the base data set which contains all employee ids (i.e., the “LEFT” data set), it is important for tenure and job to be joined to it. Joining job to tenure may result in information loss if an employee id exists in demographics and job but not in the intermediate tenure data set.
When integrating data within R, the tidyverse provides a more efficient and parsimonious method of joining many data sets using various join types. Within this framework, components are chained together via the |> operator. Though slightly less efficient, the legacy %>% operator could be used as an alternative. The example below joins nine data sets into a single employees data frame using a left join on the employee_id column:
Subqueries
Subqueries are queries nested within other queries. Subqueries are often referred to as inner queries, while the main queries are referred to as outer queries.
For example, if we are interested in performing an analysis on employees with more than a year of organization tenure, we can use a subquery to pass a list of employee ids that meet this criterion into the outer query for filtering. In this case, we would not need to include tenure in the join conditions of our main query:
## job_title employee_cnt avg_commute_dist ## 1 Healthcare Representative 118 9.7 ## 2 Laboratory Technician 198 9.6 ## 3 Manager 49 6.9 ## 4 Manufacturing Director 133 9.7 ## 5 Research Director 74 8.3 ## 6 Research Scientist 238 9.2
Virtual Tables
An alternative to a subquery is creating a virtual table in the FROM clause. When using an INNER JOIN to connect demographics to the virtual table ids, which provides a list of employee ids for those with more than a year of organization tenure, any records in demographics or job that do not relate to employees with at least a year of organization tenure will be dropped. This is true even though a LEFT JOIN is used to join job to demographics since records in demographics will be filtered based on employee_id matches in the virtual table. With this approach, our WHERE clause is limited to the department = 'Research & Development' condition:
## job_title employee_cnt avg_commute_dist ## 1 Healthcare Representative 118 9.7 ## 2 Laboratory Technician 198 9.6 ## 3 Manager 49 6.9 ## 4 Manufacturing Director 133 9.7 ## 5 Research Director 74 8.3 ## 6 Research Scientist 238 9.2
As you can see, the output of the query using a virtual table matches the results from the preceding approach that utilized a subquery.
Window Functions
Window functions are used for performing calculations over a set of rows without collapsing the records. Unlike the aggregate functions we have covered, window functions do not collapse rows into a single value; the calculated value is returned for each of the rows over which the calculation is performed.
For example, we can assign an organization tenure rank by Research & Development job using the RANK() and OVER() functions in the SELECT clause. The PARTITION BY argument functions like a GROUP BY clause but without collapsing rows, while the ORDER BY argument sorts the records in ascending (ASC) or descending (DESC) order for proper ranking:
## employee_id job_title commute_dist commute_dist_rank ## 1 2325 Healthcare Representative 29 1 ## 2 1414 Healthcare Representative 28 2 ## 3 1010 Healthcare Representative 27 3 ## 4 1573 Healthcare Representative 27 3 ## 5 2200 Healthcare Representative 26 5 ## 6 1730 Healthcare Representative 25 6 ## 7 1833 Healthcare Representative 25 6 ## 8 1993 Healthcare Representative 25 6 ## 9 2415 Healthcare Representative 25 6 ## 10 1164 Healthcare Representative 24 10
Notice that in the case of commute distance ties, the RANK() function assigns the same rank and then adds the number of ties to that rank to determine the rank for the next highest value of commute distance.
We can also treat this query as a virtual table, and then filter on the derived commute_dist_rank field to return the highest commute distance for each job. We can add a DISTINCT() function in the SELECT clause to collapse jobs for which there are more than one employee with the max commute distance and display the number of ties for each using the COUNT(*) function:
## job_title employee_count commute_dist ## 1 Healthcare Representative 1 29 ## 2 Laboratory Technician 6 29 ## 3 Manager 2 29 ## 4 Manufacturing Director 4 29 ## 5 Research Director 3 28 ## 6 Research Scientist 4 29 ## 7 Vice President 1 8
Common Table Expressions (CTEs)
An alternative to the virtual table approach is to use a common table expression (CTE), which is the result set of a query that exists temporarily and only for use in a larger query. Like the virtual table example, CTEs do not persist data in objects or tables; the data exist only for the duration of the query.
## job_title employee_count commute_dist ## 1 Healthcare Representative 1 29 ## 2 Laboratory Technician 6 29 ## 3 Manager 2 29 ## 4 Manufacturing Director 4 29 ## 5 Research Director 3 28 ## 6 Research Scientist 4 29 ## 7 Vice President 1 8
Review Questions
-
1.
What two clauses must always be present in a SQL query?
-
2.
What SQL clause is executed first at run time?
-
3.
To optimize the performance of a SQL query, how should conditions in the WHERE clause be ordered?
-
4.
How do aggregate functions differ from window functions in SQL?
-
5.
What is a subquery?
-
6.
What is the difference between an INNER JOIN and LEFT JOIN?
-
7.
What is the purpose of a common table expression (CTE)?
-
8.
What does the PARTITION BY function do?
-
9.
Why is it important for queries to limit records on the database side before reading into R?
-
10.
In which clause are filter conditions applied to aggregate functions (e.g., COUNT(*) > 5, AVG(salary) < 100000)?
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Starbuck, C. (2023). Introduction to SQL. In: The Fundamentals of People Analytics. Springer, Cham. https://doi.org/10.1007/978-3-031-28674-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-28674-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28673-5
Online ISBN: 978-3-031-28674-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)