SQL for Data Science: Introduction & Tutorial

When it comes to working with large datasets, one of the most powerful tools at a data scientist's disposal is SQL (Structured Query Language). SQL allows for efficient management, manipulation and retrieval of data from relational databases. This makes it an essential skill for anyone in the field of data science.

In this article, we'll introduce SQL and how it can be used for data science. We'll cover the basics of SQL syntax and querying databases and also introduce some advanced techniques used for more complex analysis.

Getting started with SQL

Introduction to SQL

Structured Query Language, or SQL, is a programming language used for managing relational databases. It allows users to store, manipulate, and retrieve data stored in the database.

With the increasing demand for data-driven decision-making and big data analytics, knowledge of SQL has become an essential skill for data scientists. In this tutorial, we will cover the basics of SQL and introduce you to how it can be applied to data science.

(Related reading: query languages.)

Role of SQL in data science

Data science is a multidisciplinary field that combines different methods, algorithms, and tools to extract useful knowledge and insights from data. SQL plays a crucial role in this process by providing a standardized way to store and manage data.

SQL can handle large datasets efficiently and allows for complex queries to be executed quickly. This makes it an ideal tool for data scientists working with large amounts of data. When training machine learning models, data scientists will be required to extract large amounts of relevant data from a data source. Using SQL in such scenarios can help streamline the data extraction process and save time.

Additionally, many companies use relational databases to store and organize their data. As a result, being proficient in SQL is often a requirement for data science job positions.

Basic Structure of SQL

SQL follows a simple structure that involves three main components:

Data Definition Language (DDL)
Data Manipulation Language (DML)
Data Control Language (DCL)

Data Definition Language (DDL)

DDL is used for creating, modifying, and deleting database objects such as tables, views, and indexes. It allows users to define the structure of the database and specify the relationships between different entities.

Examples of DDL:

CREATE TABLE
ALTER TABLE
DROP TABLE

Data Manipulation Language (DML)

DML is used to manipulate data stored in the database. This includes inserting, updating, deleting, and retrieving data from tables.

Examples of DML:

INSERT INTO: Used to add new data into a table.
UPDATE: Used to modify existing data in a table.
DELETE FROM: Used to remove data from a table.
SELECT: Used to retrieve specific data from tables based on given conditions.

Data Control Language (DCL)

DCL is used to control access to the database and its objects. It includes granting or revoking privileges to users and managing security permissions.

Examples of DCL:

Setting up a SQL environment

Before starting with SQL for use in data science, you'll have to pick a relational database tool and IDE (integrated development environment). Popular options include:

MySQL
Jupyter Notebooks
PostgreSQL
SQLite
MS SQL

For beginners who are new to SQL, and plan to start with application to data science, either Jupyter Notebooks or Jupyterlab. This makes for a good transition from Python to SQL by using the sqlite3 library within a Jupyter Notebook.

Alternatively, you can also use the MySQL connector library in Python as well. Using SQL with Python in this manner makes for better-integrated data analysis.

Basic SQL Syntax

To understand the grammar of SQL, you'll need to explore the syntax of the language.

SQL follows a specific syntax for writing queries, which consists of keywords, clauses, expressions, and operators. A basic SQL query has the following structure:

```sql
SELECT column1, column2
FROM table_name
```

This query selects specific columns from a table and applies a condition to filter the results. The keywords SELECT and FROM are part of the SQL syntax and must be written in uppercase letters.

Core SQL queries for Data Analysis

To perform basic SQL queries, you'll need to know some core functions and clauses used in SQL. Here are some common ones.

Data retrieval with SELECT

The SELECT statement is the foundation of data retrieval in SQL, allowing users to specify the exact data needed from a database. You can retrieve specific columns or use wildcards to select all data from a table.

Specifying data sources using FROM

FROM specifies the table or tables to query data from. Tables can also be joined together with the FROM clause.

Filtering results Using WHERE

The WHERE clause allows users to specify conditions for filtering results. It allows you to choose the rows that meet a specific condition you have chosen.

This is useful when dealing with large datasets and wanting to extract only relevant information.

Sorting results with ORDER BY

Next, the ORDER BY is used to sort the results of a query in ascending or descending order. It takes one or more column names as arguments and sorts the results based on those columns.

Here are the two main functions used with ORDER BY:

Ascending order using ASC
Descending order using DESC

This clause is typically appended at the end of a query to define how the returned data should be ordered for better exploratory data analysis.

Example:

```sql
SELECT *
FROM customer
>WHERE country = 'USA'
ORDER BY last_name ASC;
```

This query selects all columns and rows from a customer table where the country is USA, then sorts the results by last name in ascending order.

Summarizing Data

To present data in a more readable and concise way, SQL has several ways for summarizing data. These include:

Aggregate functions

To make better sense of data, you'll have to create aggregates. An aggregate function in SQL is used to perform calculations on the data retrieved from a query.

Some common aggregate functions include:

COUNT(): Count the number of rows or non-null values in a column.
SUM(): Calculates the sum of all values in a column.
AVG(): Calculates the average value of a column.
MIN(): Returns the minimum value in a column

GROUP BY for aggregation

The GROUP BY clause is used to group data by one or more columns, and then an aggregate function such as SUM, MAX, MIN, AVG, or COUNT can be applied.

Example of GROUP BY being used with the SUM aggregation:

```sql
SELECT country, SUM(total_sales)
FROM sales
GROUP BY country;
```

This query groups the sales data by country and calculates the total sales for each country.

HAVING clause for filtered aggregates

HAVING complements the aggregation process by filtering aggregated data that matches specific criteria. It provides a more refined output, retaining only the most relevant insights for analysis.

While WHERE filters rows before aggregation, HAVING deals with rows after they're bunched together. It allows us to apply conditions on the summary to ensure the final display of results is precisely tuned to the question at hand.

Here is an example of how the HAVING clause can be used:

```sql
SELECT Country, COUNT(*) AS TotalCustomers
FROM Customers
GROUP BY Country
HAVING COUNT(*) > 10;
```

This statement will return a list of countries with more than ten customers. The use of the HAVING clause ensures only relevant data is returned from the aggregation.

Advanced SQL techniques

JOIN operations for multiple data relationships

JOINis a fundamental operation in SQL, bringing together data across tables. Data comes from multiple sources and may be stored independently, making JOINs vital for connecting the dots.

INNER JOIN: Returns records that have matching values in both tables.
LEFT (OUTER) JOIN: Returns all records from the left table and matching ones from the other table.
RIGHT (OUTER) JOIN: Returns all records from the right table and matching ones from the other table.
FULL (OUTER) JOIN: Returns all records when there's a match in either of the tables, irrespective of which side they're on, and NULL values for non-matching rows.

Here is an example of the INNER JOIN syntax:

```sql
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;
```

This statement will return the order ID, customer name, and order date for all orders that have a matching customer ID in both the Orders and Customers tables.

The primary key of customer ID in the Customers table will be matched with the foreign key of customer ID in the Orders table. This process is very much similar to that of the VLOOKUP function within Microsoft Excel.

For use cases in data science, this can be particularly useful, as data may need to be combined from multiple tables to provide a comprehensive and accurate view of the data.

Subqueries for nested data extraction

Data scientists can also use subqueries to expand upon to create more complex SQL queries.

Subqueries are queries nested within the main query and help extract data from a specific subset of tables, thus refining results to a higher degree.

Subqueries can be used in various ways, such as filtering out records based on an aggregate function or comparing values across different tables. Their versatility simplifies complex data manipulation and allows for more precise analysis.

Here's a simple example to show it's syntax:

```sql
SELECT *
FROM Customers
WHERE Country IN (SELECT Country FROM Suppliers);
```

This statement will return all customers who share a country with any of the suppliers.

Common Table Expressions (CTEs)

A Common Table Expression (CTE) is a table that exists only temporarily within a SQL statement. They are not stored as objects but offer a convenient way to store and retrieve data for use within a single query. CTEs can be used to:

Simplify complex queries by breaking them down into smaller, more manageable parts.
Create recursive queries for hierarchical or tree-like data.
Store intermediate results and reuse them in further queries.

They are essentially queries within queries, similar to subqueries, but with the added benefit of improved readability and maintainability.

Here's an example of a CTE query:

```sql
WITH temp_table AS (
SELECT employee_name, salary FROM employees WHERE department = 'Sales'
)
SELECT * FROM temp_table WHERE salary > 50000;
```

In this example, the CTE (temp_table) is used to store a subset of data from the employees table and then retrieve only the relevant information in the subsequent query. This not only simplifies the query but also makes it easier to understand and modify in case of any changes.

CTEs can also be recursive, which means they can refer to themselves within the same query. This is especially useful for dealing with hierarchical data such as organizational charts or family trees.

This method simplifies the process of generating sequences and patterns, which are often encountered in data science tasks. CTEs can also be used to create temporary views within a query, making it easier to read and understand complex data relationships.

Window functions for advanced analytics

For more advanced use cases, window functions will be needed to gain more control over the data you need to extract.

A window function in SQL is a calculation performed across a set of rows that are related to the current row. This is similar to how we use aggregate functions, with the additional ability to group and partition data to a more granular level.

With these functions, users can perform calculations over a specified range of rows, which are particularly useful in data partitioning. This allows the user to define groups of data upon which certain calculations will be performed or filters applied; otherwise known as "windows".

With this level of granularity, window functions help to refine aggregation results and provide more insightful reports. Some popular examples of window functions include:

ROW_NUMBER
RANK
DENSE_RANK
LAG/LEAD

Let's have a look at what a ROW_NUMBER SQL code example looks like:

```sql
SELECT EmployeeID, OrderID, OrderDate,
ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY OrderDate) AS RowNumber
FROM Orders;
```

This statement uses the ROW_NUMBER window function to assign a unique row number value over each employee's orders.

Final thoughts

SQL is a powerful language, and its capabilities go far beyond just querying data from databases. With its versatile clauses and operations, it enables users to manipulate large datasets in various ways, providing the necessary data preparation needed for downstream data science applications like machine learning.

As clean data is always needed for accurate results in data science and AI, SQL will continue to play a crucial role in data science and analytics.

SQL for Data Science: Introduction & Tutorial | Splunk (2024)