Written by Jeremy Le on April 27 2023. Heavily inspired by Luka Kerr’s Notes.
Data modelling is a design process which converts requirements into a data model.
The aims of data modelling are to:
The world is viewed as a collection of inter-related entities.
ER has three major modelling constructs:
ER diagrams are a graphical tool for data modelling.
An ER diagram consists of:
Key = set of attributes that uniquely identifies each entity instance
ER design elements:
An entity can be viewed as either a set of entities with the same set of attributes or an abstract description of a class of entities.
A relationship is an association among several entities. A relationship set is a collection of relationships of the same type.
Relationships contain
[entity]<--(rel)-->[entity]
)[entity]<--(rel)---[entity]
)[entity]---(rel)---[entity]
)ER also implements super-class / sub-class hierarchies
A relational model contains
Tuples correspond to entities whilst relations correspond to entity sets and relationships.
There are different types of constraints
A relational database management system (RDBMS)
# Create a new, empty database
$ createdb dbname
# Drop and remove all data from a database
$ dropdb dbname
# Dump the contents of a database
$ pg_dump dbname > dumpfile
# Restore the contents of a database dump
$ psql dbname -f dumpfile
SQL is a Data Definition Language that can formalise relational schemas.
CREATE TABLE TableName (
attrName1 domain1 constraints1,
attrName2 domain2 constraints2,
...
PRIMARY KEY (attri, attrj,...)
FOREIGN KEY (attrx, attry,...)
REFERENCES
OtherTable (attrm, attrn,...)
);
-- Comments are after two dashes
-- Identifiers are alphanumeric and can be inside double quotes
-- Identifiers are case insensitive
"An Identifier", AnIdentifier
-- There are many keywords
CREATE, SELECT, TABLE, WHERE
-- Strings are inside single quotes
'a string'
-- Numbers are similar to C
1, -5, 3.14159
-- There are multiple types
integer, float, char(n), varchar(n), date
-- There are multiple operators
=, <>, <, <=, >, >=, AND, OR, NOT
-- Create a table with attributes and constraints
CREATE TABLE table (Attributes + Constraints)
-- Modify a table
ALTER TABLE table TableSchemaChanges
-- Drop a table
DROP TABLE table(s) [ CASCADE ]
-- Truncate a table
TRUNCATE TABLE table(s) [ CASCADE ]
-- Insert into a table
INSERT INTO table (Attrs) Values Tuple(s)
-- Delete from a table
DELETE FROM table WHERE condition
-- Update a table
UPDATE table SET AttrValueChanges WHERE condition
where
Attrs
= (attr1, attr2, ..., attrn)
Tuple
= (val1, val2, ..., valn)
AttrValueChanges
is a comma separated list of attrname = expression
-- Define own constrained type
CREATE DOMAIN Name AS Type CHECK (Constraint)
-- Define tuple type
CREATE TYPE Name AS (AttrName AttrType, ...)
-- Define enumerated type
CREATE TYPE Name AS ENUM ('Label', ...)
a < b
compares strings a
and b
using dictionary ordera LIKE pattern
matches string a
to pattern%
matches anything_
matches a single charPostgreSQL provides regexp-based pattern matching.
-- ~ and !
Attr ~ 'RegExp' -- Matches 'RegExp'
Attr !~ 'RegExp' -- Doesn't match 'RegExp'
-- ~* and !~*
Attr ~* 'RegExp' -- Matches 'RegExp' case-insensitive
Attr !~* 'RegExp' -- Doesn't match 'RegExp' case-insensitive
-- Concatenate a and b
a || b
-- Lowercase a
lower(a)
-- Extract substring from a
substring(a, start, count)
An SQL query consists of a sequence of clauses:
SELECT projectionList
FROM relations/joins
WHERE condition
GROUP BY groupingAttributes
HAVING groupCondition
The FROM
, WHERE
, GROUP BY
and HAVING
clauses are optional.
A view associates a name with a query. Each time the view is invoked (in a FROM
clause) the query is evaluated, yielding a set of tuples. The set of tuples is used as the value of the view.
A view can be treated as a ‘virtual table’ and are useful for packaging a complex query to use in other queries.
CREATE VIEW viewName [(attributes)] AS Query
The PostgreSQL syntax for defining stored functions:
CREATE OR REPLACE FUNCTION
funcName(arg1, arg2, ...) RETURNS retType
AS $$
String containing function definition
$$ LANGUAGE funcDefLanguage;
where
arg1, arg2, ...
consists of name type
$$ ... $$
are just a type of string quoteLANGUAGE
is a function definition language (e.g. Python, SQL, PLpgSQL)A PostgreSQL function can return a value which is
void
integer
, text
, …)setof integer
, …)setof Employee
where Employee
is a tuple)A function returning a set of values is similar to a view.
PostgreSQL allows functions to be defined in SQL:
CREATE OR REPLACE
funcName(arg1type, arg2type, ...)
RETURNS retType
AS $$
SQL statements
$$ LANGUAGE sql;
Within the function, arguments are accessed as $1, $2, ...
corresponding to their position in the function definition.
A parameterless function behaves similar to a view.
PLpgSQL stands for Procedural Language extensions to PostgreSQL. It is a PostgreSQL-specific language integrating features of procedural programming and SQL programming.
It provides a means for extending DBMS functionality, specifically it
PLpgSQL functions are created (and inserted into db) via:
CREATE OR REPLACE
funcName(param1, param2, ...)
RETURNS retType
AS $$
DECLARE
variable declarations
BEGIN
code for function
END;
$$ LANGUAGE plpgsql;
The entire function body is stored as a single SQL string.
-- Assignment
var := exp
SELECT exp INTO var
-- Selection
IF C1 THEN S1
ELSIF C2 THEN S2 ...
ELSE S END IF
-- Iteration
LOOP S END LOOP
WHILE C LOOP S END LOOP
FOR rec_var IN Query LOOP ...
FOR int_var IN lo..hi LOOP ...
SELECT ... INTO
Can capture query results via
SELECT Exp1, Exp2, ..., ExpN
INTO Var1, Var2, ..., VarN
FROM TableList
WHERE Condition ...
where the query is executed as usual, a projection list is returned as usual and each ExpI
is assigned corresponding to each VarI
.
Aggregates reduce a collection of values into a single result.
The action of an aggregate function can be viewed as
State = initial state for each item V {
// update State to include V
State = updateState(State, V)
}
return makeFinal(State)
The SQL standard does not specify user-defined aggregates, but PostgreSQL does provides a mechanism for defining them.
To define a new aggregate, we need to supply
BaseType
, the type of input valuesStateType
, the type of intermediate statessfunc(state, value) -> newState
NULL
)ffunc(state) -> result
Aggregates are created using the CREATE AGGREGATE
statement:
CREATE AGGREGATE AggName(BaseType) (
sfunc = UpdateStateFunction,
stype = StateType,
initcond = InitialValue,
finalfunc = MakeFinalFunction,
sortop = OrderingOperator
);
where
initcond
(with type StateType
) is optional; defaults to NULL
finalfunc
is optional; defaults to identity functionsortop
is optional; needed for min/max-type aggregatescount()
CREATE AGGREGATE myCount(anyelement) (
stype = int,
initcond = 0,
sfunc = oneMore
);
CREATE FUNCTION oneMore(sum int, x anyelement) RETURNS int
AS $$
BEGIN
RETURN sum + 1;
END;
$$ LANGUAGE PLPGSQL;
Column and table constraints ensure validity of one table. Global constraints may involve conditions over many tables.
SQL implementation of global constraints is ASSERTION
create assertion ClassSizeConstraint check (
not exists (
select c.id
from Courses c
join Enrolments e on (c.id = e.course)
group by c.id
having count(e.student) > 9999
)
);
This is too expensive, so DBMSs provide triggers to do targetted checking.
Triggers are procedures stored in the database that are activated in response to database events.
Triggers provide event-condition-action (ECA) programming where
To define a trigger, we use the syntax
CREATE TRIGGER TriggerName
{AFTER|BEFORE} Event1 [ OR Event2 ... ]
[ FOR EACH ROW ]
ON TableName
[ WHEN ( Condition ) ]
Block of Procedural/SQL Code;
Possible Event
s are INSERT
, DELETE
, UPDATE
.
Triggers can be activated BEFORE
or AFTER
the event.
If activated BEFORE
NEW
contains the “proposed” value of changed tupleNEW
causes a different value to be placed in the databaseIf activated AFTER
NEW
contains the current value of the changed tupleOLD
contains the previous value of the changed tupleNEW
A common database access pattern used in programming languages is seen below.
db = connect_to_dbms(DBname, User/Password)
query = build_SQL('SQLStatementTemplate', values)
results = execute_query(db, query)
while more_tuples_in(results):
tuple = fetch_row_from(results)
# do something with values in tuple
Psycopg2 is a Python module that provides
A standard Psycopg2 program is seen below.
import psycopg2
conn = psycopg2.connect(DB_connection_string)
cur = conn.cursor()
cur.execute('SQLStatementTemplate', values)
conn.close()
Psycopg2 has various useful methods
conn.commit()
commit()
cur.mogrify('SQLStatementTemplate', values)
cur.fetchall()
cur.fetchone()
cur.fetchmany(nTuples)
nTuples
amount of tuplesA good relational database design is
Relational design theory address the last issue, relying on the notion of functional dependency.
In database design, redundancy is generally a “bad thing” as it makes it difficult to maintain consistency after updates. Our goal is to reduce redundancy in sorted data by ensuring that the schema is structured appropriately.
A relation instance $r(R)$ satisfies a dependency $X \to Y$ if for any $t, u \in r, t[X] = u[X] \implies t[Y] = u[Y]$.
In other words, if two tuples in $R$ agree in their values for the set of attributes $X$, then theymust also agree in their values for the set of attributes $Y$. In this case we say that “$Y$ is functionally dependent on $X”.
Attribute sets $X$ and $Y$ may overlap, and it is trivially true that $X \to X$.
Armstrong’s rules are general rules of inference on functional dependencies.
Other useful rules exist too.
Given a set $F$ of functional dependencies, how many new functional dependencies can we derive? For a finite set of attributes, there must be a finite set of derivable functional dependencies.
The largest collection of dependencies that can be derived from $F$ is called the closure of $F$ and is denoted $F^+$.
Normalisation: branch of relational theory providing design insights. Makes use of schema normal forms to characterise the level of redundancy in a relational schema and normalisation algorithms which provide mechanisms for transforming schemas to remove redundancy.
Normalisation theory defines six normal forms (NFs).
1NF allows most redundancy and 5NF allows the least redundancy.
Normalisation aims to put a schema into xNF by ensuring that all relations in the schema are in xNF.
Boyce-Codd Normal Form (BCNF):
Third Normal Form (3NF):
The standard transformation technique to remove redundancy is to decompose relation $R$ into relations $S$ and $T$.
We accomplish decomposition by
Properties include $R = S \cup T, S \cap T \neq \emptyset$ and $r(R) = S(S) \bowtie t(T)$.
A relational schema $R$ is in BCNF with respect to a set $F$ of functional dependencies iff:
for all functional dependencies $X \to Y$ in $F^+$:
A DB schema is in BCNF if all of its relation schemas are in BCNF.
If we transform a schema into BCNF, we are guaranteed:
However, we are not guaranteed:
If we need to preserve dependencies, use 3NF
The following algorithm converts an arbitrary schema to BCNF.
# Inputs: schema R, set F of functional dependencies
# Output: set Res of BCNF schemas
Res = {R};
while any schema S in Res is not in BCNF:
choose any fd X -> Y on S that violates BCNF
Res = (Res-S) union (S-Y) union XY
A relation schema $R$ is in 3NF with respect to a set $F$ of functional dependencies iff:
for all functional dependencies $X \to Y$ in $F^+$:
A DB schema is in 3NF if all relation schemas are in 3NF.
If we transform a schema into 3NF, we are guaranteed:
However, we are not guaranteed:
Whether to use BCNF or 3NF depends on overall design considerations.
The following algorithm converts an arbitrary schema to 3NF.
# Inputs: schema R, set F of fds
# Output: set R_i of 3NF schemas
let F_c be a minimal cover for F
Res = {}
for each fd X -> Y in F_c:
if no schema S in Res contains XY:
Res = Res union XY
if no schema S in Res contains a candidate key for R:
K = any candidate key for R
Res = Res union K
Relational algebra (RA) can be viewed as a
Relational algebra consists of
Core relational algebra operations:
Operation | Standard Notation | Our Notation |
---|---|---|
Selection | $\sigma_{expr}(Rel)$ | $Selexpr$ |
Projection | $\pi_{A, B, B}(Rel)$ | $ProjA, B, C$ |
Join | $Rel_1 \bowtie_{expr} Rel_2$ | $Rel_1 \, Join[expr] \, Rel_2$ |
Rename | $\rho_{schema} Rel$ | $Renameschema$ |
We define the semantics of RA operations using
In the following, +
is used instead of $\cup$ in the pseudocode but both mean the same thing as they are sets.
Selection $SelC$ returns a subset of the tuples in a relation $r(R)$ that satisfy a specified condition $C$.
result = {}
for each tuple t in relation r:
if C(t):
result = result + {t}
Projection $ProjX$ returns a set of tuples containing a subset of the attributes in the original relation, where $X$ specifies a subset of the attributes of $R$.
In pseudocode
result = {}
for each tuple t in relation r:
result = result + {t[X]}
Natural join $r \ Join \ s$ is a specialised product. It contains only pairs that match on common attributes with one of each pair of common attributes eliminated.
In pseusocode
result = {}
for each tuple t1 in relation r:
for each tuple t2 in relation s:
if matches(t1, t2):
result = result + {combine(t1, t2)}
Theta join $r \ Join[C] \ s$ is a specialised product containing only pairs that match on a supplied condition $C$.
Rename provides “schema mapping”. If expression $E$ returns a relation $R(A_1, A_2, \dots, A_n)$ then $RenameS(B_1, B_2, \dots, B_n)$ gives a relation called $S$ containing the same set of tuples as $E$ but with the name of each attribute changed from $A_i$ to $A_b$.
Union $r_1 \cup r_2$ combines two compatible relations into a single relation via set union of sets of tuples.
In pseudocode
result = r1
for each tuple t in relation r2:
result = result + {t}
Intersection $r_1 \cap r_2$ combines two compatible relations into a single relation via set intersection of sets of tuples.
In pseudocode
result = {}
for each tuple t in relation r1:
if t in r2:
result = result + {t}
Difference $r_1 - r_2$ finds the set of tuples that exist in one relation but do not occur in a second compatible relation.
In pseudocode
result = {}
for each tuple t in relation r1:
if t not in r2:
result = result + {t}
Product (Cartesian product) $r \times s$ combines information from two relations pairwise on tuples.
If $t_1 = (A_1 \dots A_n)$ and $t_2 = (B_1 \dots B_n)$ then $(t_1:t_2) = (A_1 \dots A_n, B_1 \dots B_n)$
In pseudocode
result = {}
for each tuple t1 in relation r
for each tuple t2 in relation s
result = result + {(t1:t2)}
Division $r \ Div \ s$ considers each subset of tuples in a relation $R$ that match on $t[R - S]$ for a relation $S$. For this subset of tuples, it takes the $t[S]$ values from each, and if this covers all tuples in $S$, then it includes $t[R - S]$ in the result.
We have $r \ Div \ s = { t : t \in r[R - S] \land \text{satisfy} }$ where $\text{satisfy} = \forall t_s \in S (\exists t_r \in R (t_r[S] = t_s \land t_r[R-S] = t))$.
Operationally:
As a programmer you give a lot of control to the DBMS but you can
The cost of evaluating a query is determined by
The analysis of costs involves estimating the size of intermediate results then based on this, the cost of secondary storage accesses. Accessing data from disk is the dominant cost in query evaluation.
A DBMS query optimiser works as follows:
# Input: relational algebra expression
# Output: execution plan (sequence of RA ops)
bestCost = INF
bestPlan = None
while more possible plans:
plan = produce a new evaluation plan
cost = estimated_cost(plan)
if cost < bestCost:
bestCost = cost
bestPlan = plan
return bestPlan
Typically, there are very many possible plans, but smarter optimisers generate a subset of possible plans.
PostgreSQL provides the EXPLAIN
statement to give a representation of the query execution plan with information that may help to tune query performance.
To explain a query
EXPLAIN ANALYZE Query
Without ANALYZE
, EXPLAIN
shows a plan with estimated costs, with ANALYZE
, EXPLAIN
executes the query and prints the real costs.
Indexes provide efficient content-based access to tuples. Indexes can be built on any combination of attributes.
To define an index
CREATE INDEX name ON table (attr1, attr2, ...)
A transaction is an atomic “unit of work” in an application may may require multiple database changes. Transactions happen in multi-user, unreliable environment.
To maintain integrity of data, transactions must be:
To describe transaction effects, we consider:
READ
- transfer data from permanent storage to memoryWRITE
- transfer data from memory to permanent storageABORT
- terminate transaction, unsuccessfullyCOMMIT
- terminate transaction, successfullyNormally abbreviated to R(X)
, W(X)
, A
, C
.
A transaction must always terminate, either successfully (COMMIT
), with all changes preserved or unsuccessfully (ABORT
), with database unchanged.
SELECT
produces READ
operations on the database.INSERT
produces WRITE
operations.UPDATE
, DELETE
produce both READ + WRITE
operations.Transactions typically have intermediate states that are invalid. However, states before and after transactions must be valid.
A schedule defines a specific execution of one or more transactions, typically concurrent, with interleaved operations.
Arbitrary interleaving of operations causes anomalies, so that two consistency-preserving transactions produce a final state which is not consistent.
Given two transactions $T_1$ and $T_2$
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
there exist two serial executions
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
and
T1: R(X) W(X) R(Y) W(Y)
T2: R(X) W(X) R(Y) W(Y)
Serial execution guarantees a consistent final state if the initial state of the database is consistent and $T_1$ and $T_2$ are consistency-preserving.
Concurrent schedules interleave transaction operations. Some concurrent schedules ensure consistency, and some cause anomalies.
A serialisable schedule is a concurrent schedule for $T_1 \dots T_n$ with a final state $S$. $S$ is also a final state of one of the possible serial schedules for $T_1 \dots T_N$.
There are two common formulations of serialisability
Two transactions have a potential conflict if they perform operations on the same data item and at least one of the operations is a write operation. In such cases, the order of operations affects the result. If there is no conflict, we can swap order without affecting the result.
If we can transform a schedule by swapping the order of non-conflicting operations such that the result is a serial schedule then we can say that the schedule is conflict serialisable.
To check for conflict serialisability we need to show that ordering in concurrent schedule cannot be achieved in any serial schedule. This can be achieved by building a precedence graph where
View serialisability is an alternative formulation of serialisability that is less strict than conflict serialisability.
Two schedules $S$ and $S^\prime$ on $T_1 \dots T_n$ are view equivalent iff
To check serialisability of $S$, find a serial schedule that is view equivalent to $S$ from among the $n!$ possible serial schedules.
Lock based concurrency control involves synchronising access to shared data items via the following rules
These rules alone do not guarantee serializability but two-phase locking does as you need to acquire all needed locks before performing any unlocks. Locking also introduces potential for deadlock and starvation.
Overall, locking reduces concurrency which leads to lower throughput. Different granularity levels of locking (i.e. field, row, table, whole database) can also impact performance.
Transactions in SQL are specified by
BEGIN
: start a transactionCOMMIT
: successfully complete a transactionROLLBACK
: undo changes made by transaction + abortIn PostgreSQL, some actions cause implicit rollback:
raise exception
during execution of a functionbefore
triggerConcurrent access can be controlled via SQL:
LOCK TABLE
explicitly acquires lock on an entire table.
Some SQL commands implicitly acquire appropriate locks (e.g. ALTER TABLE
, UPDATE
, DELETE
). All locks are released at end of transaction.
Explicit locks are avaliable:
lock table TableName [ in LockMode mode ]
Some possible $LockModes$:
SHARE
: simply reading the tableSHARE ROW EXCLUSIVE
: intend to update tableNo UNLOCK
… all locks are released at end of transaction
Row-level locking: lock all selected rows fro writing
select * from Table where Condition for update
Version based concurrency control provides multiple (consistent) versions of the database and gives each transaction access to a version that maintains the serialisability of the transaction.
In version based concurrency control, writing never blocks reading and reading never blocks writing.