CS585 Final

Database Systems

Data != information
Database (DB) and Database Management System (DBMS)
Advantages of DBMS
Types of DBs based on user count, location, content, data currency, and structure

Evolution of Database Systems

Early DBs: file systems
Problems with file systems
- Structural dependence
- Redundancy of data
- Data anomalies

Database Systems

DB vs file system
DBMS functions
- Data dictionary management
- Data storage management
- Data transformation and presentation
- Security management
- Multiuser access control
- Backup and recovery management
- Data integrity management
- Interfaces and communication

Key Concepts

Data vs. Information
- Data: raw facts, building blocks of information
- Information: produced by processing data, reveals meaning, enables decision making
Database: shared, integrated computer structure that stores end-user data and metadata
DBMS: collection of programs that manages the database structure and controls access to data
Advantages of DBMS: better data integration, increased productivity, improved data sharing, security, access, decision making, and quality
Types of Databases
- Based on user count: single-user, multiuser (workgroup, enterprise)
- Based on location: centralized, distributed, cloud
- Based on content: general-purpose, discipline-specific
- Based on data currency: operational, analytical (data warehouse, OLAP)
- Based on structure: unstructured, structured, semi-structured (XML)
File system problems
- Structural dependence, data dependence
- Data redundancy, data anomalies (update, insertion, deletion)
DBMS functions: data dictionary, storage, transformation, security, multiuser access, backup and recovery, integrity management, interfaces and communication
Disadvantages of DB systems: increased costs, management complexity, maintaining currency, vendor dependence, frequent upgrade/replacement cycles

Data Modeling

Data modeling: iterative process of creating a specific data model for a problem domain
Data models: simple representations of complex real-world data structures
Importance of data models: communication tool, overall view of the database

Data Model Building Blocks

Entity: a unique and distinct object used to collect and store data
Attribute: characteristic of an entity
Relationship: describes an association among entities (1:M, M:N, 1:1)
Constraint: set of rules to ensure data integrity

Business Rules

Brief, precise, and unambiguous description of a policy, procedure, or principle
Enable defining basic building blocks and describing characteristics of the data
Sources: company policy, department managers, written documentation, interviews
Help standardize the company's view of data and serve as a communication tool
Translate into data model components (nouns → entities, verbs → relationships)

Naming Conventions

Entity names: descriptive of objects in the business environment, familiar to users
Attribute names: descriptive of the data represented
Proper naming facilitates communication and promotes self-documentation

Hierarchical and Network Models

Hierarchical: manages large amounts of data, represents 1:M relationships
Network: represents complex data relationships, depicts 1:M and M:N relationships

Relational Model

Based on a relation (table) composed of tuples (rows) and attributes (columns)
Advantages: structural independence, conceptual simplicity, ad hoc query capability
Relational Database Management System (RDBMS): performs basic functions, hides complexities

Entity-Relationship (ER) Model

Graphical representation of entities and their relationships in a database structure
Entity Relationship Diagram (ERD): uses graphic representations to model components
Advantages: visual modeling yields conceptual simplicity, effective communication tool
Disadvantages: limited constraint and relationship representation, no data manipulation language

Data Abstraction Levels

External Model: end users' view of the data environment, represented by ER diagrams
Conceptual Model: global view of the entire database, software and hardware independent
Internal Model: database as seen by the DBMS, software-dependent and hardware-independent
Physical Model: describes how data are saved on storage media, requires definition of physical storage and access methods

Entity Relationship Model (ERM)

Basis of an entity relationship diagram (ERD)
ERD depicts conceptual database as viewed by end users, including entities, attributes, and relationships

Attributes

Characteristics of entities
- Required attribute: must have a value
- Optional attribute: can be left empty
Domain: set of possible values for an attribute
Identifiers: one or more attributes that uniquely identify each entity instance
- Composite identifier: primary key composed of more than one attribute
Attribute types
- Composite attribute: can be subdivided into additional attributes
- Simple attribute: cannot be subdivided
- Single-valued attribute: has only a single value
- Multivalued attribute: has many values, may require creating new attributes or entities
Derived attribute: value calculated from other attributes

Relationships

Association between entities that operates in both directions
Participants: entities that participate in a relationship
Connectivity: describes the relationship classification
Cardinality: expresses the minimum and maximum number of entity occurrences associated with one occurrence of a related entity
Existence dependence
- Existence dependent: entity exists only when associated with another related entity occurrence
- Existence independent: entity exists apart from related entities (strong or regular entity)
Relationship strength
- Weak (non-identifying) relationship: primary key of related entity does not contain primary key component of parent entity
- Strong (identifying) relationship: primary key of related entity contains primary key component of parent entity
Weak entity: existence-dependent and has a primary key partially or totally derived from parent entity
Relationship participation
- Optional participation: entity occurrence does not require corresponding entity occurrence in a relationship
- Mandatory participation: entity occurrence requires corresponding entity occurrence in a relationship
Relationship degree: number of entities or participants associated with a relationship
- Unary relationship: association maintained within a single entity
- Recursive relationship: relationship exists between occurrences of the same entity set
- Binary relationship: two entities are associated
- Ternary relationship: three entities are associated

Associative Entities

Also known as composite or bridge entities
Represent M:N relationships between two or more entities
In a 1:M relationship with parent entities
Composed of primary key attributes of each parent entity
May contain additional attributes that play no role in the connective process

Developing an ER Diagram

Create a detailed narrative of the organization's description of operations
Identify business rules based on the descriptions
Identify main entities and relationships from the business rules
Develop the initial ERD
Identify attributes and primary keys that adequately describe entities
Revise and review ERD

Extended Entity Relationship Model (EERM)

Result of adding more semantic constructs to the original entity relationship (ER) model
EER diagram (EERD): uses the EER model

Entity Supertypes and Subtypes

Entity supertype: generic entity type related to one or more entity subtypes, contains common characteristics
Entity subtype: contains unique characteristics of each entity subtype
Criteria for usage:
- Different, identifiable kinds of the entity in the user's environment
- Different kinds of instances should have one or more unique attributes

Specialization Hierarchy

Depicts arrangement of higher-level entity supertypes and lower-level entity subtypes
Relationships described in terms of "is-a" relationships
Subtype exists within the context of a supertype
Every subtype has one directly related supertype, while a supertype can have many subtypes
Provides means to support attribute inheritance, define subtype discriminator, and define disjoint/overlapping and complete/partial constraints

Inheritance

Enables an entity subtype to inherit attributes and relationships of the supertype
All entity subtypes inherit primary key attribute from their supertype
At implementation level, supertype and subtype(s) maintain a 1:1 relationship
Entity subtypes inherit all relationships in which supertype entity participates
Lower-level subtypes inherit all attributes and relationships from upper-level supertypes

Subtype Discriminator

Attribute in the supertype entity that determines to which entity subtype the supertype occurrence is related
Default comparison condition is the equality comparison

Disjoint and Overlapping Constraints

Disjoint subtypes: contain a unique subset of the supertype entity set (nonoverlapping subtypes)
- Implementation based on the value of the subtype discriminator attribute in the supertype
Overlapping subtypes: contain nonunique subsets of the supertype entity set
- Implementation requires the use of one discriminator attribute for each subtype

Completeness Constraint

Specifies whether each supertype occurrence must also be a member of at least one subtype
Types:
- Partial completeness: not every supertype occurrence is a member of a subtype
- Total completeness: every supertype occurrence must be a member of any subtype

Relational Modeling

Relational Tables

Logical view: 'relation'
Key concepts: keys, determinants, dependents, dependencies

Keys

Primary key (PK): attribute(s) that uniquely identify a row
Composite key: key composed of multiple attributes
Entity integrity: each row has a unique identity, no null values in PK
Types of keys: superkey, candidate key, primary key, foreign key, secondary key
- Superkey: attribute(s) that uniquely identify a row
- Candidate key: irreducible superkey
- Foreign key: attribute(s) in one table that match the PK in another table
- Secondary key: attribute(s) used for data retrieval purposes

Functional Dependency

Value of one or more attributes determines the value of other attributes
Determinant: attribute whose value determines another
Dependent: attribute whose value is determined by another attribute
Full functional dependence: entire collection of attributes in the determinant is necessary for the relationship

Nulls and Referential Integrity

Null: absence of a data value
Referential integrity: every reference to an entity instance by another entity instance is valid
Handling nulls: flags, NOT NULL constraint, UNIQUE constraint

Relational Algebra

Theoretical way of manipulating table contents using relational operators
Relvar: variable that holds a relation
Closure: use of relational algebra operators on existing relations produces new relations
Relational set operators:
- SELECT: outputs a subset of rows
- PROJECT: outputs a subset of columns
- UNION: combines all rows from two tables, excluding duplicates
- INTERSECT: yields only rows that appear in both tables
- DIFFERENCE: yields all rows in one table that are not found in the other
- PRODUCT: yields all possible pairs of rows from two tables

Joins

Combine information from two or more tables
Types of joins:
- Natural join: links tables by selecting rows with common values in common attributes
- Equijoin: links tables based on an equality condition comparing specified columns
- Theta join: extension of natural join with a theta subscript
- Inner join: returns only matched records from joined tables
- Outer join: retains matched pairs and leaves unmatched values as null
  - Left outer join: yields all rows in the first table, including unmatched ones
  - Right outer join: yields all rows in the second table, including unmatched ones

Data Dictionary and System Catalog

Data dictionary: description of all user-created tables in the database
System catalog: system data dictionary describing all objects within the database
Homonyms (same name for different attributes) and synonyms (different names for the same attribute) should be avoided

Database Normalization

Normalization: evaluating and correcting table structures to minimize data redundancies
Reduces data anomalies and assigns attributes to tables based on determination
Normal forms: 1NF, 2NF, 3NF, BCNF, 4NF
Higher normal forms are better than lower normal forms
Denormalization: produces a lower normal form, increases performance, and introduces greater data redundancy

The Need for Normalization

Used while designing a new database structure or improving an existing one
Analyzes relationships among attributes within each entity
Improves the existing data structure and creates an appropriate database design

The Normalization Process

Objective: ensure each table conforms to well-formed relations
- Each table represents a single subject
- No data item is unnecessarily stored in more than one table
- All nonprime attributes are dependent on the primary key
- Each table is void of insertion, update, and deletion anomalies
Ensures all tables are in at least 3NF
Works one relation at a time by identifying dependencies and progressively breaking the relation into new relations

Functional Dependence Concepts

Functional dependence: attribute B is fully functionally dependent on attribute A if each value of A determines one and only one value of B
Fully functional dependence (composite key): attribute B is fully functionally dependent on a composite key A if it is functionally dependent on A but not on any subset of A

Types of Functional Dependencies

Partial dependency: functional dependence in which the determinant is only part of the primary key
Transitive dependency: an attribute functionally depends on another nonkey attribute

Conversion to Normal Forms

First Normal Form (1NF):
- Eliminate repeating groups, identify the primary key, and identify all dependencies
- All relational tables satisfy 1NF requirements
Second Normal Form (2NF):
- Make new tables to eliminate partial dependencies and reassign corresponding dependent attributes
- Table is in 2NF when it is in 1NF and includes no partial dependencies
Third Normal Form (3NF):
- Make new tables to eliminate transitive dependencies and reassign corresponding dependent attributes
- Table is in 3NF when it is in 2NF and contains no transitive dependencies

Requirements for a Good Normalized Set of Tables

Evaluate PK assignments and naming conventions
Refine attribute atomicity (atomic attribute: cannot be further subdivided)
Identify new attributes and relationships
Refine primary keys as required for data granularity (level of detail represented by the values stored in a table's row)
Maintain historical accuracy and evaluate using derived attributes

Denormalization

Design goals: creation of normalized relations and processing requirements and speed
Joining a larger number of tables reduces system speed
Defects in unnormalized tables:
- Data updates are less efficient because tables are larger
- Indexing is more cumbersome
- No simple strategies for creating virtual tables known as views

Structured Query Language (SQL)

SQL is a nonprocedural language with a basic command vocabulary set of less than 100 words
Differences in SQL dialects are minor
SQL is an implementation of Ed Codd's relational set operators (SELECT, PROJECT, JOIN, UNION, INTERSECT, DIFFERENCE, PRODUCT, DIVIDE)

SQL Data Types

Numeric: NUMBER(L,D) or NUMERIC(L,D)
Character: CHAR(L), VARCHAR(L) or VARCHAR2(L)
Date: DATE

Data Definition Language (DDL) Commands

CREATE SCHEMA, CREATE TABLE, NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY, DEFAULT, CHECK, CREATE INDEX, CREATE VIEW, ALTER TABLE, CREATE TABLE AS, DROP TABLE, DROP INDEX, DROP VIEW

Data Manipulation Language (DML) Commands

INSERT, SELECT, COMMIT, UPDATE, ROLLBACK, DELETE
Comparison operators, logical operators, and special operators (BETWEEN, IS NULL, LIKE, IN, EXISTS)
SELECT with ORDER BY, DISTINCT, aggregate functions (MIN, MAX, SUM, AVG), and GROUP BY

Creating Table Structures

Use one line per column definition and spaces to line up attribute characteristics and constraints
Table and attribute names are capitalized
Syntax: CREATE TABLE tablename(...);

Primary Key and Foreign Key

Primary key attributes contain both NOT NULL and UNIQUE specifications
RDBMS enforces referential integrity for foreign keys
Use ON DELETE and ON UPDATE clauses to specify actions on foreign key changes

SQL Constraints

NOT NULL, UNIQUE, DEFAULT, CHECK

Data Manipulation Commands

INSERT: add rows to a table
SELECT: list table contents
UPDATE: modify data
DELETE: delete rows from a table
COMMIT: save changes
ROLLBACK: restore the database

Additional SELECT Query Keywords

ORDER BY: specify listing order
DISTINCT: produce list of unique values
Aggregate functions: MIN, MAX, SUM, AVG
GROUP BY: create frequency distributions
HAVING: extension of GROUP BY, applied to GROUP BY output

Advanced Data Definition Commands

ALTER TABLE: make changes to table structure (ADD, MODIFY, DROP)
DROP TABLE: delete a table from the database

Joining Database Tables

Retrieve data from multiple tables using equality comparisons between foreign and primary keys
Use table aliases to identify source tables
Recursive joins: join a table to itself using aliases

Advanced SQL

SQL Join Operators

Join types: natural join, join using, join on, outer join (left, right, full)
Relational join operation merges rows from two tables based on common values, equality/inequality, or outer join conditions

Subqueries and Correlated Queries

Subquery: query inside another query, can return single value, list of values, virtual table, or no value
WHERE subqueries: uses inner SELECT subquery on the right side of a WHERE comparison expression
IN and HAVING subqueries: compare a single attribute to a list of values or restrict the output of a GROUP BY query
FROM subqueries: uses SELECT subquery in the FROM clause to create a virtual table
Attribute list subqueries: uses subquery expressions in the SELECT attribute list
Correlated subquery: executes once for each row in the outer query, inner query references a column of the outer subquery

Multirow Subquery Operators

ALL operator: compares a single value with a list of values using comparison operators other than equals
ANY operator: compares a single value to a list of values and selects rows greater than or less than any value in the list

Relational Set Operators

UNION: combines rows from two or more queries without duplicates
UNION ALL: combines rows from two or more queries with duplicates
INTERSECT: returns rows that appear in both query result sets
EXCEPT (MINUS): returns rows that appear in the first query result set but not in the second

Virtual Tables (Views)

View: a virtual table based on a SELECT query
CREATE VIEW statement: DDL command that stores the subquery specification in the data dictionary

SQL Functions

Functions use numeric, date, or string values and can appear anywhere a value or attribute is used
Categories: numeric, character, date, conversion, null-related, aggregate, and miscellaneous functions

Oracle Sequences

Independent database objects used to generate numeric values for table columns
Can be created, deleted, and edited any time

Procedural SQL (PL/SQL)

Performs conditional or looping operations by isolating code and making all application programs call the shared code
Anonymous PL/SQL blocks, triggers, stored procedures, and functions

Triggers

Procedural SQL code automatically invoked by the RDBMS when a given data manipulation event occurs
Parts: triggering timing (before/after), triggering event (insert/update/delete), triggering level (statement/row), and triggering action (PL/SQL code)

Stored Procedures

Named collection of procedural and SQL statements that reduce network traffic, increase performance, and reduce code duplication

PL/SQL Stored Functions

Named group of procedural and SQL statements that returns a value, can be invoked from within stored procedures or triggers

Transaction Management and Concurrency Control

Transactions

Logical unit of work that must be entirely completed or aborted
Consists of SELECT, UPDATE, and/or INSERT statements
Must begin with the database in a known consistent state
Transaction properties: Atomicity, Consistency, Isolation, Durability (ACID)

Transaction Management with SQL

SQL statements for transaction support: COMMIT, ROLLBACK
Transaction sequence continues until COMMIT, ROLLBACK, end of program, or abnormal termination

Transaction Log

Keeps track of all transactions that update the database
Used by DBMS for recovery and rollback

Concurrency Control

Coordination of simultaneous transaction execution in a multiuser database system
Ensures serializability of transactions

Problems in Concurrency Control

Lost update: Same data element updated in two concurrent transactions, one update is lost
Uncommitted data: Transaction is rolled back after a second transaction has accessed its uncommitted data
Inconsistent retrievals: Transaction accesses data before and after other transactions finish working with the data

The Scheduler

Establishes the order of operations within concurrent transactions
Creates serialization schedule to ensure serializability and isolation

Concurrency Control with Locking Methods

Locking methods facilitate isolation of data items used in concurrently executing transactions
Pessimistic locking assumes conflict between transactions is likely
Lock granularity: database, table, page, row, or field level

Lock Types

Binary lock: locked (1) or unlocked (0)
Shared lock: read access granted to concurrent transactions
Exclusive lock: write access reserved for the transaction that locked the object

Two-Phase Locking (2PL)

Guarantees serializability but does not prevent deadlocks
Growing phase: transaction acquires all required locks without unlocking any data
Shrinking phase: transaction releases all locks and cannot obtain new locks

Deadlocks

Occur when two transactions wait indefinitely for each other to unlock data
Control techniques: prevention, detection, avoidance

Time Stamping

Assigns global, unique time stamp to each transaction
Produces explicit order in which transactions are submitted to DBMS
Disadvantages: increased memory needs and processing overhead

Wait/Die and Wound/Wait Concurrency Control Schemes

Two different schemes for requesting access based on transaction age

Phases of Optimistic Approach

Read: transaction reads database, executes computations, and makes updates to a private copy
Validation: transaction is validated to ensure changes will not affect database integrity and consistency
Write: changes are permanently applied to the database

Distributed Database Management Systems (DDBMS)

Centralized databases no longer popular or useful due to globalization and the need for rapid, ad-hoc data access
Distributed databases are now the norm, especially for web-based applications

Distributed Processing vs. Distributed Databases

Distributed processing: database's logical processing shared among physically independent sites via a network
Distributed database: logically related database stored over physically independent sites via a computer network

DDBMS Components and Functions

Components: computer workstations, network hardware and software, communications media, transaction processors (TP), and data processors (DP)
Functions: receives application requests, validates and analyzes them, maps them, decomposes into I/O operations, searches and validates data, ensures consistency and security, and presents data in the required format

Data and Processing Distribution

Single-site processing, single-site data (SPSD): processing and data storage on a single host computer
Multiple-site processing, single-site data (MPSD): multiple processes on different computers sharing a single data repository
Multiple-site processing, multiple-site data (MPMD): fully distributed DBMS supporting multiple data and transaction processors at multiple sites

Distributed Concurrency Control

Important in distributed databases due to multi-site, multiple-process operations that can create inconsistencies and deadlocks
Two-phase commit protocol (2PC): guarantees that if a portion of a transaction cannot be committed, all changes at other sites will be undone

Distributed Database Transparency

Distribution transparency: allows management of physically dispersed databases as if centralized
Transaction transparency: ensures database transactions maintain distributed database integrity and consistency
Performance transparency: allows a DDBMS to perform as if it were a centralized database
Failure transparency: ensures the system will operate in case of network failure

Distributed Database Design

Data fragmentation: breaks a single object into many segments (horizontal, vertical, or mixed)
Data replication: stores data copies at multiple sites served by a computer network
Data allocation: determines where to locate fragments and replicas (centralized, partitioned, or replicated)

The CAP Theorem

Consistency, Availability, Partition tolerance
In a distributed system, only two of the three can be fully achieved
BASE (Basically Available, Soft state, Eventually consistent) model: data changes propagate slowly through the system until all replicas are consistent

Date's 12 Commandments for DDBMS

Local site autonomy, no central site dependence, failure independence, location transparency, fragmentation transparency, replication transparency, distributed query processing, distributed transaction processing, hardware independence, operating system independence, network independence, and database independence

Database Connectivity and Web Technologies

Database Connectivity

Database middleware: provides an interface between the application program and the database
Data repository/source: data management application used to store data generated by an application program
Various connectivity options: Native SQL, ODBC, DAO+JET, RDO, OLE-DB, ADO.NET, JDBC

Microsoft's Universal Data Access (UDA)

Collection of technologies used to access any type of data source and manage data through a common interface
ODBC, OLE-DB, and ADO.NET form the backbone of the MS UDA architecture

Native SQL Connectivity

Connection interface provided by database vendors, unique to each vendor
Optimized for particular vendor's DBMS, but maintenance is a burden for programmers

ODBC, DAO+Jet, RDO

Open Database Connectivity (ODBC): Microsoft's implementation of SQL Access Group Call Level Interface (CLI) standard
Data Access Objects (DAO): object-oriented API used to access MS Access, FoxPro, and dBase databases from Visual Basic
Remote Data Objects (RDO): higher-level object-oriented application interface to access remote database servers

Object Linking and Embedding for Database (OLE-DB)

Database middleware that adds object-oriented functionality for access to data
Series of COM objects provides low-level database connectivity for applications
ActiveX Data Objects (ADO): provides a high-level application-oriented interface to interact with OLE-DB, DAO, and RDO

ADO.NET

Data access component of Microsoft's .NET application development framework
Manipulates any type of data using any combination of network, operating system, and programming language
Key features: DataSet (disconnected memory-resident representation of database) and XML support

Java Database Connectivity (JDBC)

Application programming interface that allows a Java program to interact with a wide range of data sources
Advantages: leverages existing technology and personnel training, allows direct access to database server or via middleware, and provides a way to connect to databases through an ODBC driver

Database Internet Connectivity

Allows innovative services for rapid response, increased customer satisfaction, anywhere/anytime data access, and effective information dissemination

Web-to-Database Middleware

Server-side extension (program) that interacts directly with the web server
Provides services to the web server transparently to the client browser

Client-Side Extensions

Add functionality to web browsers
Types: plug-ins, Java, JavaScript, ActiveX, and VBScript

Web Application Servers

Middleware application that expands the functionality of web servers by linking them to a wide range of services
Uses: connect to and query databases from web pages, create dynamic web search pages, enforce referential integrity
Features: security, user authentication, access to multiple services, integrated development environment, computational languages, HTML page generation, performance, fault tolerance, and database access with transaction management

Modern Architecture Styles

SOAP: XML-based, for enterprise applications
RESTful: resource-based, for web servers
GraphQL: query language, reduces network load
gRPC: high performance, for microservices
WebSocket: bi-directional, for low-latency data exchange
Webhook: asynchronous, for event-driven applications

Business Intelligence (BI)

Comprehensive, cohesive, integrated set of tools and processes
Captures, collects, integrates, stores, and analyzes data
Transforms data into information, information into knowledge, and knowledge into wisdom

BI Benefits

Improved decision making
Integrating architecture
Common user interface for data reporting and analysis
Common data repository fosters a single version of company data
Improved organizational performance

Decision Support Data

Differ from operational data in time span, granularity, and dimensionality
Drill down: decomposing data to a lower level
Roll up: aggregating data into a higher level

Decision Support Database Requirements

Complex, non-normalized data representations
Aggregated and summarized data
Queries extracting multidimensional time slices
Batch and scheduled data extraction
Support for different data sources and data validation rules
Advanced integration, aggregation, and classification
Support for very large databases (VLDBs) and multiple-processor technologies

Data Warehouses

Characteristics: integrated, subject-oriented, time-variant, and nonvolatile
ETL (Extract, Transform, Load) process: filters, transforms, integrates, classifies, aggregates, and summarizes operational data

Data Marts

Small, single-subject data warehouse subset
Provide decision support to a small group of people
Benefits: lower cost, shorter implementation time, technologically advanced

Star Schema

Data-modeling technique mapping multidimensional decision support data into a relational database
Components: facts (numeric values), dimensions (qualifying characteristics), attributes, and attribute hierarchy
Facts and dimensions represented by physical tables in the data warehouse database
Many-to-one (M:1) relationship between fact table and each dimension table

Techniques to Optimize Data Warehouse Design

Normalizing dimensional tables (snowflake schema)
Maintaining multiple fact tables to represent different aggregation levels
Denormalizing fact tables

Data Analytics

Encompasses mathematical, statistical, and modeling techniques to extract knowledge from data
Explanatory analytics: discovers and explains data characteristics and relationships based on existing data
Predictive analytics: predicts future outcomes with a high degree of accuracy

Online Analytical Processing (OLAP)

Advanced data analysis environment supporting decision making, business modeling, and operations research
Characteristics: multidimensional data analysis techniques, advanced database support, easy-to-use end-user interfaces
ROLAP (Relational OLAP): uses relational databases and tools to store and analyze multidimensional data
MOLAP (Multidimensional OLAP): uses proprietary techniques to store data in matrix-like n-dimensional arrays

SQL Extensions for OLAP

ROLLUP: generates aggregates by different dimensions, enables subtotals for each column except the last (grand total)
CUBE: generates aggregates by listed columns, includes the last column

Spatial Databases

Entity view: space populated by discrete objects (roads, buildings, rivers)
Field view: space as an area covered with continuous surfaces
Components: types, operators, indices

Spatial Data

Examples: crime data, disease spread, census data, real estate prices, traffic data, agricultural land use
Created by government agencies, NASA, USGS, NOAA, and other organizations
Described via points, polylines, polygons, and pixels/raster

Spatial Database Management System (SDBMS)

Architecture built on top of a general-purpose DBMS
Adds spatial data types, operators, functions, and indexing to handle spatial data
Differs from Geographic Information System (GIS) in application focus

Spatial Relationships and Operations

Topological relationships: containment, overlap, proximity
Types: topology-based, metric-based, direction-based, network-based
Operations: spatial predicates, functions, and measurements

Spatial Data in Relational Databases

Implemented using SQL's user-defined types (UDTs) and functions
Supported by major DBMSs: Oracle, PostgreSQL, MySQL, SQL Server, DB2, Informix, and SQLite

Indexing Spatial Data

B-Trees with Z-order curves
R-Trees and variants (R+, R*)
Quadtrees and Octrees
K-D Trees and K-D-B Trees

Query Processing

Filter and refine strategy
Filter step: query region overlaps with minimum bounding rectangles (MBRs)
Refine step: query region overlaps with actual spatial objects

Visualizing Spatial Data

Dot maps, proportional symbol maps, diagram maps, choropleth maps
Mapping platforms: ESRI ArcGIS, QGIS, MapBox, Carto, GIS Cloud

Miscellaneous

Google KML format for encoding spatial data
OpenLayers as an open GIS platform

Database Performance Tuning and Query Optimization

Database Performance-Tuning Concepts

Goal: execute queries as fast as possible
Database performance tuning: set of activities and procedures to reduce response time
Fine-tuning requires all factors to operate at optimum level with minimal bottlenecks

Performance Tuning: Client and Server

Client-side: SQL performance tuning, generating efficient SQL queries
Server-side: DBMS performance tuning, configuring DBMS environment for optimal response

DBMS Architecture

Data stored in data files, grouped in file groups or table spaces
Data cache or buffer cache: shared memory area for recently accessed data blocks
SQL cache or procedure cache: stores recently executed SQL statements or PL/SQL procedures
Listener, user, scheduler, lock manager, and optimizer processes

Database Query Optimization Modes

Automatic query optimization: DBMS finds the most cost-effective access path
Manual query optimization: user or programmer selects and schedules optimization
Static query optimization: best strategy selected at query compilation time
Dynamic query optimization: access strategy determined at run time
Statistically-based query optimization: DBMS uses statistics to determine the best access strategy
Rule-based query optimization: based on user-defined rules to determine the best access strategy

Query Processing

Parsing: DBMS parses SQL query and chooses the most efficient access/execution plan
Execution: DBMS executes SQL query using the chosen execution plan
Fetching: DBMS fetches data and sends the result set back to the client

Indexes and Query Optimization

Help speed up data access and facilitate searching, sorting, aggregate functions, and join operations
Data structures: hash indexes, B-tree indexes, bitmap indexes
Index selectivity: measure of the likelihood that an index will be used in query processing
Function-based index: based on a specific SQL function or expression

Optimizer Choices

Rule-based optimizer: uses preset rules and points to determine the best approach to execute a query
Cost-based optimizer: uses algorithms based on statistics to determine the best approach to execute a query
Optimizer hints: special instructions for the optimizer, embedded in the SQL command text

SQL Performance Tuning

Evaluated from the client perspective
Most SQL performance optimization techniques are DBMS-specific and rarely portable
Majority of performance problems are related to poorly written SQL code

Conditional Expressions

Guidelines for writing efficient conditional expressions in SQL code:
- Use simple columns or literals as operands
- Numeric field comparisons are faster than character, date, and NULL comparisons
- Equality comparisons are faster than inequality comparisons
- Transform conditional expressions to use literals
- Write equality conditions first when using multiple conditional expressions
- When using multiple AND conditions, write the condition most likely to be false first
- When using multiple OR conditions, put the condition most likely to be true first
- Avoid the use of NOT logical operator

Query Formulation

Identify required columns and computations
Identify source tables
Determine how to join tables
Determine selection criteria
Determine the order to display the output

DBMS Performance Tuning

Managing DBMS processes in primary memory and structures in physical storage
Parameters for data cache, SQL cache, sort cache, and optimizer mode
In-memory database: store large portions of the database in primary storage
Recommendations for physical storage:
- Use RAID for performance improvement and fault tolerance
- Minimize disk contention
- Put high-usage tables in their own table spaces
- Assign separate data files in separate storage volumes for indexes, system, and high-usage tables

NoSQL Databases

The term NoSQL was used as early as 1998
NoSQL databases arose due to the need for flexible, efficient, available, and scalable solutions for handling big data
Advantages: high throughput, easy scalability, avoidance of complexity, ability to run on commodity hardware, etc.

Factors Leading to the Need for NoSQL

Rapid generation of new types of data
Difficulty in fitting data into the relational model
Scalability and performance issues with traditional databases

Characteristics of NoSQL Databases

Schema-less: no fixed tables or relationships
Flexible: easy to add new types of data
(Data) Scalable: ability to scale out (horizontal scaling)
Fast: easy to process large volumes of data

Types of NoSQL Databases

Key-value stores
Document stores
Column-family stores
Graph databases

BASE vs. ACID

NoSQL databases are characterized by BASE (Basically Available, Soft state, Eventually consistent)
Traditional databases are characterized by ACID (Atomicity, Consistency, Isolation, Durability)

Key-Value Databases

Simple data model: keys paired with values
Examples: Memcached, Redis, Amazon's Dynamo

Document Databases

Store data as documents (JSON, XML, etc.)
Flexible schema, no need for joins
Examples: MongoDB, CouchDB, Couchbase

Column-Family Databases

Store data in columns rather than rows
Optimized for aggregate queries
Examples: Google's BigTable, Apache Cassandra, Apache HBase

Graph Databases

Store data as nodes and edges (relationships)
Optimized for traversing relationships
Examples: Neo4j, FlockDB, HyperGraphDB

Triple Store Databases

Store data as triples (subject, predicate, object)
Used for semantic querying and inference
Examples: AllegroGraph, MarkLogic, SparkleDB

Querying NoSQL Databases

Non-SQL query languages, often specific to the database type
MapReduce for parallel processing of large datasets

Polyglot Persistence

Using different NoSQL databases for different parts of an application
Choosing the right database for each use case

The CAP Theorem and NoSQL

Consistency, Availability, Partition tolerance
NoSQL databases often prioritize availability and partition tolerance over strong consistency

Trends and Developments

Increasing adoption of NoSQL databases for big data and web-scale applications
Emergence of multi-model databases and hybrid SQL/NoSQL solutions

Data Mining

Data mining is the science of extracting useful information from large datasets
Involves discovering relationships between parts of a dataset
Data mining is a cyclical process: data → discovery → action ("deployment") → new data

Machine Learning vs. Data Mining

Machine learning: training an algorithm on an existing dataset to discover relationships and create a model for analyzing new data

Data Mining Algorithms: Categories

Association: relating data
Clustering: grouping data based on similarity
Classification: labeling data
Regression: coupling data, including finding outliers

Algorithms

Decision Trees (e.g., C4.5, C5.0)

Used for classification and regression
Recursively partitions data space and fits a simple prediction model within each partition

Support Vector Machine (SVM)

Used for binary classification
Finds a hyperplane that maximizes the gap between two classes of data

k-Nearest Neighbors (kNN)

Used for classification
Classifies a new data point based on the majority class of its k nearest neighbors

Naive Bayes

Probability-based classifier
Assumes each feature is statistically independent of the others

k-Means Clustering

Unsupervised learning algorithm
Creates k clusters from input data based on a measure of closeness

Hierarchical Clustering

Organizes data into hierarchical clusters (clusters of clusters)
Can be seen as merging smaller clusters into larger ones or dividing larger clusters into smaller ones

Expectation-Maximization (EM)

Used for clustering data with missing values or latent variables
Iteratively estimates model parameters and latent variable distributions

A Priori

Used for association rule learning
Finds frequently occurring itemsets in transactional data

Linear Regression

Models the relationship between a dependent variable and one or more independent variables
Fits a line to the observed data

Non-Linear Regression

Fits a higher-order, non-linear surface to the observed data

Non-Parametric Modeling

Makes no assumptions about the form of the relationship between variables

Logistic Regression

Used for binary classification
Computes regression coefficients corresponding to a decision boundary

Ensemble Learning

Combines results from multiple learners to improve predictive power
Examples: RandomForest, Bagging, AdaBoost

Classification vs. Clustering

Classification algorithms assign new data to pre-existing, labeled categories
Clustering algorithms group new data into unlabeled clusters based on similarity

Machine Learning

Machine learning is a subset of AI that is revolutionizing the world
Three types of AI: inference-based (symbolic), goals/rewards-based (reinforcement), and connection-based (neuro)
Machine learning comes in several flavors: supervised, unsupervised, semisupervised, and reinforcement learning

Neural Networks

Inspired by the human brain, which contains about 100 billion neurons
A neuron is like a function with inputs (dendrites) and an output (axon)
Artificial neurons are interconnected to form neural networks
Output is generated through an activation function (e.g., sigmoid, ReLU)
Learning occurs through adjusting weights and biases via backpropagation

Convolutional Neural Networks (CNNs)

Biologically inspired by the visual cortex
Designed to process grid-like data (e.g., images)
Consist of convolutional layers, pooling layers, and fully connected layers
Convolutional layers detect features using filters (kernels)
Pooling layers reduce spatial dimensions and provide translation invariance
Fully connected layers perform classification or regression

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)

Designed to process sequential data (e.g., time series, text)
RNNs have connections between nodes forming a directed graph along a temporal sequence
LSTMs are a special kind of RNN, capable of learning long-term dependencies

Generative Adversarial Networks (GANs)

Consist of a generator network and a discriminator network
Generator creates synthetic data, while the discriminator tries to distinguish real data from generated data
Both networks are trained simultaneously in a minimax game

Applications and Demonstrations

Image and video recognition, natural language processing, speech recognition, and more
Examples: Google's QuickDraw, DeepFake, style transfer, face detection, self-driving cars

Key Players and Developments

Major tech companies (Google, Facebook, Microsoft, Amazon) and researchers (Andrew Ng, Geoffrey Hinton, Yann LeCun)
Hardware advancements (GPUs) and software frameworks (TensorFlow, PyTorch)

Current Work and Future Directions

Transformer architecture for language processing
Graph neural networks, capsule networks, and attention mechanisms
Explainable AI (XAI) for interpretability
Neuromorphic computing for energy efficiency

Challenges and Concerns

Bias, explainability, adversarial attacks, and deepfakes
Ethical considerations and potential misuse

Map Reduce

Functional Programming Concepts

lambda operator allows defining anonymous, just-in-time functions.
map() applies a function to a sequence (e.g., list).
filter() filters elements of a list based on a Boolean-valued function.
reduce() applies a function repeatedly to elements of a list to generate a single value.

MapReduce Programming Paradigm

Designed for processing Big Data in a parallel and distributed fashion.
Consists of map and reduce phases, with an optional shuffle phase.
Users supply a mapper task and a reducer task; the rest is handled automatically.
Widely used for processing data in NoSQL databases.

Hadoop Ecosystem

Core Components

HDFS — Hadoop Distributed File System for storing data across clusters.
YARN — Yet Another Resource Negotiator, a resource management and job scheduling technology.

Data Processing Engines

Hive — SQL-like scripting language (HiveQL) for querying data stored in HDFS.
Pig — Data flow language (Pig Latin) for parallel data processing on Hadoop.
Spark — In-memory data processing engine for iterative and real-time processing.
Storm — Distributed real-time computation system for processing unbounded streams of data.
Flink — Parallel data processing platform, generalizing the MapReduce model.
Samza — Processing engine for handling batch data and streaming data in a unified manner.

Graph Processing

Giraph — Open-source implementation of Google's Pregel, specifically designed for iterative graph computations.
HAMA — General-purpose Bulk Synchronous Parallel (BSP) computing engine on top of Hadoop.

Cloud Computing and Virtual Machine Setups

Cloud platforms like AWS, Google Cloud, and Microsoft Azure offer computing resources and services for building Big Data applications.
Virtual machines (VMs) can be used to experiment with Hadoop implementations like Hortonworks Sandbox, MapR Sandbox, and others.

Bulk Synchronous Parallel (BSP) Model

An alternative to MapReduce for parallel and distributed computing.
Computation is divided into supersteps, with phases for local computation, communication, and synchronization barrier.
Suitable for iterative computations, as input data can be saved in memory and transferred between supersteps.

Data Visualization

Visualization of data for understanding, communication, decision-making, and gaining insights.
Types of visualizations: pie charts, bar charts, histograms, scatter plots, maps, network diagrams, word clouds, and more.
Purpose: to make data more intuitive, revealing patterns, trends, and relationships.

Simple Visualizations

Pie charts for expressing relative fractions of a quantity.
Bar charts and histograms for visualizing distributions and comparing values.
Word clouds for indicating relative strengths of keywords or topics.
Bubble plots for displaying multivariate data.

Spatial Data Visualization

Maps for plotting spatial data like locations, incidence, and geographic distributions.
Choropleth maps for displaying aggregated data over regions (unclassed or classed).
Superimposing data on maps for visualizing spatio-temporal data.

Multivariate and Complex Data Visualization

Charles Minard's chart on Napoleon's Russian campaign, considered one of the greatest statistical graphics ever drawn, showing multiple variables.
Network visualizations for displaying relationships between entities, with node and edge attributes.

Interactivity and Animation

Interactive visualizations for exploration, drill-down, and selective viewing.
Animations for presenting changes over time and providing fresh perspectives.

Real-time Visualization

Real-time visualizations for immediate and relevant data representations.
Examples: traffic conditions, stock prices, earthquake monitoring, and population growth.

Data Visualization Tools and Platforms

Spreadsheet software like Excel.
Mathematical and statistical packages: MATLAB, Mathematica, R, Python (matplotlib), and JavaScript (D3.js).
Dedicated data visualization software and platforms: Tableau, Qlik, SiSense, and more.
Online tools and services for creating visualizations.

Principles and Guidelines

Data visualization as an art and science, involving principles from various disciplines like visual perception, color theory, composition, and design.
Guidelines for choosing appropriate visualizations based on data types and analysis goals.
Emphasizing substance over methodology and graphic design (Edward Tufte's principles).

Examples and Resources

A collection of examples showcasing different visualization types and techniques.
Resources for learning and practicing data visualization, including courses, whitepapers, and online galleries.

Privacy, Security, and Data Governance

Privacy

The European Union's GDPR requires businesses to protect the personal data and privacy of EU citizens for transactions occurring within the EU.
It provides a single set of privacy guidelines for EU member countries and citizens, effective from May 25, 2018.

Privacy Challenges

Websites and e-commerce companies collect vast amounts of customer data for sales and analytics, making data vulnerable to theft and leakage.
Search data and online activities are tracked, enabling data inference and connecting of dots about individuals.
Surveillance and monitoring of individuals, particularly in countries like China and Japan.
Location tracking through mobile devices, security cameras, and other means.

Fair Information Practices (FIP)

The FIP efforts in organizations followed five tenets:

Openness
Disclosure
Secondary usage limits
Correctability
Security

Redefining Privacy

The notion of "privacy" needs to be redefined in the digital age, where personal data is widely collected and shared.

Security

Data Breaches

Data breaches involving theft or exposure of valuable data, identity information, and customer details are becoming commonplace (e.g., Yahoo, Home Depot, Facebook, Uber, Equifax).

Risks and Costs

Privacy and security breaches can result in significant costs, including lost revenue, customer attrition, fines, and lawsuits. Investing in security technologies and policies is crucial to mitigate these risks.

Internet of Things (IoT) Security

Novel security threats include attacks on IoT devices and data, including connected cars and other smart devices.

Ethics

Ethical Use of Data

Ethical use of data involves multiple aspects, such as:

Potential for biased or unethical behavior in AI systems (e.g., autonomous drones, robot soldiers).
Fairness issues, such as bias in face recognition systems.
Deepfakes and information warfare (fake videos, audio, images, and text).
Interpretability and transparency of machine learning models.

Data Governance

Data governance is the exercise of authority and control over the management of data assets. It involves planning, monitoring, and enforcing policies, procedures, rules, and guidelines for data collection, storage, and usage.

Master Data Management (MDM)

MDM is the management of "master data," providing a single point of reference for critical, shared data within an organization. It ensures data consistency, accuracy, and completeness.

Data Provenance

Data provenance documents the inputs, entities, systems, and processes that influence data, providing a historical record of its origins and lineage. Provenance helps establish trust in data and is crucial for scientific research.

Data Curation

Data curation involves actively managing data throughout its lifecycle to maintain quality, add value, and enable reuse over time. It includes processes and technologies for organizing, preserving, and facilitating access to data.

Compliance

Compliance refers to adhering to laws, regulations, and policies related to data handling and privacy. It involves understanding the differences between laws, regulations, and policies, and ensuring that organizations comply with relevant requirements.

Trust

Trust in organizations handling personal data is proportional to transparency, value delivery, and acceptance of consequences. Individuals need assurance that their data is being used responsibly by businesses, governments, and non-profit organizations.

Generative AI

Generative AI runs neural networks backwards, generating new data instead of classifying existing data.

Generative Adversarial Networks (GANs)

Two neural networks (generator and discriminator) are paired in opposing order, with the generator creating new data and the discriminator evaluating its authenticity.

Style Transfer

One of the early uses of GenAI was to add artistic styles to images.

Autoencoders and Variational Autoencoders (VAEs)

Autoencoders consist of an encoder that maps input to a latent space and a decoder that reconstructs the input from the latent representation.
VAEs produce a distribution in the latent space instead of a single point, allowing for the generation of new data by sampling from the distribution.

Transformers

Transformers, initially introduced for machine translation, can be adapted to various domains, including vision (ViT), time-series prediction, and music generation.
The self-attention mechanism in Transformers addresses the quadratic bottleneck of attention computation, enabling better handling of context.

Retrieval-Augmented Generation (RAG)

RAG models augment language models with external memory (knowledge bases, databases, or text) to enhance their ability to understand and generate relevant content.

Fine-tuning and Representation Fine-Tuning (ReFT)

Fine-tuning is a way to adapt a pre-trained language model to a specialized domain by further training on domain-specific data.
ReFT is a new family of fine-tuning techniques that modify the relevant hidden representations of the language model for the target task, providing a more efficient alternative to traditional fine-tuning methods.

Directions and Challenges

New directions in GenAI include multimodal models, "embodied" language models, sparse and modular architectures, and specialized hardware.
Challenges and issues include context mismatch, input chunking, hallucinations, and the need for computational resources to scale these models.

The Centrality of Data

Data is the driving force behind all these advancements and applications, emphasizing the importance of data organization, governance, curation, and responsible usage.
Continuous learning and exploration of new data-driven technologies and techniques are essential for staying up-to-date and contributing to the field.

Links to This Note

2024-04-30

- CS585 Final

Database Systems
- Evolution of Database Systems
- Database Systems
- Key Concepts
Data Modeling
- Data Model Building Blocks
- Business Rules
- Naming Conventions
- Hierarchical and Network Models
- Relational Model
- Entity-Relationship (ER) Model
- Data Abstraction Levels
Entity Relationship Model (ERM)
- Attributes
- Relationships
- Associative Entities
- Developing an ER Diagram
Extended Entity Relationship Model (EERM)
- Entity Supertypes and Subtypes
- Specialization Hierarchy
- Inheritance
- Subtype Discriminator
- Disjoint and Overlapping Constraints
- Completeness Constraint
Relational Modeling
- Relational Tables
- Keys
- Functional Dependency
- Nulls and Referential Integrity
- Relational Algebra
- Joins
- Data Dictionary and System Catalog
Database Normalization
- The Need for Normalization
- The Normalization Process
- Functional Dependence Concepts
- Types of Functional Dependencies
- Conversion to Normal Forms
- Requirements for a Good Normalized Set of Tables
- Denormalization
Structured Query Language (SQL)
- SQL Data Types
- Data Definition Language (DDL) Commands
- Data Manipulation Language (DML) Commands
- Creating Table Structures
- Primary Key and Foreign Key
- SQL Constraints
- Data Manipulation Commands
- Additional SELECT Query Keywords
- Advanced Data Definition Commands
- Joining Database Tables
Advanced SQL
- SQL Join Operators
- Subqueries and Correlated Queries
- Multirow Subquery Operators
- Relational Set Operators
- Virtual Tables (Views)
- SQL Functions
- Oracle Sequences
- Procedural SQL (PL/SQL)
- Triggers
- Stored Procedures
- PL/SQL Stored Functions
Transaction Management and Concurrency Control
- Transactions
- Transaction Management with SQL
- Transaction Log
- Concurrency Control
- Problems in Concurrency Control
- The Scheduler
- Concurrency Control with Locking Methods
- Lock Types
- Two-Phase Locking (2PL)
- Deadlocks
- Time Stamping
- Wait/Die and Wound/Wait Concurrency Control Schemes
- Phases of Optimistic Approach
Distributed Database Management Systems (DDBMS)
- Distributed Processing vs. Distributed Databases
- DDBMS Components and Functions
- Data and Processing Distribution
- Distributed Concurrency Control
- Distributed Database Transparency
- Distributed Database Design
- The CAP Theorem
- Date's 12 Commandments for DDBMS
Database Connectivity and Web Technologies
- Database Connectivity
- Microsoft's Universal Data Access (UDA)
- Native SQL Connectivity
- ODBC, DAO+Jet, RDO
- Object Linking and Embedding for Database (OLE-DB)
- ADO.NET
- Java Database Connectivity (JDBC)
- Database Internet Connectivity
- Web-to-Database Middleware
- Client-Side Extensions
- Web Application Servers
- Modern Architecture Styles
Business Intelligence (BI)
- BI Benefits
- Decision Support Data
- Decision Support Database Requirements
- Data Warehouses
- Data Marts
- Star Schema
- Techniques to Optimize Data Warehouse Design
- Data Analytics
- Online Analytical Processing (OLAP)
- SQL Extensions for OLAP
Spatial Databases
- Spatial Data
- Spatial Database Management System (SDBMS)
- Spatial Relationships and Operations
- Spatial Data in Relational Databases
- Indexing Spatial Data
- Query Processing
- Visualizing Spatial Data
- Miscellaneous
Database Performance Tuning and Query Optimization
- Database Performance-Tuning Concepts
- Performance Tuning: Client and Server
- DBMS Architecture
- Database Query Optimization Modes
- Query Processing
- Indexes and Query Optimization
- Optimizer Choices
- SQL Performance Tuning
- Conditional Expressions
- Query Formulation
- DBMS Performance Tuning
NoSQL Databases
- Factors Leading to the Need for NoSQL
- Characteristics of NoSQL Databases
- Types of NoSQL Databases
- BASE vs. ACID
- Key-Value Databases
- Document Databases
- Column-Family Databases
- Graph Databases
- Triple Store Databases
- Querying NoSQL Databases
- Polyglot Persistence
- The CAP Theorem and NoSQL
- Trends and Developments
Data Mining
- Machine Learning vs. Data Mining
- Data Mining Algorithms: Categories
- Algorithms
- Decision Trees (e.g., C4.5, C5.0)
- Support Vector Machine (SVM)
- k-Nearest Neighbors (kNN)
- Naive Bayes
- k-Means Clustering
- Hierarchical Clustering
- Expectation-Maximization (EM)
- A Priori
- Linear Regression
- Non-Linear Regression
- Non-Parametric Modeling
- Logistic Regression
- Ensemble Learning
- Classification vs. Clustering
Machine Learning
- Neural Networks
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
- Generative Adversarial Networks (GANs)
- Applications and Demonstrations
- Key Players and Developments
- Current Work and Future Directions
- Challenges and Concerns
Map Reduce
- Functional Programming Concepts
- MapReduce Programming Paradigm
- Hadoop Ecosystem
- Core Components
- Data Processing Engines
- Graph Processing
- Cloud Computing and Virtual Machine Setups
- Bulk Synchronous Parallel (BSP) Model
Data Visualization
- Simple Visualizations
- Spatial Data Visualization
- Multivariate and Complex Data Visualization
- Interactivity and Animation
- Real-time Visualization
- Data Visualization Tools and Platforms
- Principles and Guidelines
- Examples and Resources
Privacy, Security, and Data Governance
- Privacy
- General Data Protection Regulation (GDPR)
- Privacy Challenges
- Fair Information Practices (FIP)
- Redefining Privacy
- Security
- Data Breaches
- Risks and Costs
- Internet of Things (IoT) Security
- Ethics
- Ethical Use of Data
- Data Governance
- Master Data Management (MDM)
- Data Provenance
- Data Curation
- Compliance
- Trust
Generative AI
- Generative Adversarial Networks (GANs)
- Style Transfer
- Autoencoders and Variational Autoencoders (VAEs)
- Transformers
- Retrieval-Augmented Generation (RAG)
- Fine-tuning and Representation Fine-Tuning (ReFT)
- Directions and Challenges
- The Centrality of Data

Database Systems​

Evolution of Database Systems​

Database Systems​

Key Concepts​

Data Modeling​

Data Model Building Blocks​

Business Rules​

Naming Conventions​

Hierarchical and Network Models​

Relational Model​

Entity-Relationship (ER) Model​

Data Abstraction Levels​

Entity Relationship Model (ERM)​

Attributes​

Relationships​

Associative Entities​

Developing an ER Diagram​

Extended Entity Relationship Model (EERM)​

Entity Supertypes and Subtypes​

Specialization Hierarchy​

Inheritance​

Subtype Discriminator​

Disjoint and Overlapping Constraints​

Completeness Constraint​

Relational Modeling​

Relational Tables​

Keys​

Functional Dependency​

Nulls and Referential Integrity​

Relational Algebra​

Joins​

Data Dictionary and System Catalog​

Database Normalization​

The Need for Normalization​

The Normalization Process​

Functional Dependence Concepts​

Types of Functional Dependencies​

Conversion to Normal Forms​

Requirements for a Good Normalized Set of Tables​

Denormalization​

Structured Query Language (SQL)​

SQL Data Types​

Data Definition Language (DDL) Commands​

Data Manipulation Language (DML) Commands​

Creating Table Structures​

Primary Key and Foreign Key​

SQL Constraints​

Data Manipulation Commands​

Additional SELECT Query Keywords​

Advanced Data Definition Commands​

Joining Database Tables​

Advanced SQL​

SQL Join Operators​

Subqueries and Correlated Queries​

Multirow Subquery Operators​

Relational Set Operators​

Virtual Tables (Views)​

SQL Functions​

Oracle Sequences​

Procedural SQL (PL/SQL)​

Triggers​

Stored Procedures​

PL/SQL Stored Functions​

Transaction Management and Concurrency Control​

Transactions​

Transaction Management with SQL​

Transaction Log​

Concurrency Control​

Problems in Concurrency Control​

The Scheduler​

Concurrency Control with Locking Methods​

Lock Types​

Two-Phase Locking (2PL)​

Deadlocks​

Time Stamping​

Wait/Die and Wound/Wait Concurrency Control Schemes​

Phases of Optimistic Approach​

Distributed Database Management Systems (DDBMS)​

Distributed Processing vs. Distributed Databases​

DDBMS Components and Functions​

Database Systems

Evolution of Database Systems

Database Systems

Key Concepts

Data Modeling

Data Model Building Blocks

Business Rules

Naming Conventions

Hierarchical and Network Models

Relational Model

Entity-Relationship (ER) Model

Data Abstraction Levels

Entity Relationship Model (ERM)

Attributes

Relationships

Associative Entities

Developing an ER Diagram

Extended Entity Relationship Model (EERM)

Entity Supertypes and Subtypes

Specialization Hierarchy

Inheritance

Subtype Discriminator

Disjoint and Overlapping Constraints

Completeness Constraint

Relational Modeling

Relational Tables

Keys

Functional Dependency

Nulls and Referential Integrity

Relational Algebra

Joins

Data Dictionary and System Catalog

Database Normalization

The Need for Normalization

The Normalization Process

Functional Dependence Concepts

Types of Functional Dependencies

Conversion to Normal Forms

Requirements for a Good Normalized Set of Tables

Denormalization

Structured Query Language (SQL)

SQL Data Types

Data Definition Language (DDL) Commands

Data Manipulation Language (DML) Commands

Creating Table Structures

Primary Key and Foreign Key

SQL Constraints

Data Manipulation Commands

Additional SELECT Query Keywords

Advanced Data Definition Commands

Joining Database Tables

Advanced SQL

SQL Join Operators

Subqueries and Correlated Queries

Multirow Subquery Operators

Relational Set Operators

Virtual Tables (Views)

SQL Functions

Oracle Sequences

Procedural SQL (PL/SQL)

Triggers

Stored Procedures

PL/SQL Stored Functions

Transaction Management and Concurrency Control

Transactions

Transaction Management with SQL

Transaction Log

Concurrency Control

Problems in Concurrency Control

The Scheduler

Concurrency Control with Locking Methods

Lock Types

Two-Phase Locking (2PL)

Deadlocks

Time Stamping

Wait/Die and Wound/Wait Concurrency Control Schemes

Phases of Optimistic Approach

Distributed Database Management Systems (DDBMS)

Distributed Processing vs. Distributed Databases

DDBMS Components and Functions