DM – SIGCSE 2022 Version

Data Management (DM)

Each area of computer science can be described as “The study of algorithms and data structures to …” In this case the blank is filled in with “deal with data sets too large to fit in primary memory.”

Since the mid-1970’s this has meant an almost exclusive study of relational database systems. Depending on institutional context, students have studied, in varying proportions:

Data Modeling: e.g. E-R Data model, relational model, normalization theory
Query construction: e.g. Relational algebra, SQL
Query processing: e.g. indices (B+tree, hash), algorithms (external sorting, select-project-join), query optimization (transformations, index selection)
DBMS internals: e.g. concurrency/locking, transaction management, buffer management

Today’s graduates are expected to possess DBMS user (as opposed to developer) skills. These primarily include data modeling and query construction; be able to take an unorganized collection of data, organize it using a DBMS, and access/update the collection via queries.

Additionally, students need to study:

The role data plays in an organization. This includes
- The Data Life Cycle: Creation-Processing-Review/Reporting-Retention/Retrieval-Destruction.
- The social/legal aspects of data collections: e.g. scale, data privacy, database privacy (compliance) by design, anonymity, ownership, reliability, intended and unintended applications.
Emerging and advanced technologies that are augmenting/replacing traditional relational systems, particularly those used to support (big) data analytics: NoSQL (e.g. JSON, XML, Key-Value store databases), cloud computing, MapReduce, dataframes.

We note that DM is related to fundamental information security concepts that are described in the Information Assurance and Security (IAS) topic area, IAS/Fundamental Concepts. Furthermore, see the ACM Data Science Curriculum for information about Data Mining.

Knowledge Units:

DM/The Role of Data	DM/DBMS Internals
DM/Core Database Systems Concepts	DM/NoSQL Systems
DM/Data Modeling	DM/Distributed Databases/Cloud Computing
DM/Relational Databases	DM/Semi-structured and Unstructured Databases
DM/Query Construction	DM/Emerging Topics
DM/Query Processing

DM/The Role of Data

The Data Life Cycle: Creation-Processing-Review/Reporting-Retention/Retrieval-Destruction.
The social/legal aspects of data collections:
- scale
- data privacy
- database privacy (compliance) by design
- anonymity
- ownership
- reliability
- intended and unintended applications.

DM/Core Database System Concepts

Purpose and advantages of database systems
Components of database systems
Design of core DBMS functions (e.g., query mechanisms, transaction management, buffer management, access methods)
Database architecture, data independence, and data abstraction
Use of a declarative query language
Systems supporting structured and/or stream content
Approaches for managing large volumes of data (e.g., noSQL database systems, use of MapReduce).

DM/Data Modeling

Data modeling
Conceptual models (e.g., entity-relationship, UML diagrams)
Spreadsheet models
Relational data models
Object-oriented models (cross-reference PL/Object-Oriented Programming)
- GraphQL
Semi-structured data model (expressed using DTD, XML, or JSON Schema, for example)
- What is new in SQL:2016
Specialized Data Modeling topics
- Time series data (aggregation, and join)
- Graph data (link traversal)
- Materialized Views and Special data structures like (Hyperloglog, bitmap, …)
  - Typically querying “Raw time series data” directly is very slow for things like “avg daily price”, “daily unique count”, “daily membership”
- Geo-Spatial data

DM/Relational Databases

Mapping conceptual schema to a relational schema
Entity and referential integrity
Relational database design
Physical database design: file and storage structures
Functional dependency
- Candidate keys, superkeys, and closure of a set of attributes
Normalization Theory
- Decomposition of a schema; lossless-join and dependency-preservation properties of a decomposition
- Normal forms (BCNF)
- Multi-valued dependency (4NF)
- Join dependency (PJNF, 5NF)
- Representation theory

DM/Query Construction

Relational Algebra
Relational Calculus
SQL
- Data definition including integrity and other constraints specification
- Query formulation
- Update sublanguage
QBE and 4th-generation environments
Different ways to invoke non-procedural queries in conventional languages
Introduction to other major query languages (e.g., XPATH, SPARQL)
Stored procedures

DM/Query Processing

Index structures
- B+ trees
- Hash indices: static and dynamic
- Index creation in SQL
Algorithms for query operators
- External Sorting
- Selection
- Projection; with and without duplicate elimination
- Natural Joins: Nested loop, Sort-merge, Hash join
- Analysis of algorithm efficiency
Query transformations
Query optimization
- Access paths
- Query plan construction
- Selectivity estimation
- Index-only plans
Database tuning: Index selection
- Impact of indices on query performance

DM/DBMS Internals

Buffer Management
Transaction Processing
- Isolation Levels
- ACID
- Serializability
Concurrency Control:
- Locking
- Deadlocks: Detection and Prevention strategies
- Optimistic CC
- Timestamp CC
Recovery Manager
- Relation with Buffer Manager
- Write-Ahead logging
- ARIES recovery system (Analysis, REDO, UNDO)
Interaction of transaction management with storage, especially buffering

DM/NoSQL Systems

Why NoSQL? (e.g. Impedance mismatch between Application and RDBMS)
Key-Value and Document data model
Storage system (e.g. Key-Value system )
Distribution Models (Sharding and Replication)
Consistency Models (Update and Read, Quorum consistency, CAP theorem)
Processing model (e.g. Map-Reduce, multi-stage and incremental map-reduce)
Case Studies: Cloud storage system (e.g. S3); Graph databases ; “When not to use NoSQL”

DM/Distributed Databases/Cloud Computing

Distributed DBMS
- Distributed data storage
- Distributed query processing
- Distributed transaction model
- Homogeneous and heterogeneous solutions
- Client-server distributed databases (cross-reference SF/Computational Paradigms)
Parallel DBMS
- Parallel DBMS architectures: shared memory, shared disk, shared nothing;
- Speedup and scale-up, e.g., use of the MapReduce processing model (cross-reference CN/Processing, PD/Parallel Decomposition)
- Data replication and weak consistency models

DM/Semi-structured and Unstructured Databases

Vectorized unstructured data (text, video, audio, …) and vector storage
- TF-IDF Vectorizer with ngram
- Word2Vec Word2vec – Wikipedia
- Array database or array data type handling
semi-structured (e.g. JSON, XML)
- Storage
  - Encoding and compression of nested data types
- Indexing
  - Btree, skip index, Bloom filter
  - Inverted index and bitmap compression
  - Space filling curve indexing for semi-structured geo-data
- Query processing for OLTP and OLAP use cases
  - Insert, Select, update/delete trade offs
  - Case studies on Postgres/JSON, MongoDB and Snowflake/JSON

DM/Emerging Topics

Data As a Service
DataOps
Analytics at the Edge.
Graph DBs.
Augmented Data Management: ML for self-tuning purposes

Data Management (DM)

DM/The Role of Data

DM/Core Database Systems Concepts

DM/Data Modeling

DM/Relational Databases

DM/The Role of Data

DM/Core Database System Concepts

DM/Data Modeling