Data Management (DM)
Each area of computer science can be described as “The study of algorithms and data structures to …” In this case the blank is filled in with “deal with data sets too large to fit in primary memory.”
Since the mid-1970’s this has meant an almost exclusive study of relational database systems. Depending on institutional context, students have studied, in varying proportions:
- Data Modeling: e.g. E-R Data model, relational model, normalization theory
- Query construction: e.g. Relational algebra, SQL
- Query processing: e.g. indices (B+tree, hash), algorithms (external sorting, select-project-join), query optimization (transformations, index selection)
- DBMS internals: e.g. concurrency/locking, transaction management, buffer management
Today’s graduates are expected to possess DBMS user (as opposed to developer) skills. These primarily include data modeling and query construction; be able to take an unorganized collection of data, organize it using a DBMS, and access/update the collection via queries.
Additionally, students need to study:
- The role data plays in an organization. This includes
- The Data Life Cycle: Creation-Processing-Review/Reporting-Retention/Retrieval-Destruction.
- The social/legal aspects of data collections: e.g. scale, data privacy, database privacy (compliance) by design, anonymity, ownership, reliability, intended and unintended applications.
- Emerging and advanced technologies that are augmenting/replacing traditional relational systems, particularly those used to support (big) data analytics: NoSQL (e.g. JSON, XML, Key-Value store databases), cloud computing, MapReduce, dataframes.
We note that DM is related to fundamental information security concepts that are described in the Information Assurance and Security (IAS) topic area, IAS/Fundamental Concepts. Furthermore, see the ACM Data Science Curriculum for information about Data Mining.
Knowledge Units:
DM/The Role of Data |
DM/DBMS Internals |
DM/Core Database Systems Concepts |
DM/NoSQL Systems |
DM/Data Modeling |
DM/Distributed Databases/Cloud Computing |
DM/Relational Databases |
DM/Semi-structured and Unstructured Databases |
| DM/Query Construction | DM/Emerging Topics
|
| DM/Query Processing |
DM/The Role of Data
- The Data Life Cycle: Creation-Processing-Review/Reporting-Retention/Retrieval-Destruction.
- The social/legal aspects of data collections:
- scale
- data privacy
- database privacy (compliance) by design
- anonymity
- ownership
- reliability
- intended and unintended applications.
DM/Core Database System Concepts
- Purpose and advantages of database systems
- Components of database systems
- Design of core DBMS functions (e.g., query mechanisms, transaction management, buffer management, access methods)
- Database architecture, data independence, and data abstraction
- Use of a declarative query language
- Systems supporting structured and/or stream content
- Approaches for managing large volumes of data (e.g., noSQL database systems, use of MapReduce).
DM/Data Modeling
- Data modeling
- Conceptual models (e.g., entity-relationship, UML diagrams)
- Spreadsheet models
- Relational data models
- Object-oriented models (cross-reference PL/Object-Oriented Programming)
- GraphQL
- Semi-structured data model (expressed using DTD, XML, or JSON Schema, for example)
- What is new in SQL:2016
- Specialized Data Modeling topics
- Time series data (aggregation, and join)
- Graph data (link traversal)
- Materialized Views and Special data structures like (Hyperloglog, bitmap, …)
- Typically querying “Raw time series data” directly is very slow for things like “avg daily price”, “daily unique count”, “daily membership”
- Geo-Spatial data
DM/Relational Databases
- Mapping conceptual schema to a relational schema
- Entity and referential integrity
- Relational database design
- Physical database design: file and storage structures
- Functional dependency
- Candidate keys, superkeys, and closure of a set of attributes
- Normalization Theory
- Decomposition of a schema; lossless-join and dependency-preservation properties of a decomposition
- Normal forms (BCNF)
- Multi-valued dependency (4NF)
- Join dependency (PJNF, 5NF)
- Representation theory
DM/Query Construction
- Relational Algebra
- Relational Calculus
- SQL
- Data definition including integrity and other constraints specification
- Query formulation
- Update sublanguage
- QBE and 4th-generation environments
- Different ways to invoke non-procedural queries in conventional languages
- Introduction to other major query languages (e.g., XPATH, SPARQL)
- Stored procedures
DM/Query Processing
- Index structures
- B+ trees
- Hash indices: static and dynamic
- Index creation in SQL
- Algorithms for query operators
- External Sorting
- Selection
- Projection; with and without duplicate elimination
- Natural Joins: Nested loop, Sort-merge, Hash join
- Analysis of algorithm efficiency
- Query transformations
- Query optimization
- Access paths
- Query plan construction
- Selectivity estimation
- Index-only plans
- Database tuning: Index selection
- Impact of indices on query performance
DM/DBMS Internals
- Buffer Management
- Transaction Processing
- Isolation Levels
- ACID
- Serializability
- Concurrency Control:
- Locking
- Deadlocks: Detection and Prevention strategies
- Optimistic CC
- Timestamp CC
- Recovery Manager
- Relation with Buffer Manager
- Write-Ahead logging
- ARIES recovery system (Analysis, REDO, UNDO)
- Interaction of transaction management with storage, especially buffering
DM/NoSQL Systems
- Why NoSQL? (e.g. Impedance mismatch between Application and RDBMS)
- Key-Value and Document data model
- Storage system (e.g. Key-Value system )
- Distribution Models (Sharding and Replication)
- Consistency Models (Update and Read, Quorum consistency, CAP theorem)
- Processing model (e.g. Map-Reduce, multi-stage and incremental map-reduce)
- Case Studies: Cloud storage system (e.g. S3); Graph databases ; “When not to use NoSQL”
DM/Distributed Databases/Cloud Computing
- Distributed DBMS
- Distributed data storage
- Distributed query processing
- Distributed transaction model
- Homogeneous and heterogeneous solutions
- Client-server distributed databases (cross-reference SF/Computational Paradigms)
- Parallel DBMS
- Parallel DBMS architectures: shared memory, shared disk, shared nothing;
- Speedup and scale-up, e.g., use of the MapReduce processing model (cross-reference CN/Processing, PD/Parallel Decomposition)
- Data replication and weak consistency models
DM/Semi-structured and Unstructured Databases
- Vectorized unstructured data (text, video, audio, …) and vector storage
- TF-IDF Vectorizer with ngram
- Word2Vec Word2vec – Wikipedia
- Array database or array data type handling
- semi-structured (e.g. JSON, XML)
- Storage
- Encoding and compression of nested data types
- Indexing
- Btree, skip index, Bloom filter
- Inverted index and bitmap compression
- Space filling curve indexing for semi-structured geo-data
- Query processing for OLTP and OLAP use cases
- Insert, Select, update/delete trade offs
- Case studies on Postgres/JSON, MongoDB and Snowflake/JSON
- Storage
DM/Emerging Topics
- Data As a Service
- DataOps
- Analytics at the Edge.
- Graph DBs.
- Augmented Data Management: ML for self-tuning purposes