The rise of MPP platforms — Comparing SMP to MPP Architecture
The motivation for writing this is to explain the major difference between SMP and MPP platforms. I will also explain their appropriate use cases, pros, and cons.
Symmetric Multi-Processor (SMP) Architecture
“Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes” reference
Let’s start with some history on analytical databases
Relational databases (ex SQL Server, Oracle, DB2) were used both as an Online Transactional Processing (OLTP) database (to support applications) and Online Analytical Processing (OLAP) database (for the analytical use cases).
By the way, the main difference between OLAP and OLTP is that in OLAP we build a data model that is denormalized (facts, and dimensions) in a star or snowflake schema, while OLTP data model is normalized (3NF at least) using reference, transaction, and bridge tables.
In those days, organizations that desire to improve the performance of their analytical database have only one option which was to scale up by adding more CPU, RAM, and I/O disk storage.
Characteristics of SMP architecture
- In SMP every processor shares a single copy of the operating system.
- SMP architecture is a tightly coupled multiprocessor system
- SMP grows by buying a bigger System
- In SMP resources like bus, memory, and an I/O system are common/shared.
Benefits of SMP architecture
- Network speed: Since all the components sit in the same server, there is no latency at all
- Enforcement of constraints: Primary, foreign key, unique constraints are easily maintained when all the data sits on the same server
- Seamless integration of components: Due to the tightly coupled CPU, memory, I/O, we have fewer failures in a single server.
- Data consistency: We can implement the ACID property of Relational Databases in full effect. Data written to the database is validated in real-time for validity, incorruptibility, and integrity. That is why we can implement Database triggers
What are the issues with this SMB architecture?
- Performance: Even with the best data model, most SMP Relational databases struggled to scale with the data growth. There is only so much CPU you can add to a single server. This constraint hinders your performance
- Scalability: We can only scale up in a limited fashion. Any data size of more than 15TB is at risk of major performance issues. There is also unavailability of single hardware that can host a petabyte of storage.
- Cost: the more you spend buying faster processors and bigger memory, the more expensive the system became.
- Single point of failure: If one processor fails, the entire server can become unusable. This creates a maintenance nightmare
- Elasticity: If I want to add more storage, the chances are the entire server needs to be brought down and a lot of work needs to be done to reconfigure the hardware and software.
When to use an SMP Database
- Great as a Database for monolithic applications
- Maintenance of data integrity and constraints is a necessity
- Data size is less than 4 TB
Massively Parallel Processing (MPP) Database Architecture
An MPP database is a database that is optimized to be processed in parallel for many operations to be performed by many processing units at a time.
MPP (massively parallel processing) is the coordinated processing of a program by multiple processors working on different parts of the program. Each processor has its own operating system and memory.
Imagine I wanted to count the number of pages in a book, I can achieve this quickly if I split the work according to the chapters in the book, and assign each chapter to a node. Each node will perform its count of the chapter and send the result to a parent node for accumulation. All these are done in parallel. You can also see the power this gives us because we can now scale our processing power by adding more nodes to the machine.
Characteristics of MPP Architecture
- MPP supports shared-nothing Architecture
2. In MPP each processor works on a different part of the task.
3. Each processor has its own set of disk
4. Each node is responsible for processing only the rows on its own disk
5. Scalability is easy by just adding nodes
6. Data is Horizontally Partitioned with huge compression ability
7. MPP processors communicate with each other using some form of messaging interface
8. In MPP each processor uses its own operating system (OS) and memory.
Advantages of MPP Architecture
- Performance: The speed of computation grows linearly. The more nodes you have, the faster it is to perform aggregations and computations on the entire dataset.
- Scalability: We can scale out in an unlimited fashion. By adding more nodes to our architecture we can scale out our MPP database to store and process larger volumes of data.
- Cost: We don’t need to buy the most expensive hardware to accomplish the task. Since we are adding more nodes, it is easier to handle more data with less expensive hardware
- Single point of failure: If one node fails, other nodes can still be functioning and support the database activities while it is being maintained
- Elasticity: If I want to add more nodes to the database, it is easy to do so without making the entire cluster unavailable.
Disadvantages of MPP
- Network speed: Since all the nodes are connected with a network fabric, this introduces some latency, though it is minimal
- Enforcement of constraints: Primary, foreign key, unique constraints are not maintained because the nodes are not sharing the same pieces of data. There won’t be a way to validate the data integrity easily.
- Seamless integration of components: Sometimes we experience more network or system failures due to the multinodal configuration of the MPP architecture.
- Data consistency: in MPP we trade immediate consistency for the partition tolerance. Therefore we may not guarantee all the nodes processed all the data at the same time due to possible network issues.
The best use case is for modern large scale Data warehouses for analytics.
Examples of MPP Database: Snowflake, Azure Synapse, Netezza, Teradata, Redshift
I will delve into the various differences in architecture between some of the most popular MPP databases in a later post.