Basic Big Data Terminology
Data is called Big Data when there is no way to process it using a single computer (server). Big Data assumes processing by a network of computers (cluster).
Availability is defined as the proportion of time the system is in a functioning condition. High availability means that up and running time is as close to 100% as possible.
Scalability is the capability to handle growing amounts of data and growing number of database clients either by adding more hardware resources or by optimization and more efficient usage of the existing resources.
Vertical scaling means increasing the power of a single computer by adding more CPU cores, RAM, I/O controllers, disks, etc. It’s the easiest approach, but the most expensive and reaching its limits very quickly, because of physical limitations of how many hardware components you can put into single computer box.
Horizontal scaling means increasing the power by adding more computers, called nodes, connected with each other via computer network. Comparing to the vertical scaling, it provides higher scalability at lower costs by extensive adding of inexpensive hardware nodes of commodity level. The drawback is that it requires much more complex software, especially if full-fledged ACID transactions are required.
Clustering is often used as a synonym of horizontal scaling. However, in the context of relational databases, clustering usually means that database consists of several server nodes each having its own CPUs and RAM, but all of them share the same redundant array of disks via high-performance I/O controller (shared-disks architecture).
Partitioning means splitting a database into parts that are stored and processed at different server nodes. There is a horizontal partitioning that means separation of database tables by rows, and vertical partitioning that means separation of database tables by columns.
Sharding is a type of horizontal partitioning meaning that rows/objects of a database are automatically distributed across server nodes depending on a value of a particular column/field (or its hash). For example, user profiles in a database can be stored on different server nodes depending on their country.
Replication means maintaining redundant copies of the same data by copying database or its partitions from master node to secondary node(s). It makes it possible to scale a database by load-balancing clients between multiple nodes. However, the drawback is data consistency challenges - when data on a master node is changed and secondary nodes still have obsolete data until next replication cycle. It becomes even more complicated if full-fledged ACID transactions are required.
Federation means that database physically consists of multiple autonomous databases, possible of different nature and from different vendors. But logically the database is represented as a single unified database accessible by clients via a special mediator layer. The biggest challenge for federated database is transaction processing, which make mediator layer extremely complicated, and full-fledged ACID transactions are hard to achieve.