When considering the technologies required to approach the problem of Big Data, it’s only natural to consider the database management system first. Most of the most widely used databases are already optimized to store and handle large data volumes. For some years now, systems based on the relational model have been successfully used both in the industry and in research environments. However, the threshold defined by the term “Big Data” entails, above all, a paradigm shift in the information management model.
While databases based on the relational model guarantee certain properties which at first sight might seem more important or even necessary (the famous ACID trifecta), nowadays it is impossible to handle certain volumes without relaxing some of them. It is precisely out of this relaxing and out of the need to provide other properties that a new data management paradigm has arisen: NoSQL.
The properties to be provided by these new systems, in particular in an Internet environment, are:
- High availability
- Failure tolerance
- Large storage capacity
- High input/output capacity
To provide all this, ACID is usually not guaranteed, and the system has a distributed architecture. In addition, the data access interface loses expression capacity with respect to SQL (hence the name NoSQL), given that the complexity of the data scheme is much lower.
Even though in the most modern systems all these features can be configured, when optimum performance is required most of them must be given up. For example, a strong consistency is usually not guaranteed, as in a distributed architecture changes must be spread among the various machines. Fulfilling ACID principles would be a burden on performance which would make the system non-optimal for the most usual Big Data scenario.
Nowadays various distributed database systems can be found to boost different aspects. In fact, there is a distributed system theorem, the CAP theorem , which clearly groups them. By this theorem, a distributed system cannot fully provide the following attributes at the same time: Strong Consistency, High Availability, and Partition Tolerance. Taking into account the fact that a distributed database can satisfactorily provide two of these attributes, these systems can be grouped on the basis of these features. The following grouping shows some examples:
As can be seen, the group of systems that meet the availability and consistency requirements includes those derived from the relational model. On the right hand side are those inspired by Amazon Dynamo, and on the lower part the descendants of Google Big Table. This classification can be very useful to face Big Data problems. In our next posts we will discuss some of these systems in depth and compare their features on the ground.