On the Event-Horizon of a Data-Centric Future

 

database_thumbThe advent of database-science began back in 1970 at IBM’s San Jose Research Laboratory with publication of groundbreaking-research by Dr. E.F. Codd describing the first relational database model.  Nearly one decade later, Relational Software Incorporated, the precursor to Larry Ellison’s empire known today as Oracle, released the first relational database management system (RDMS) using modern structured-query languages (SQL) for data organization and retrieval.  Immaterial developments transpired since the original RDMS framework until one decade into the next Millennium when reality became evermore data-centric, in-part enabled by the cloud- more specifically server-side web applications.  According to IBM, as of 2013, worldwide users generated 2.5 quintillion bytes of data (that’s 57.5 billion 32 GB iPads), 90% of which was created in the prior two years alone. Telecommunications transferred to internet protocol; social media emerged as a new medium for communication; content disseminated across myriads of channels over the air and under the ground- yielding a new paradigm in 2009 monikered as “NoSQL” (Non-Structured Query Language, a.k.a, (NDMS), Non-Structured Database Systems), as an unstructured alternative to bypass the rigidity imposed by RDMS SQL, facilitating rapid data-collections on the fly- from Twitter hash-tags to Facebook’s email search system.  A comprehensive examination of RDMS/SQL vs NDMS/NoSQL beginning with a comparison of relative strengths and weaknesses within the context of businesses cases and applications, ultimately reveal a data-centric way of the future on the event-horizon.

vlookup_example

Click to watch example of VLOOKUP executed in Excel.

A relational database management system utilizing structured-query language requires logically structured management of data, cataloged by unique identifiers relating to a uniform set of data that may be “zipped-up” between two different tables if needs be.  Microsoft Excel is a visual representation of what could be described as a SQL-driven database model for data management, by organizing worksheets, RDMS tables, into workbooks, or RDMS databases.  Excel data retrieval isn’t too far of a departure from RDMS interactions.  The VLOOKUP function featured in Excel specifically hinges on a unique identifier to query and retrieve column-data requested between two different worksheets that may have relatable rows within a workbook; similarly, RDMS affords multiple JOINS-functions in lieu of Excel’s VLOOKUP to enable table traversal.  Relatable unique identifiers concordantly define the very essence of RDMS- an endless relationship of tables, and the underlying structures queried by language-SQL-to organize and retrieve data.

NoSQL database management systems are intentionally devoid of any logical structuring or any relational dependencies to dynamically accommodate rapid data collections, on the fly.  Picture a leaky house on a rainy-day: there are buckets indoors, situated about to collect drips of water underneath those very leaks-that’s NoSQL; each bucket is designed to collect the same “drips” of information from a certain area but shares no relation to other buckets.  The NDMS approach is designed to bypass the limitations of strict relationships that would otherwise relegate dynamic management of data.  Unlike traditional relationship-oriented databases (e.g. Oracle SQL, MySQL, SQLite), NoSQL affords freedom to group collections of data together, wherefore each solution creates its very own querying-methodology according to key-value pairs.   Essentially, a NoSQL model is a hash-table defined by key-value pairs, like a dictionary storing related synonyms that may pertain to a given word. NoSQL databases are “relation-less,” or “schema-less;” they are “not based on a single model…each database, depending on their target functionality, adopts a different one,” (Digital Ocean).  NoSQL proliferated on account of RDMS inefficiency to scale horizontally in a distributed system, which is a collection of independent computers that appears to its users as a single coherent system or as a single system.

Relatable unique identifiers concordantly define the very essence of RDMS- an endless relationship of tables, and the underlying structures queried by language-SQL-to organize and retrieve data.

Definitions of existing database management paradigms, RDMS and NDMS, yields opportunity to scope within the context of Eric Brewer’s CAP Theorem, suggesting that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: consistency, availability, and tolerance to network partitioning- all of which may be reduced to the following database attributes: structure, querying-action, scalability, reliability, support and application, as eloquently summarized below by Digital Ocean, a cloud infrastructure provider that provisions virtual servers for software developers:

  1. Structure: SQL/Relational databases require a structure with defined attributes to hold the data, unlike NoSQL databases which usually allow free-flow operations.
  2. Querying: Regardless of their licenses, relational databases all implement the SQL standard to a certain degree and thus, they can be queried using the Structured Query Language (SQL). NoSQL databases, on the other hand, each implement a unique way to work with the data they manage.
  3. Scaling: Both solutions are easy to scale vertically (i.e. by increasing system resources). However, being more modern (and simpler) applications, NoSQL solutions usually offer many simple means to scale horizontally.
  4. ReliabilityWhen it comes to data reliability and safe guarantee of performed transactions, SQL databases are still the better bet..They are extremely popular, and it is very easy to find both free and paid support.
  5. Support: Relational database management systems have decades long history. They are extremely popular, and it is very easy to find both free and paid support. If an issue arises, it is therefore much easier to solve than recently-popular NoSQL databases — especially if said solution is complex in nature.
  6. Data-Warehousing: By nature, relational databases are the go-to solution for complex querying and data keeping needs. They are much more efficient and excel in this domain- more so than RDMS.  

Obviously, SQL and NoSQL have relative strengths, and weaknesses-one may be more suitable in business or use-cases whether the other may not; in essence, they can be complementary frameworks; however, it is fair to maintain that SQL can perform NoSQL operations, albeit at compromised performance, whereas NoSQL cannot- due to the intentional avoidance of RDMS schema.  So what is the key-difference (pardon the pun;) between SQL, a relational database management system, and NoSQL database management systems?  The difference is that RDMS applications store data in a tabular form, while DBMS applications store data as files, which means tables are options for NoSQL DBMS, but there will be no relation between the tables, like in a RDMS.  Observations of systems in practice may be appropriate to determine suitable applications.

Picture a leaky house on a rainy-day: there are buckets indoors, situated about to collect drips of water underneath those very leaks-that’s NoSQL; each bucket is designed to collect the same “drips” of information from a certain area but shares no relation to other buckets.   

Amazon DynamoDB, advertised as a “fast and flexible NoSQL database service for any scale; pay only for the throughout and storage you need” uses eventual consistency  to come close to get all three CAP theorem properties.  According to Werner Vogels, CTO of Amazon, “Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly available key value-storage system…designed to give its users the ability to trade-off cost, consistency and performance while maintaining high-availability.”  Furthermore, Amazon’s Vogels explains Dynamo is not directly exposed externally as a web-service, but it does power parts of Amazon like AWS S3-a simple storage as a service providing developers with secure, scalable cloud storage.

In addition to non-RDMS, Twitter uses a version of RDMS SQL, called MySQL.  Since incorporation, MySQL has been one of Twitter’s key data storage technologies storing data in hundreds of schemas as their largest cluster is thousands of nodes serving millions of queries per second.  Twitter uses MySQL for “replication for fault-tolerance and read-scalability, [storing] a wide variety of data from commerce and ads to authentication, trends, internal services and more.”

Cloudera offers an enterprise distribution of Hadoop, a “non-conformist” type of database management. The underlying technology was invented by Google to index rich textural and structural information they were collecting, and then present meaningful and actionable results to users.  This Google innovation was rolled-up into Nutch, an open-source project contributed by Yahoo.  According to Mike Olson, Hadoop subscribes to relational databases practices, as it was invented to “create cachet around a bunch of different projects, each of which has different properties and behaves in different ways.”  

predator-query

That feeling whenever I hit a big-ass query.

At the end of the day, the question becomes what problem are you trying to solve for a given case?  The Navy SEAL BUD/school teaches candidates that plan of attack is determined by the target, not the weapon.  RDMS may be better suited than non-RDMS for certain applications, and vice-versa.  For data collection, organization and retrieval, currently, DBAs have three “weapons” of choice at their disposal: SQL, NoSQL, and the hybrid, Hadoop.  But what about the future of data collection?  The amount of options available pales in comparison to both closed and open-source frameworks available for web application programming.  Is this the age of post-modern database management or is database science just getting started?  Perhaps the underlying technology powering Bitcoin, known as “blockchain,” may provide a new methodology presenting a potentially viable database solution alternative for modern, high transaction volume applications.  Relating the initial claim about RDMS vs NDMS relative strengths and weaknesses, the RDMS ability to do anything NDMS, and how heavily the hybrid framework Hadoop relies on SQL, it is sufficient to conclude that relational databases are far from obsolescence and probably more relevant than ever as we approach the event-horizon of a data-centric future.

 

  1 comment for “On the Event-Horizon of a Data-Centric Future

  1. AJ
    June 1, 2016 at 5:37 pm

    Loved this Article, Alex. Keep up the good work on the big data front. SQL 2016 drops today. I hope you’re celebrating accordingly!

Leave a Reply

Your email address will not be published. Required fields are marked *