NoSQL and Technical Debt

So I just had a chance encounter with a former professor of mine, Dr. Hicks from Trinity University.

While I was at Trinity University, mumble mumble long time ago, Dr. Hicks assured me that I needed to take his database class, the course at Trinity CS department that was the most focused on SQL. That class had a sterling reputation, it was regarded as difficult, but practical and comprehensive. Dr. Hicks is a good-natured and capable teacher, and approached the topic with both sophistication, enthusiasm and affection. But databases and SQL is a complicated topic and no amount of nice-professoring is going to make it easier. It was a hard class and at the time, avoiding a hard class seemed like the right decision for me.

I did not take this class, which I profoundly regret. Not a moralizing regret, of course, more of a “that decision cost me” regret. Since leaving Trinity, I have had to teach myself database design, database querying, normalization, un-normalization, query optimization, data importing and in-SQL ETL processes.

Over the last 10 years, I estimate that I have spent at least 1000 hours teaching myself these methods outside of a single class. In terms of missed opportunity dollars, as well as simple compensation, I am sure it has cost me upwards of $100k (this is what happens when you as an entrepreneur… when you take too long to do some work, you suffer as both the client and the programmer). I really wish someone had taken me through the basics, so that I would have only had to teach myself the advanced database topics that I only apply more rarely. As it is, I have lots of legacy code that I am moderately embarrassed by. Not because it is bad code, but because it is code that I wrote to solve problems that are well-solved in a generic manner inside almost all modern database engines.

Dr. Hicks also mentioned to me that he was deliberating how much he should consider including NoSQL technologies in his class. He indicated that students regarded the NoSQL topics as more modern and valuable, and regarded the SQL topics with some distaste.

This prompted the following in-person rant from me on Technical Debt which I thought might be interesting to my readers (why are you here again?) and perhaps some of Dr. Hicks potential students.

His students made the understandable but dangerous error of seeing a new type of technology as exclusively a progression from an old one. NoSQL is not a replacement for SQL, it is solving a different problem in a different way. Both helicopters and airplanes fly, but they do so in different ways that are optimized to solve different problems. They have different benefits and handicaps.

The right way to think about NoSQL is as an excellent solution to the scaling problem, at the expense of supporting complex queries. NoSQL is very simple to query effectively, precisely because complex queries are frequently impossible at the database level. SQL is much harder to learn to query, because one must understand almost everything about the underlying data structures as well as the way the SQL language works, before query writing can even start.

Most of the time, the right way to think about any technology is:

  • What does this make easy?
  • What does this make hard?
  • What does mistakes does this encourage?
  • What mistakes does it prevent?

Almost all modern programming languages are grappling with the “powerful tool I can use to shoot myself in the foot” problem. Computer Science shows us that any Turing-complete programming language can fully emulate any other programming language. So you can use fortran to make web pages if you want, but php makes it easy to make web pages. You can use php to do data science, but R makes that easy. And of course, languages like python seek to be “pretty good” at everything.

NoSQL tends to encourage a very specific and dangerous type of technical debt, because it enables a programmer to skip upfront data architecture, in favor of “store it in a pile and sort it out later”. This is roughly equivilent to storing all of your clothes, clean and otherwise, in large piles on the floor. Adulthood usually means using closets and dressers, since ordering your clothes storage has roughly the same benefit of arranging your data storage.

SQL forces several good habits that avoid the “pile on the floor effect”. To use SQL you have to ask yourself, as an upfront task:

  • What does my data look like now?
  • What relationships are represented in my data?
  • How and Why will I need to query this data in the future?
  • How much data of various types do I expect to get?
  • What will my data look like in the future?

With NoSQL, you get to defer these decisions. With NoSQL you get to just through the data on the pile, in a very generic manner, and later you figure out how you want to use the data. Because the underlying emphasis of NoSQL on scaling, you can be sure that you can defer these decisions without losing data.  If all you need to do is CRUD, at scale, and data analysis is secondary, NoSQL can be ideal. Most of the time, however, data analysis is critical to the operation of an application. When you have to have both scaling and data analysis… well, that is a true data science topic… there is no bottom in that pond.

SQL is not the only querying language that enforces this discipline. Neo4J has created a query language called Cypher, that has many of the same underlying benefits of the SQL language, but is designed for querying graph structures, rather than. Unlike traditional NoSQL databases, Neo4J enforces upfront thinking about data structures, much like a SQL database, it just uses a different underlying data structure: A graph instead of a table. In fact, with time, having experience with both SQL and Graph databases, I have started to understand when the data I am working with “wants” to be in a graph database, vs a traditional tabular SQL database. (Hint: If everything is many-to-many and shortest path or similar things matter… then you probably want a graph database)

Indeed, it is not a requirement that you forgo the efforts to create careful data structures in NoSQL languages. NoSQL experts very quickly realize that using schema’s for data is a good idea, even if doing so is not enforced by the engine.

The key underlying concept of the Technical Debt metaphor is that a programmer must consciously make decisions about how much Debt to incur, in order to avoid the crisis of software that requires so much eventual maintenance, that no further progress can be made on it. Essentially, there is something like “software design bankruptcy” that we should stay far away from.

Like financial debt, bankruptcy is not actually the worst state to reach with technical debt.  The worst state, in both finances and technology, is poverty created and sustained by interest payments. What people sometimes call “debt slavery”. Another state to avoid is taking on no debt at all. Debt is a ready source of capital, and can be used to dramatically accelerate both technical and financial progress.

Also like real life, most individuals manage debt poorly, and the few individuals who learn to use debt wisely have a significant advantage.

But the first step to managing debt wisely is to be recognize when you are taking debt on, and to ensure that it is done with intention and forethought. Make no mistake, forging ahead without designing your data structures is a kind of hedonism, nor dissimilar from those who choose to purchase drinks they cannot afford on their credit card.

If you are looking forward to career with data, not learning SQL is a technical debt equivalent of taking a payday loan. By learning SQL carefully, you will learn to forecast and plan your data strategy, which in many cases is at the heart of your application. Even if you abandon SQL for the sake of another database with some some other benefit later on, the habits you learn from careful data structure planning will always be valuable. Even if you never actually use a SQL database in your career.

HTH,

-FT