Selecting a data storage: how House, M.D. would do that

This article takes its origin from a presentation, which I prepared for the OLX Product and Tech conference, that has taken place in Berlin in September 2018. I took the title slide for my featured image.

The reason I chose this topic for my presentation is that we had a bit of a struggle, doing this for a new service. We had a lot of freedom in database selection. And as a rule, having choices is good because, well, you have options.  But imagine that you want to select between three sorts of ice cream. You have chocolate, vanilla and strawberry – it is probably easy right?

But what if I tell you that you have not three, but thirty sorts of ice cream to choose from?

Our current database choices are in the second category. How the hell do you make a selection if legion is their name?

This is where Uncle Bob comes to help.

This is what you see if you google “Uncle Bob”.

Uncle Bob, or Robert Martin, is a writer and a software engineer and one of the Agile Manifesto authors. He wrote Clean Code, Clean Coder, Clean Architecture and a few other books.

His thoughts on the matter: database is an implementation detail.

What it means is that users don’t care about how the data is stored or fetched. They don’t care how you query the data. They only care how the data is presented to them. Therefore, while you are doing the POC, or the MVP, or even in the later stages of the application development, the decision about the data storage can be delayed.

How does one do that? Easy. By hiding the implementation behind the abstraction.

Every developer heard this rule: program to the interface.

The interface is the specification of WHAT you are going to do; the implementation is the HOW. So, when you define an interface, you declare WHAT it does and also the inputs and the outputs. The rest is up to the implementation.

When you program like this, it means that changing the data storage solution that’s hidden behind the interface is just a question of replacing one implementation with another. Most often it will be just a few classes. The rest of the code can stay untouched.

In this paradigm, the first implementation you pick should just be something you can implement as quickly and easily as possible. You can keep your data in flat files, in memory storage, anything.

House, M.D.

So, how Dr.House, or House M.D., fits into the picture? To explain that, we need to know the way he works.

Dr. House is a diagnostician from a TV series. He is supposed to be brilliant because he can find out what is happening to patients with complicated diseases. And if you watch at least a few episodes of House MD, you will see his main method.

Throw stuff at the wall and see if it sticks.

Basically, what he does is he has a hypothesis about the patient, and he tries to prove it by giving the patient medicine that should work if this hypothesis is correct.  If the patient gets better, then all is good. If the patient gets worse, he tries the next hypothesis.

And he iterates.

So, with the help of Robert Martin, we saw that the database implementations can be replaced. This means that the process of selecting the database can also be iterative.

You try something, you see if it fits. If it does, you leave it as is. If it doesn’t, then you discard it and try something else.

To go through the process, you might need a set of criteria.

These are some example criteria you might have.

  • Capability – how big a data this solution can work with;
  • Query language – SQL is something a lot of people are familiar with;
  • AWS compatibility – because we are mostly backed by the AWS stack;
  • Development effort – how difficult it is to integrate this database into a Spring Boot application which we were sure to go with;
  • Infra effort – how difficult it is to set up this data storage;
  • Limitations – what the solution can’t do (and we need).

Now we come to the practical example: the metadata service. The service we were implementing was supposed to be able to extract, store and provide the metadata about the files, kept in the system. Mostly those files are images.

The system currently holds 1 billion files and more are upload daily, so it grows quite quickly. Each file can have 10-20 properties, which adds one more order of magnitude to the metadata.

First proposed solution: AWS Dynamo DB

  • Key-value data storage.
  • Fully AWS-managed, automatically scaled and backed up.
  • Very fast reads due to in-memory cache.

Dynamo DB allows a maximum of 5 global secondary indexes and 5 local secondary indexes per table. Primary key can be a partition key or partition key + sort key. You can query on partition key or partition key + sort key, but not sort key only (or, it can be done with a table scan).

Why we discarded this option: we need to be able to query the data on different combinations of attributes. Dynamo DB allows the data to be queried on non-key attributes with the help of secondary indexes, but the secondary indexes are basically also tables, and they have the same limitations, that is, no more than 2 key attributes per index (partition + sort key).

This leads to ugly workarounds when you need to query for more than 1 or 2 attributes: like creating extra columns that are a result of concatenation of the values of the columns you need to query on. This is not pretty, not flexible and not easy to support, because if the data was already there and you discovered the need for such a query, then you need to retrofit the data with the scripts.

Since our service was only in its initial stages and we were still not sure how the data will be used, we decided that we don’t want this complexity.

Next candidate: Cassandra

  • Self-managed – no ready-made AWS solution.
  • Optimized for big volumes and fast writes.
  • SQL-like queries (CQL).
  • Used by big players: Netflix, Hulu, Instagram, Ebay…

Cassandra is also a noSQL, key-value storage.
Thus, it has a lot of the same characteristics as DynamoDB. And the main idea is, you need to design the schema very carefully.

The articles about Cassandra often say “Cassandra table is a query”.  Most of us come from a SQL world. Table is a table, it has rows and columns. Query is a statement you use to read some data from a table based on a few conditions. So, how can a table be a query?

The answer is simple: it can’t. It is just an expression. What this expression means is is that data in Cassandra is best arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization.

Cassandra tables have primary keys which can be composite.
First part of a PK is always the partition key (can contain more than 1 attribute). The rest are clustering attributes, they determine the order of data within a partition.

The data should be queried by the key attributes in the same order as they are specified in the key. So, if you have keys 1-4 in the PK, then you can query by key 1, key 1 + key 2, key 1 + 2 + 3, key 1 + 2 + 3 + 4, but not key 1 + 3 or 2 + 4 or even 2 + 3 (because the first is a partitioning one and to query without a partitioning key you need to allow filtering).

Cassandra allows secondary indexes but they are best not used.
Reason is, data is physically arranged by the partition key, so all the secondary indexes are local to the partition. Which means all of them should be scanned to get the data and then the results merged together. This is why secondary indexes aren’t efficient.

Another characteristic of Cassandra is that it’s pretty opinionated in how it wants to be used. So for example, where another database would just let you run a less-then-efficient query and deal with the performance problems, Cassandra will just will raise an error.

So, the main reasons we didn’t go Cassandra were complicated infrastructure (completely self-managed solution) and complicated schema design.

Last candidate: AWS Aurora

  • AWS-managed, automatically scalable, automated backups, point-in-time restore.
  • We already use it for other services.
  • It is MySQL and PostgreSQL compatible.
  • Simple SQL for querying.

One big advantage:

As with all relational databases, we can provide a normalized schema and think about which queries we need later.

The main reason why it was not our first choice is, as we are already using it for another service, we know that on our data size, some queries don’t perform that well. And potentially, the metadata storage would have much more data.

On the other hand:

  • 1 – we are only working to get one type of metadata into the API right now, so it won’t be that huge.
  • 2 – initially, we will only have that data for one category of the files.
  • 3 – and that data will not have to be extracted for all the images of the category, but just for some of them.

This means that we can apply the YAGNI principle right now: you ain’t gonna need it. Choose the simplest solution that satisfies current use cases.

And as we said before – hide it behind an interface to maybe change later.

Takeaway of this whole article is that is is possible to delay the final database selection and so make it less critical and maybe a bit less painful.

Advertisements
Posted in Programming | Tagged , , | Leave a comment

Spring State Machine: what is it and do you need it?

State machine is a model of computation based on the finite states, as Wikipedia very obligingly says. Usually there are workflows to go with the states, meaning that you can’t just go from any state to any other state: there’re rules one should follow. The transitions between these states are limited by the rules.

The Spring framework has a whole library called Spring State Machine. It is an implementation of the concept, intended to simplify the development of the state machine logic for developers, already using Spring framework.

Let’s see how it works. Continue reading

Posted in java | Tagged , | Leave a comment

Book review – Clean Architecture by Robert Martin

Clean Architecture is a third book in Robert C. Martin’s Clean Code collection, first two being Clean Code and Clean Coder. I really like the whole series. To me, Robert Martin writes simply, clearly, with enough examples and without unnecessary complicated details. His books can be read through, as well as used for reference, but I would say that his are the books that are better to be read cover to cover, sequentially, and not just being referenced as parts. Indeed, they not at all huge and are logically constructed in such a way as to provide a completed story. Continue reading

Posted in Learning, Programming, Uncategorized | Tagged , | Leave a comment

Local type inference in Java 10, or If it quacks like a duck

Quite recently, Oracle adopted a new strategy of releasing a new language version every six months. The strategy assumes that only every 3rd version will have long-term support, or LTS. Quick notes about it:

  • Current version that has LTS is Java 8;
  • Java 9 is only supposed to have support until March 2018, so it is already OVER;
  • Java 10 is supported through September 2018;
  • Next version to have LTS is supposed to be Java 11, which is supposed to be released in September 2018 and will be supported at least through to September 2023.

Continue reading

Posted in IntelliJ, java, Uncategorized | Tagged , | 1 Comment

Lombok, AutoValue and Immutables, or How to write less and better code returns

In the previous post about Lombok library, I have described a library that helps to deal with boilerplate code in Java (and yes I know that these problems are already solved in Kotlin, but this is real life and we can’t just all sit and rewrite every existing project once a newer or a simpler language appears). But as many things in life, project Lombok has its alternatives. Let’s give them a chance as well.

Code samples for this article can be found here and here.

Continue reading

Posted in java, Programming | Tagged , , , | Leave a comment

Wait a minute mister Postman, or Postman environments

Postman is a tool that facilitates web application testing by sending requests and allowing you to inspect responses and whatever comes with them (response codes, headers, cookies etc). Yes I know that really (like, really) cool devs use curl or better yet, httpie (which is really cool, seriously, try it). But sometimes you just want to have a nice UI which you can actually show to a QA or a product manager and don’t pretend that you are that egg-headed guy doing some dark magic that’s spitting text into your console in response to you spitting out the commands that make no sense. Or you are just tired of trying to remember those commands and parameters yourself (but of course, a real dev would never acknowledge that, right?…).

This is how the cool kids do it (matrix gif)

This is how the cool kids supposedly do it.

Continue reading

Posted in Productivity, Programming | Tagged , | Leave a comment

How to write less and better code, or Project Lombok

I have long intended to write about Project Lombok, so much so that I am probably doing it when every self-respecting Java developer has already heard about it. Nevertheless, it is worth mentioning, if only to remind myself that one should not hesitate to try performance-enhancing tools and see if they fit, and Lombok is certainly enhancing performance of a Java-coder by allowing to simultaneously write less code and add to its quality, which is no small matter.

What is it that Java opponents usually say about its weaknesses?

Java is too verbose.
(c) Every Java opponent

Unfortunately, there’s a lot of truth in this statement.  Continue reading

Posted in IntelliJ, java, Libraries, Programming, Uncategorized | Tagged , , | 2 Comments