A. has strong opinions on stuff we’re doing wrong at OLX (on the SRE side). Like, reinventing our own wheels for stuff and adding manual hacks that those wheels require. Also, stuff that complicates life along with simplifying it. Like having our own DSL which of course isn’t parsable by IDEs, therefore making code unreadable — it is really difficult to see what comes from where and the IDE goes crazy and underlines everything in red.Are custom tools square wheels, or are they an attempt to fix the more “standard” square wheels? — Image taken from Giphy — https://giphy.com/gifs/square-wheels-UP5CZUXC5dH1K
However, these custom tools also exist for a reason. Standard tools like Helm makes one type complicated command lines to do things, there’s a lot of repetition, and the commands syntax usually isn’t like any programming language. Therefore, a lot depends on SREs, who are humans and therefore make mistakes — like typos — which are really hard to catch while debugging. The custom tools attempt to express infra in terms of code, with objects and types that allow to use the compiler to catch errors faster and easier.
This is a story about a rather unusual experiment, which our company ran with me as a (willing) guinea pig, to try and retrain a software developer as an SRE. SREs (or DevOps, and there’s a controversy on whether it’s the same job or not) are a hot item right now, I think maybe even more so than data scientists (well I don’t have stats in hand to confirm it, that’s rather a one-sided view). Anyway, our company was desperately searching for SREs, and then the bright idea came to one head.
We have all those devs, and they are all technical people too, right? And they work with infrastructure too, only a bit on the other side, but at least they have some idea about it, right? And maybe retraining a senior developer would actually be easier and less costly than training a junior SRE?
However the idea came to be born, born it was, and moreover, it was implemented. (Spoiler alert: due to unforeseen circumstances, the experiment ran shorter than expected.)
But the initial idea was that the best training is hands-on, and that to make sense, it should happen along a 2–3 months period, and that the chosen software developer was to be given as an apprentice (call it a trainee if you wish, but I actually like the word apprentice here) to an actual SRE, working on one of the teams.
That chosen software developer was me, a backend programmer with about 15 years of experience, currently working with Java and Kotlin.
And this was how the story began.
Some disclaimers first.
To protect people’s sensitivity, I won’t give them their real names. Let them be called A., B., C. etc. in order of appearance. I understand it might make the story somewhat artificial, but these are my real thoughts, and they are about real people. So… bear with me. Anyway, I am the author. I have the power.
Another thing I must note is, the company made no assumptions that I will switch to be an SRE after the training. I was free to go back to development if I felt it suited me better. But in case I would say I found my vocation, the company was ready to embrace me in a new position.
21. January 2019
So, first day as an apprentice. I come to work and my “master” SRE, A., is already there. Turns out he comes to work before 8am, which I can’t possibly match since my commute alone takes almost an hour, and I also have breakfast at home. Oh well. I have stopped trying to impress people with my working hours long ago because I think it is not a good metric of your performance, but still in cases like this, I feel inadequate. Whatever.
A. starts to help me onboard by sending links to the Wiki pages I need to read, and at the same time we go through the tools I need, tools I already have installed, repos I need to check out etc. What’s good is that I have had some exposure to the tools: I have used Terraform (very little, true, but I know how the TF file looks at least), I already have my GPG key set up and A. only needs to add it to some projects and repos for me to be able to use them; I have an idea of what git and git-crypt is used for and though I am not fluent in console, I can do basic stuff.
Of course A. is working in the console with the speed of light, and of course he has his console split into 4 parts, each performing its own task.
And of course the font is so tiny it is unreadable for me, but he quickly corrects it without me needing to say anything. I suppose he noticed that my console and app fonts are scaled almost to the maximum. One more reason for me to feel inadequate: I can’t consume so much information at once. It is easier for me to use tabs in the console instead of splitting the windows, and it is also much easier for me to read stuff that doesn’t make me strain my eyes. But the fact that A. works like this isn’t SRE-specific: a lot of developers I know do the same. When I recall that most of them wear either glasses or contact lenses and I still don’t use any, I feel a bit better, but still not a lot.
A. also seems to be using Visual Studio Code as a main IDE to edit stuff. He’s happy that I use it also, — what I don’t mention is that I actually am a newbie in that as well. My IDE of choice is IntelliJ IDEA. It is perfect for Java and Kotlin, which are my main specialty. But IDEA is actually pretty slow when loading large projects, and it is also not very helpful with the terraform syntax, so lately I tried VSCode as a replacement for the configuration-only projects. But whereas I know my way around IDEA pretty well, as in — I remember a lot of standard shortcuts and have also set up some of my own, — I am basically still trying to blunder my way around VSCode.
Do you ever mistype stuff when someone is sitting behind your shoulder? A. is great and very patient, but I feel like I misspell the simplest commands (I challenge you to mistype git — but that is something I actually managed!).
I attend the standup and sit through weekly planning where I predictably don’t understand much. Not that the concepts are unfamiliar, but I just realise I have very little idea what this team is actually doing — what are their current goals, challenges etc. The team also seems to be following the sprint workflow which we don’t. I realise that I will probably have a much worse meeting-to-coding ratio that I’ve had previously.
By the end of the day, A. basically has hardly left my side. We have resolved a production JIRA to increase some read/write limits on the AWS Dynamo DB, which required a couple of line changes in the code containing the Terraform configuration and a few console commands to stage and apply the change of config. It wasn’t exactly difficult, but I hope I will just be able to remember it all tomorrow. My head aches because all the new information I’ve consumed is threatening to squeeze out of my ears. Mercifully, it is time for my master SRE to go home. I stay to type up my notes about the day and add some links that he shared with me to Kotlink (which is a tool Illia Sorokoumov invented and I swear by because I can’t possibly remember all the browser links I need. And no, Chrome Bookmarks aren’t the same). And then I go out into the freezing Berlin night.
This is not going to be easy.
But this is going to be manageable. Especially if my master SRE will continue to keep his cool.
I wonder what he’s thinking though. He probably might not share this opinion.
22. January 2019
Second day, more of the same. I find that I remember most of the stuff we did with Terraform but almost nothing about OpenShift which we also did a little. Makes sense, because I used Terraform at least a bit, but the OpenShift is completely new. I really need to do a dump of console commands and keep them as a reference.
I start to think that this will help a lot with my goal to get to be an architect at some point. Already, I see many more opportunities to think about the architecture and not the actual application logic.
Why? As a developer, I think I just always knew in theory that I needed to think about the architecture, but somehow I never actually tried to find holes in it. I could think of a basic design that satisfies the requirements but not about its cost or performance with live traffic — bottlenecks, interactions with other systems. And also, I was too little exposed to the actual behind-the-scenes setup and it was just easier to leave it to someone else because I either didn’t have access, or didn’t know how.
23. January 2019
SRE B. after a meeting: “If you have any trouble understanding any stuff in SRE meetings, I can always stay with you and explain whatever you need.”.
Me: “How much time do you have?…”.
25. January 2019
A. asked me whether I was excited to think of being on call at some point. No, not really… I think this ad-hoc thing is what puts me off most in the SRE job.
This is something we don’t think much about, but a lot of SRE work is putting out fires. The best SRE is the one that doesn’t allow many fires to happen, of course. But it’s not realistic to think that none will happen. Which means that sometimes you will still be fixing something in a hurry, because a disgruntled engineering manager is breathing down your neck because something just broke in production. And even if it’s not production and/or there’s no fire breathing manager, your flow will still be broken.
These notes are to be continued in the next article. Hope you found them interesting! If yes, let me know in the comments and I will continue. If not, then also let me know in the comments… and maybe I won’t!
Everything “as code” is all the rage now. What can we represent as code except for the programs? First of all, infrastructure as code is gaining popularity — it is enough to see the Google Trends graph for it to see that it is steadily climbing year by year. Terraform, OpenShift, CloudFormation, Helm, Puppet and many other tools are the representatives of this trend.
However, this article deals with something else entirely: diagrams as code. Why do it? Well, code has a few advantages over, well, diagrams:
It is readable. Well, at least good code is. A lot of people absorb written information better than anything else, despite that saying about one picture being better than a thousand words.
It is compact. A text file size is usually times and times smaller than any picture. And is much easier therefore to store in the repository.
Version control. You can keep pictures under version control, however, they are binary files, and the changes are therefore obfuscated. If you change the picture in a repo, people will not know what the change was about, until they check out the repo and have a look at the picture. The diff itself won’t be much help at all.
It is easy. It is much easier to type “Service A uses Service B” than draw those boxes on a diagram, label them, connect them with arrows etc. Especially for people who might be, let’s say, artistically challenged.
It turns out, however, that there’s a tool that allow you to have a best of both worlds. And this tool is PlantUML.
PlantUML allows to basically write text which is automatically transformed into the diagrams. It has its own pretty simple DSL and allows for a lot of the types of UML diagrams:
Also, it supports some non-UML diagrams which are pretty cool, for example the Wireframe diagrams for UI design, which seems a really interesting concept.
How to use PlantUML? Actually, in a hundred ways. It can be installed locally as a separate tool or as a plugin to basically anything (Wikis, forums, text editors, IDEs and what not, check the link and chances are, you will find at least several alternatives that you’re already using). As my tool of choice is IntelliJ IDEA, this is the plugin I use.
Let’s try a sequence diagram, because it’s the one that usually gives me a lot of headache. (All those swimlanes and blocks that need to be aligned, don’t make me started.) We’re designing an automated restaurant order system (no waiter, just a tablet to order with — know what I mean?) and need a bird’s view of the basic flow. We have a client who orders from the menu, an inventory against which the order is checked, and a feedback system to be able to correct the order. And we’ll put some queues in to make the process asynchronous (just because we are cool).
How will it look? Approximately like this.
We can clearly see that we have one actor — Client, four participants MenuService, InventoryService and two queues for requests and responses — and a database to keep track of all this. The IDE plugin instantly transforms the code into this picture:
What can I do with it? I can export it into a picture and show to anyone. Also, I can use the online demo server and just copy and paste the whole code I have into the textbox there and click Submit. The demo server will return a URL to the generated diagram:
This URL can be used to get the picture into your project readme file, confluence wiki or just any web page. The interesting thing about it is that a picture itself isn’t stored on the demo server, because all the information is already encoded into the URL. So, just the URL is stored.
I think this tool is great to play with and explore. And these “diagrams” are great to store under source control, because all the changes are immediately readable by just scrolling to a diff. And it goes so much faster than drawing and repositioning all those blocks and swimlanes.
If you like the idea, by any means go and try the tool on your own! What I’ve shown here is just a very basic example, but I thing one can do a lot with it. The website also has a FAQ to help people with some issues that may arise (I experienced none with the IDE plugin, but as this tool has so many integrations which I haven’t tried).
Not all of us are artist, but the great thing is, not all of us have to be.
Remote and distributed teams: fringe trend or the future of IT?
I have some experience with remote work, which I’ve shared in an article called Out of sight, out of mind, or How to be productive when working remotely. The topic still interests me, however, in more ways than just to understand how to make it work. The IT industry, while not adopting remote work approach on a global scale, does have big and successful companies that swear by it and want nothing else.
This article takes its origin from a presentation, which I prepared for the OLX Product and Tech conference, that has taken place in Berlin in September 2018. I took the title slide for my featured image.
The reason I chose this topic for my presentation is that we had a bit of a struggle, doing this for a new service. We had a lot of freedom in database selection. And as a rule, having choices is good because, well, you have options. But imagine that you want to select between three sorts of ice cream. You have chocolate, vanilla and strawberry – it is probably easy right?
But what if I tell you that you have not three, but thirty sorts of ice cream to choose from?
Our current database choices are in the second category. How the hell do you make a selection if legion is their name?
This is where Uncle Bob comes to help.
Uncle Bob, or Robert Martin, is a writer and a software engineer and one of the Agile Manifesto authors. He wrote Clean Code, Clean Coder, Clean Architecture and a few other books.
His thoughts on the matter: database is an implementation detail.
What it means is that users don’t care about how the data is stored or fetched. They don’t care how you query the data. They only care how the data is presented to them. Therefore, while you are doing the POC, or the MVP, or even in the later stages of the application development, the decision about the data storage can be delayed.
How does one do that? Easy. By hiding the implementation behind the abstraction.
Every developer heard this rule: program to the interface.
The interface is the specification of WHAT you are going to do; the implementation is the HOW. So, when you define an interface, you declare WHAT it does and also the inputs and the outputs. The rest is up to the implementation.
When you program like this, it means that changing the data storage solution that’s hidden behind the interface is just a question of replacing one implementation with another. Most often it will be just a few classes. The rest of the code can stay untouched.
In this paradigm, the first implementation you pick should just be something you can implement as quickly and easily as possible. You can keep your data in flat files, in memory storage, anything.
So, how Dr.House, or House M.D., fits into the picture? To explain that, we need to know the way he works.
Dr. House is a diagnostician from a TV series. He is supposed to be brilliant because he can find out what is happening to patients with complicated diseases. And if you watch at least a few episodes of House MD, you will see his main method.
Throw stuff at the wall and see if it sticks.
Basically, what he does is he has a hypothesis about the patient, and he tries to prove it by giving the patient medicine that should work if this hypothesis is correct. If the patient gets better, then all is good. If the patient gets worse, he tries the next hypothesis.
And he iterates.
So, with the help of Robert Martin, we saw that the database implementations can be replaced. This means that the process of selecting the database can also be iterative.
You try something, you see if it fits. If it does, you leave it as is. If it doesn’t, then you discard it and try something else.
To go through the process, you might need a set of criteria.
These are some example criteria you might have.
Capability – how big a data this solution can work with;
Query language – SQL is something a lot of people are familiar with;
AWS compatibility – because we are mostly backed by the AWS stack;
Development effort – how difficult it is to integrate this database into a Spring Boot application which we were sure to go with;
Infra effort – how difficult it is to set up this data storage;
Limitations – what the solution can’t do (and we need).
Now we come to the practical example: the metadata service. The service we were implementing was supposed to be able to extract, store and provide the metadata about the files, kept in the system. Mostly those files are images.
The system currently holds 1 billion files and more are upload daily, so it grows quite quickly. Each file can have 10-20 properties, which adds one more order of magnitude to the metadata.
First proposed solution: AWS Dynamo DB
Key-value data storage.
Fully AWS-managed, automatically scaled and backed up.
Very fast reads due to in-memory cache.
Dynamo DB allows a maximum of 5 global secondary indexes and 5 local secondary indexes per table. Primary key can be a partition key or partition key + sort key. You can query on partition key or partition key + sort key, but not sort key only (or, it can be done with a table scan).
Why we discarded this option: we need to be able to query the data on different combinations of attributes. Dynamo DB allows the data to be queried on non-key attributes with the help of secondary indexes, but the secondary indexes are basically also tables, and they have the same limitations, that is, no more than 2 key attributes per index (partition + sort key).
This leads to ugly workarounds when you need to query for more than 1 or 2 attributes: like creating extra columns that are a result of concatenation of the values of the columns you need to query on. This is not pretty, not flexible and not easy to support, because if the data was already there and you discovered the need for such a query, then you need to retrofit the data with the scripts.
Since our service was only in its initial stages and we were still not sure how the data will be used, we decided that we don’t want this complexity.
Next candidate: Cassandra
Self-managed – no ready-made AWS solution.
Optimized for big volumes and fast writes.
SQL-like queries (CQL).
Used by big players: Netflix, Hulu, Instagram, Ebay…
Cassandra is also a noSQL, key-value storage.
Thus, it has a lot of the same characteristics as DynamoDB. And the main idea is, you need to design the schema very carefully.
The articles about Cassandra often say “Cassandra table is a query”. Most of us come from a SQL world. Table is a table, it has rows and columns. Query is a statement you use to read some data from a table based on a few conditions. So, how can a table be a query?
The answer is simple: it can’t. It is just an expression. What this expression means is is that data in Cassandra is best arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization.
Cassandra tables have primary keys which can be composite. First part of a PK is always the partition key (can contain more than 1 attribute). The rest are clustering attributes, they determine the order of data within a partition.
The data should be queried by the key attributes in the same order as they are specified in the key. So, if you have keys 1-4 in the PK, then you can query by key 1, key 1 + key 2, key 1 + 2 + 3, key 1 + 2 + 3 + 4, but not key 1 + 3 or 2 + 4 or even 2 + 3 (because the first is a partitioning one and to query without a partitioning key you need to allow filtering).
Cassandra allows secondary indexes but they are best not used. Reason is, data is physically arranged by the partition key, so all the secondary indexes are local to the partition. Which means all of them should be scanned to get the data and then the results merged together. This is why secondary indexes aren’t efficient.
Another characteristic of Cassandra is that it’s pretty opinionated in how it wants to be used. So for example, where another database would just let you run a less-then-efficient query and deal with the performance problems, Cassandra will just will raise an error.
So, the main reasons we didn’t go Cassandra were complicated infrastructure (completely self-managed solution) and complicated schema design.
As with all relational databases, we can provide a normalized schema and think about which queries we need later.
The main reason why it was not our first choice is, as we are already using it for another service, we know that on our data size, some queries don’t perform that well. And potentially, the metadata storage would have much more data.
On the other hand:
1 – we are only working to get one type of metadata into the API right now, so it won’t be that huge.
2 – initially, we will only have that data for one category of the files.
3 – and that data will not have to be extracted for all the images of the category, but just for some of them.
This means that we can apply the YAGNI principle right now: you ain’t gonna need it. Choose the simplest solution that satisfies current use cases.
And as we said before – hide it behind an interface to maybe change later.
Takeaway of this whole article is that is is possible to delay the final database selection and so make it less critical and maybe a bit less painful.
State machine is a model of computation based on the finite states, as Wikipedia very obligingly says. Usually there are workflows to go with the states, meaning that you can’t just go from any state to any other state: there’re rules one should follow. The transitions between these states are limited by the rules.
Clean Architecture is a third book in Robert C. Martin’s Clean Code collection, first two being Clean Code and Clean Coder. I really like the whole series. To me, Robert Martin writes simply, clearly, with enough examples and without unnecessary complicated details. His books can be read through, as well as used for reference, but I would say that his are the books that are better to be read cover to cover, sequentially, and not just being referenced as parts. Indeed, they not at all huge and are logically constructed in such a way as to provide a completed story. Continue reading →