Data Engineering by a Pragmatist — Tales of Choosing a Database

10 min readJan 3, 2022

2022. 2nd day of that. 1st Sunday of that. Interestingly, 52nd week of 2021. 8th day since I “prologed” Data Engineering Story by a Pragmatist | by Boris Serebrinskiy | Dec, 2021 | Medium to my astonishment LinkedIn showed it to close to 8000 people, and about 50 chose to follow me on Medium. Forever grateful to everyone! 7 days of thinking what to write about next, interrupted by the anti-diet post-Christmas week, culminating in a criminal-almost eating event known as “Russians celebrating New Year”. As I slowly recover from the food coma (2 workouts on the 1st day of the year, hopefully a dozen more (total) by the end of year :), I sit down to write the highly anticipated (by 50 people?) “how one chooses their database” blog…

Look at how much food… err.. sorry… data is all around us. 30 minutes into writing this, I hear “Dad, let’s go see the Rangers!” (NYC ice hockey team, beloved by their fans, including my family). My smirk about “but you guys wanted me to write this blog…” is shut down with a shrug and “you can do it on the train like I do my school homework”. That will have to be discussed later. But right now, I have more data to contend with — ticket prices, subway schedule (does it even run given the Omicron news?), phone battery levels, remembering password for ticketmaster app, and more, and more…. 15 minutes later, I’m looking out of the Q train window and mentally locking on to other “data”… Does the train speed get logged? How often? What about the temperature of its brakes? 3rd rail voltage? I’ve got to ask one day where they keep that… My dad worked for the subway transit authority, he might have some contacts. Wow, more data — velocity, temperature, voltage, contacts…

We are at the arena — “the world’s most famous one”, they say — Madison Square Garden (in fact, it is the oldest in the US, but is oldest = famous? Let’s ignore this useless data for now). Our tickets and vaccination cards are getting scanned — I can feel how the data is virtually flowing from our phones into the hands of the guys with the scanners… My son’s data is a little messed up — the city misspelled his last name — so we need to show the paper card. A data quality issue… The hockey game rages on, the home crowd is on its feet half the time — we have crushed our nemesis, Tampa Bay Lightning, twice in a row now, and the Lightnings are Stanley Cup champions, no less! So much more new data to contend with, the famous “celebrity” kind — sports statistics, the heart of the modern “Moneyball” (you saw the movie, right?) industry.

Where would I keep all of it? Where does my company keep something like that? Luckily, I’m well familiar with these questions. Part of my job is to look for such technology, test it, and then advise others on how to make better choices when it comes to picking a database.

These are not very easy questions to answer. The challenge is quite a multidimensional one. There are dozens of commercial vendors to choose from, plus multiples of that in the open-source space, and the man-in-the-middle kind of vendors who’d offer you a commercial distro of an open-source software. Then you got your turnkey database appliance vendors who promise to roll in a box that just needs some power and network, and you’d be all set. The acronym galore can make your head spin — OLTP, OLAP, HTAP, HOLAP, MPP, SMP, with choices of operating systems, cloud vs on-premise (and let’s not forget venerable mainframe), columnar vs row-based vs hybrid table organization, or no tables at all but some “collections” instead. I better stop before I go down deeper into the rabbit hole (btw, I watched the new Matrix movie yesterday, talk about abundance of “data” in that thing!)

Let’s lay down some principles for this conversation. I will resist all attempts, foreign and domestic, to mention any commercial vendors, where possible, maybe. You, of course, are free to comment on your favorite database vendor later, I may have an opinion too then :) I reserve the right to mention open-source software. I will focus on the ideas that drove and continue to drive our company choices, and since I’ve been here for many years there is a very high chance these choices and my advice correlate, a lot. Couple more things to mention — being a large bank, we are influenced by several critical factors — a) any open source software we use must be backed by a commercial distributor, b) we often invest, develop, and actually use software from startup database companies one may never heard of until they go public, c) there is a very high barrier to entry for any new database software due to strict requirements to data protection, security, etc. d) we are predominantly a Linux shop, but interestingly enough not 100% when it comes to the databases.

What does it all mean for the readers of this blog? It means our choices are pragmatic, both conservative and innovative at the same time, commercially viable, vetted on a variety of use cases, from small scale proofs of concept to high-end “Big Data” super projects. And we do innovate — we have identified and onboarded 4–5 new database vendors in the last 15 or so years. That’s a crazy figure for a large investment bank. And I am not even including things like Kafka, Spark, various BI platforms.

Here are a few more principles I use myself in addition to the company ones.

When looking for a new database, first see what you already got in house. That includes cloud data services too, not just on-premise offerings. See Possession is nine-tenths of the law — Wikipedia. Reasoning: the costs of onboarding of a new database are very high, it is a hell lot easier to use something you already have, a bird in the hand… and so on. Your senior technology leadership will love you for that because nothing strikes fear into the DBA’s heart as the sentence “we are going to purchase a new database platform” :) A little personal anecdote — many years ago I was running a complex analytical platform on top of a traditional relational row-based database. The system could not be scaled anymore, and I wanted to bring a modern, columnar MPP platform to run a proof of concept. I still keep an email from the head of Data Engineering stating “we already have several platforms to choose from and the Firm would not be entertaining adding another”. Well, he was wrong, I was right, and that “new” platform now houses over 100 Petabytes of data within the bank. And he is not with the Firm anymore :)
Anything new you bring has to be an “order of magnitude” better than what you already have. I know, this sounds subjective. So, let’s bring it down to the proper level of pragmatism — say you have an in-house database capable of 1000 transactions per second. Can you ride the “hardware wave” and scale it to 10000? Probably, and thus you don’t need that shiny new database, and the head DBA is right. But if you can’t, or the new software can do it on the same hardware or just scales better in general, that’s your ticket! So, the rule reads as “10 times better on the same hardware or 50x better on the new hardware”. Well, that’s a dream of course.
You will have multiple database platforms. And you will have the same data copied into multiple databases. Despite all efforts by the senior management to keep everything in just one database. That’s pragmatism. Databases are like cars. Some are like dump trucks — they can carry huge loads but aren’t very fast; some are incredibly fast but lack horizontal scale and can only hold 1 day worth of data; some are very rigid in structure but allow instant access through that structure; and some are ultimately flexible and can hold any data but require heavy indexing to make it work. Just like you wouldn’t take a dump truck to drive your kids to school, you also don’t take your family car to carry 40 sheets of plywood from Home Depot. (“But I do that!” you say. And I do too! But it does not make it right and eventually that SUV will become a piece of s…. if you keep doing that Home Depot run every week) Don’t turn rules 1 and 2 into a religion. And thus, we have rule #4.
“3 strikes and you’re out!” rule. What it means in practice — don’t count how many things your favorite database can do for you, count how many features it lacks, and you cannot live without them. That sounded a little like JFK so let’s bring it down with an example — say, you want to store PDFs and images, but that beautiful database can only store columns of text and numbers. That’s a “strike”. Or maybe you want it to be able to concurrently update multiple rows per transaction, but it can only do 1. That’s a strike too. Ignore all the marketing mambo-jumbo and focus just on those critical must haves. I’ll list a few more examples — lack of a modern resource manager to support high concurrency workloads; lack of massive parallel processing and horizontal scale out; lack of SQL support; lack of modern authentication support; lack of modern encryption support, lack of flexible schema-on-read, lack of time series support, lack of window functions, lack of secondary indexes, etc.
Unless the feature you want already exists or at least is in beta and going to be released in the next 6 months, at the most, it does not exist. Future roadmaps do not count. Database development is a very time-consuming process, with features laid out years in advance and roadmaps difficult to change. It is also incredibly difficult to change core design of an existing database, in the best-case scenario it will be a bolt-on feature lacking the same maturity as the core engine. Example — one of major database vendors added support for columnar table storage under pressure from their marketing department. It came with so many limitations that only following several major versions worked OK. Seeing through the marketing claims of the sales department was a challenge, and only deeper understanding of the product let us identify the immaturity of the claim.
If possible, look deeper into the leadership of the commercial vendor, or open-source community behind the platform. Are they visionaries? Are they capable of marketing the product to grow its market share? Can they bring enough new capital to fund future development? Working for a bank has its advantages — we have a very strong technology research department focusing on the startup space. But if you don’t have access to such a resource, do your own due diligence. For an open-source product, see if there is a strong company backing it, employing major committers. If there isn’t, more likely than not this product won’t make it. For a commercial vendor I check if the database product is their main offering or part of some larger platform. Essentially, is the leadership of that company committed to their product or the database was acquired in some merger, and they don’t know what to do with it now. I stay away from the “merger acquisitions”.
Market share. A highly subjective topic. From a pragmatic point of view, I look into the relative growth, not the absolute — is the platform gaining or losing their ranking. This is like doing any other due diligence. Check out https://db-engines.com
Ecosystem! With a very bold exclamation point! One of my biggest “beefs” with DBAs and database vendors is lack of attention to the drivers and the connectivity from compute platforms and BI tools. I made it a point that any database platform is reviewed in terms of compatibility with the clients. Consider it a “strike” if the database does not offer a Python driver or has no Windows driver support, or whatever other tool you want to connect to it. I feel really good when a vendor offers a full spread of the drivers, and if they can’t build one on their own, they may resell one from the other vendors like CDATA.
Commercially feasible, practical pricing and no arm twisting by the vendor. Let me explain what I mean. Despite being a big bank with lots of money, we want to get the biggest bang for the buck — we look for pricing models where we can grow, and these have to be modern and not atavistic. When I hear someone charging by a number of cores, my mood turns sour. Pragmatic outcome — vendors who ask for less money and give us a better deal, e.g. unlimited use for a few years tend to have a much better future with us, conversely someone who sold us a few servers at a high price eventually fizzled away. This is why open-source platforms with commercial support (like Kafka, Postgres, MongoDB) are so successful.
I don’t believe in database appliances. I did not believe in them when we had kept all the data on premises, but with the advent of the cloud databases I think even less of that business model.
And this brings me to the critical point — enthusiasm of the external and internal communities (not to be confused with the market share). This is related to our ability to hire talent and influence senior leadership within the company. A database platform with a great following is more important (to me) than one with a larger market share or even with “better” features.

Well, I think that’s enough of “principles”. If I was reading this blog myself, I’d have thought — “you have no system”. That’d be harsh, if not pragmatic assessment. It would not be true tough. I did not go into how exactly we test and validate designs — starting with an initial sales call all the way to high performance proofs of concepts, security reviews, etc. That sounds like an excellent topic in itself. Perhaps, I will do that next. Or I may actually jump into why doing joins on strings is not any worse than on integers. I am looking forward to any comments on medium or linkedin. And on that note, I thank you all for spending a few moments of your life on reading this blog and have a HAPPY NEW YEAR!!!

Edit: Link to the third blog in the series Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) | by Boris Serebrinskiy | Jan, 2022 | Medium

And the 2nd — Data Engineering by a Pragmatist — Tales of Choosing a Database | by Boris Serebrinskiy | Jan, 2022 | Medium

P.S You can also find me on LinkedIn https://www.linkedin.com/in/silverboris

Data Engineering by a Pragmatist — Tales of Choosing a Database

Written by Boris Serebrinskiy