Beyond basic APIs

David Doherty
5 min readDec 15, 2020

Following on from:

In previous posts we talked more about getting data than analyzing data. I firmly believe the bedrock of any successful technology product is the ability to easily get access to the data you want in an easy, scalable and secure manner. This allows you to build automated solutions easily.

However, there are times when you need to explore the data more fluidly:

  • A business user trying to understand the scope of your system
  • A product owner or business analyst fleshing out details of a business problem
  • A developer debugging an issue
  • and so on

To meet all the technical use cases you have to think beyond key/value lookups, and even being able to rely on knowing which fields to index.

Joins / ad hoc data analysis

There are 2 prevalent trends for ad hoc data analysis: GraphQL vs. Analytical data store. (I’ll conflate data warehouse, data mart, data lake, etc into that definition)

GraphQL as a way to combine datasets in a consumer-centric fashion. This approach is synchronous in manner, and when a client request is made, data can be resolved at runtime. As a concrete example, we’ll use a variation of our use case before:

  • Show me the P&L on a portfolio
  • Show me the transactions that comprise the portfolio, and their entry prices
  • Show me the current market data for those instruments

Stitch them together so I can view the whole dataset easily. With GraphQL you can build a microservice that exposes the data but allows the client to select which fields they’re interested in. This microservices’ special sauce is that it knows how to combine data from different underlying services, and can optimize for the client experience.

There are a couple impacts you are consciously choosing with this approach:

  • Composing valid GraphQL requests is not as prevalent a skill as SQL, so tends to be leveraged more for system-to-system workflows
  • Adding new fields to your dataset needs some careful design if you want it to be dynamic, or a code change.

GraphQL use cases tend to be transactional/synchronous in nature.

This approach is well-suited when you expect external users of the API to appear down the line. With a token based permission approach, the token can be passed to the entry point API which may be expressed as GraphQL. High-level authentication and authorization can occur here. Assuming we’re authorized, the credentials can be passed down to the underlying data services where each can apply its security model. For example, perhaps the caller can see market data, transaction data for specified accounts only, and anonymized data about the account. This is a great approach where security is first and foremost a part of your architecture. Note, that while I noted GraphQL tend to be transactional/synchronous in nature, there is now support for streaming connections in GraphQL. It uses websockets under the covers so if you use this for streaming you’ll need to think through your dropped connection strategy.

Anyway, this GraphQL approach is very flexible, and useful if you’re building a platform for external users and software providers. GraphQL forces you to detach your physical data model from your query model which is good because it means you will be able to evolve your physical data stores independently from feature releases. It also means you can layer custom security models at the public API level (e.g. this customer can see the data, this customer can see partial data, this customer sees random/sample data).

There is a big assumption based into this approach: the underlying ‘core data’ services have to be scalable and resilient. Meaning they never go down and they increase capacity on demand.

A key consideration is ‘enrichment’, ‘transform’, etc. In an earlier article we discussed having SMEs design each API together. If they have done this correctly then there should be a natural flow between APIs. If one system is using BBG ID for identifiers and another is using ISIN, then you’re going to be forced to write join logic in your GraphQL layer. “Join logic is business logic”! Repeat with me “Join logic is business logic”. What happens if the BBG ID doesn’t have an ISIN do you drop the record, do you pass a null; how will clients interpret that?

What happens if one API is using local date and another is using date based on UTC?

What happens if one system has currency pair as USD/CAD, and another as 2 fields?

All of these challenges would have to be ‘solved’ in your GraphQL layer, and will pollute that layer which will slow down development on your key services. If your GraphQL layer has lots of enrichment logic you need to go back and make fixes otherwise you will end up in a world of pain!

Switching up — data [warehouse/mart/lake/etc]

A very different approach to data joins is analytical data stores. There are a number now: Snowflake, ApacheSpark, Oracle Exadata, etc (btw: the Snowflake valuation currently is… ‘wow’). There are a variety of things to think about and this could be a 3-part series by itself:

Security

For sake of simplicity I will conflate them into one theme of technical solution. And within that theme you have 2 philosophies to pick from: low security or high security.

Many large firms go down the low security route. They will dump data from their core data services (either streaming or periodic extracts) into a single data warehouse. Users will be permissioned by data sets or data types. Given these solutions are generally columnar databases they don’t suit row-level permission strategies.

Alternatively you can go down the high-security route where a user is given an empty space, and they can choose what to perform analytics one, with the data extract being performed on demand. With this approach you can extract data with the user’s specific permissions so that row level security is maintained.

Performance Profile

When picking your data warehouse technology you need to consider your performance requirements. Technologies like Spark and Snowflake are inherently column-oriented stores. If you want fast and reliable record level lookup you will need your clients to go back to your core data service. (So you’ll need to make that jump easy for them).

If you think you will need great row-level performance too you might want to think about MemSQL, Oracle Exadata or one of those family of technologies.

Lastly

Data stewardship/governance/etc. I will keep this short, but when you start having large sets of data, e.g. 100s of APIs, 100s of schemas, etc you need to have a think about how you will educate new joiners about your key datasets.

Wiki?

Alation for schemas?

SwaggerHub for APIs?

Talend?

Though documentation should be largely self-explaining — your API docs should be very readable — you will need to curate the data from a high-level. Position data: where should I go? And are there different services for different use cases.

Thanks for reading!

--

--

David Doherty

I write about Fintech, it's past & future, leveraging 20+ years of experience in leadership roles at large Fintechs