What drives architecture decisions?

There is no single best architecture that fits all problems. Architecture decisions must be based on business and technical constraints. We often call those constraints as “non-functional requirements”.

What are the example drivers?

  • time – when project has to be delivered? is there a deadline like upcoming changes in law, scheduled event or a business opportunity that exist only for a while?
  • scope – how much there is to be done? is it a small project or long-term investment?
  • skills in the team – what programming languages, frameworks and tools the team is most productive with?
  • scalability – how many users will be using the software? how fast the number of users will grow? will there be any peaks in traffic?
  • performance – are the operations happening in real-time or can be done in background with acceptable delay?
  • security – is there a risk of data breach or system does not deal with private data?
  • maintainability – will the system be developed in long-term? will it be evolving and changing often? should regression tests be automated or it is acceptable to perform manual tests once and codebase will be frozen afterwards?
  • availability – what happens in case of system failure? is the plane crashing so that system has to be up 100% of time, or maybe it is a background data integration and it is acceptable when system will be down for a couple of hours?
  • business process automation – is it required to automate all the steps and cases in the process or it is acceptable to leave some manual work to be done for example by customer care department
  • platform – on which devices will the software be used? Smartphones, laboratory devices, cars, servers, watches, payment terminals or many of them?
  • usability – is it a B2C/C2C software intended to be used by end-consumers or only trained stuff will be using it?
  • already possessed resources – do employees already work on MacBooks and all DevOps tools are built for AWS? Then .NET 4.7 and Azure are not the best choice
  • budget – Last but not least. Are we investing in a product that already brings revenue? How much will we save or earn thanks to that project? Is it a new business idea to validate at lowest possible cost?

It is important to list constraints at the beginning of the project and make it transparent for both stakeholders and development team. Having common awareness of constraints allows to make more confident, accurate and quicker decisions.

It is also important to note that drivers may change when business environment will change. Good architecture will adopt without putting too much effort to prepare for that at early stage. Example: you may built the search functionality based on simple SQL “like” query, but abstract the search functionality so that it is decoupled and will be possible to be replaced by more scalable technology in the future without changing other parts of the system.

When the system has good logical structure it will be much easier to adopt it in the future for larger amount of users, even if the architecture at the beginning is a simple monolithic app. But if the logic is mixed-up since the beginning and code is a mess, no matter what fancy architecture and tools will be used – the project will fail.

Summary

  • Know and communicate your drivers.
  • Estimate how likely it is for each driver to change
  • Have a high-level plan of adopting to likely changes

That rules lead to better decisions when building software.

Actor model programming in Orleans framework

I’ve spent some time recently playing around with Orleans Framework. It’s an alternative to Akka.NET offering similar actor-based architecture.

What is actor model?

Actors are called Grains in Orleans. Actor or Grain is a class representing some business entity. It can be instance of a Player, Trader, Customer or Account depending on domain topic.  In practice Grains are implemented like normal C# classes. A constraint that we have is that all methods have to be asynchronous to keep all the communication between grains non-blocking.

Local state

Grains have local state. It means that they live in memory between requests to system. It gives big performance benefits compared to creating entity instance based on data from database for each request. State can be persisted to database to avoid loosing data on system restarts. As programmer you can invoke saving to storage any time when state has changed.

Horizontal scalability

Orleans can run in cluster using many servers. Framework can place grains in all nodes across the cluster, so that each grain is located only in a single node. There can be some exceptions from that rule: when node crashes framework may be not sure when exactly grain has finished its processing. This problem is in general called a split-brain in computing. But this is an edge-case which falls into error handling strategies, overall assumption is that each grain is activated only once.

Grains are exchanging massages between each other. That messages use super-fast .NET binary serialization. Messages can go over network if 2 grains are on separate nodes. So it is important to make grains not too chatty if you care about performance, and you probably care if you are interested in frameworks like Orleans 🙂

Possibility to run Orleans in a cluster gives beautiful linear scalability.

What problems is actor-model good for?

Actor model is suitable when you have a lot of objects communicating with each other. Example use cases:

  • Real-time trading systems
  • Multiplayer games
  • IoT applications connected to many devices

Grain activations should be distributed randomly and decentralized. Actor-model is not suitable for batch processing or centralized design where some entities have to process most of the requests (so called hot-spots).

Event sourcing

Actors are  a good fit to match with event sourcing pattern. Grain supports that pattern by JournaledGrains. But here comes a disappointment. Available storage mechanisms for event log persistence are poor. The only built in storage provider saves event log for given grain as collection serialized into single state object, so the whole event log needs to be read before recreating grain state. Other built in storage saves only state snapshot without saving event log. Good thing is that there is flexible extensibility point  allowing to write your own provider by implementing just 2 methods for reading and writing events. There is also a community contribution available which integrates Orleans with Event Store but this database is not my favorite. Probably I’m complaining too much and should instead contribute by implementing events log storage based on Cassandra or CosmosDB, it does not look like a hard task, but the next topic is much harder – distributed transactions.

Distributed transactions

Creators of Orleans framework did a great job to formally describe frameworks semantics. You can have a look at how they implemented distributed transactions: https://www.microsoft.com/en-us/research/publication/transactions-distributed-actors-cloud-2/

The algorithm is very interesting but from practical point if view, what I miss is lack of support for transactional communication between JournaledGrains. Again, support for event sourcing pattern seems to have been not a top priority in Orleans so far.

I you would like to jump deeper into other theoretical aspects of actor-based architecture, you may be interested in other Microsoft Research materials:
https://www.microsoft.com/en-us/research/project/orleans-virtual-actors/

Message delivery

Orleans can give you one of the guarantees:

  • message will be delivered at most once
  • message will be delivered at least once

There is not guarantee to deliver the message exactly once. We are in distributed system and this problem is not easy to solve without sacrificing performance. This is something to be aware of. It’s up to you how to introduce fault tolerance.

Orleans and microservices

You can think of Orleans as of microservices framework. The services are really micro. Each grain is a service. You probably cannot go more micro with microservices  than in actor-based architecture. If you are building a microservices-based system, please have a look at Orleans docs and ask yourself an honest question: have you thought about all that problems that Orleans addresses and solves when building your microservices solution? We often make shortcuts through mud and bush because we do not even know that there is a better way. Please have a look at this presentation to illustrate some examples:

Summary

I’m very grateful for all contributors who put Orleans into existence because it provides decent ground for building well-defined actor based architecture. Even if this model is not suitable for your needs, Orleans is very educational. Making a deep dive into its architecture and implementation can broaden architectural horizons a lot.

But on the other hand in my opinion you have to be prepared to make quite a lot of custom extensions / contributions on a framework level to build production-class system. There is an interesting initiative called Microdot framework which adds to Orleans many must-have features when building real system. But even with Microdot, this ecosystem looks more like an academical research rather than a shiny framework ready to ship to production out-of-the box. For everyone looking for something more mature with bigger support I would recommend to look at Azure Service Fabric.

But forgetting about production and enterprise readiness, programming model in Orleans is sweet. APIs are well designed and framework offers many extensions points to play with. Worth trying before signing-up for a cloud solution.

Azure Monitor (aka Application Insights)

So far I was using NewRelic for .NET applications monitoring on production and it is a great product. It has everything that we could expect from APM tool: browser and backend code instrumentation, alerts, error stack traces, requests performance statistics, CPU, memory and disk usage; even showing SQL queries sent to database and Redis query statistics.

Recently I’ve put some effort to evaluate Azure Monitor as an alternative. I was using it before for basic use cases of monitoring Azure resources but I’ve never explored it’s full capabilities. And that capabilities are enormous!

With NewRelic I was using ELK (Elastic, Logstash , Kibana) as a complementary tool to gather custom application-specific logs and metrics. With Azure Monitor I don’t see such need anymore. When hosting applications in Azure, Azure Monitor already covers all the functionalities of both New Relic and ELK in one box.

What I like most about Azure Monitor:

  • Integrates seamlessly with Azure cloud to provide host-level metrics
  • Provides insights into containers running in Azure Kubernetes Service
  • Runs on powerful Azure Data Explorer engine which allows to analyze data in various formats in a consistent way
  • Makes it easy to define custom metrics
  • Supports advanced alerts and automation based on log queries and metrics
  • Easily integrates with .NET Core applications
  • Rich visualization tools including ability to export data to Power BI
  • … and yes, it provides exception traces, code profiling and web request performance statistics

Apache Ignite as an alternative to Redis cache

Introduction to Redis

I am quite a big fan of Redis as a distributed in-memory cache. It also acts good as a session storage.

There is a network penalty to communicate with Redis service, so as with talking to database you cannot be too chatty. It’s much better to ask for multiple keys in a single request at the beginning of your logic to quickly get all necessary data at hand. But reading the values from Redis should be much quicker than from database. First of all it’s a simple key-value store so it’s like always reading by primary key . Secondly we benefit from having everything in RAM. It is also possible to run Redis in persistent mode but that’s a different use case, when you may not use an SQL database at all.

Cache-aside pattern

RAM is usually limited and cannot store all the data we have. Especially that in Redis you will usually introduce quite a lot of redundancy to keep as much as possible in a single key.  Limited memory space is easily solvable by applying cache-aside pattern.

Updating data in Redis

More difficult problem to solve is refreshing data in Redis when something changes. One solution is to set expiration date after specific time but your users would not be happy. We all live in a real-time instantly updated world. Delay by design? It does not look good. So what is left is to remove old data from Redis as soon as it was changed. First of all you need to identify all the places in your system where given piece of information is modified. In a big legacy system that may be a challenge. If you are more lucky your system may have proper event sourcing implementation allowing for easy change detection by just listening on events. OK so we know that given entity has changed, which keys to remove from Redis now? It is handy if your code is able to generate all the Redis keys under which data from given entity is stored and delete them in a single Redis call. For batch updates you may consider using scan operation for keys pattern-matching.

Updating data in Apache Ignite

Apache Ignite is easier to introduce as a cache layer in a system with SQL database because it supports SQL and Read-through/Write-through pattern. There is out-of-the-box integration with Entity Framework: https://apacheignite-net.readme.io/docs/entity-framework-second-level-cache Unfortunately no version for EF Core is available.

Conclusion

If you use EF >= 6.1 < 7 and would like to introduce distributed cache or you are already fighting with not-updated-cache bugs every week, consider using Apache-Ignite.

How to make password reset link more secure?

Sensitive data should not be stored in URLs. A lot has been written about that. URLs are logged in a lot of different servers through which the HTTP request is travelling (web server, SMTP servers, proxies, browser history etc) and sensitive data is stored there.

But there are situations when avoiding having access token in URL is difficult, for example in a password reset link which is sent by email.

In that case we can add more security by implementing a following pattern:
1. Action which handles reset password reads token from GET parameter
2. Token is validated and stored in user session or cookie
3. User is automatically redirected to password reset action which does not have access token in GET parameter anymore. It could be even the same action. After the redirect we don’t have the access token anymore in URL.

Note that if we have any external link on our password reset page (e.g. social media), all GET parameters would be also accessible from that 3rd party servers as a HTTP-Referrer request header after the user follows a link.

Also remember to add expiration time to password reset links and make it single-use.

Good and bad technical debt

Albert Einstein said that everything should be made as simple as possible but not simpler. Simplicity is the holy grail of IT and also of business. It needs smart thinking and often experience to make complex things simple. But how about the other situation, when things are simpler than they should? In that case we are creating technical debt.

In finance debt is not always bad. When debt contributes to generating higher income and cost of interest is lower than that income, then the debt is healthy. For example when company makes a loan to modernize its equipment to be more competitive and generate bigger revenue – then we consider that loan as a good investment. Of course there is always risk involved and we usually cannot be sure about the future revenue but that’s another story.

How about IT projects? Can technical debt be also good?

Examples of good technical debt

Debt is a shortcut to generate revenue quicker. IT projects are no different. If a quick and dirty implementation will be enough to acquire the customer which will bring income – debt is acceptable. It is better to have quick and dirty implementation which generates revenue than missed deadline and lost deal. However, income should be high enough to pay the debt later. Introducing big technical debt to have little income is probably not a good move.

An example of good debt may be building a prototype. Prototype is something that we should afford to throw away when it turns out that the idea is not worth continuing or there is a better approach to continue it. Prototypes are great for demos and idea testing. However, when the prototype is approved we should keep in mind that it will usually require a significant refactoring or even building from the scratch before it becomes the final product. In other words – the debt will need to be paid back. Good software development practices may help to reuse significant part of the prototype in the final product.

Another example of a good debt may be hardcoding some logic. Hardcoding is always a concern for software developers because it means that the created solution is not flexible and may need more work in the future to introduce changes. But over-engineering is as bad as over-investing. When we don’t need flexibility, let’s not introduce it just in case. Postpone any work as long as possible. It may require some rework in the future but we may also have bigger budget in the future. Paradox here is that what means debt for technical team may be savings for business team. To avoid technical debt in codebase, business may need to make real debt to cover development costs. Usually it is better to have debt in codebase rather than on bank account. Good software architecture will allow to pay technical debt much easier that to pay real money back to the lender.

Debt management

Technical debt as any other debt has to be manageable. Stakeholders and the team need to be aware where the debt exists, how big it is and what is the interest rate. Interest from technical debt is paid by having lower productivity. Teams spend time on investigating how system really works, fixing bugs, manual configuration and manual testing. The more time is needed for those activities, the higher is the interest rate. It is really bad when the cost of this time is not covered by the revenue that the product generates. But even when that cost is covered, companies should be paying back technical debt systematically to have proper level of productivity which allows to stay competitive.

How to manage technical debt?

1. Measuring how much time is spent on system maintenance.

Management teams are often not aware how much time is spent in this area and therefore don’t have numbers to understand how much it costs to pay the interest for technical debt.

2. Using tools to measure code quality and automated tests coverage.

That tools are usually easy to integrate with development pipeline and help to identify technical debt. They provide deep insights into codebase but are not saying if the identified debt is good or bad.

3. Mapping reported issues to specific areas of the codebase.

It helps to identify which parts of the system generate highest interest for technical debt. Some parts of the system may have poor code quality but there are no issues in terms of maintenance.  It is like having 0% interest loan. Other parts may have great code quality and 100% test coverage but they are generating a lot of issues because of wrong logic, missing requirements or lots of manual actions involved. It shows that interest can be paid not only because of poor codebase but also because of poor requirements specification and business analysis. Usually it goes together – when requirements are messy then the source code becomes messy.

Sumamry

What are your opinions? Do you agree that technical debt can be good? Can you give any more examples of good technical debt and how to manage technical debt?