CI/CD at Trellis HFL
At Trellis Housing Finance Limited, we are building our systems using the latest technology principals. In this blog, we will discuss our CI/CD process.
DevOps and CI/CD are two related terms that are loosely defined with a variety of opinions on what they represent. I like the Wikipedia definitions:
- “Continuous integration (CI) is the practice of merging all developers’ working copies to a shared mainline several times a day.” https://en.wikipedia.org/wiki/Continuous_integration
- “Continuous deployment (CD) is a software engineering approach in which software functionalities are delivered frequently through automated deployments.” https://en.wikipedia.org/wiki/Continuous_deployment
- “Continuous delivery (CD) is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time and, when releasing the software, without doing so manually.” Also, “[Continuous Delivery] contrasts with continuous deployment, a similar approach in which software is also produced in short cycles but through automated deployments rather than manual ones.” https://en.wikipedia.org/wiki/Continuous_delivery
- “DevOps is a set of practices that combines software development (Dev) and IT operations (Ops).” Also, “Other than it being a cross-functional combination of the terms and concepts for “development” and “operations,” academics and practitioners have not developed a universal definition for the term “DevOps”. Most often, DevOps is characterized by key principles: shared ownership, workflow automation, and rapid feedback. ” https://en.wikipedia.org/wiki/DevOps
Additionally, while not always discussed together, I believe project management (including requirements definition and user acceptance testing) is a critical enabler for CI/CD. If tasks are not sufficiently broken down into small, atomic pieces, documentation is unorganized, and/or testing is done infrequently, CI/CD is much more difficult in practice.
Project Management and Task Tracking at Trellis
At Trellis, we have a continuous, Kanban-style of project management using JIRA. As a startup, our requirements and priorities change frequently, and this allows us to stay organized while maintaining our agility. We can easily adjust backlog priorities without disrupting artificial “deadlines” or sprints. In general, team members can focus on working through their list of tasks in priority order while management focuses on the larger chunks being accomplished collectively. With JIRA, we can easily create custom reports and dashboards showing tasks organized by status and priority order for individuals, teams, and/or larger scale initiatives. Often 5-20 tasks get grouped into EPICs and occasionally we group EPICs into larger releases (usually less than 1 month of work at a time), but so far we have not seen a need for more traditional agile sprints.
For documenting requirements, we have found JIRA to be the best place for detailed information (e.g. implementation details) and Confluence for higher-level information (e.g. business requirements, architecture) with the status of related tasks embedded into the page. This helps keep features loosely coupled, so we can re-prioritize and complete them in any order or postpone tasks indefinitely. It also means that the developer (and all other stakeholders) has everything they need directly on the task they were assigned without a need to hunt for information, plus there is an interface for comments and discussion built-in to keep things organized.
By leaving low-level implementation details out of Confluence, the higher-level requirements documentation is much closer to what we require post-implementation for longer term documentation. We are currently refining our documentation process, but it is clear we can more efficiently convert requirements into user documentation. Given how rapidly our systems are evolving, this an important area to get right or else the documentation will rapidly grow stale.
We strive to keep task scope to less than a week of work and ideally only about a day of work. Completing several tasks each week helps keep up the feeling of progress for each person. It also ensures we are focused on iteration and continuous improvement while providing value as soon as possible. We get a “minimally viable” feature or capability released, then add to it based on evolving priorities and feedback on the first version. This is a key enabler for continuous integration, because there is less to integrate at a time. Each change is only a few days of work and we have dozens of integrations per week ensuring our process is well tested and predictable.
Continuous Integration at Trellis
We use GitHub version control for all code, templates, and infrastructure configurations. Central repositories with the ability to branch, merge, and view historical versions of each file is very helpful in maintaining a consistent “source of truth” for our solutions and in recovering from mistakes that inevitably occur (e.g. accidentally deleting a file or something within the file).
Similar to most open source projects, developers create and test changes locally using a fork (i.e. personal copy) of the main repository then submit changes for approval and merger using GitHub pull requests. Each pull request must reference a JIRA task ID, which simplifies code review, testing, and auditing, plus automation can automatically close the JIRA task when the PR is merged. By using forks and pull requests, only senior members of the team need write access on the repo, plus we can enforce mandatory code review and automated testing policies. Given the nature of our organization, security, auditability, and quality are top priorities.
For every pull request, automated tests run to build and validate every aspect of the system. It is impossible to test everything perfectly, but by requiring tests for every change and adding tests every time we find an issue, we can continuously improve coverage over time. Certain design patterns like dependency injection with mocks are encouraged from the design phase to produce more testable (and higher quality) systems. High quality can only be achieved through constant attention starting in the design phase.
If these automated tests succeed, the change must then be reviewed by a senior member of the team to ensure best practices are followed. Code review a crucial step, because bugs do not always cause a program to crash or even an automated test to fail. Sometimes, it is just a misunderstanding of the business requirements (where both the code and the tests are updated incorrectly), or an accidental change in functionality without proper test coverage to catch it, or even just a poor implementation that will be difficult to maintain long term.
Instead of one large version control repository, we have many to help reduce risk by separating higher risk code from lower risk code. It increases the need for integration testing, but that can be automated just like unit tests. As we grow, we will likely want to split some of our repositories to further isolate risk, distribute code review responsibility, and speed up release cycles for improved organizational scalability.
Continuous Deployment at Trellis
After a change is merged into the main branch, our continuous deployment pipeline automatically builds and deploys the latest version to a production-like test environment. In the test environment, we can perform final validation and user acceptance testing (UAT) with production-like integrations and security configurations.
In theory, we could deploy to production automatically, but it would require much higher levels of automated testing and validation prior to merging a change. As a relatively new company with very high quality requirements, we have found it more effective to have user validation occur between test and production deployments.
We have dozens of changes merged into version control and our test environment every day. Periodically, we perform final validation and roll these changes over into production by creating a release branch from our main branch and submitting a pull request from it to our production branch (the release branch ensures new changes can keep going into main without interfering with the release process). Inexplicably, GitHub does not allow fast-forward merges (the default in git…), so we have to use the git CLI for the actual merge, but stakeholders can still review, comment, and approve the production deployment using the same interface that we use for changes going into the main branch.
Every merge into the production branch results in automated deployment to our production environment using the same pipeline configuration as in our test environment. At first, we required an extra step from our cloud admins to deploy to production after the code was merged into the production branch, but we found that to be inefficient and somewhat redundant. Instead, cloud admin approval is required at the time the release PR is merged (so they are ready just in case anything breaks), and the deployment is fully automated from there.
In a more traditional process, build and deployment processes are complex and manual which makes them error prone, and admins must be very careful to ensure every step is followed without errors. This tends to be a self-defeating cycle: the process is complex and manual, so it is error prone, so it is done as infrequently as possible, so it never improves. For us, our deployment process automation runs dozens of times per day in our test environment which ensures predictable, low-risk deployments to production. Blurring the lines between development and operations (aka DevOps) ensures our CI/CD is smooth with no sharp inter-departmental lines, and we have continuous feedback to further optimize.
As we scale, we will inevitably have many improvements to make. Software needs to be designed/architected correctly to benefit from more threads/cores/servers efficiently and without errors (e.g. race conditions, deadlock, fault tolerance). Similarly organizations need to be designed/architected to fully benefit from additional people. By spending a bit more time on process design (and redesign), inefficiencies can be removed for increased scalability and stability.
If you find any of this interesting, we are hiring a variety of senior engineering roles in Karachi, Pakistan. Come join us in revolutionizing the housing industry and improving access to affordable housing across the country.