How the Bank of England redesigned its data hub around open source

The UK's central bank has been busy redesigning its data architecture to meet new regulatory and scale requirements, and has already leveraged that data to potentially avoid disaster after Brexit

The Bank of England has drastically changed how it collects and analyses fast-growing data sets over the past few years, adopting enterprise versions of open source technologies to help it modernise its architecture.

Speaking during the Dataworks Summit in Barcelona last week, Adrian Waddy, technical lead for the bank's big data platform, and Nick Vaughan, domain SME for data analytics and modelling, explained how they designed this new data platform and some key lessons learned.

When Mark Carney took over as the Bank of England Governor in 2013 he commissioned an independent review into the organisation, which, when published, highlighted the need for better use of data and called for a new central data platform to underpin that.

Amongst its many roles, from regulation to research, the Bank of England is responsible for £485 billion of assets and, through its real-time gross settlement function, processes £650 billion worth of transactions a day. It also performs regular 'stress tests', running pretty complex analytics to assess whether the bank can withstand a variety of financial shocks, so reliable uptime and watertight data integrity are key requirements for any new, central data platform.

The bank's remit then expanded in 2012 with the introduction of the European Market Infrastructure Regulation (EMIR), forcing the IT department to collect more and more data and automate reporting where possible. This led to the bank reassessing its systems from the ground up, and start thinking about a new architecture.

Before this the bank was running 128 different data analytics systems, along with the "significant cost and complexity of operating that," Waddy said.

"The analysts in the bank relied very heavily on their internal network to know what data existed and where," he added. "Even when they managed to get access to it, if the data that they wanted to combine didn't fit on their laptop they had no real place to do so and increasingly there is demand for using some of the analytical techniques that are becoming more common in the industry."

As Vaughan added: "Given London's position as a financial centre, and our role as a central bank to regulate entities within the UK, that meant that we had to collect in the region of 50 million transactions every day at a peak of 85 million a day. For us that was a step change, we needed a different data architecture."

Data Hub One

Roughly speaking the new architecture looks like this: data comes in from the trade repositories, where it is then unzipped into CSV files and stored in a 'raw zone', where a set of unique schemes and structures are applied. It is then loaded into tables, where the data is appended and structured for querying with Apache Hive.

"Then we load that into our consume zone, which you can think of as data from the warehouse being pushed into a mart, structured, aggregated and reduced in size to respond to specific use cases," Vaughan explained.

This first iteration of a new data architecture certainly didn't run perfectly though.

"We had problems because we didn't have the skills in house at that time to build our own large clusters and didn't have pots of money to invest in a massive data lake, or push huge kit into our data centre," Vaughan said. "We needed to make sure we get value for money but also with a pretty low appetite for risk. So we realised pretty early on that we needed to partner with a vendor."

The unnamed winning vendor would be tasked with building and configuring the storage infrastructure, and installing software on top. Then, a couple of weeks before the planned go-live date, the vendor pulled the plug on the product.

"On the one hand this gave the opportunity for lots of naysayers and people to come around us to maybe accuse us of having made a disastrous mistake," Vaughan said, "and the communications around it were difficult at the time.

"Once the dust settled, we realised that actually we had gone from nothing to serving up EMEA data to our analysts. We also built up our skills enormously in a short amount of time and went from nothing to being confident enough to now build our second data hub. It will be five times bigger and run in multiple data centres. So this will limp on for now, but we're getting a huge amount of value out of it."

One example of that immediate value was how it allowed researchers at the bank to investigate the impact Brexit could have on the derivatives market here in the UK.

"We have a responsibility to flag what we see as a risk to the financial system," Vaughan explained. "There was a risk that there would be derivative trades contracts open at [the point of leaving the EU] in the region of £41 trillion that would be in a relative state of limbo.

"Thankfully, because we went public with it, and because we had this data set, we were able to flag this as a very significant risk and those that were, and should, be concerned by this were able to deal with it and alleviate that, so we are in a better place now."

Data Hub Two

Now the bank is preparing to launch the second iteration of this data hub next year, working closely with enterprise Hadoop specialist Cloudera (now that it has merged with Hortonworks). Going into more detail regarding the Data Hub Two, Waddy talked about the importance of automation, dynamic provisioning and virtualisation.

"One of the key differences in terms of the live environment is we are going from effectively a single cluster to three separate production clusters," he explained. "This largely reflects the cloud offering on Azure and means we will be able to tune these clusters to workload."

This helps balance the need for strong data governance with analytical users that want to move fast and iterate.

"The approach is to have the data lake, the ingest cluster, which is going to be the engine doing the work and then a query cluster tuned for low-latency, high concurrency, containing highly modern data, and an analytics cluster that will have the same data and might have some raw data if that is what the analysts want," Waddy said.

"There we will have a lower level of data governance, allowing for that kind of analysis and that means their flow and output could increase and it's only when they want to bring that back do we effectively put technology SLAs around a particular reporting pack, with a high level of data governance coming back in again."

For governance the bank will use Apache Atlas for creating a data audit platform. This will, Vaughan added, help the organisation gain visibility on what its people are doing, the data they're using, and how they derive insights.

"We still are fairly siloed sometimes in the way people work in divisions like for regulation or markets," Waddy added, "so again you need to be willing to play nice and contribute to the central platform, where everybody needs to do the stuff that no one likes to do around capturing metadata and making sure that we know enough about the data sets as we onboard them onto our platform."

The new data hub architecture runs on hyper converged VMWare VxRack hardware with EMC Isilon storage, offering 320TB of 'usable' storage and circa 10TB of RAM.

For example, with more than 400 researchers at the Bank of England, this ability for IT to step back and give them a more self-serve environment to work in is important.

"As a technology function we will be retreating to the perimeter of that environment, we will support it and keep it running but what happens within there will be driven by the business," Vaughan said. "So they'll be building their own pipelines, they'll be publishing, it's for production of their models. Perhaps in the future we will get to the point where the next stress test will be running efficiently through our pipelines and back and circulating through this architecture. We have lots of other tools around the outside for automation.

"It was pretty daunting to think that in less than two years we can become experts in building, configuring and managing a large cluster, let alone build a solution on top that was secure and could deliver what we needed.

"Even though we partnered with a vendor who pulled the plug on the project, if you step back and look at what we managed to achieve in that space of time, I don't believe we could have got to where we are now, where we are comfortable to take that challenge on with the technology and have the skills in house to go again, but five times bigger."

Cost savings

The bank has been criticised recently by a select committee for expensive and ineffective IT spending, so is this part of a broader effort to improve that?

Vaughan told Computerworld UK: "The strategic review in 2014 called for a more efficient use of technology, reducing those silos, so every time we build a new system we don't need a [user acceptance testing] instance or a live instance with their own infrastructure and management and licensing overhead.

"Actually something that uses open source technology that we can put all of our data on and use more efficiently is definitely part of that drive to make more efficient use of technology and bring down costs, but more importantly the capability it gives us to tap into new data sets."