Logos

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

NFS is a central piece of infrastructure that is essential to services like Toolforge. Recently, the Cloud Services team at Wikimedia had been reviewing how we do NFS.

The current situation

NFS is a central piece of technology for some of the services that the Wikimedia Cloud Services team offers to the community. We have several shares that power different use cases: Toolforge user home directories live on NFS, and Cloud VPS users can also access dumps using this protocol. The current setup involves several physical hardware servers, with about 20TB of storage, offering shares over 10G links to the cloud. For the system to be more fault-tolerant, we duplicate each share for redundancy using DRBD.

Running NFS on dedicated hardware servers has traditionally offered us advantages: mostly on the performance and the capacity fields.

As time has passed, we have been enumerating more and more reasons to review how we do NFS. For one, the current setup is in violation of some of our internal rules regarding realm separation. Additionally, we had been longing for additional flexibility managing our servers: we wanted to use virtual machines managed by Openstack Nova. The DRBD-based high-availability system required mostly a hand-crafted procedure for failover/failback. There’s also some scalability concerns as NFS is easy to grow up, but not to grow horizontally, and of course, we have to be able to keep the tenancy setup while doing so, something that NFS does by using LDAP/Unix users and may get complicated too when growing. In general, the servers have become ‘too big to fail’, clearly technical debt, and it has taken us years to decide on taking on the task to rethink the architecture. It’s worth mentioning that in an ideal world, we wouldn’t depend on NFS, but the truth is that it will still be a central piece of infrastructure for years to come in services like Toolforge.

Over a series of brainstorming meetings, the WMCS team evaluated the situation and sorted out the many moving parts. The team managed to boil down the potential service future to two competing options:

  • Adopt and introduce a new Openstack component into our cloud: Manila — this was the right choice if we were interested in a general NFS as a service offering for our Cloud VPS users.
  • Put the data on Cinder volumes and serve NFS from a couple of virtual machines created by hand — this was the right choice if we wanted something that required low effort to engineer and adopt.

Then we decided to research both options in parallel. For a number of reasons, the evaluation was timeboxed to three weeks. Both ideas had a couple of points in common: the NFS data would be stored on our Ceph farm via Cinder volumes, and we would rely on Ceph reliability to avoid using DRBD. Another open topic was how to back up data from Ceph, to store our important bits in more than one basket. We will get to the back up topic later.

The manila experiment

The Wikimedia Foundation was an early adopter of some Openstack components (Nova, Glance, Designate, Horizon), but Manila was never evaluated for usage until now. Our approach for this experiment was to closely follow the upstream guidelines. We read the documentation and tried to understand the different setups you can build with Manila. As we often feel with other Openstack components, the documentation doesn’t perfectly describe how to introduce a given component in your particular local setup. Here we use an admin-controller flat-topology Neutron network. This network is shared by all tenants (or projects) in our Openstack deployment. Also, Manila can use many different driver backends, for things like NetApps or CephFS—that we don’t use…, yet. After some research, the generic driver was the one that seemed to better fit our use case. The generic driver leverages Nova virtual machines instances plus Cinder volume to create and manage the shares. In general, Manila supports two operational modes, whether it should create/destroy the share servers (i.e, the virtual machine instances) or not. This option is called driver_handles_share_server (or DHSS) and takes a boolean value.

We were interested in trying with DHSS=true, to really benefit from the potential of the setup.

Manila diagram NFS idea 6, original image in Wikitech

So, after sorting all these variables, we moved on with our initial testing. We built a PoC setup as depicted in the diagram above, with the manila-share component running in a virtual machine inside the cloud. The PoC led to us reporting several bugs upstream:

In some cases we tried to address these bugs ourselves:

It’s worth mentioning that the upstream community was extra-welcoming to us, and we’re thankful for that. However, at the end of our three-week period, our Manila setup still wasn’t working as expected. Your experience may change with other drivers—perhaps the ZFSonLinux or the CephFS ones. In general, we were having trouble making the setup work as expected, so we decided to abandon this approach in favor of the other option we were considering at the beginning.

Simple virtual machine serving NFS

The alternative was to create a Nova virtual machine instance by hand and to configure it using puppet. We have been investing in an automation framework lately, so the idea is to not actually create the server by hand. Anyway, the data would be decoupled from the instance into Cinder volumes, which led us to the question we left for later: How should we back up those terabytes of important information? Just to be clear, the backup problem was independent of the above options; with Manila we would still have had to solve the same challenge. We would like to see our data be backed up somewhere else other than in Ceph. And that’s exactly where we are at right now. We’ve been exploring different backup strategies and will finally use the Cinder backup API.

Conclusion

The iteration will end with the dedicated NFS hardware servers being stopped, and the shares being served from within the cloud. The migration will take some time to happen because we will check and double-check that everything works as expected (including from the performance point of view) before making definitive changes. We already have some plans to make sure our users experience as little service impact as possible. The most troublesome shares will be those related to Toolforge. At some point we will need to disallow writes to the NFS share, rsync the data out of the hardware servers into the Cinder volumes, point the NFS clients to the new virtual machines, and then enable writes again. The main Toolforge share has about 8TB of data, so this will take a while.

We will have more updates in the future. Who knows, perhaps our next-next iteration, in a couple of years, will see us adopting Openstack Manila for good.

Featured image credit: File:(from break water) Manila Skyline – panoramio.jpg, ewol, CC BY-SA 3.0

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.