Toggle navigation

Mastery comes from practical experience and time. At SoftIron, the scars of those experiences are reflected in our task-specific hardware design, purpose-built operating system, and the most resilient Ceph clusters in the world.

From the many clients we’ve spoken to, we have seen some patterns in design and deployment that are easy to avoid. If you’re going down the DIY path, here are some of the key mistakes to consider.

Mistake #1 – Choosing a bad journal drive

When building a Ceph cluster, especially one with HDD’s, you’ll want to add a journal drive, an SSD that houses some key elements of the Ceph architecture (for example a write-ahead log and metadata database) because they’re a very cost effective way to improve performance for most use cases.

Too often, this is an afterthought. Folks will often use the M.2 SATA ports of a commodity server (intended for boot drives) as journal drives. Trouble is that most M.2 drives are recommended only for boot and have endurance in the 1DWD (drive writes per day) range. Because of the way Ceph uses write-ahead logs and metadata, those could get burned out quickly.

If you’re going to insist on using the M.2 port, some vendors allow tuning the oversubscription allotted to the drive which can increase endurance.

Mistake #2 – Using a server that requires a RAID controller

In some cases there’s just no way around this, especially with very dense HDD servers that use Intel Xeon architectures. But the RAID functionality isn’t useful within the context of a Ceph cluster. Worst-case, if you have to use a RAID controller, configure it into RAID-0. Ideally you can find a RAID controller that operates with a passthrough mode (JBOD, aka just a bunch of disks).

If you can’t get away from using a RAID controller, you should also either:

  1. disable write caching (preferred), or
  2. have a battery backed cache.

Otherwise, you’re guaranteed to have problems. We’ve seen it.

As a side note, these servers also tend to come with long cable paths that represent additional points of failure. In some cases the failures are handled silently, but in the case of something jiggling loose, it creates maintenance headaches.

Mistake #3 – Putting MON daemons on the same hosts as OSDs

99% of the life of your cluster, the monitor service does very little. But it works the hardest when your cluster is under strain, like when hardware fails. Your monitors are scrubbing your data to make sure that what you get back is consistent with what you stored. After all, the mantra of Ceph is “we will not lose data”. During failures, a monitor is performing checksums, which requires compute power. It may also direct data to get moved from one device to another, at which point your OSDs are now working harder as well. It also does election-based stuff which is why there’s typically an odd number of monitors. The issue, documented here, is that “Ceph Monitors flush their data from memory to disk very often, which can interfere with Ceph OSD daemon workloads”.

So when you build a minimum sized cluster (typically 3 hosts), and one fails, all hell breaks loose. But we’ll get to that later.

A better workaround is to have separate monitors and storage services daemons. It’s not like you get a massive benefit from collocating them. If you’re concerned about dedicating hardware to this, one potential alternative is to stick them in a virtualization cluster you have. It’s a better alternative than co-locating them.

Mistake #4 – Setting min_size and replica size incorrectly

Setting your min_size to 1 and replica size to 2 is very tempting. It looks similar to the familiar RAID1, so you could get away with having a system operate in a degraded state and get pretty good efficiency of raw to usable storage vs triple replication.

But remember – Ceph doesn’t want you to lose data. That means when you’re reading, it’s checking to make sure that the data you wrote is still what you wrote. That means confirming it matches across copies. When copies aren’t there to compare against, Ceph thinks you can’t trust a read anymore and won’t let you. It won’t let you write either. The entire system locks up. If any disk goes offline, even temporarily, the cluster will stop access to any placement group using those OSDs. Also with min_size 1, it is very easy to have major issues because two devices disagree on what the data should look like. It breaks the consistency goals that might have led you to choose Ceph in the first place.

Another alternative, if you need to increase your usable storage capacity, would be to use erasure coding. You’ll likely sacrifice some performance, but you can get higher storage efficiency without risking big chunks of your data becoming inaccessible.

Mistake #5 – Thinking denser is always better

Denser servers, denser drives, denser CPUs – that way you cut down on the cost of metalwork, PCB’s, CPUs, and networking, right?

Turns out when you’re using a replication or erasure-coding based data protection mechanism across a cluster, it’s good to have your bandwidth spread out over a larger surface area. In the CPUs – denser means more wasted clock cycles and larger power budget which has downstream consequences on your rack and power capacity. Maybe not a big deal in racks with 25KW of power to work with, but definitely a concern when you’ve got less than 8KW, which is about the global average.

But the biggest concern is the blast radius. Let’s say you build a minimum viable cluster with 3 of the densest 60 drive servers you can get from commodity suppliers. When this was written, that’s 18TB per drive, but that’ll probably double in a year or two. If it takes 1 day to recover the data lost from a single drive, it will take 2 months to recover data from a lost server. Moving the drives doesn’t work because the OSDs have to get recreated in the new chassis. In that time, if you lose another drive or server, you’re potentially hosed.

Hardware matters – The SoftIron Way

The better approach though would be to build your cluster with SoftIron HyperDrive appliances. We can save you time, money, and headache through having worked out the kinks ahead of time. We right-size the journal drives, we use CPU architectures which allow us to eliminate the RAID controller or HBA, and we design clusters according to best practices for resilience and performance, and recoverability. For those who have already invested in HyperDrive,, we can consult with you on how to make the best of it, and we can provide you with unmatched support.

And if you’re concerned about plopping down a huge pile of money, we can help there too with our no-obligation Test Drive program. Getting that conversation started is as easy as clicking the little chat icon on the right.