Cfengine configuration directory

Introduction to scalability

Relevant classes: any

Scalability is something to think about whether you have twenty hosts or twenty-thousand hosts. There is no recipe for scaling your system; there is no right answer, but there are often wrong answers!

It is actually wrong to talk about whether a system "scales" or "doesn't scale". Rather we have to think about how the operation of a network changes in relation to the number of clients and services within it.

What is scaling?

Why would anything prevent us from just having as many hosts as we want? There are several problems that occur with scale. Let us suppose that we have N machines at out site.
  1. The sharing of the resource cake between many rather than few is harder for large N.
  2. The human difficulty of managing so many machines.
    1. Difficulty of understanding.
      1. Heterogeneity.
      2. Multiple policies (subgroups and special needs)
    2. Difficulty of implementing.
      1. Heterogeneity.
      2. Multiple policies (subgroups and special needs)
      3. Work to be done increases as a function of N.

Workflow and scaling

Let's look at some examples of how workflow affects scaling. The following four diagrams represent network topologies which show the flows of information (resources) between hosts. You can think of this picture at several levels.
  • Physical level: how do hosts communicate?
  • Logical level: which hosts collaborate with others?
  • Policy level: which hosts govern others?
In the first of the pictures (a), there is a centralization of control which leads to a star configuration. The node at the centre of the star is a bottleneck for this scenario, because it has to share itself between N clients. In the second case (b), not all the N hosts have to go through the central node in order to be connected, so we lighten the load in the bottleneck. In (c) we have a hierarchical organization that has low level bottlenecks and high level bottlenecks, i.e. it has several scales. This often reflects the structure of human organizations (as humans, we are in love with the hierarchy). In the final case (d), we have completely decentralized (peer to peer) organization which has no bottlenecks.
(a) (b)
(c) (d)

Scaling in cfengine

Because cfengine is an automation system, it scales flatly in all things except file copying and complexity of policy. That means that there is no penalty for the number of hosts you have to manage, or the workload in independent of N.

Since all work is done on the end hosts, the work is completely decentralized. The only time you have to think about the problems of scaling are therefore in:

  • Managing file copying bottlenecks
  • Managing the complexity of configuration
The strength of cfengine is that it does not impose a control model on you. The only constraints that cfengine insists on are related to the pull-based file transfer.

How to plan scaling

The stages of scaling decisions are:
  1. Choosing your model. This includes planning for future expansion. Whatever model you choose, you will have to be sure that it will be able to grow to meet your future needs. The choice of model affects
    • The efficiency of operation (load balancing)
    • The likelihood of errors and problems (resilience)
  2. Deploying the resources. Whether you choose a centralized model or a distributed model, you have to deploy adequate resources to cope. Centralized models place more pressure on the bottlenecks of your system: the central controller and the network the brings the traffic to and from it. There is no doubt that distributed models offer more efficient scalability, but some administrators feel uncomfortable managing distributed systems. With cfengine you can go for complete centralization or even peer to peer models.
  3. Optimizing the resources that you have deployed. Once you have decided on what you can afford, you need to maximize your returns on the investment. This is a tricky issue and needs to be covered in several articles.

Tips and features for large N

Tools for easing resource bottlenecks:
  1. SplayTime - if you are copying data from a single central location (star model), set this to 1/10 the number of hosts you have (in minutes) to spread the load.
  2. Failover server in copy - i a host should time out, offer an alternative
  3. Partition your peers. If you have a large number of hosts, partition them into subgroups using
    SelectPartitionGroup(filename,comment,Policy,group size) 
    SelectPartitionLeader(filename,comment,Policy,group size) 
    
    to select a representative server amongst a flat group of peers.
  4. Strategies and classes can be used to randomize the time at which operations take place to avoid overlapping. i.e. if things overlap in space (machines), make sure that they don't overlap in time.
Back to documentation