Mesosphere DC/OS Masterclass : Tips and tricks to make life easier

Introduction

DC/OS is an open-source operating system and distributed system for data center built on Apache Mesos distributed system kernel. As a distributed system, it is a cluster of master nodes and private/public nodes, where each node also has host operating system which manages the underlying machine. 

It enables the management of multiple machines as if they were a single computer. It automates resource management, schedules process placement, facilitates inter-process communication, and simplifies the installation and management of distributed services. Its included web interface and available command-line interface (CLI) facilitate remote management and monitoring of the cluster and its services.

  • Distributed System DC/OS is distributed system with group of private and public nodes which are coordinated by master nodes.

  • Cluster Manager : DC/OS  is responsible for running tasks on agent nodes and providing required resources to them. DC/OS uses Apache Mesos to provide cluster management functionality.

  • Container Platform : All DC/OS tasks are containerized. DC/OS uses two different container runtimes, i.e. docker and mesos. So that containers can be started from docker images or they can be native executables (binaries or scripts) which are containerized at runtime by mesos.

  • Operating System :  As name specifies, DC/OS is an operating system which abstracts cluster h/w and s/w resources and provide common services to applications.

Unlike Linux, DC/OS is not a host operating system. DC/OS spans multiple machines, but relies on each machine to have its own host operating system and host kernel.

The high level architecture of DC/OS can be seen below :

For the detailed architecture and components of DC/OS, please click here.

Adoption and usage of Mesosphere DC/OS:

Mesosphere customers include :

  • 30% of the Fortune 50 U.S. Companies

  • 5 of the top 10 North American Banks

  • 7 of the top 12 Worldwide Telcos

  • 5 of the top 10 Highest Valued Startups

Some companies using DC/OS are :

  • Cisco

  • Yelp

  • Tommy Hilfiger

  • Uber

  • Netflix

  • Verizon

  • Cerner

  • NIO

Installing and using DC/OS

A guide to installing DC/OS can be found here. After installing DC/OS on any platform, install dcos cli by following documentation found here.

Using dcos cli, we can manager cluster nodes, manage marathon tasks and services, install/remove packages from universe and it provides great support for automation process as each cli command can be output to json.

NOTE: The tasks below are executed with and tested on below tools:

  • DC/OS 1.11 Open Source

  • DC/OS cli 0.6.0

  • jq:1.5-1-a5b5cbe

DC/OS commands and scripts

Setup DC/OS cli with DC/OS cluster

dcos cluster setup <CLUSTER URL>

Example :

dcos cluster setup http://dcos-cluster.com

The above command will give you the link for oauth authentication and prompt for auth token. You can authenticate yourself with any of Google, Github or Microsoft account. Paste the token generated after authentication to cli prompt. (Provided oauth is enabled).

DC/OS authentication token

docs config show core.dcos_acs_token

DC/OS cluster url

dcos config show core.dcos_url

DC/OS cluster name

dcos config show cluster.name

Access Mesos UI

<DC/OS_CLUSTER_URL>/mesos

Example:

http://dcos-cluster.com/mesos

Access Marathon UI

<DC/OS_CLUSTER_URL>/service/marathon

Example:

http://dcos-cluster.com/service/marathon

Access any DC/OS service, like Marathon, Kafka, Elastic, Spark etc.[DC/OS Services]

<DC/OS_CLUSTER_URL>/service/<SERVICE_NAME>

Example:

http://dcos-cluster.com/service/marathon                                             http://dcos-cluster.com/service/kafka

Access DC/OS slaves info in json using Mesos API [Mesos Endpoints]

curl -H "Authorization: Bearer $(dcos config show                                    core.dcos_acs_token)" $(dcos config show                                             core.dcos_url)/mesos/slaves | jq 

Access DC/OS slaves info in json using DC/OS cli

dcos node --json

Note : DC/OS cli ‘dcos node --json’ is equivalent to running mesos slaves endpoint (/mesos/slaves)

Access DC/OS private slaves info using DC/OS cli

dcos node --json |  jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip == null) | "Private Agent : " + .hostname ' -r

Access DC/OS public slaves info using DC/OS cli

dcos node --json |  jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip != null) | "Public Agent : " + .hostname ' -r

Access DC/OS private and public slaves info using DC/OS cli

dcos node --json |  jq '.[] | select(.type | contains("agent"))                      | if (.attributes.public_ip != null) then "Public Agent  :                            " else "Private Agent : " end + " - " + .hostname ' -r | sort

Get public IP of all public agents

Note : As ‘dcos node ssh’ requires private key to be added to ssh. Make sure you add your private key as ssh identity using :

ssh-add </path/to/private/key/file/.pem>

Get public IP of master leader

dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --leader "curl -s ifconfig.co" 2>/dev/null

Get all master nodes and their private ip

dcos node --json |  jq '.[] | select(.type | contains("master"))                     |  .ip + " = " + .type' -r

Get list of all users who have access to DC/OS cluster

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           "$(dcos config show core.dcos_url)/acs/api/v1/users" | jq ‘.array[].uid’ -r

Add users to cluster using Mesosphere script (Run this on master)

Users to add are given in list.txt, each user on new line

for i in `cat list.txt`; do echo $i;                                                 sudo -i dcos-shell /opt/mesosphere/bin/dcos_add_user.py $i; done

Add users to cluster using DC/OS API

Delete users from DC/OS cluster organization

Offers/resources from individual DC/OS agent

In recent versions of the many dcos services, a scheduler endpoint at

http://yourcluster.com/service/<service-name>/v1/debug/offers

will display an HTML table containing a summary of recently-evaluated offers. This table’s contents are currently very similar to what can be found in logs, but in a slightly more accessible format. Alternately, we can look at the scheduler’s logs in stdout. An offer is a set of resources all from one individual DC/OS agent.

<DC/OS_CLUSTER_URL>/service/<service_name>/v1/debug/offers

Example:

http://dcos-cluster.com/service/kafka/v1/debug/offers                                http://dcos-cluster.com/service/elastic/v1/debug/offers

Save JSON configs of all running Marathon apps

Get report of Marathon apps with details like container type, Docker image, tag or service version used by Marathon app.

Get DC/OS nodes with more information like node type, node ip, attributes, number of running tasks, free memory, free cpu etc.

Framework Cleaner

Uninstall framework and clean reserved resources if any after framework is deleted/uninstalled. (applicable if running DC/OS 1.9 or older, if higher than 1.10, then only uninstall cli is sufficient)

SERVICE_NAME=<service_name>
dcos package uninstall $SERVICE_NAME
dcos node ssh --option StrictHostKeyChecking=no  --master-proxy 
--leader "docker run mesosphere/janitor /janitor.py -r 
${SERVICE_NAME}-role -p ${SERVICE_NAME}-principal -z dcos-service-${SERVICE_NAME}"

Get DC/OS apps and their placement constraints

dcos marathon app list --json | jq '.[] |                                            if (.constraints != null) then .id, .constraints else empty end'


Run shell command on all slaves

Run shell command on master leader

CMD=<shell command, Ex: ulimit -a >
dcos node ssh --option StrictHostKeyChecking=no --option                             LogLevel=quiet --master-proxy --leader "$CMD"

Run shell command on all master nodes

Add node attributes to dcos nodes and run apps on nodes with required attributes using placement constraints

Install DC/OS Datadog metrics plugin on all DC/OS nodes

Get app / node metrics fetched by dcos-metrics component using metrics API

  • Get DC/OS node id [dcos node]

  • Get Node metrics (CPU, memory, local filesystems, networks, etc) :  <DC/OS_CLUSTER_URL>/system/v1/agent/<AGENT_ID>/metrics/v0/node

  • Get id of all containers running on that agent : <DC/OS_CLUSTER_URL>/system/v1/agent/<AGENT_ID>/metrics/v0/containers

  • Get Resource allocation and usage for the given container ID. : <DC/OS_CLUSTER_URL>/system/v1/agent/<AGENT_ID>/metrics/v0/containers/<CONTAINER_ID>

  • Get Application-level metrics from the container (shipped in StatsD format using the listener available at STATSD_UDP_HOST and STATSD_UDP_PORT) : <DC/OS_CLUSTER_URL>/system/v1/agent/<AGENT_ID>/metrics/v0/containers/<CONTAINER_ID>/app     

Get app / node metrics fetched by dcos-metrics component using dcos cli

  • Summary of container metrics for a specific task

    • dcos task metrics summary <task-id>
  • All metrics in details for a specific task

    • dcos task metrics details <task-id>
  • Summary of Node metrics for a specific node

    • dcos task metrics summary <mesos-node-id>
  • All Node metrics in details for a specific node

    • dcos node metrics details <mesos-node-id>

NOTE - All above commands have ‘--json’ flag to use them programmatically.  

Launch / run command inside container for a task

DC/OS task exec cli only supports Mesos containers, this script supports both Mesos and Docker containers.

Get DC/OS tasks by node

Get cluster metadata - cluster Public IP and cluster ID

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           $(dcos config show core.dcos_url)/metadata 

Sample Output:

{
  "PUBLIC_IPV4": "123.456.789.012",
  "CLUSTER_ID": "abcde-abcde-abcde-abcde-abcde-abcde"
}

Get DC/OS metadata - DC/OS version

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           $(dcos config show core.dcos_url)/dcos-metadata/dcos-version.json

Sample Output:

{  
"version": "1.11.0",
  "dcos-image-commit": "b6d6ad4722600877fde2860122f870031d109da3",
  "bootstrap-id": "a0654657903fb68dff60f6e522a7f241c1bfbf0f"
}

Get Mesos version

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           $(dcos config show core.dcos_url)/mesos/version

Sample Output:

{
  "build_date": "2018-02-27 21:31:27",
  "build_time": 1519767087.0,
  "build_user": "",
  "git_sha": "0ba40f86759307cefab1c8702724debe87007bb0",
  "version": "1.5.0"
}

Access DC/OS cluster exhibitor UI (Exhibitor supervises ZooKeeper and provides a management web interface)

<CLUSTER_URL>/exhibitor

Access DC/OS cluster data from cluster zookeeper using Zookeeper Python client - Run inside any node / container

Access dcos cluster data from cluster zookeeper using exhibitor rest API

# Get  znode data using endpoint :
# /exhibitor/exhibitor/v1/explorer/node-data?key=/path/to/node
# Example : Get znode data for path = /cluster-id
curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" $(dcos config show core.dcos_url)/exhibitor/exhibitor/v1/explorer/node-data?key=/cluster-id

Sample Output:

{
  "bytes": "3333-XXXXXX",
  "str": "abcde-abcde-abcde-abcde-abcde-",
  "stat": "XXXXXX"
}

Get cluster name using Mesos API

curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           $(dcos config show core.dcos_url)/mesos/state-summary | jq .cluster -r

Mark Mesos node as decommissioned

Some times instances which are running as DC/OS node gets terminated and can not come back online, like AWS EC2 instances, once terminated due to any reason, can not start back. When Mesos detects that a node has stopped, it puts the node in the UNREACHABLE state because Mesos does not know if the node is temporarily stopped and will come back online, or if it is permanently stopped. In such case, we can explicitly tell Mesos to put a node in the GONE state if we know a node will not come back.

dcos node decommission <mesos-agent-id>

Conclusion

We learned about Mesosphere DC/OS, its functionality and roles. We also learned how to setup and use DC/OS cli and use http authentication to access DC/OS APIs as well as using DC/OS cli for automating tasks.

We went through different API endpoints like Mesos, Marathon, DC/OS metrics, exhibitor, DC/OS cluster organisation etc. Finally, we looked at different tricks and scripts to automate DC/OS, like DC/OS node details, task exec, Docker report, DC/OS API http authentication etc.


Parvez is a DevOps Engineer at Velotio. He is passionate about infrastructure automation and DevOps. He has strong expertise in container orchestrators (DCOS, Kubernetes), Docker, Jenkins, and AWS. A Golang beginner, he likes to play Cricket, Carrom and listen to music as hobbies.