Useful USG pro 4 commands and hints

I’ve recently changed from a FreeBSD vm as my router to a Ubiquiti PRO USG4. It’s a solid device, with many great features, and I’m really impressed at how it “just works” in many cases. So far my only disappointment is lack of documentation about the CLI, especially for debugging and auditing what is occuring in the system, and for troubleshooting steps. This post will aggregate some of my knowledge about the topic.

Current config

Show the current config with:

mca-ctrl -t dump-cfg

You can show system status with the “show” command. Pressing ? will cause the current compeletion options to be displayed. For example:

# show <?>
arp              date             dhcpv6-pd        hardware

DNS

The following commands show the DNS statistics, the DNS configuration, and allow changing the cache-size. The cache-size is measured in number of records cached, rather than KB/MB. To make this permanent, you need to apply the change to config.json in your controllers sites folder.

show dns forwarding statistics
show system name-server
set service dns forwarding cache-size 10000
clear dns forwarding cache

Logging

You can see and aggregate of system logs with

show log

Note that when you set firewall rules to “log on block” they go to dmesg, not syslog, so as a result you need to check dmesg for these.

It’s a great idea to forward your logs in the controller to a syslog server as this allows you to aggregate and see all the events occuring in a single time series (great when I was diagnosing an issue recently).

Interfaces

To show the system interfaces

show interfaces

To restart your pppoe dhcp6c:

release dhcpv6-pd interface pppoe0
renew dhcpv6-pd interface pppoe0

There is a current issue where the firmware will start dhcp6c on eth2 and pppoe0, but the session on eth2 blocks the pppoe0 client. As a result, you need to release on eth2, then renew of pppoe0

If you are using a dynamic prefix rather than static, you may need to reset your dhcp6c duid.

delete dhcpv6-pd duid

To restart an interface with the vyatta tools:

disconnect interface pppoe
connect interface pppoe

OpenVPN

I have setup customised OpenVPN tunnels. To show these:

show interfaces openvpn detail

These are configured in config.json with:

# Section: config.json - interfaces - openvpn
    "vtun0": {
            "encryption": "aes256",
            # This assigns the interface to the firewall zone relevant.
            "firewall": {
                    "in": {
                            "ipv6-name": "LANv6_IN",
                            "name": "LAN_IN"
                    },
                    "local": {
                            "ipv6-name": "LANv6_LOCAL",
                            "name": "LAN_LOCAL"
                    },
                    "out": {
                            "ipv6-name": "LANv6_OUT",
                            "name": "LAN_OUT"
                    }
            },
            "mode": "server",
            # By default, ubnt adds a number of parameters to the CLI, which
            # you can see with ps | grep openvpn
            "openvpn-option": [
                    # If you are making site to site tunnels, you need the ccd
                    # directory, with hostname for the file name and
                    # definitions such as:
                    # iroute 172.20.0.0 255.255.0.0
                    "--client-config-dir /config/auth/openvpn/ccd",
                    "--keepalive 10 60",
                    "--user nobody",
                    "--group nogroup",
                    "--proto udp",
                    "--port 1195"
            ],
            "server": {
                    "push-route": [
                            "172.24.0.0/17"
                    ],
                    "subnet": "172.24.251.0/24"
            },
            "tls": {
                    "ca-cert-file": "/config/auth/openvpn/vps/vps-ca.crt",
                    "cert-file": "/config/auth/openvpn/vps/vps-server.crt",
                    "dh-file": "/config/auth/openvpn/dh2048.pem",
                    "key-file": "/config/auth/openvpn/vps/vps-server.key"
            }
    },

Netflow

Net flows allow a set of connection tracking data to be sent to a remote host for aggregation and analysis. Sadly this process was mostly undocumented, bar some useful forum commentors. Here is the process that I came up with. This is how you configure it live:

set system flow-accounting interface eth3.11
set system flow-accounting netflow server 172.24.10.22 port 6500
set system flow-accounting netflow version 5
set system flow-accounting netflow sampling-rate 1
set system flow-accounting netflow timeout max-active-life 1
commit

To make this persistent:

"system": {
            "flow-accounting": {
                    "interface": [
                            "eth3.11",
                            "eth3.12"
                    ],
                    "netflow": {
                            "sampling-rate": "1",
                            "version": "5",
                            "server": {
                                    "172.24.10.22": {
                                            "port": "6500"
                                    }
                            },
                            "timeout": {
                                    "max-active-life": "1"
                            }
                    }
            }
    },

To show the current state of your flows:

show flow-accounting

The idea of CI and Engineering

In software development I see and interesting trend and push towards continuous integration, continually testing, and testing in production. These techniques are designed to allow faster feedback on errors, use real data for application testing, and to deliver features and changes faster.

But is that really how people use software on devices? When we consider an operation like google or amazon, this always online technique may work, but what happens when we apply a continous integration and “we’ll patch it later” mindset to devices like phones or internet of things?

What happens in other disciplines?

In real engineering disciplines like aviation or construction, techniques like this don’t really work. We don’t continually build bridges, then fix them when they break or collapse. There are people who provide formal analysis of materials, their characteristics. Engineers consider careful designs, constraints, loads and situations that may occur. The structure is planned, reviewed and verified mathematically. Procedures and oversight is applied to ensure correct building of the structure. Lessons are learnt from past failures and incidents and are applied into every layer of the design and construction process. Communication between engineers and many other people is critical to the process. Concerns are always addressed and managed.

The first thing to note is that if we just built lots of scale-model bridges and continually broke them until we found their limits, this would waste many resources to do this. Bridges are carefully planned and proven.

So whats the point with software?

Today we still have a mindset that continually breaking and building is a reasonable path to follow. It’s not! It means that the only way to achieve quality is to have a large test suite (requires people and time to write), which has to be further derived from failures (and those failures can negatively affect real people), then we have to apply large amounts of electrical energy to continually run the tests. The test suites can’t even guarantee complete coverage of all situations and occurances!

This puts CI techniques out of reach of many application developers due to time and energy (translated to dollars) limits. Services like travis on github certainly helps to lower the energy requirement, but it doesn’t stop the time and test writing requirements.

No matter how many tests we have for a program, if that program is written in C or something else, we continually see faults and security/stability issues in that software.

What if we CI on … a phone?

Today we even have hardware devices that are approached as though they “test in production” is a reasonable thing. It’s not! People don’t patch, telcos don’t allow updates out to users, and those that are aware, have to do custom rom deployment. This creates an odd dichomtemy of “haves” and “haves not”, of those in technical know how who have a better experience, and the “haves not” who have to suffer potentially insecure devices. This is especially terrifying given how deeply personal phones are.

This is a reality of our world. People do not patch. They do not patch phones, laptops, network devices and more. Even enterprises will avoid patching if possible. Rather than trying to shift the entire culture of humans to “update always”, we need to write software that can cope in harsh conditions, for long term. We only need to look to software in aviation to see we can absolutely achieve this!

What should we do?

I believe that for software developers to properly become software engineers we should look to engineers in civil and aviation industries. We need to apply:

  • Regualation and ethics (Safety of people is always first)
  • Formal verification
  • Consider all software will run long term (5+ years)
  • Improve team work and collaboration on designs and development

The reality of our world is people are deploying devices (routers, networks, phones, lights, laptops more …) where they may never be updated or patched in their service life. Even I’m guilty (I have a modem that’s been unpatched for about 6 years but it’s pretty locked down …). As a result we need to rely on proof that the device can not fail at build time, rather than patch it later which may never occur! Putting formal verification first, and always considering user safety and rights first, shifts a large burden to us in terms of time. But many tools (Coq, fstar, rust …) all make formal verification more accessible to use in our industry. Verifying our software is a far stronger assertion of quality than “throw tests at it and hope it works”.

You’re crazy William, and also wrong

Am I? Looking at “critical” systems like iPhone encryption hardware, they are running the formally verified Sel4. We also heard at Kiwicon in 2018 that Microsoft and XBox are using formal verification to design their low levels of their system to prevent exploits from occuring in the first place.

Over time our industry will evolve, and it will become easier and more cost effective to formally verify than to operate and deploy CI. This doesn’t mean we don’t need tests - it means that the first line of quality should be in verification of correctness using formal techniques rather than using tests and CI to prove correct behaviour. Tests are certainly still required to assert further behavioural elements of software.

Today, if you want to do this, you should be looking at Coq and program extraction, fstar and the kremlin (project everest, a formally verified https stack), Rust (which has a subset of the safe language formally proven). I’m sure there are more, but these are the ones I know off the top of my head.

Conclusion

Over time our industry must evolve to put the safety of humans first. To achive this we must look to other safety driven cultures such as aviation and civil engineering. Only by learning from their strict disciplines and behaviours can we start to provide software that matches behavioural and quality expectations humans have for software.

Nextcloud and badrequest filesize incorrect

My friend came to my house and was trying to share some large files with my nextcloud instance. Part way through the upload an error occurred.

"Exception":"Sabre\\DAV\\Exception\\BadRequest","Message":"expected filesize 1768906752 got 1768554496"

It turns out this error can be caused by many sources. It could be timeouts, bad requests, network packet loss, incorrect nextcloud configuration or more.

We tried uploading larger files (by a factor of 10 times) and they worked. This eliminated timeouts as a cause, and probably network loss. Being on ethernet direct to the server generally also helps to eliminate packet loss as a cause compared to say internet.

We also knew that the server must not have been misconfigured because a larger file did upload, so no file or resource limits were being hit.

This also indicated that the client was likely doing the right thing because larger and smaller files would upload correctly. The symptom now only affected a single file.

At this point I realised, what if the client and server were both victims to a lower level issue? I asked my friend to ls the file and read me the number of bytes long. It was 1768906752, as expected in nextcloud.

Then I asked him to cat that file into a new file, and to tell me the length of the new file. Cat encountered an error, but ls on the new file indeed showed a size of 1768554496. That means filesystem corruption! What could have lead to this?

HFS+

Apple’s legacy filesystem (and the reason I stopped using macs) is well known for silently eating files and corrupting content. Here we had yet another case of that damage occuring, and triggering errors elsewhere.

Bisecting these issues and eliminating possibilities through a scientific method is always the best way to resolve the cause, and it may come from surprising places!

Identity ideas …

I’ve been meaning to write this post for a long time. Taking half a year away from the 389-ds team, and exploring a lot of ideas from other projects has led me to come up with some really interesting ideas about what we do well, and what we don’t. I feel like this blog could be divisive, as I really think that for our services to stay relevant we need to make changes that really change our own identity - so that we can better represent yours.

So strap in, this is going to be long …

What’s currently on the market

Right now the market for identity has two extremes. At one end we have the legacy “create your own” systems, that are build on technologies like LDAP and Kerberos. I’m thinking about things like 389 Directory Server, OpenLDAP, Active Directory, FreeIPA and more. These all happen to be constrained heavily by complexity, fragility, and administrative workload. You need to spend months to learn these and even still, you will make mistakes and there will be problems.

At the other end we have hosted “Identity as a Service” options like Azure AD and Auth0. These have very intelligently, unbound themself from legacy, and tend to offer HTTP apis, 2fa and other features that “just work”. But they are all in the cloud, and outside your control.

But there is nothing in the middle. There is no option that “just works”, supports modern standards, and is unhindered by legacy that you can self deploy with minimal administrative fuss - or years of experience.

What do I like from 389?

  • Replication

The replication system is extremely robust, and has passed many complex tests for cases of eventual consistency correctness. It’s very rare to hear of any kind of data corruption or loss within our replication system, and that’s testament to the great work of people who spent years looking at the topic.

  • Performance

We aren’t as fast as OpenLDAP is 1 vs 1 server, but our replication scalability is much higher, where in any size of MMR or read-only replica topology, we have higher horizontal scaling, nearly linear based on server additions. If you want to run a cloud scale replicated database, we scale to it (and people already do this!).

  • Stability

Our server stability is well known with administrators, and honestly is a huge selling point. We see servers that only go down when administrators are performing upgrades. Our work with sanitising tools and the careful eyes of the team has ensured our code base is reliable and solid. Having extensive tests and amazing dedicated quality engineers also goes a long way.

  • Feature rich

There are a lot of features I really like, and are really useful as an admin deploying this service. Things like memberof (which is actually a group resolution cache when you think about it …), automember, online backup, unique attribute enforcement, dereferencing, and more.

  • The team

We have a wonderful team of really smart people, all of whom are caring and want to advance the state of identity management. Not only do they want to keep up with technical changes and excellence, they are listening to and want to improve our social awareness of identity management.

Pain Points

  • C

Because DS is written in C, it’s risky and difficult to make changes. People constantly make mistakes that introduce unsafety (even myself), and worse. No amount of tooling or intelligence can take away the fact that C is just hard to use, and people need to be perfect (people are not perfect!) and today we have better tools. We can not spend our time chasing our tails on pointless issues that C creates, when we should be doing better things.

  • Everything about dynamic admin, config, and plugins is hard and can’t scale

Because we need to maintain consistency through operations from start to end but we also allow changing config, plugins, and more during the servers operation the current locking design just doesn’t scale. It’s also not 100% safe either as the values are changed by atomics, not managed by transactions. We could use copy-on-write for this, but why? Config should be managed by tools like ansible, but today our dynamic config and plugins is both a performance over head and an admin overhead because we exclude best practice tools and have to spend a large amount of time to maintain consistent data when we shouldn’t need to. Less features is less support overhead on us, and simpler to test and assert quality and correct behaviour.

  • Plugins to address shortfalls, but a bit odd.

We have all these features to address issues, but they all do it … kind of the odd way. Managed Entries creates user private groups on object creation. But the problem is “unix requires a private group” and “ldap schema doesn’t allow a user to be a group and user at the same time”. So the answer is actually to create a new objectClass that let’s a user ALSO be it’s own UPG, not “create an object that links to the user”. (Or have a client generate the group from user attributes but we shouldn’t shift responsibility to the client.)

Distributed Numeric Assignment is based on the AD rid model, but it’s all about “how can we assign a value to a user that’s unique?”. We already have a way to do this, in the UUID, so why not derive the UID/GID from the UUID. This means there is no complex inter-server communication, pooling, just simple isolated functionality.

We have lots of features that just are a bit complex, and could have been made simpler, that now we have to support, and can’t change to make them better. If we rolled a new “fixed” version, we would then have to support both because projects like FreeIPA aren’t going to just change over.

  • client tools are controlled by others and complex (sssd, openldap)

Every tool for dealing with ldap is really confusing and arcane. They all have wild (unhelpful) defaults, and generally this scares people off. I took months of work to get a working ldap server in the past. Why? It’s 2018, things need to “just work”. Our tools should “just work”. Why should I need to hand edit pam? Why do I need to set weird options in SSSD.conf? All of this makes the whole experience poor.

We are making client tools that can help (to an extent), but they are really limited to system administration and they aren’t “generic” tools for every possible configuration that exists. So at some point people will still find a limit where they have to touch ldap commands. A common request is a simple to use web portal for password resets, which today only really exists in FreeIPA, and that limits it’s application already.

  • hard to change legacy

It’s really hard to make code changes because our surface area is so broad and the many use cases means that we risk breakage every time we do. I have even broken customer deployments like this. It’s almost impossible to get away from, and that holds us back because it means we are scared to make changes because we have to support the 1 million existing work flows. To add another is more support risk.

Many deployments use legacy schema elements that holds us back, ranging from the inet types, schema that enforces a first/last name, schema that won’t express users + groups in a simple away. It’s hard to ask people to just up and migrate their data, and even if we wanted too, ldap allows too much freedom so we are more likely to break data, than migrate it correctly if we tried.

This holds us back from technical changes, and social representation changes. People are more likely to engage with a large migrational change, than an incremental change that disturbs their current workflow (IE moving from on prem to cloud, rather than invest in smaller iterative changes to make their local solutions better).

  • ACI’s are really complex

389’s access controls are good because they are in the tree and replicated, but bad because the syntax is awful, complex, and has lots of traps and complexity. Even I need to look up how to write them when I have to. This is not good for a project that has such deep security concerns, where your ACI’s can look correct but actually expose all your data to risks.

  • LDAP as a protocol is like an 90’s drug experience

LDAP may be the lingua franca of authentication, but it’s complex, hard to use and hard to write implementations for. That’s why in opensource we have a monoculture of using the openldap client libraries because no one can work out how to write a standalone library. Layer on top the complexity of the object and naming model, and we have a situation where no one wants to interact with LDAP and rather keeps it at arm length.

It’s going to be extremely hard to move forward here, because the community is so fragmented and small, and the working groups dispersed that the idea of LDAPv4 is a dream that no one should pursue, even though it’s desperately needed.

  • TLS

TLS is great. NSS databases and tools are not.

  • GSSAPI + SSO

GSSAPI and Kerberos are a piece of legacy that we just can’t escape from. They are a curse almost, and one we need to break away from as it’s completely unusable (even if it what it promises is amazing). We need to do better.

That and SSO allows loads of attacks to proceed, where we actually want isolated token auth with limited access scopes …

What could we offer

  • Web application as a first class consumer.

People want web portals for their clients, and they want to be able to use web applications as the consumer of authentication. The HTTP protocols must be the first class integration point for anything in identity management today. This means using things like OAUTH/OIDC.

  • Systems security as a first class consumer.

Administrators still need to SSH to machines, and people still need their systems to have identities running on them. Having pam/nsswitch modules is a very major requirement, where those modules have to be fast, simple, and work correctly. Users should “imply” a private group, and UID/GID should by dynamic from UUID (or admins can override it).

  • 2FA/u2f/TOTP.

Multi-factor auth is here (not coming, here), and we are behind the game. We already have Apple and MS pushing for webauthn in their devices. We need to be there for these standards to work, and to support the next authentication tool after that.

  • Good RADIUS integration.

RADIUS is not going away, and is important in education providers and business networks, so RADIUS must “just work”. Importantly, this means mschapv2 which is the universal default for all clients to operate with, which means nthash.

However, we can make the nthash unlinked from your normal password, so you can then have wifi password and a seperate loging password. We could even generate an NTHash containing the TOTP token for more high security environments.

  • better data structure (flat, defined by object types).

The tree structure of LDAP is confusing, but a flatter structure is easier to manage and understand. We can use ideas from kubernetes like tags/labels which can be used to provide certain controls and filtering capabilities for searches and access profiles to apply to.

  • structured logging, with in built performance profiling.

Being able to diagnose why an operation is slow is critical and having structured logs with profiling information is key to allowing admins and developers to resolve performance issues at scale. It’s also critical to have auditing of every single change made in the system, including internal changes that occur during operations.

  • access profiles with auditing capability.

Access profiles that express what you can access, and how. Easier to audit, generate, and should be tightly linked to group membership for real RBAC style capabilities.

  • transactions by allowing batch operations.

LDAP wants to provide a transaction system over a set of operations, but that may cause performance issues on write paths. Instead, why not allow submission of batches of changes that all must occur “at the same time” or “none”. This is faster network wise, protocol wise, and simpler for a server to implement.

What’s next then …

Instead of fixing what we have, why not take the best of what we have, and offer something new in parallel? Start a new front end that speaks in an accessible way, that has modern structures, and has learnt from the lessons of the past? We can build it to standalone, or proxy from the robust core of 389 Directory Server allowing migration paths, but eschew the pain of trying to bring people to the modern world. We can offer something unique, an open source identity system that’s easy to use, fast, secure, that you can run on your terms, or in the cloud.

This parallel project seems like a good idea … I wonder what to name it …

Work around docker exec bug

There is currently a docker exec bug in Centos/RHEL 7 that causes errors such as:

# docker exec -i -t instance /bin/sh
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

As a work around you can use nsenter instead:

PID=docker inspect --format {{.State.Pid}} <name of container>
nsenter --target $PID --mount --uts --ipc --net --pid /bin/sh

For more information, you can see the bugreport here.