veda/readme.md

243 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Veda
The new setup of my homelab will be based on Kubernetes, which will prevent all of my services going down when I need to do physical maintenance of a host.
## Services
### Core
- Ceph for all storage: cephfs, object storage and block storage
- Nextcloud: file storage interface for the entire family
- Jellyfin: Web based media streaming
- Authentik: Central identification and authentication server
- Nginx reverse proxy
- ACME client: SSL certificate handling
- ArgoCD: Revision control for all Kubernetes configuration
- Homeassistant + Zigbee2mqtt
- Prometheus
- Grafana
- Grafana Loki + FluentD
- Cilium
- Harbor: Container image storage
### Nice-to-have
- Jellyseerr: Nice interface to request movies and series
- Sonarr: Automated downloading and handling of series
- Radarr: Automated downloading and handling of movies
- Flaresolverr: Fetching data hidden behind captchas
- Torrent client (qBittorrent): To download all the linux ISOs
- ExternalDNS
- Paperless-ngx
### Look-into-later
- Mastodon: federated social platform
- Forgejo: Git platform. Maybe this should not be hosted on the cluster as it will depend on it.
- CloudNativePG: K8s operator for PostgreSQL
## Installing
### Configuration
```bash
export CLUSTER_NAME="veda"
export API_ENDPOINT="https://192.168.0.1:6443"
```
```bash
talosctl gen secrets --output-file secrets.yaml
```
```bash
talosctl gen config \
--with-secrets secrets.yaml \
--output-types talosconfig \
--output talosconfig \
$CLUSTER_NAME \
$API_ENDPOINT
```
```bash
talosctl config merge ./talosconfig
```
Then correct the endpoint in the Talos client configuration:
```yaml
# ~/.talos/config
context: veda
contexts:
veda:
endpoints:
- 192.168.0.1
# (...)
```
For controlplane nodes:
```bash
talosctl gen config \
--output rendered/master1.yaml \
--output-types controlplane \
--with-secrets secrets.yaml \
--config-patch @nodes/master1.yaml \
--config-patch @patches/network.yaml \
--config-patch @patches/scheduling.yaml \
--config-patch @patches/discovery.yaml \
--config-patch @patches/diskselector.yaml \
--config-patch @patches/vip.yaml \
--config-patch @patches/metrics.yaml \
--config-patch @patches/hostpath.yaml \
$CLUSTER_NAME \
$API_ENDPOINT
```
For worker nodes:
```bash
talosctl gen config \
--output rendered/worker1.yaml \
--output-types worker \
--with-secrets secrets.yaml \
--config-patch @nodes/worker1.yaml \
--config-patch @patches/network.yaml \
--config-patch @patches/scheduling.yaml \
--config-patch @patches/discovery.yaml \
--config-patch @patches/diskselector.yaml \
--config-patch @patches/metrics.yaml \
--config-patch @patches/hostpath.yaml \
$CLUSTER_NAME \
$API_ENDPOINT
```
### Bootstrapping
Apply the configuration to each node:
```bash
talosctl apply-config --insecure --file rendered/master1.yaml --nodes 192.168.0.10
```
Optionally, check the status.
Point the Talos API endpoint directly to the node, since etcd, and thereby kube-vip, is not up.
```bash
talosctl -n 192.168.0.10 -e 192.168.0.10 dashboard
```
To start the cluster, we need to bootstrap the etcd cluster.
This only has to be done for a single node.
```bash
talosctl -n 192.168.0.10 -e 192.168.0.10 bootstrap
```
Finally, retrieve the kubeconfig, it will merge with `~/.kube/config`, if it exists.
```bash
talosctl -n 192.168.0.10 kubeconfig
```
Check nodes, note the NotReady status, since the Cilium CNI is not running yet:
```bash
kubectl get nodes
```
Install the Gateway API:
```bash
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
```
Install Cilium:
```bash
bash scripts/cilium.sh
```
## TODO
- Remove secrets from config
## Misc
### Applying patches
```bash
talosctl patch machineconfig -p @argocd.yaml -n 192.168.0.0
```
### Reset node
```bash
talosctl reset --system-labels-to-wipe EPHEMERAL,STATE --reboot -n 192.168.0.0
```
### ArgoCD default login
User: admin, password can be retrieved with (ignore the '%' at the end):
```bash
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
```
### Ceph default login
User: admin on [http://ceph.noxxos.nl](http://ceph.noxxos.nl)
```bash
kubectl -n ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo
```
### Wiping disks for Ceph
Start a temporary pod on each node where the disks are:
```bash
kubectl run -it --rm \
-n ceph \
--image quay.io/ceph/ceph:v19.2.2 \
--privileged \
--overrides='{"spec": { "nodeSelector": {"kubernetes.io/hostname": "master3"}}}' fix
```
Search for the correct disk with `blkid`, set `DISK=/dev/sdX`, then run (some of) the following commands:
```bash
ceph-volume lvm zap $DISK --destroy
wipefs -a $DISK
# Zap the disk to a fresh, usable state (zap-all is important, b/c MBR has to be clean)
sgdisk --zap-all $DISK
# Wipe portions of the disk to remove more LVM metadata that may be present
dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=0 # Clear at offset 0
dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=$((1 * 1024**2)) # Clear at offset 1GB
dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=$((10 * 1024**2)) # Clear at offset 10GB
dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=$((100 * 1024**2)) # Clear at offset 100GB
dd if=/dev/zero of="$DISK" bs=1K count=200 oflag=direct,dsync seek=$((1000 * 1024**2)) # Clear at offset 1000GB
# SSDs may be better cleaned with blkdiscard instead of dd
blkdiscard $DISK
# Inform the OS of partition table changes
partprobe $DISK
```
### Certificate lifetimes
Talos Linux automatically manages and rotates all server side certificates for etcd, Kubernetes, and the Talos API. Note however that the kubelet needs to be restarted at least once a year in order for the certificates to be rotated. Any upgrade/reboot of the node will suffice for this effect.
You can check the Kubernetes certificates with the command `talosctl get KubernetesDynamicCerts -o yaml` on the controlplane.
Client certificates (talosconfig and kubeconfig) are the users responsibility. Each time you download the kubeconfig file from a Talos Linux cluster, the client certificate is regenerated giving you a kubeconfig which is valid for a year.
The talosconfig file should be renewed at least once a year, using the `talosctl config new` command.
### Ceph host networking
For some reason the Ceph object gateway is not properly configured in the dashboard.
[See this issue for similiar symptons](https://github.com/rook/rook/issues/12099)