Deploying the RAG pattern

Overview¤

This project uses the validated patterns operator, an opinionated gitops system, to deploy onto an OpenShift cluster.

Assumptions¤

GPUs¤

The current demonstration relies on flash-attention to decrease memory consumption for the LLM models. Today support to this limited to specific Nvidia GPUs which this system can work with. GPUs which are known to be good include:

Nvidia L40S
Nvidia A100
Nvidia H100/H200

Note: The V100 GPUs are not supported.

GPU pool management (WIP)¤

The pattern today allows GPU pools to be managed for scale-out computing via MCAD and Instascale. It is important to note that this is designed primarily to manage scaling for batch workloads.

This works where:

The cluster auto-scaler is enabled (e.g. using the assisted installer into your own tenancy on AWS / GCP)
Clusters managed via OpenShift Cluster Manager (e.g. ROSA, ARO and OSD)

Bootstrap machine¤

git and podman are required on the bootstrap machine. A valid KUBECONFIG is required making oc (recommended) or kubectl strongly recommended.

Setup workflow¤

Start podman
Fork the git repository
Forking is required in order to customise the configuration.
It is significantly easier to start with a public repository.
Clone the forked git repository.
Customise the values-global.yaml (see below)
cp ./values-secret.yaml.template ~/values-secret-gen-llm-rag-pattern.yaml and fill in required secrets
cd into the repository and ./pattern.sh make install