Deploying the RAG pattern
Overview¤
This project uses the validated patterns operator, an opinionated gitops system, to deploy onto an OpenShift cluster.
Assumptions¤
GPUs¤
The current demonstration relies on flash-attention
to decrease memory consumption for the LLM models. Today support to this limited to specific Nvidia GPUs which this system can work with. GPUs which are known to be good include:
- Nvidia L40S
- Nvidia A100
- Nvidia H100/H200
Note: The V100 GPUs are not supported.
GPU pool management (WIP)¤
The pattern today allows GPU pools to be managed for scale-out computing via MCAD and Instascale. It is important to note that this is designed primarily to manage scaling for batch workloads.
This works where:
- The cluster auto-scaler is enabled (e.g. using the assisted installer into your own tenancy on AWS / GCP)
- Clusters managed via OpenShift Cluster Manager (e.g. ROSA, ARO and OSD)
Bootstrap machine¤
git
and podman
are required on the bootstrap machine.
A valid KUBECONFIG
is required making oc
(recommended) or kubectl
strongly recommended.
Setup workflow¤
-
Start
podman
-
Fork the git repository
-
Forking is required in order to customise the configuration.
-
It is significantly easier to start with a public repository.
-
Clone the forked git repository.
-
Customise the
values-global.yaml
(see below) -
cp ./values-secret.yaml.template ~/values-secret-gen-llm-rag-pattern.yaml
and fill in required secrets -
cd
into the repository and./pattern.sh make install