Apache Spark™ unifies the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. It executes fast, distributed ANSI SQL queries for dashboards and ad-hoc reporting faster than most data warehouses. Users can perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling and train machine learning algorithms on a laptop, using the same code to scale to fault-tolerant clusters of thousands of machines.
The Kubernetes Operator for Apache Spark, developed by Google Cloud, uses the Kubernetes operator pattern and custom resources to handle Apache Spark applications the same way as other Kubernetes workloads.
You can deploy the Kubernetes Operator for Apache Spark in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.
Warning
If you are going to use this product in production, we recommend to configure it according to the Spark recommendations.
-
If you haven’t done it yet, create a Kubernetes cluster and a node group in it.
-
Install kubectl and configure it to work with the created cluster.
-
Click the button in this card to go to the cluster selection form.
-
Select the folder and the cluster, and click Continue.
-
Configure the application:
- Namespace: Select a namespace for Apache Spark applications or create a new one.
- Application name: Enter the application name.
-
Click Install.
-
Wait for the application to change its status to
Deployed
. -
To check that the Kubernetes Operator for Apache Spark is working:
-
Deploy a sample Apache Spark application:
Commandkubectl apply -n <namespace> -f - <<EOF apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi spec: type: Python pythonVersion: "3" mode: cluster image: "spark:3.5.0-scala2.12-java17-python3-ubuntu" imagePullPolicy: Always mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py" sparkVersion: "3.5.0" restartPolicy: type: Never volumes: - name: "test-volume" hostPath: path: "/tmp" type: Directory driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.5.0 serviceAccount: spark volumeMounts: - name: "test-volume" mountPath: "/tmp" executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.5.0 volumeMounts: - name: "test-volume" mountPath: "/tmp" EOF
-
Check that a
SparkApplication
namedspark-pi
is created:kubectl get sparkapplications spark-pi -n <namespace> -o=yaml
-
Get
spark-pi
's logs:kubectl describe sparkapplication spark-pi -n <namespace>
-
- Running Apache Spark applications in a Kubernetes cluster.
- Using cron for scheduled application runs.
- Customizing Apache Spark pods.
- Configuring automatic policies for applications.
- Collecting and exporting Apache Spark metrics.
Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.
Helm chart | Version | Pull-command | Documentation |
---|---|---|---|
cr.nemax.nebius.cloud/yc-marketplace/nebius/spark-operator/chart/spark-operator | 1.1.27 | Open |
Docker image | Version | Pull-command |
---|---|---|
cr.nemax.nebius.cloud/yc-marketplace/nebius/spark-operator/spark-operator1705507674214086237342862133349223436070191493478 | v1beta2-1.3.8-3.1.1 |