Kubernetes Operator for Apache Spark™

Name: Kubernetes Operator for Apache Spark™
Brand: Nebius

Updated April 11, 2024

Apache Spark^™ unifies the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. It executes fast, distributed ANSI SQL queries for dashboards and ad-hoc reporting faster than most data warehouses. Users can perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling and train machine learning algorithms on a laptop, using the same code to scale to fault-tolerant clusters of thousands of machines.

The Kubernetes Operator for Apache Spark, developed by Google Cloud, uses the Kubernetes operator pattern and custom resources to handle Apache Spark applications the same way as other Kubernetes workloads.

You can deploy the Kubernetes Operator for Apache Spark in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.

Warning

If you are going to use this product in production, we recommend to configure it according to the Spark recommendations.

Deployment instructions

If you haven’t done it yet, create a Kubernetes cluster and a node group in it.
Install kubectl and configure it to work with the created cluster.
Click the button in this card to go to the cluster selection form.
Select the folder and the cluster, and click Continue.
Configure the application:
- Namespace: Select a namespace for Apache Spark applications or create a new one.
- Application name: Enter the application name.
Click Install.
Wait for the application to change its status to Deployed.

To check that the Kubernetes Operator for Apache Spark is working:

Deploy a sample Apache Spark application:

Command

kubectl apply -n <namespace> -f - <<EOF
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "spark:3.5.0-scala2.12-java17-python3-ubuntu"
  imagePullPolicy: Always
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  sparkVersion: "3.5.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.5.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.5.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
EOF

Check that a SparkApplication named spark-pi is created:

kubectl get sparkapplications spark-pi -n <namespace> -o=yaml

Get spark-pi's logs:

kubectl describe sparkapplication spark-pi -n <namespace>

Use cases

Running Apache Spark applications in a Kubernetes cluster.
Using cron for scheduled application runs.
Customizing Apache Spark pods.
Configuring automatic policies for applications.
Collecting and exporting Apache Spark metrics.

Links

Apache Spark website Apache Spark documentation Kubernetes Operator for Apache Spark on GitHub

Technical support

Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.

Product composition

Helm chart	Version	Pull-command	Documentation
cr.nemax.nebius.cloud/yc-marketplace/nebius/spark-operator/chart/spark-operator	1.1.27		Open

Docker image	Version	Pull-command
cr.nemax.nebius.cloud/yc-marketplace/nebius/spark-operator/spark-operator1705507674214086237342862133349223436070191493478	v1beta2-1.3.8-3.1.1

Terms

By using this product you agree to the Nebius AI Marketplace Terms of Service, Apache 2.0

Kubernetes Operator for Apache Spark™

Platform

Resources

Solutions

Prices

Company

Legal