import org.apache.spark.sql.types._ val schema = StructType(Seq( StructField("name", StringType), StructField("age", IntegerType), StructField("address", StructType(Seq( StructField("city", StringType), StructField("zip", LongType) ))) ))
val rdd = sc.parallelize(1 to 4) rdd.map(x => x * 2) // 2,4,6,8 rdd.flatMap(x => 1 to x) // 1,1,2,1,2,3,1,2,3,4 rdd.mapPartitions(iter => iter.map(_ * 2)) // same as map but per partition Spark uses lineage (RDD dependency graph). Each RDD remembers how it was built from other datasets. If a partition is lost, Spark recomputes it using the lineage, not replication. However, you can also cache/persist with replication (e.g., StorageLevel.MEMORY_AND_DISK_2 ). Apache Spark Scala Interview Questions- Shyam Mallesh
breaks long lineages by saving RDD to reliable storage (HDFS/S3). ✅ 3. What is the difference between cache() , persist() , and checkpoint() ? | Method | Storage Level | Purpose | |--------------|------------------------------|---------| | cache() | MEMORY_ONLY (default) | Speed up repeated actions | | persist() | Choose level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.) | Fine-grained control over eviction | | checkpoint() | Saves to HDFS/S3 (reliable storage) | Break lineage, reduce driver memory, avoid recomputation chain | 💡 Use persist when memory is limited. Use checkpoint for long iterative algorithms (ML, GraphX). ✅ 4. Explain how Spark evaluates transformations and actions. Spark uses lazy evaluation – transformations build DAG but no data is processed until an action ( count , collect , save , show , etc.) is called. import org
✅ ✅ 6. How do you handle skewed data in Spark? Skewed keys cause a few partitions to receive most of the data → slow tasks. However, you can also cache/persist with replication (e