
[Oct-2022] Associate-Developer-Apache-Spark Dumps PDF - Associate-Developer-Apache-Spark Real Exam Questions Answers
Associate-Developer-Apache-Spark Dumps 100% Pass Guarantee With Latest Demo
New Way To Prepare For Your Databricks Associate Developer Apache Spark Exam
Is Databricks Associate Developer Apache Spark Exam worth it?
Read Our Databricks Associate Developer Apache Spark Exam Guide That Will Help You Pass in First Try
Do you want to know everything you need to know in order to pass the Databricks Associate Developer Apache Spark exam in less than an hour?
It's always a good idea to know what the best material for preparing yourself for an exam is. In fact, it's a requirement. When it comes to the Apache Spark exam, however, many people don't really know where to begin in order to prepare themselves. They get stuck feeling overwhelmed and confused.
If you're like most people out there who have been struggling with taking the exam, you've probably been feeling like you have no idea what to do. And the more time you spend trying to figure this out, the longer it will take you to pass the test.
In this article, I'm going to tell you about how I personally used to prepare myself for this test and how you too can do the same. Databricks Associate Developer Apache Spark exam dumps will help you pass your exam in first try.
So if you want to know everything you need to know in order to pass the Databricks Associate Developer Apache Spark exam in less than an hour, then read on…
What is the Databricks Associate Developer Apache Spark Exam?
The Databricks Associate Developer Apache Spark Exam is a certification that can be earned by anyone who has successfully completed the Databricks Associate Developer Apache Spark Certification Training. The exam covers all the material that was covered in the training. The exam is designed to test your knowledge of the concepts, skills, and abilities that you learned during the course.
Do you want to become a Data Engineer or a Spark Architect? If so, then the Databricks Associate Developer Apache Spark Exam is a must-pass. The Databricks Associate Developer Apache Spark Exam is designed to help you develop a complete understanding of the technology used by the Databricks platform. You will learn about the basics of Spark, including the Spark programming language, Spark SQL, Spark Streaming, and the Spark ecosystem. Databricks Associate Developer Apache Spark exam dumps are the choice of champions.
The Databricks Associate Developer Apache Spark Exam is a test that aims to assess whether you have the knowledge required to become a certified Apache Spark developer. The Databricks Associate Developer Apache Spark Exam consists of two parts: the first part tests your knowledge of the fundamentals of the Apache Spark framework and the second part tests your ability to apply this knowledge. This post will help you get a head start in preparing for the Databricks Associate Developer Apache Spark Exam. The executors disk division actions documentation frame for the executor syntax variables object return allowed partition for the fit output transformation to induce couple of manager and evaluated expected safely, lazily named nodes broadcast operations for correctly mock driver.
NEW QUESTION 58
Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?
- A. from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY) - B. transactionsDf.cache()
- C. transactionsDf.clear_persist()
- D. transactionsDf.persist()
- E. transactionsDf.storage_level('MEMORY_ONLY')
- F. from pyspark import StorageLevel
transactionsDf.cache(StorageLevel.MEMORY_ONLY)
Answer: A
Explanation:
Explanation
from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.
transactionsDf.cache()
This is wrong because the default storage level of DataFrame.cache() is MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.
transactionsDf.persist()
This is wrong because the default storage level of DataFrame.persist() is MEMORY_AND_DISK.
transactionsDf.clear_persist()
Incorrect, since clear_persist() is not a method of DataFrame.
transactionsDf.storage_level('MEMORY_ONLY')
Wrong. storage_level is not a method of DataFrame.
More info: RDD Programming Guide - Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist - PySpark
3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)
NEW QUESTION 59
The code block displayed below contains an error. The code block is intended to return all columns of DataFrame transactionsDf except for columns predError, productId, and value. Find the error.
Excerpt of DataFrame transactionsDf:
transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))
- A. The select operator should be replaced by the drop operator.
- B. The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value wrapped in the col operator so they should be expressed like drop(col(predError), col(productId), col(value)).
- C. The select operator should be replaced with the deselect operator.
- D. The column names in the select operator should not be strings and wrapped in the col operator, so they should be expressed like select(~col(predError), ~col(productId), ~col(value)).
- E. The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value as strings.
(Correct)
Answer: E
Explanation:
Explanation
Correct code block:
transactionsDf.drop("predError", "productId", "value")
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 60
Which of the following code blocks reads JSON file imports.json into a DataFrame?
- A. spark.read().mode("json").path("/FileStore/imports.json")
- B. spark.read.format("json").path("/FileStore/imports.json")
- C. spark.read("json", "/FileStore/imports.json")
- D. spark.read().json("/FileStore/imports.json")
- E. spark.read.json("/FileStore/imports.json")
Answer: E
Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/25.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 61
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to
30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__((__2__.__3__) __4__ (__5__))
- A. 1. where
2. col("storeId")
3. geq(20).leq(30)
4. &
5. col("productId")==2 - B. 1. select
2. col("storeId")
3. between(20, 30)
4. &&
5. col("productId")=2 - C. 1. select
2. col("storeId")
3. between(20, 30)
4. and
5. col("productId")==2 - D. 1. select
2. col("storeId")
3. between(20, 30)
4. &
5. col("productId")==2 - E. 1. select
2. "storeId"
3. between(20, 30)
4. &&
5. col("productId")==2
Answer: B
Explanation:
Explanation
Correct code block:
transactionsDf.select((col("storeId").between(20, 30)) & (col("productId")==2)) Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans - this should point you to the select statement.
Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId
<= 30)) in SQL. geq() and leq() do not exist.
Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 62
The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame transactionsDf. Find the errors.
Code block:
transactionsDf.select([col(productId), col(f)])
Sample of transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
- A. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
- B. The column names should be listed directly as arguments to the operator and not as a list.
- C. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
- D. The select operator should be replaced by a drop operator.
- E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
Answer: E
Explanation:
Explanation
Correct code block: transactionsDf.drop("productId", "f")
This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the question can be solved by using a select statement, a drop statement, given the answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column names should be expressed as strings and not as Python variable names as in the original code block.
The column names should be listed directly as arguments to the operator and not as a list.
Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.
The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId - for that, you need to express it as a string.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
No. This still leaves you with Python trying to interpret the column names as Python variables (see above).
The select operator should be replaced by a drop operator.
Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables (see above).
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 63
Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?
- A. 1.counter = 0
2.
3.for index, row in itemsDf.iterrows():
4. if 'Inc.' in row['supplier']:
5. counter = counter + 1
6.
7.print(counter) - B. print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())
- C. 1.counter = 0
2.
3.def count(x):
4. if 'Inc.' in x['supplier']:
5. counter = counter + 1
6.
7.itemsDf.foreach(count)
8.print(counter) - D. 1.accum=sc.accumulator(0)
2.
3.def check_if_inc_in_supplier(row):
4. if 'Inc.' in row['supplier']:
5. accum.add(1)
6.
7.itemsDf.foreach(check_if_inc_in_supplier)
8.print(accum.value) - E. print(itemsDf.foreach(lambda x: 'Inc.' in x))
Answer: D
Explanation:
Explanation
Correct code block:
accum=sc.accumulator(0)
def check_if_inc_in_supplier(row):
if 'Inc.' in row['supplier']:
accum.add(1)
itemsDf.foreach(check_if_inc_in_supplier)
print(accum.value)
To answer this question correctly, you need to know both about the DataFrame.foreach() method and accumulators.
When Spark runs the code, it executes it on the executors. The executors do not have any information about variables outside of their scope. This is whhy simply using a Python variable counter, like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly that counter is a special shared variable, an Accumulator, which is managed by the driver and can be accessed by all executors for the purpose of adding to it.
If you have used Pandas in the past, you might be familiar with the iterrows() command. Notice that there is no such command in PySpark.
The two examples that start with print do not work, since DataFrame.foreach() does not have a return value.
More info: pyspark.sql.DataFrame.foreach - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 64
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
- A. transactionsDf.dropDuplicates().agg(count("storeId"))
- B. transactionsDf.select(distinct("storeId")).count()
- C. transactionsDf.select("storeId").dropDuplicates().count()
- D. transactionsDf.select(count("storeId")).dropDuplicates()
- E. transactionsDf.distinct().select("storeId").count()
Answer: C
Explanation:
Explanation
transactionsDf.select("storeId").dropDuplicates().count()
Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.
transactionsDf.select(count("storeId")).dropDuplicates()
No. transactionsDf.select(count("storeId")) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count("storeId"))
Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates instead.
transactionsDf.distinct().select("storeId").count()
Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count not represent the number of unique values in that column.
transactionsDf.select(distinct("storeId")).count()
False. There is no distinct method in pyspark.sql.functions.
NEW QUESTION 65
The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items in column value. Find the error.
Code block:
transactionsDf.orderBy('value', asc_nulls_first(col('predError')))
- A. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
- B. Column value should be wrapped by the col() operator.
- C. Column predError should be sorted in a descending way, putting nulls last.
- D. Instead of orderBy, sort should be used.
- E. Column predError should be sorted by desc_nulls_first() instead.
Answer: C
Explanation:
Explanation
Correct code block:
transactionsDf.orderBy('value', desc_nulls_last('predError'))
Column predError should be sorted in a descending way, putting nulls last.
Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last.
Instead of orderBy, sort should be used.
No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here.
Column value should be wrapped by the col() operator.
Incorrect. DataFrame.sort() accepts both string and Column objects.
Column predError should be sorted by desc_nulls_first() instead.
Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted.
Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question.
More info: pyspark.sql.DataFrame.orderBy - PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3
NEW QUESTION 66
Which of the following describes a shuffle?
- A. A shuffle is a process that compares data across partitions.
- B. A shuffle is a process that compares data across executors.
- C. A shuffle is a Spark operation that results from DataFrame.coalesce().
- D. A shuffle is a process that is executed during a broadcast hash join.
- E. A shuffle is a process that allocates partitions to executors.
Answer: A
Explanation:
Explanation
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)
NEW QUESTION 67
Which of the following DataFrame operators is never classified as a wide transformation?
- A. DataFrame.sort()
- B. DataFrame.repartition()
- C. DataFrame.aggregate()
- D. DataFrame.select()
- E. DataFrame.join()
Answer: D
Explanation:
Explanation
As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation.
DataFrame.select()
Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation.
DataFrame.repartition()
Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation.
DataFrame.aggregate()
No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.
DataFrame.join()
Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation.
DataFrame.sort()
False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.
More info: Understanding Apache Spark Shuffle | Philipp Brunenberg
NEW QUESTION 68
Which of the following describes how Spark achieves fault tolerance?
- A. If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage.
- B. Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
- C. Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
- D. Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
- E. Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
Answer: A
Explanation:
Explanation
Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
Wrong - Between transformations, DataFrames are immutable. Given that Spark also records the lineage, Spark can reproduce any DataFrame in case of failure. These two aspects are the key to understanding fault tolerance in Spark.
Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
Wrong. RDD stands for Resilient Distributed Dataset and it is at the core of Spark and not a "legacy system".
It is fault-tolerant by design.
Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
This is not true. For supporting recovery in case of worker failures, Spark provides "_2", "_3", and so on, storage level options, for example MEMORY_AND_DISK_2. These storage levels are specifically designed to keep duplicates of the data on multiple nodes. This saves time in case of a worker fault, since a copy of the data can be used immediately, vs. having to recompute it first.
Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
No, Spark is fault-tolerant by design.
NEW QUESTION 69
Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)
- A. spark.read.parquet(filePath, mergeSchema='y')
- B. spark.read.parquet(filePath)
- C. spark.read.option("mergeSchema", "true").parquet(filePath)
- D. 1.nx = 0
2.for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith(".parquet"):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.union(df_temp)
10. nx = nx+1
11.df - E. 1.nx = 0
2.for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith(".parquet"):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.join(df_temp, how="outer")
10. nx = nx+1
11.df
Answer: C
Explanation:
Explanation
This is a very tricky question and involves both knowledge about merging as well as schemas when reading parquet files.
spark.read.option("mergeSchema", "true").parquet(filePath)
Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or more columns with the same name that appear in both partitions would have different data types.
spark.read.parquet(filePath)
Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition (e.g. tax_id) would be lost.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.union(df_temp)
nx = nx+1
df
Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical data types.
spark.read.parquet(filePath, mergeSchema="y")
False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a boolean or string variable. But 'y' is not a valid option.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.join(df_temp, how="outer")
nx = nx+1
df
No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the question says all columns that are included in the partitions should appear exactly once.
More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium Static notebook | Dynamic notebook: See test 3
NEW QUESTION 70
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?
- A. itemsDf.withColumn("supplier").alias("manufacturer")
- B. itemsDf.withColumn(["supplier", "manufacturer"])
- C. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
- D. itemsDf.withColumnsRenamed("supplier", "manufacturer")
- E. itemsDf.withColumnRenamed("supplier", "manufacturer")
Answer: E
Explanation:
Explanation
itemsDf.withColumnRenamed("supplier", "manufacturer")
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the question asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied.
itemsDf.withColumnsRenamed("supplier", "manufacturer")
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.
itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
No. Watch out - although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below, withColumnRenamed expects strings.
itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns.
withColumn is typically used to add columns to DataFrames, taking the name of the new column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn("supplier").alias("manufacturer")
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn - PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias - PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 71
The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column itemName. Find the error.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-------------------+
2.|itemId|itemName |supplier |
3.+------+----------------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |YetiX |
6.|3 |Outdoors Backpack |Sports Company Inc.|
7.+------+----------------------------------+-------------------+
Code block:
itemsDf.withColumnRenamed("itemNameElements", split("itemName"))
itemsDf.withColumnRenamed("itemNameElements", split("itemName"))
- A. All column names need to be wrapped in the col() operator.
- B. The expressions "itemNameElements" and split("itemName") need to be swapped.
- C. Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "
" needs to be passed to the split method. - D. Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument
"," needs to be passed to the split method. - E. Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.
Answer: C
Explanation:
Explanation
Correct code block:
itemsDf.withColumn("itemNameElements", split("itemName"," "))
Output of code block:
+------+----------------------------------+-------------------+------------------------------------------+
|itemId|itemName |supplier |itemNameElements |
+------+----------------------------------+-------------------+------------------------------------------+
|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|
|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |
|3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |
+------+----------------------------------+-------------------+------------------------------------------+ The key to solving this question is that the split method definitely needs a second argument here (also look at the link to the documentation below). Given the values in column itemName in DataFrame itemsDf, this should be a space character " ". This is the character we need to split the words in the column.
More info: pyspark.sql.functions.split - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 72
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?
- A. spark.read().path(filePath)
- B. spark.read.path(filePath)
- C. spark.read().json(filePath)
- D. spark.read.path(filePath, source="json")
- E. spark.read.json(filePath)
Answer: E
Explanation:
Explanation
spark.read.json(filePath)
Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.
spark.read.path(filePath)
Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).
spark.read.path(filePath, source="json")
Wrong. A DataFrameReader.path() method does not exist (see above).
spark.read().json(filePath)
Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.
spark.read().path(filePath)
No, Spark's DataFrameReader is not callable (see above).
More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 73
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?
- A. transactionsDf.clearCache()
- B. transactionsDf.unpersist()
(Correct) - C. del transactionsDf
- D. transactionsDf.persist()
- E. array_remove(transactionsDf, "*")
Answer: B
Explanation:
Explanation
transactionsDf.unpersist()
Correct. The DataFrame.unpersist() command does exactly what the question asks for - it removes all cached parts of the DataFrame from memory and disk.
del transactionsDf
False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below.
array_remove(transactionsDf, "*")
Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block.
transactionsDf.persist()
No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK).
transactionsDf.clearCache()
Wrong. Spark's DataFrame does not have a clearCache() method.
More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3
NEW QUESTION 74
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
- A. 1. where
2. "array_contains(attributes, 'cozy')"
3. select
4. itemId
5. explode
6. attributes - B. 1. filter
2. "array_contains(attributes, cozy)"
3. select
4. "itemId"
5. explode
6. "attributes" - C. 1. filter
2. array_contains("cozy")
3. select
4. "itemId"
5. explode
6. "attributes" - D. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. map
6. "attributes" - E. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. explode
6. "attributes"
Answer: E
Explanation:
Explanation
The correct code block is:
itemsDf.filter("array_contains(attributes, 'cozy')").select("itemId", explode("attributes")) The key here is understanding how to use array_contains(). You can either use it as an expression in a string, or you can import it from pyspark.sql.functions. In that case, the following would also work:
itemsDf.filter(array_contains("attributes", "cozy")).select("itemId", explode("attributes")) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/29.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 75
Which of the following is not a feature of Adaptive Query Execution?
- A. Reroute a query in case of an executor failure.
- B. Collect runtime statistics during query execution.
- C. Split skewed partitions into smaller partitions to avoid differences in partition processing time.
- D. Coalesce partitions to accelerate data processing.
- E. Replace a sort merge join with a broadcast join, where appropriate.
Answer: A
Explanation:
Explanation
Reroute a query in case of an executor failure.
Correct. Although this feature exists in Spark, it is not a feature of Adaptive Query Execution. The cluster manager keeps track of executors and will work together with the driver to launch an executor and assign the workload of the failed executor to it (see also link below).
Replace a sort merge join with a broadcast join, where appropriate.
No, this is a feature of Adaptive Query Execution.
Coalesce partitions to accelerate data processing.
Wrong, Adaptive Query Execution does this.
Collect runtime statistics during query execution.
Incorrect, Adaptive Query Execution (AQE) collects these statistics to adjust query plans. This feedback loop is an essential part of accelerating queries via AQE.
Split skewed partitions into smaller partitions to avoid differences in partition processing time.
No, this is indeed a feature of Adaptive Query Execution. Find more information in the Databricks blog post linked below.
More info: Learning Spark, 2nd Edition, Chapter 12, On which way does RDD of spark finish fault-tolerance?
- Stack Overflow, How to Speed up SQL Queries with Adaptive Query Execution
NEW QUESTION 76
The code block shown below should read all files with the file ending .png in directory path into Spark.
Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)
- A. 1. read()
2. format
3. "binaryFile"
4. "recursiveFileLookup"
5. load - B. 1. read
2. format
3. binaryFile
4. pathGlobFilter
5. load - C. 1. read
2. format
3. "binaryFile"
4. "pathGlobFilter"
5. load - D. 1. open
2. format
3. "image"
4. "fileType"
5. open - E. 1. open
2. as
3. "binaryFile"
4. "pathGlobFilter"
5. load
Answer: C
Explanation:
Explanation
Correct code block:
spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator - the open operator shown in one of the answers does not exist.
NEW QUESTION 77
The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the answer that correctly fills the blanks in the code block to accomplish this.
1.from pyspark import StorageLevel
2.transactionsDf.__1__(StorageLevel.__2__).__3__
- A. 1. persist
2. DISK_ONLY_2
3. count() - B. 1. persist
2. MEMORY_ONLY_2
3. count() - C. 1. persist
2. MEMORY_ONLY_2
3. select() - D. 1. cache
2. MEMORY_ONLY_2
3. count() - E. 1. cache
2. DISK_ONLY_2
3. count()
Answer: B
Explanation:
Explanation
Correct code block:
from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY_2).count()
Only persist takes different storage levels, so any option using cache() cannot be correct. persist() is evaluated lazily, so an action needs to follow this command. select() is not an action, but count() is - so all options using select() are incorrect.
Finally, the question states that "the executors' memory should be utilized as much as possible, but not writing anything to disk". This points to a MEMORY_ONLY storage level. In this storage level, partitions that do not fit into memory will be recomputed when they are needed, instead of being written to disk, as with the storage option MEMORY_AND_DISK. Since the data need to be duplicated across two executors, _2 needs to be appended to the storage level.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 78
The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).
Code block:
1.from pyspark.sql.functions import udf
2.from pyspark.sql import types as T
3.
4.transactionsDf.createOrReplaceTempView('transactions')
5.
6.def pow_5(x):
7. return x**5
8.
9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())
10.spark.sql('SELECT power_5_udf(value) FROM transactions')
- A. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function appropriately.
- B. The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is not result.
- C. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf DataFrame.
- D. The returned DataFrame includes multiple columns instead of just one column.
- E. The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.
Answer: A
Explanation:
Explanation
Correct code block:
from pyspark.sql.functions import udf
from pyspark.sql import types as T
transactionsDf.createOrReplaceTempView('transactions')
def pow_5(x):
if x:
return x**5
return x
spark.udf.register('power_5_udf', pow_5, T.LongType())
spark.sql('SELECT power_5_udf(value) AS result FROM transactions')
Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python's ** operator cannot deal with any null value Spark passes into method pow_5.
The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below).
Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL's AS result argument.
Omitting that argument, you end up with the column name power_5_udf(value) and not result.
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation
NEW QUESTION 79
The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__.__3__(__4__))
- A. 1. select
2. col("storeId")
3. cast
4. StringType() - B. 1. select
2. storeId
3. cast
4. StringType() - C. 1. cast
2. "storeId"
3. as
4. StringType() - D. 1. select
2. col("storeId")
3. cast
4. StringType - E. 1. select
2. col("storeId")
3. as
4. StringType
Answer: A
Explanation:
Explanation
Correct code block:
transactionsDf.select(col("storeId").cast(StringType()))
Solving this question involves understanding that, when using types from the pyspark.sql.types such as StringType, these types need to be instantiated when using them in Spark, or, in simple words, they need to be followed by parentheses like so: StringType(). You could also use .cast("string") instead, but that option is not given here.
More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 80
The code block shown below should return the number of columns in the CSV file stored at location filePath.
From the CSV file, only lines should be read that do not start with a # character. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
- A. 1. DataFrame
2. spark
3. read()
4. escape='#'
5. shape[0] - B. 1. size
2. spark
3. read()
4. escape='#'
5. columns - C. 1. len
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns - D. 1. size
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns - E. 1. len
2. spark
3. read
4. comment='#'
5. columns
Answer: E
Explanation:
Explanation
Correct code block:
len(spark.read.csv(filePath, comment='#').columns)
This is a challenging question with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a question of this difficulty level appears in the exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.
Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1, returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard this answer option.
Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but this method only returns the length of an array or map stored within a column (documentation linked below).
So, using a size() method is not an option here. This leaves us with two potentially valid answers.
We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql, which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session (pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.
More info:
- pyspark.sql.functions.size - PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.csv - PySpark 3.1.2 documentation
- pyspark.sql.SparkSession.read - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 81
Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?
- A. Use a narrow transformation to reduce the number of partitions.
- B. Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame. - C. Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
- D. Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
Answer: A
Explanation:
Explanation
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the DataFrame.
Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" - this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info:
distributed computing - Spark - repartition() vs coalesce() - Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info: pyspark.sql.DataFrame.coalesce - PySpark 3.1.2 documentation Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient way of reducing the number of partitions of all listed options.
Use a wide transformation to reduce the number of partitions.
No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.
NEW QUESTION 82
......
Prerequisites for the Databricks Associate Developer Apache Spark Exam?
- You need to be able to complete the individual data manipulation task with the help of the Spark DataFrameAPI.
- If you have a basic understanding of the architecture, you can use adaptive query execution.
Dumps Real Databricks Associate-Developer-Apache-Spark Exam Questions [Updated 2022]: https://www.free4torrent.com/Associate-Developer-Apache-Spark-braindumps-torrent.html
Prepare Associate-Developer-Apache-Spark Question Answers Free Update With 100% Exam Passing Guarantee [2022]: https://drive.google.com/open?id=1dTVADLkSWGwq3dBMORm52QKnHxcQ5iTv