Certified Machine Learning Associate Databricks Exam Dumps & Practice Test Questions

Question 1:

A data scientist has created a new Feature Table named new_table using the Feature Store client (fs) and added a metadata description to document the table's purpose. Later, the scientist wants to access this metadata description through code.

Which of the following code snippets correctly retrieves the description that was added during the Feature Table’s creation?

A. There is no way to return the metadata description programmatically.
B. fs.create_training_set("new_table")
C. fs.get_table("new_table").description
D. fs.get_table("new_table").load_df()
E. fs.get_table("new_table")

Correct Answer: C

Explanation:

In the Databricks Feature Store, each Feature Table can have associated metadata, such as a description, ownership information, tags, and more. This metadata provides essential context for downstream consumers of the data, including data scientists, ML engineers, and platform maintainers. Being able to programmatically retrieve this metadata is important for data cataloging, governance, and reproducibility.

The Feature Store client (fs) includes a method called get_table(table_name) that fetches the Feature Table object by its name. This object contains several attributes, including the description field that stores the textual metadata provided during table creation.

This approach first accesses the Feature Table object and then retrieves the description attribute.

Now, let’s evaluate the incorrect options:

  • A is wrong because metadata is accessible programmatically through the Feature Store client, contrary to the claim.

  • B is incorrect because create_training_set() is used to prepare a dataset for model training and doesn’t retrieve metadata.

  • D uses load_df() to return the DataFrame of feature values but does not expose metadata like descriptions.

  • E simply returns the table object, which holds metadata, but it does not specifically retrieve the description unless explicitly called with .description.

In summary, the only option that correctly accesses the table’s metadata description is C. It directly fetches the textual description embedded during table creation, making it the proper and efficient choice for retrieving metadata.

Question 2:

A data scientist is using PySpark and has a DataFrame named spark_df, which includes a column named "price". They want to create a new DataFrame that only contains the rows where "price" is greater than zero.

Which of the following code snippets correctly filters the DataFrame?

A. spark_df[spark_df["price"] > 0]
B. spark_df.filter(col("price") > 0)
C. SELECT * FROM spark_df WHERE price > 0
D. spark_df.loc[spark_df["price"] > 0, :]
E. spark_df.loc[:, spark_df["price"] > 0]

Correct Answer:  B

Explanation:

When working with PySpark, filtering a DataFrame using column conditions must be done through its built-in API, which includes methods like .filter() or .where(). These methods are designed to operate in a distributed fashion and take expressions that evaluate to boolean conditions.

To filter rows where the "price" column is greater than zero, the appropriate syntax involves using the col() function from pyspark.sql.functions. The correct syntax 

This is shown in Option B, which correctly uses filter() in combination with col("price") > 0. This is the idiomatic and recommended approach in PySpark to programmatically reference column names and apply filtering logic.

Let’s examine why the other options are incorrect:

  • A uses syntax from the pandas library, where boolean indexing with square brackets is valid. However, PySpark DataFrames do not support this style of indexing, so this code would result in an error.

  • C looks like a valid SQL statement, but SQL syntax cannot be directly applied to PySpark DataFrames unless the DataFrame is first registered as a temporary SQL view using createOrReplaceTempView(). Without that step, this query will not work in PySpark.

  • D and E both rely on .loc[], a pandas-only DataFrame accessor. PySpark does not include this method, so these will also raise errors.

Therefore, among all the choices, only Option B demonstrates the correct and executable way to filter a PySpark DataFrame based on a column condition. It is syntactically valid, leverages the functional API of PySpark, and aligns with distributed computing best practices.

Question 3:

A healthcare provider is creating a machine learning model to classify whether patients are currently infected with a specific illness. The primary aim is to ensure that as many true cases of infection as possible are identified, even if that results in occasionally misclassifying non-infected individuals. In this medical scenario, missing a genuine infection case could lead to delayed care and increased transmission. 

Based on this priority, which evaluation metric should the organization focus on the most?

A. Root Mean Squared Error (RMSE)
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall

Correct Answer: E

Explanation:

When developing classification models—particularly in healthcare—it’s critical to select performance metrics that align with the model’s intended use. In this case, the organization wants to identify every actual infected patient, even if the model sometimes wrongly classifies healthy patients as infected. This indicates a higher concern for false negatives than for false positives.

Recall is the metric that best aligns with this goal. Also known as sensitivity or the true positive rate, recall measures the proportion of actual positive cases that the model correctly identifies. The formula for recall is:

Recall = True Positives / (True Positives + False Negatives)

A high recall means the model captures most of the truly infected cases. This is crucial in medical contexts, where missing a positive case can have serious repercussions such as disease progression or community spread. In contrast, a false positive (predicting infection when there is none) may result in inconvenience, but it’s far less dangerous than a missed diagnosis.

Now, let’s review why the other options are less appropriate:

  • A. RMSE is a regression metric and doesn’t apply to classification problems.

  • B. Precision evaluates how many predicted positive cases are correct, but doesn’t penalize missed actual cases as recall does.

  • C. Area under the residual operating curve seems to be a mistaken term; it’s likely intended to refer to the ROC curve. While AUC-ROC is helpful, it provides a more general view and doesn’t focus specifically on minimizing false negatives.

  • D. Accuracy considers all correct predictions but may be misleading when positive cases are rare, which is often true in disease detection.

Therefore, Recall is the ideal metric when the goal is to catch as many true positive cases as possible.

Question 4:

When addressing missing values in a dataset, under which condition is it typically more appropriate to fill in missing values of a numerical feature using the median rather than the mean?

A. When the feature represents categorical values
B. When the feature contains boolean values
C. When the numerical feature has many extreme outliers
D. When the numerical data is normally distributed and free of outliers
E. When there are no missing values in the feature

Correct Answer: C

Explanation:

Handling missing data is a vital step in preparing datasets for machine learning. For numerical features, two popular techniques to fill missing values are mean imputation and median imputation. The choice between these depends on the distribution of the data and the presence of outliers.

When a dataset includes extreme outliers, the mean can be heavily skewed. Since the mean is calculated as the average of all values, it’s sensitive to large or small extreme values. For instance, if most house prices in a dataset are around $200,000 but a few are priced at $5 million, the mean will shift higher, not representing the typical case. Filling missing data with such a skewed mean can introduce inaccuracies.

The median, however, is the middle value in a sorted dataset and is robust against outliers. It doesn’t change significantly in the presence of a few extreme values, making it a better measure of central tendency for skewed or non-normal distributions. As a result, when a numerical column contains outliers, the median offers a more stable and representative value for imputation.

Looking at the other options:

  • A. Categorical features shouldn’t be imputed using the median; techniques like mode or category-based imputation are more appropriate.

  • B. Boolean data should use logical rules or mode, not numerical median imputation.

  • D. If the data is normally distributed and outlier-free, the mean might actually be more statistically appropriate.

  • E. If there are no missing values, imputation isn’t needed at all.

Therefore, using the median is especially effective when numerical features include a significant number of extreme values, preserving the integrity of the dataset for modeling.

Question 5:

A data scientist is working with a dataset that contains numerous missing values scattered across different features. To handle this, the scientist decides to fill the gaps using the median of each respective feature. However, a team member expresses concern that this method might overlook important information embedded in the missing values.

To preserve as much meaningful information as possible about the missing data while still performing imputation, which of the following would be the most effective strategy?

A. Use the mean instead of the median for imputation
B. Skip imputation entirely and let the model deal with missing values
C. Discard all features that contain missing data
D. Introduce a binary flag column for each feature with missing values to indicate if a value was imputed
E. Add a constant column per feature showing the percentage of missing data in that column

Correct Answer: D

Explanation:

When dealing with missing data, it's not just about filling in the blanks; it's also about acknowledging the pattern of missingness, which might carry predictive power. The most efficient and informative way to preserve this missingness-related insight is to add a binary indicator column for each feature that has missing values. This approach, known as missing indicator imputation, flags whether a value was originally missing (1) or not (0), which can reveal underlying trends in the data.

For example, in healthcare datasets, a missing test result might indicate that the test wasn’t needed or ordered—possibly pointing to a specific diagnosis or treatment pattern. Without explicitly capturing this missingness, valuable signal could be lost.

Let’s break down the other choices:

  • A: Replacing missing values with the mean instead of the median doesn’t introduce any additional information about the missingness—it just uses a different statistic, which doesn't solve the colleague’s concern.

  • B: Relying on the algorithm to manage missing data may not work effectively. Many popular machine learning models (like linear regression, SVM, or even certain decision trees) do not natively handle null values.

  • C: Dropping features with missing data can result in substantial information loss, especially if those features are important predictors.

  • E: Adding a column with the percentage of missing values provides dataset-level statistics, not row-specific indicators. It won’t help the model understand which specific observations had missing values.

Adding binary indicators (option D) is the most targeted and model-friendly way to preserve and utilize missingness as a potentially valuable feature.

Question 6:

A data scientist is analyzing a PySpark DataFrame called spark_df and wants to view summary statistics for all the numerical columns. These statistics should include basic metrics like count, mean, standard deviation, minimum, and maximum, along with percentile-based values such as the 25th, 50th (median), and 75th percentiles—needed to assess the interquartile range (IQR).

Which of the following methods should the data scientist use to best achieve this goal?

A. spark_df.summary()
B. spark_df.stats()
C. spark_df.describe().head()
D. spark_df.printSchema()
E. spark_df.toPandas()

Correct Answer: A

Explanation:

To explore comprehensive summary statistics in PySpark, the method spark_df.summary() is the most suitable and robust option. Unlike describe(), which provides only a limited subset of statistics, summary() includes a broader range, such as percentile values (25%, 50%, 75%) that are crucial for calculating the interquartile range (IQR). This makes it an ideal tool for data exploration and identifying outliers or spread in the data.

Here’s a breakdown of the available options:

  • A: spark_df.summary() is correct because it returns a DataFrame that includes count, mean, standard deviation, min, max, and percentiles (25%, 50%, and 75%) for all numeric columns. These values are essential when you're interested in distribution insights such as IQR.

  • B: spark_df.stats() is incorrect. There is no stats() method for PySpark DataFrames, and using it will result in an error.

  • C: spark_df.describe().head() is partially correct in that describe() provides some descriptive statistics (count, mean, stddev, min, max), but it doesn’t include percentiles, so it falls short when you need the IQR. Moreover, calling .head() fetches only the first few rows, which may obscure the full output.

  • D: spark_df.printSchema() simply prints the column names and data types. It provides no statistical insight whatsoever.

  • E: spark_df.toPandas() converts the Spark DataFrame into a Pandas DataFrame, which might not be feasible with large datasets due to memory constraints. Also, this conversion doesn’t automatically compute statistics.

Therefore, for a full statistical summary—including IQR—spark_df.summary() is the correct and most informative choice.

Question 7:

A company is creating a shared feature store for use in multiple machine learning projects. As part of the initial setup, the team decides to apply one-hot encoding to all categorical features before training any models. However, a data scientist objects and advises against applying one-hot encoding within the centralized repository. Instead, they recommend handling encoding later during model development.

What is the most likely reason for the data scientist's suggestion?

A. One-hot encoding is generally unsupported by machine learning libraries.
B. One-hot encoding relies on the output variable, which changes across different projects.
C. One-hot encoding is resource-heavy and suitable only for small datasets.
D. One-hot encoding is not a commonly used technique for handling categorical data.
E. One-hot encoding can reduce flexibility and create compatibility issues with certain algorithms.

Correct Answer: E

Explanation:

The data scientist recommends postponing one-hot encoding to maintain adaptability and maximize the reusability of the centralized feature repository. While one-hot encoding is a popular method for transforming categorical variables into a numerical format, applying it in a shared repository can introduce challenges across different modeling contexts.

One key concern is loss of flexibility. Some machine learning algorithms—especially tree-based models such as decision trees, random forests, and gradient boosting—can handle categorical variables directly or perform better with alternative encoding methods such as ordinal encoding, target encoding, or embeddings. These techniques retain more information about category relationships or avoid creating sparse data, which one-hot encoding may inadvertently cause.

Moreover, pre-encoding data in the repository introduces rigidity. If the categorical domain changes over time (e.g., new category levels appear in future datasets), the one-hot schema becomes outdated. Updating the encoding in the repository to reflect these changes may lead to inconsistencies, versioning issues, or data leakage.

Additionally, high-cardinality categorical variables can produce a large number of dummy variables when one-hot encoded, significantly increasing the dimensionality of the data. This not only adds computational cost but can also degrade model performance for algorithms sensitive to feature size.

By applying one-hot encoding later in the pipeline, data scientists retain control over choosing the most suitable encoding technique for their specific algorithm and use case. This ensures that models are optimized individually without being constrained by a generic, inflexible schema.

Hence, the data scientist's concern stems from one-hot encoding's tendency to reduce model compatibility and pipeline flexibility—making E the correct choice.

Question 8:

A data scientist is using a housing dataset to predict property prices via linear regression. Two models are created:

  • Model A uses the actual home price as the prediction target.

  • Model B predicts the natural logarithm of the home price.

To compare performance, the data scientist uses RMSE by evaluating predicted vs. actual price values. Model B shows a much higher RMSE than Model A.

Which of the following is not a valid explanation for why Model B appears to perform worse?

A. Model B may actually produce better predictions than Model A.
B. The data scientist did not convert Model B’s log predictions back to the original price scale.
C. The data scientist applied a log transformation to Model A's outputs before calculating RMSE.
D. Model A could genuinely be more accurate than Model B.
E. RMSE is not an appropriate metric for regression performance evaluation.

Correct Answer: E

Explanation:

Root Mean Squared Error (RMSE) is a standard and widely used metric for evaluating regression models. It measures the square root of the average of squared differences between predicted and actual values. RMSE is particularly useful because it penalizes larger errors more heavily, offering a sensitive indicator of model performance. Therefore, saying RMSE is invalid is incorrect—making option E an invalid explanation.

In this scenario, Model A directly predicts price in dollars, so calculating RMSE by comparing predictions to actual price values is appropriate. However, Model B predicts log(price), and to correctly evaluate its RMSE on the same scale, those predictions must first be exponentiated (i.e., transformed back using the exponential function) to get actual dollar values. Failing to do this leads to a mismatch in scales—logarithmic predictions vs. raw prices—which will result in an artificially high RMSE. This supports option B as a valid reason.

Similarly, if the data scientist took the logarithm of Model A’s predictions before calculating RMSE, it would again introduce a mismatch with the actual price values, leading to incorrect evaluation. That means option C is another valid possibility.

It’s also plausible that Model A or Model B is truly more accurate, depending on the underlying patterns in the data. The RMSE alone, when used properly and on the same scale, can reveal this—so options A and D are valid.

Ultimately, only option E is invalid because it incorrectly claims that RMSE is not suitable for evaluating regression models. In truth, RMSE is valid and widely used, provided it's applied consistently on comparable scales.

Question 9:

A data scientist is evaluating a regression model using 3-fold cross-validation. This method involves dividing the dataset into three equal parts and then training the model on two parts while validating it on the third. This process is repeated three times, each time rotating the validation fold. After completing the cross-validation, the RMSE (Root Mean Squared Error) values obtained are as follows:

  • Fold 1: RMSE = 10.0

  • Fold 2: RMSE = 12.0

  • Fold 3: RMSE = 17.0

What is the average RMSE across all three folds, which should be used as the final estimate of the model’s performance?

A. 13.0
B. 17.0
C. 12.0
D. 39.0
E. 10.0

Correct Answer: A

Explanation:

Cross-validation is a common model validation technique that helps evaluate a model’s ability to generalize to unseen data. In k-fold cross-validation, the dataset is split into k equal subsets, and the model is trained and tested k times. Each time, the model is trained on k–1 folds and validated on the remaining fold. This rotation ensures that every data point is used both for training and validation exactly once.

In this scenario, the data scientist used 3-fold cross-validation. The RMSE values from each fold are:

  • Fold 1: 10.0

  • Fold 2: 12.0

  • Fold 3: 17.0

Since RMSE is a metric that reflects the square root of the average squared differences between predicted and actual values, it is important to average the RMSE values from each fold to get an overall performance estimate. The calculation is:

Average RMSE=10.0+12.0+17.03=39.03=13.0\text{Average RMSE} = \frac{10.0 + 12.0 + 17.0}{3} = \frac{39.0}{3} = 13.0Average RMSE=310.0+12.0+17.0​=339.0​=13.0

This average provides a more balanced view of the model’s performance, smoothing out potential fluctuations caused by the data distribution in any single fold.

Why the other options are incorrect:

  • B (17.0) reflects only the worst-case performance and ignores the other folds.

  • C (12.0) underestimates the true average, excluding the highest error.

  • D (39.0) is the total RMSE sum, not an average, which misrepresents the model's overall performance.

  • E (10.0) considers only the best-case scenario.

Therefore, the correct average RMSE for evaluating this model is 13.0, making A the right choice.

Question 10:

A machine learning engineer is working in Databricks to build a binary classification model using the mlflow.sklearn API for experiment tracking. After training the model and logging parameters, metrics, and artifacts, the engineer wants to register the trained model into the Model Registry for future deployment. 

Which of the following is the correct step to register the model in the Databricks Model Registry?

A. Use mlflow.register_model() with the run ID and model path.
B. Use mlflow.log_model() with the model name argument.
C. Use mlflow.register_model() with the experiment name and version number.
D. Use mlflow.set_tag() to register the model automatically.

Correct Answer: A

Explanation:

The Certified Machine Learning Associate – Databricks exam evaluates a candidate’s knowledge in applying core ML concepts within the Databricks environment, including experimentation, tracking, feature engineering, model training, and model management using MLflow.

One critical part of the Databricks ML workflow involves the Model Registry, which allows users to manage model versions, promote models to staging/production, and collaborate across teams.

After training and logging a model using MLflow (e.g., using mlflow.sklearn.log_model()), the next step is to register that model so it can be versioned and governed centrally. This is done using the mlflow.register_model() function. This function requires two inputs:

  1. The model URI – usually in the form runs:/<run-id>/model, which references the logged model artifact.

  2. The registered model name – a user-defined name under which the model will appear in the registry.

So, the correct code snippet would be something like:

Let's consider why the other options are incorrect:

  • B (mlflow.log_model() with model name) is used for logging models but does not register them in the Model Registry.

  • C mentions using an experiment name and version number, which are not required by mlflow.register_model().

  • D (mlflow.set_tag()) is used to add metadata to a run but doesn’t register a model.

Understanding this process is essential for passing the exam, as model lifecycle management and deployment are key exam topics. Being familiar with MLflow APIs and how they are integrated within Databricks is fundamental for ML practitioners working in production environments.


SPECIAL OFFER: GET 10% OFF

ExamCollection Premium

ExamCollection Premium Files

Pass your Exam with ExamCollection's PREMIUM files!

  • ExamCollection Certified Safe Files
  • Guaranteed to have ACTUAL Exam Questions
  • Up-to-Date Exam Study Material - Verified by Experts
  • Instant Downloads
Enter Your Email Address to Receive Your 10% Off Discount Code
A Confirmation Link will be sent to this email address to verify your login
We value your privacy. We will not rent or sell your email address

SPECIAL OFFER: GET 10% OFF

Use Discount Code:

MIN10OFF

A confirmation link was sent to your e-mail.
Please check your mailbox for a message from support@examcollection.com and follow the directions.

Next

Download Free Demo of VCE Exam Simulator

Experience Avanset VCE Exam Simulator for yourself.

Simply submit your e-mail address below to get started with our interactive software demo of your free trial.

Free Demo Limits: In the demo version you will be able to access only first 5 questions from exam.