How can you validate and test data for machine learning?

Data engineering is the process of preparing and managing data for machine learning and other analytical tasks. It involves collecting, cleaning, transforming, integrating, and storing data from various sources and formats. Data engineering also requires validating and testing data to ensure its quality, reliability, and suitability for machine learning models. In this article, you will learn some of the common methods and tools for data validation and testing in data engineering for machine learning.

1 Data validation

Data validation is the process of checking if the data meets certain criteria or expectations, such as data types, ranges, formats, completeness, accuracy, consistency, and uniqueness. Data validation can help you identify and correct errors, outliers, missing values, duplicates, and anomalies in your data before feeding it to machine learning algorithms. Data validation can be performed at different stages of the data pipeline, such as during data ingestion, transformation, integration, or loading. Some of the tools and frameworks that can help you with data validation are:

- Pandas : A popular Python library for data analysis and manipulation that provides various methods and functions for validating data, such as info() , describe() , isnull() , dropna() , fillna() , unique() , duplicated() , drop_duplicates() , and assert_frame_equal() .

- Great Expectations : An open-source Python library that allows you to define and test data quality expectations using a declarative syntax. You can use Great Expectations to validate data against schemas, rules, distributions, patterns, and thresholds, and generate data documentation and profiling reports.

- Deequ : An open-source Scala library that enables you to define and verify data quality metrics using Apache Spark. You can use Deequ to compute data quality statistics, such as completeness, uniqueness, distinctness, compliance, and correlation, and apply data quality constraints and checks.

Add your perspective

Andre Almeida

Founder, CEO, Board member @ Dom Rock | Data and AI technology
Report contribution
Alongside with technical aspect of the data itself, there is a room here to implement business criterias for data validation process. For example, data that could not be null, data that must fall in certain ranges of values, data that has a business meaning when linked to other data fields and so on. This approach can be easily implemented as class or function having yaml configuration files as tool.

Like

Unhelpful
Earl Mark Joseph Santos

Data Science | Quantitative Trading | Engineering
Report contribution
When diving into data validation, visualization packages like Matplotlib and Seaborn are super handy. They help you visually spot any oddities or trends right from the get-go. If you're focusing on a specific sector or field, you'll want to make sure your data lines up with what's typically expected there. For those working with time-bound data, ensuring there aren't unexpected gaps or jumps is key. And a pro tip? Always double-check where your data's coming from – a reliable source can save you a ton of validation headaches down the road!

Like

Unhelpful
Aurimas Griciūnas

Chief Product Officer @ neptune.ai | Follow me to Learn about LLM and Data Engineering Systems | Author of SwirlAI Newsletter | Public Speaker
Report contribution
On top of regular data validation techniques common in Data Engineering pipelines I would also add: - At validation stage you should check for Feature Drifts. While it is a good practice to do it after the ML Model is deployed to serve the use case, it is expensive to implement so checking for drifts on incoming training data is a good alternative to signal data distribution shifts that require model retraining. - Be sure to version your data that is used for training ML models. Don't forget to track training validation splits and random seeds used in your training runs for full reproducibility.

Like

Unhelpful
Mahsut Demiroğlu

Generative AI & LLMs | Machine Learning | Data Science | Data Engineering | Cloud
(edited)
Report contribution
TensorFlow Extended, aka TFX, offers a full-fledged data validation framework. It comprises a sequence of components which are used for data ingestion, validation, transformation and preparation purposes. ExampleGen ingests and optionally splits the input dataset. StatisticsGen calculates statistics for the dataset. SchemaGen examines the statistics and creates a data schema. ExampleValidator looks for anomalies and missing values in the dataset. Transform performs feature engineering on the dataset. TFX is especially helpful in developing production grade ML model. It has many built in data validation capabilities that enables to inspect training and test data distributions visually.

Like

Unhelpful
🔹Christelle Julias

🦹• AI Systems Engineer • Queries Valkyrie 🔹⚔️🔹
(edited)
Report contribution
One thing I have found helpful is performing visualization before and after each operation. This little step emphasizes all possible flaws of your data and helps you strategize the way to get a balanced dataset.

Like

Unhelpful
Diego Horna

@PEPSICO | Azure AI & Data Analytics | Agile Management | Driving Operational Excellence and Digital Capability
Report contribution
Effective data validation for ML involves: 1. Requirements: Define data expectations. 2. Profiling: Identify anomalies, patterns. 3. Cleansing: Fix missing, inconsistent data. 4. Rule-based: Use tools like Great Expectations. 5. Stats Analysis: Check distribution, correlation. 6. Metrics: Compute completeness, uniqueness. 7. Testing: Cross-validation, integration. 8. Feedback Loop: Update validation rules. 9. Documentation: Detailed process records. 10. Continuous Monitoring: Data quality vigilance. This approach acts as an initial step in ensuring dependable, top-notch data for achieving successful machine-learning results.

Like

Unhelpful
Ganesh DG

Data Engineering & Project Management professional | VP - Sr Technical Manager with Expertise in Financial and Manufacturing Domain | Mentor | Views expressed are my own
Report contribution
- It is equally important to ensure we remove outliers data points which will skew the calculation to undesirable direction - It is important to check which data points are consistently clean if you are setting up a model for establishing automated future data calculations

Like

Unhelpful
Nitesh P.

Data Engineer | AWS | PySpark | Python
Report contribution
A real world scenario in which a financial institution is working with credit card transaction data. Before using this data for fraud detection modeling, data validation becomes vital. 1.Using library like PyCaret, the institution can identify incomplete transactions, outliers, or inconsistent values. For instance, if the transaction amount is unusually high, it could be flagged for manual review. 2. Additionally Apache NiFi can be utilized to validate incoming transaction data in real-time, ensuring that only legitimate transactions are processed, while anomalies,fraudulent activities are flagged. In this way, these libraries & tools play a crucial role in maintaining data quality & accuracy, mitigating risks & make informed decisions.

Like

Unhelpful

2 Data testing

Data testing is the process of verifying if the data meets the requirements and specifications of the machine learning models, such as data size, shape, distribution, balance, and features. Data testing can help you evaluate and improve the performance, accuracy, and robustness of your machine learning models. Data testing can be performed at different stages of the machine learning lifecycle, such as during data preprocessing, feature engineering, model training, validation, or deployment. Some of the tools and frameworks that can help you with data testing are:

- Scikit-learn : A widely used Python library for machine learning that provides various methods and functions for data testing, such as train_test_split() , cross_validate() , GridSearchCV() , RandomizedSearchCV() , accuracy_score() , confusion_matrix() , classification_report() , and roc_curve() .

- PyTest : A popular Python testing framework that allows you to write and run automated tests for your data and code. You can use PyTest to create test cases, fixtures, mocks, and assertions for your data engineering and machine learning projects.

- MLflow : An open-source platform for managing the end-to-end machine learning lifecycle that enables you to track, compare, and reproduce your data and model experiments. You can use MLflow to log and monitor your data and model metrics, parameters, artifacts, and versions, and deploy your models to various environments.

Data engineering for machine learning is a complex and iterative process that requires constant validation and testing of your data. By using the methods and tools discussed in this article, you can ensure that your data is of high quality, reliability, and suitability for your machine learning models, and that your models are performing as expected and meeting your objectives.

Add your perspective

César R. F.

Data Analytics Manager | Data Project Manager | Power BI | Python | SQL | DAX | M | HTML | CSS | JS
Report contribution
To validate and test data for machine learning, I use to split the data into three sets: a training set, a validation set, and a test set. Use the training set to train the model, the validation set to fine-tune the model parameters, and the test set to evaluate the model's performance on unseen data. By following these steps, you can ensure that the model is not overfitting to the training data and that it is able to generalize to unseen data.

Like

Unhelpful
Diego Horna

@PEPSICO | Azure AI & Data Analytics | Agile Management | Driving Operational Excellence and Digital Capability
(edited)
Report contribution
Begin by assessing data quality, size, distribution, and balance. Scikit-learn's functions for an unbiased validation. Employ PyTest for automated, robust testing, creating cases and fixtures. Validate data transformations, aligning with model needs. For monitoring and reproducibility, embrace MLflow. Log metrics, parameters, artifacts, facilitating model comparison. Analyze. Iterate and refine based on results. Test during preprocessing, feature engineering, training, and deployment. Validate data's alignment with objectives. This approach ensures informed decisions, and enhances model robustness, empowering impactful machine learning.

Like

Unhelpful
Earl Mark Joseph Santos

Data Science | Quantitative Trading | Engineering
Report contribution
Strategies like stratified splits or k-fold cross-validation should be employed for comprehensive testing. Tools from Scikit-learn, like SelectKBest, can help in prioritizing relevant features. For a robust assessment, synthetic data generation methods, including SMOTE, are recommended along with adversarial testing to gauge model resilience.

Like

Unhelpful
Ganesh DG

Data Engineering & Project Management professional | VP - Sr Technical Manager with Expertise in Financial and Manufacturing Domain | Mentor | Views expressed are my own
Report contribution
Test the efficiency of the code to handle larger quantities of data as against lab condition data load. Check if code works efficiently when the data goes ten -100 folds of test data. Many times initial results are good but when huge pipelines get added realworld scenarios are slow and in effective which need to be avoided

Like

Unhelpful
Shafkat Rahman

ML Engineering | LLMOps | Generative AI applications + solutions
(edited)
Report contribution
Here is the correct process: 1. Split the dataset first and set your test set aside 2. Transform the train set 3. Transform the rest of the data After transforming the train set, you should use the same parameters to change the rest of the data. Then you will use the min and max values calculated on the train set to scale the test samples

Like

Unhelpful
Jessica C.

AI/ML Research Engineer
Report contribution
Although accuracy is important, it is also wise to use other outcome testing measurements such as robustness, reliability, resilience, fairness, explainability, etc… A model can have high accuracy, but in production, it may not perform as well. PiML is a great Python package that helps compare those aforementioned outcome testing measurements against other models.

Like

Unhelpful
Aurimas Griciūnas

Chief Product Officer @ neptune.ai | Follow me to Learn about LLM and Data Engineering Systems | Author of SwirlAI Newsletter | Public Speaker
Report contribution
When testing ML Models, be sure to include production monitoring. The phenomenon that often times stays undetected is training/serving skew that is related to the nature of the data itself or differences between technologies used at training and inference time. Data related example: Imagine you are training a defect detection model that will be deployed in a factory. You train on a well lit photos while the factory is poorly lit - you will get an instant hit to the model accuracy once it is deployed so be sure to monitor it. Technology related example: You are performing preprocessing for training purposes with Python but are performing preprocessing in production using Go. Differences in technologies can result in unexpected results.

Like

Unhelpful
Sunil Parmar

Fraud-Risk, Data Science Architecture at Walmart
Report contribution
Test with dataset samples that reflect real-world distribution and scenarios - not just random samples. Run tests in pre-prod environments that closely match production infrastructure. Go beyond accuracy metrics - also test for biases, drift, graceful failure modes, etc. Implement canary deployments to test models on subsets of live data before full deployment. Monitor models post-deployment to confirm they work as expected in the real world.

Like

Unhelpful
Yooki 🐻

Data Scientist at GR4
Report contribution
To test data we typically divide our dataset into training and test sets. This way, after training our model, we can evaluate its performance on data it hasn't seen before. Tools like Scikit learn provide convenient methods for creating these splits. It's also essential to consider techniques like cross-validation, which offers a more proper assessment of the model's consistency. Ultimately, the goal is to ensure our model doesn't just memorise the training data but can make accurate predictions on new, unseen data.

Like

Unhelpful

3 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

Add your perspective

Ganesh DG

Data Engineering & Project Management professional | VP - Sr Technical Manager with Expertise in Financial and Manufacturing Domain | Mentor | Views expressed are my own
Report contribution
What is important while building a framework is to understand which data points make a real difference to the output and hypothesis. Sometimes data which are outliers and with lesser impact tend to skew the results. Working with business to understand the data points and it's impact is very critical

Like

Unhelpful
Dominic Ligot

Technologist, Social Impact, Data Ethics, AI
Report contribution
One thing that's not commonly discussed is model drift. This is how models deteriorate over time. At regular intervals it's always good to test how the through-the-door population or dataset compares with the original development, test, and validation samples. A significant variance can presage a model deterioration.

Like

Unhelpful
Aurimas Griciūnas

Chief Product Officer @ neptune.ai | Follow me to Learn about LLM and Data Engineering Systems | Author of SwirlAI Newsletter | Public Speaker
Report contribution
ML Models are only useful as long as they are solving our business problems. Be sure to always monitor business metrics and their improvements/deterioration once iterating on new versions of your model. There are multiple and widely used approaches to perform online model testing, these include A/B Testing, interleaving experiments, multi-armed bandits etc.

Like

Unhelpful
Diego Horna

@PEPSICO | Azure AI & Data Analytics | Agile Management | Driving Operational Excellence and Digital Capability
Report contribution
Validating and testing data for machine learning extends beyond technical processes. Cultural aspects like collaboration and mindset play a crucial role. Encourage open communication between data engineers and domain experts to refine validation rules effectively. Embrace an adaptable mindset. Consider an e-commerce scenario: A model for customer preferences failed as it didn't validate seasonal trends. Collaborative efforts could've led to realizing the importance of temporal validation. Stories like this highlight the essence of understanding data's context. Don't solely rely on tools; Contextual insights matter.

Like

Unhelpful
Yooki 🐻

Data Scientist at GR4
Report contribution
Ensure that the data is free from biases, as biased data can lead to unfair or discriminatory model outcomes. It's a blend of technical insight and a deep understanding of the data origin and implications.

Like

Unhelpful
Sunil Parmar

Fraud-Risk, Data Science Architecture at Walmart
Report contribution
Involve business stakeholders in defining requirements and success metrics. View validation and testing as iterative, not one-off tasks. Expect to refine and add tests over time. Automate as much as possible for efficiency and consistency. Treat data quality as a feature of the overall system - not just a preprocessing step.

Like

Unhelpful

How can you validate and test data for machine learning?

1

2

3

1 Data validation

2 Data testing

3 Here’s what else to consider

Data Engineering

Rate this article

Thanks for your feedback

More articles on Data Engineering

More relevant reading

How can you validate and test data for machine learning?

1

2

3

1 Data validation

2 Data testing

3 Here’s what else to consider

Data Engineering

Rate this article

Thanks for your feedback

Explore Other Skills