How can you validate and test data for machine learning?
Data engineering is the process of preparing and managing data for machine learning and other analytical tasks. It involves collecting, cleaning, transforming, integrating, and storing data from various sources and formats. Data engineering also requires validating and testing data to ensure its quality, reliability, and suitability for machine learning models. In this article, you will learn some of the common methods and tools for data validation and testing in data engineering for machine learning.
Data validation is the process of checking if the data meets certain criteria or expectations, such as data types, ranges, formats, completeness, accuracy, consistency, and uniqueness. Data validation can help you identify and correct errors, outliers, missing values, duplicates, and anomalies in your data before feeding it to machine learning algorithms. Data validation can be performed at different stages of the data pipeline, such as during data ingestion, transformation, integration, or loading. Some of the tools and frameworks that can help you with data validation are:
- Pandas : A popular Python library for data analysis and manipulation that provides various methods and functions for validating data, such as info() , describe() , isnull() , dropna() , fillna() , unique() , duplicated() , drop_duplicates() , and assert_frame_equal() .
- Great Expectations : An open-source Python library that allows you to define and test data quality expectations using a declarative syntax. You can use Great Expectations to validate data against schemas, rules, distributions, patterns, and thresholds, and generate data documentation and profiling reports.
- Deequ : An open-source Scala library that enables you to define and verify data quality metrics using Apache Spark. You can use Deequ to compute data quality statistics, such as completeness, uniqueness, distinctness, compliance, and correlation, and apply data quality constraints and checks.
-
Alongside with technical aspect of the data itself, there is a room here to implement business criterias for data validation process. For example, data that could not be null, data that must fall in certain ranges of values, data that has a business meaning when linked to other data fields and so on. This approach can be easily implemented as class or function having yaml configuration files as tool.
-
When diving into data validation, visualization packages like Matplotlib and Seaborn are super handy. They help you visually spot any oddities or trends right from the get-go. If you're focusing on a specific sector or field, you'll want to make sure your data lines up with what's typically expected there. For those working with time-bound data, ensuring there aren't unexpected gaps or jumps is key. And a pro tip? Always double-check where your data's coming from – a reliable source can save you a ton of validation headaches down the road!
-
On top of regular data validation techniques common in Data Engineering pipelines I would also add: - At validation stage you should check for Feature Drifts. While it is a good practice to do it after the ML Model is deployed to serve the use case, it is expensive to implement so checking for drifts on incoming training data is a good alternative to signal data distribution shifts that require model retraining. - Be sure to version your data that is used for training ML models. Don't forget to track training validation splits and random seeds used in your training runs for full reproducibility.
-
Mahsut Demiroğlu
Generative AI & LLMs | Machine Learning | Data Science | Data Engineering | Cloud
(edited)TensorFlow Extended, aka TFX, offers a full-fledged data validation framework. It comprises a sequence of components which are used for data ingestion, validation, transformation and preparation purposes. ExampleGen ingests and optionally splits the input dataset. StatisticsGen calculates statistics for the dataset. SchemaGen examines the statistics and creates a data schema. ExampleValidator looks for anomalies and missing values in the dataset. Transform performs feature engineering on the dataset. TFX is especially helpful in developing production grade ML model. It has many built in data validation capabilities that enables to inspect training and test data distributions visually.
-
One thing I have found helpful is performing visualization before and after each operation. This little step emphasizes all possible flaws of your data and helps you strategize the way to get a balanced dataset.
-
Effective data validation for ML involves: 1. Requirements: Define data expectations. 2. Profiling: Identify anomalies, patterns. 3. Cleansing: Fix missing, inconsistent data. 4. Rule-based: Use tools like Great Expectations. 5. Stats Analysis: Check distribution, correlation. 6. Metrics: Compute completeness, uniqueness. 7. Testing: Cross-validation, integration. 8. Feedback Loop: Update validation rules. 9. Documentation: Detailed process records. 10. Continuous Monitoring: Data quality vigilance. This approach acts as an initial step in ensuring dependable, top-notch data for achieving successful machine-learning results.
-
- It is equally important to ensure we remove outliers data points which will skew the calculation to undesirable direction - It is important to check which data points are consistently clean if you are setting up a model for establishing automated future data calculations
-
A real world scenario in which a financial institution is working with credit card transaction data. Before using this data for fraud detection modeling, data validation becomes vital. 1.Using library like PyCaret, the institution can identify incomplete transactions, outliers, or inconsistent values. For instance, if the transaction amount is unusually high, it could be flagged for manual review. 2. Additionally Apache NiFi can be utilized to validate incoming transaction data in real-time, ensuring that only legitimate transactions are processed, while anomalies,fraudulent activities are flagged. In this way, these libraries & tools play a crucial role in maintaining data quality & accuracy, mitigating risks & make informed decisions.
Data testing is the process of verifying if the data meets the requirements and specifications of the machine learning models, such as data size, shape, distribution, balance, and features. Data testing can help you evaluate and improve the performance, accuracy, and robustness of your machine learning models. Data testing can be performed at different stages of the machine learning lifecycle, such as during data preprocessing, feature engineering, model training, validation, or deployment. Some of the tools and frameworks that can help you with data testing are:
- Scikit-learn : A widely used Python library for machine learning that provides various methods and functions for data testing, such as train_test_split() , cross_validate() , GridSearchCV() , RandomizedSearchCV() , accuracy_score() , confusion_matrix() , classification_report() , and roc_curve() .
- PyTest : A popular Python testing framework that allows you to write and run automated tests for your data and code. You can use PyTest to create test cases, fixtures, mocks, and assertions for your data engineering and machine learning projects.
- MLflow : An open-source platform for managing the end-to-end machine learning lifecycle that enables you to track, compare, and reproduce your data and model experiments. You can use MLflow to log and monitor your data and model metrics, parameters, artifacts, and versions, and deploy your models to various environments.
Data engineering for machine learning is a complex and iterative process that requires constant validation and testing of your data. By using the methods and tools discussed in this article, you can ensure that your data is of high quality, reliability, and suitability for your machine learning models, and that your models are performing as expected and meeting your objectives.
-
To validate and test data for machine learning, I use to split the data into three sets: a training set, a validation set, and a test set. Use the training set to train the model, the validation set to fine-tune the model parameters, and the test set to evaluate the model's performance on unseen data. By following these steps, you can ensure that the model is not overfitting to the training data and that it is able to generalize to unseen data.
-
Begin by assessing data quality, size, distribution, and balance. Scikit-learn's functions for an unbiased validation. Employ PyTest for automated, robust testing, creating cases and fixtures. Validate data transformations, aligning with model needs. For monitoring and reproducibility, embrace MLflow. Log metrics, parameters, artifacts, facilitating model comparison. Analyze. Iterate and refine based on results. Test during preprocessing, feature engineering, training, and deployment. Validate data's alignment with objectives. This approach ensures informed decisions, and enhances model robustness, empowering impactful machine learning.
-
Strategies like stratified splits or k-fold cross-validation should be employed for comprehensive testing. Tools from Scikit-learn, like SelectKBest, can help in prioritizing relevant features. For a robust assessment, synthetic data generation methods, including SMOTE, are recommended along with adversarial testing to gauge model resilience.
-
Test the efficiency of the code to handle larger quantities of data as against lab condition data load. Check if code works efficiently when the data goes ten -100 folds of test data. Many times initial results are good but when huge pipelines get added realworld scenarios are slow and in effective which need to be avoided
-
Here is the correct process: 1. Split the dataset first and set your test set aside 2. Transform the train set 3. Transform the rest of the data After transforming the train set, you should use the same parameters to change the rest of the data. Then you will use the min and max values calculated on the train set to scale the test samples
-
Although accuracy is important, it is also wise to use other outcome testing measurements such as robustness, reliability, resilience, fairness, explainability, etc… A model can have high accuracy, but in production, it may not perform as well. PiML is a great Python package that helps compare those aforementioned outcome testing measurements against other models.
-
When testing ML Models, be sure to include production monitoring. The phenomenon that often times stays undetected is training/serving skew that is related to the nature of the data itself or differences between technologies used at training and inference time. Data related example: Imagine you are training a defect detection model that will be deployed in a factory. You train on a well lit photos while the factory is poorly lit - you will get an instant hit to the model accuracy once it is deployed so be sure to monitor it. Technology related example: You are performing preprocessing for training purposes with Python but are performing preprocessing in production using Go. Differences in technologies can result in unexpected results.
-
Test with dataset samples that reflect real-world distribution and scenarios - not just random samples. Run tests in pre-prod environments that closely match production infrastructure. Go beyond accuracy metrics - also test for biases, drift, graceful failure modes, etc. Implement canary deployments to test models on subsets of live data before full deployment. Monitor models post-deployment to confirm they work as expected in the real world.
-
To test data we typically divide our dataset into training and test sets. This way, after training our model, we can evaluate its performance on data it hasn't seen before. Tools like Scikit learn provide convenient methods for creating these splits. It's also essential to consider techniques like cross-validation, which offers a more proper assessment of the model's consistency. Ultimately, the goal is to ensure our model doesn't just memorise the training data but can make accurate predictions on new, unseen data.
-
What is important while building a framework is to understand which data points make a real difference to the output and hypothesis. Sometimes data which are outliers and with lesser impact tend to skew the results. Working with business to understand the data points and it's impact is very critical
-
One thing that's not commonly discussed is model drift. This is how models deteriorate over time. At regular intervals it's always good to test how the through-the-door population or dataset compares with the original development, test, and validation samples. A significant variance can presage a model deterioration.
-
ML Models are only useful as long as they are solving our business problems. Be sure to always monitor business metrics and their improvements/deterioration once iterating on new versions of your model. There are multiple and widely used approaches to perform online model testing, these include A/B Testing, interleaving experiments, multi-armed bandits etc.
-
Validating and testing data for machine learning extends beyond technical processes. Cultural aspects like collaboration and mindset play a crucial role. Encourage open communication between data engineers and domain experts to refine validation rules effectively. Embrace an adaptable mindset. Consider an e-commerce scenario: A model for customer preferences failed as it didn't validate seasonal trends. Collaborative efforts could've led to realizing the importance of temporal validation. Stories like this highlight the essence of understanding data's context. Don't solely rely on tools; Contextual insights matter.
-
Ensure that the data is free from biases, as biased data can lead to unfair or discriminatory model outcomes. It's a blend of technical insight and a deep understanding of the data origin and implications.
-
Involve business stakeholders in defining requirements and success metrics. View validation and testing as iterative, not one-off tasks. Expect to refine and add tests over time. Automate as much as possible for efficiency and consistency. Treat data quality as a feature of the overall system - not just a preprocessing step.
Rate this article
More relevant reading
-
Data AnalysisWhat are some of the advanced data analysis techniques and tools that you use or want to learn?
-
Data AnalyticsWhat are the best data analytics tools for predictive modeling and forecasting?
-
Data ScienceWhat strategies can you use to combine multiple datasets for machine learning?
-
Information SystemsWhat advanced data analysis techniques do you use or want to learn?