Under articles like “The Triple A principle”, I often get questions like “we have troubles with overlapping user data for our tests” or how would you handle X tests needing data seeding for each one of them.
I found out in time that the way you set up data for your automated checks might be very impactful on their performance, reliability, maintainability and resilience and it deserves a topic on its own.
I even have a small task that I gave to people at automation interviews that goes like this:
Imagine you have an e-commerce app that you must write automated checks for. The requirement is to write checks for the checkout functionality. In order to do so, you need a user to be logged in, but to actually perform checkout, you will also need the user to have 3 sets of data – Personal data (incl. Name, Family name, etc), Address data (where you will deliver the goods to) and payment data (credit card, PayPal token or bank account data or other).
The question is – how would you setup data for your checks, so you can successfully automate checkout? Why did you choose that specific approach?
If it’s easier for you, you can imagine the three sets of data simply as 3 database tables you need to fill with valid data.
I will also use this task as an example when I comment on different approaches and their pros and cons.
A bit of context on the task
I consider this a classic case when it comes to running automated checks. With this question I want to see in first place if the person has practical experience with automation, if they do, it wont be problem to provide at least two working solutions, if they don’t, they are either too scared and stressed or they’ve done it the wrong way.
The worst possible way of data seeding…
For contrast and for fun, I am going to start with the worst possible way that a person can take. Say we have the app up and running we only need to create the user and perform log in to do our job.
The worst possible way is to go and register, through the UI the user before each test or test suite you run.
Pros:
- This will work very well, almost all the times.
- You have a fresh user, no dependencies introduced.
Cons:
- If you create user per test, this will be veery slow.
- If you create it before a suite, you will have to manage intertest dependencies (where did one test finish, when the next carries on)
- In addition, you will run in additional complications if you need to parallelize test, as they will all depend on the same users. Management of states in this case will be a mess.
Slightly modified worst possible way…
I guess the other worst possible manner is to create data outside of the test, for example in my testing environment I have a user called [email protected] I know it has all necessary data, and I simply do all my tests login with [email protected]
Pros:
- I don’t care about data creation and I am free to write checks.
- They will work, because I have working test user.
Cons:
- If I decide to run tests on fresh instance/db or in CI I wont have [email protected] and all tests will fail.
- I will have manual tasks to deal with every time DB is wiped. We don’t want this.
- We create a case which is too sterile, lacking real life context. Nothing in that user is going to change and that’s a bit scary.
- Using the credentials, we normally test and explore with, might be a very bad idea. I don’t know about you, but because of all the bugs and little tweaks I make with my users, they are a fucking mess, like Frankenstein’s monster. Not very reliable in tests.
- Plus, the data under this user – orders for example, will grow with every test run. What are you going to do about it?
The tricky part in data seeding
I hope you realize there’s no single right answer here, but there will rather be less wrong answer. In fact, it all depends on various factors.
Factors that you need to account for, before you start seeding automated checks data
Here are some of them I encountered during my experience:
- What’s your infrastructure? – It definitely matters if you are running a light-weight testing environment like docker compose or virtual machines, compared to a multicomponent setup, that has to be cared for every time you deploy. In addition – how big and how complex is the application that you are testing, do you really need to set it up from ground 0 or there’s a way to reuse some part of it? Can you use some sort of caching to provide data that doesn’t change frequently, like external libraries, for example?
- What type of a platform are you testing? – Not all platforms could be tested in the manner we are testing web. If you are an experienced tester that should be out of the question for you. For example – the pattern of resetting and seeding data before every test or suite, might be unthinkable for mobile or desktop testing, although it’s pretty straight forward in API testing. It might be very naïve, if you use inappropriate approach for running your checks.
- How often are you running checks? – This might vary depending on your whole process and the release schedule of your company. Anyway, the strategy of data seeding will be totally different if you are doing weekly build, daily, nightly or if you are running CI/CD build on every commit to master 20 times a day. As I said many times and as a general rule in testing, never simply grab something that’s working for someone and try to do it yourself, it doesn’t work like this, nothing in life does.
Let’s start from the end – state is a king!
Before I list some approaches let’s start from the end. One thing that you need to figure out once and for all, in your context for your particular case is – whether or not you are going to reset the test data that you already have either by resetting it to particular state or by deleting it completely.
Why would you reset the data, anyway? One simple reason – the state, the state of your data in automated checks is a vital thing to have reliable checks running all the time. If your they don’t always start from a fixed definitive state, you will start introducing what many people call “flakiness”, which is in fact, very, very crappy check design and bad data management.
Why would you like to reset test data?
Simply for good known state from which your tests will start. It could be either a DB image, or a script or anything.
Why would you like to completely delete it?
May be, because each of your tests is creating their own data and you don’t really care if other data persists or not. Anyway, there might be some specific cases about this. Let’s discuss it in depth a bit later.
Important notice
Deleting or resetting data after each test run might be timely operation, depending of the size of your database (and perhaps the write speed of the underlying infrastructure), that’s why I said earlier it’s something you should account for. Some companies and testers claim they gained 20-30 % speed improvement of their build by simply removing that step, so be careful, this might cost a lot.
Couple of different approaches for data seeding from my experience
I will list few approaches I used during the years and the potential issues that might occur or occurred while using them.
Most trivial one – SQL scripts
Like it or not, that’s probably the first thing a tester will come up with, as I said, it’s 3 tables after all, so it’s simple to just write couple of files with insert statements. This will be fast and easy; you will create data every time you need it.
Problems:
- You must definitely keep the SQL scripts containing them in source control system.
- These scripts will become very brittle every time a change is introduced in the DB or a table you use. Imagine, every time a new required field is added in that table, you will have to update your scripts.
- It works well with small DB, but in time more data will be added, and it might have dependencies as foreign keys in other tables, these are not obvious until you run the script. It will quickly turn into nightmare.
- Last, but not least – it is very tempting here to base your tests on user you are inputting in the scripts this might look like a good idea but take a look from OO design perspective. Your test X will run and perform its job only if hidden dependency to record in script Y is fulfilled. I don’t pretend to be a programming guru, but it’s a smelly approach to me.
Using data fixtures
Using data fixtures is a slight improvement compared to using direct SQL scripts, but I can observe a lot of similarities in terms of approach.
What this approach resolves is, that it sort of decouples you from the concreteness of the database, as data fixtures might be handled at higher level like entity manager for example.
In short data fixtures will represent entities or objects that you will seed at some level (per test, per suite, before all, etc) and it will be have some clean up mechanism.
Problems:
- This doesn’t resolve the coupling between data and tests, you can still write tests that require X specific entity.
- Although, crating objects is easier than typing SQL insert statements, sometimes crating new fixtures might require assistance from a Dev or Dev level of competence.
- You might be tempted to perform tests like – Check I have 5 items displayed, which will break the first time somebody adds something to the fixture. Checks failing for any other reason than finding a problem are normally something we don’t want to see.
Every test for himself approach
This is perhaps the approach to which I mostly sympathize (maybe it’s a bias of mine, most likely it is) is the approach inspired by test isolation (or hermetic) principle, which states each test or suite should be responsible for the data it requires – e.g. Test A will create the data it needs to run, test B will create the one it needs, etc. The creation part might be often via some external tool, some specific fixture invocation or similar.
This way, data creation remains responsibility of the test class, which satisfies our (or at least mine) single responsibility OCD.
Of course, this is not snake oil it also has issues.
Problems:
- This approach will quickly become a huge problem if your database is large and you need a lot of data. It will be complex to add all the required entities or data that you need in order to execute a specific case.
- Having a good tool to create the data will either require you to implement a process of data creation that mimics the prod, which might mean it will have bugs or you will have to spend sufficient amount of time doing something that’s already there. A workaround might be to ask a dev for help, to provide you with such tool that reuses the prod logic.
- You will need to be very clear what are you going to do with your data – deleting it after each test might not be a great idea, leaving it might mean you will have to make each record of same type unique, so they don’t conflict. Leaving it until you are over with the test execution might also be a problem if you have a lot of tests, but I guess everything is a problem if you have lots of tests.
- Also, this approach is not a good idea if you have lots of let’s call it operational data. By operational data I mean, data that is not related to your specific check, but is required for you app to operate. For example – in our e-commerce example this might be data related to locations or storefronts, currencies or similar. This will mean you might need to import it for each test, when what you in fact need is to have in the DB prior to even initiating the test and only deal with the data related to your checks
Spaghetti approach
All the above approaches assume you are willing to run tests in random order. You might want to do this because you might want to know there isn’t tiny invisible “strings” between your checks. This would mean, your tests are having implicit dependencies that you forgot about. Anyway, there is cases, as I mentioned where you would deliberately chain tests in specific order. This might seem unreasonable, but as I said earlier – not every platform is web and not every configuration allows for quick restore of state. Such example might be mobile where running the whole emulator or device state might be slow or desktop apps.
Problems:
- Seems to me like this approach is even more complicated than the other ones mainly because you to take care of the state even closer.
- If in the above approach you only care that all tests start and end in the same state, here you will have to take care that every test start in unique state which then leads to another check and another. This seems way much more complicated and might cause much more problems than it will solve.
Combined approach
As you can probably see, there is no “one approach to rule them all” and even though some of us might like one approach over another, it’s all about adapting to a specific case and context. Therefore, combining some of these approaches might be useful.
For instance:
- Might be a good idea to have a trimmed database with which to start your db instance so you only build the data essential for checks.
- If you can not do this, might be a good idea to make a low level fixture to populate such data and manage the other.
Practically not a data seeding approach
I only heard about this approach, never used it, never saw it, so you can treat it with all the scepticism you’ve got.
You can totally decouple checks from data creation if you split it in two separate applications. One will sync data from your production DB to make sure you have 20-50 fresh accurate client records. It will also have to take care about trimming/replacing client sensitive data, such as protected by GDPR or other regulations.
In this case, your tests will only take care about picking one of these users at random and executing the check with it.
I imagine this will be the most realistic test, in terms of closest to prod user data.
I also think, this might be useful for companies that have large infrastructure, that is not easily restorable to specific state.
Problems:
I never used this, so I will only list things I believe might be a problem
- There will be a significant engineering effort to build the ecosystem around such approach.
- You will need to do deep data analysis on the personas that your users represent and how the data you are pulling satisfies them. If this is a big application with large user base, where each user has various configurations, the listed 20 to 50 might not be enough.
- It will be a tough decision to make whether you want to execute one test with one random user or all tests with all user profiles, which will mean the total number of your tests will be cartesian product of user personas x tests x different configurations, etc.
Some win and fail points from my experience
- Every time I used some hard-coded data (either fixtures or SQL scripts) it bit my ass later. It might not be right away, but it definitely happened.
- We made a very good internal tool at one of the projects I was testing that we used for data creation. It was reusing the business logic, exposing it through a PHP package that we used. Anyway, it required a full time Sr Developer working on it for a couple of days and considering additional time to add new features to it, every time we change or release something new.
- Deleting test data when you are running your database in container is a self-solved problem – you just destroy the container after all the tests are run. Of course, you will need to make sure all checks create unique users and data, so you don’t get failed checks due to repetition.
Conclusion
I hope that lengthy exposé helps you figure out that data seeding is not a simplistic problem, in the same manner that testing itself is not a simplistic problem. It depends a lot on many variables and the approach that you pick should be accounting for all of them.
Good luck! 😉
Insightful post! I’m in the early stage of building a data seeding configuration platform (https://drybase.io), and reading through this blog post is very insightful. The idea of inventing some sort of db on the lower environment that is specifically built for data seeding crossed my mind while reading this post; specifically regarding dealing with the complexity of rollbacks, resets, migrations, state etc. I’m not sure if that makes sense, but if it does, is it something you’ve thought about?