Conversations with Young Lives' Data Managers: Part Two
Ahead of the publication of our new report on data management, we sat down to speak with some of Young Lives’ Data Managers, past and present, about their experiences of the role.
This is the second conversation in the series, with Anne Yates Solon (International Data Manager 2007-2018), Monica Lizama (Peru’s Data Manager), Tien Nguyen (Vietnam’s Data Manager), Shyam Sunder (India’s Data Manager) and Hazel Ashurst (Data Coordinator, Oxford 2011-2013).
Catch up with the first part here - Conversations with Data Managers.
How did you ensure that data captured was of highest quality?
Tien Nguyen: We did double data entry and were very diligent.
Hazel Ashurst: Fieldworkers checked the survey data at the time of entry. Another level of data checking happened when field workers uploaded data at the end of the day, when it went to local supervisors who then checked it. They were able to catch problems ASAP and then the fieldworkers could go back to ask the family to clarify.
The local field supervisor’s key role was to do local data checks, then send the data to Country Data Managers, who then sent this on to Oxford.
Due to the nature and volume of the data, it wasn’t possible for it to be perfect. Sometimes there were still problems. There were also multiple surveys: child, household, community questionnaires etc. But on the whole the data quality was very good.
Monica Lizama: We always have a couple of months of training for the Enumerators before they go into the field. They practice using Computer Assisted Personal Interviewing (CAPI) and understand how to deal with errors. We also sometimes return to families to confirm information. It doesn’t happen often, but sometimes we do return to the field to clarify some answers.
Hazel: The questionnaires were verified in Oxford when setting up SurveyBe, and by Tien for Vietnam.
We used checklists for data management—in Oxford we had a grid of who would set up which questionnaire and who would double-check it. The grid included each country and each survey. We had to make sure countries had the most up to date version. Once data collection had started, we tried very hard to not change a questionnaire. One time we had to make a change midway through, which was unfortunate. Usually we were able to avoid it.
We also constructed databases with built-in validation rules. We used closed questions as much as possible and dropdowns for a single response. This minimized the possibility of an incorrect or out of range answer. But there were complicated skip patterns. We also had to program links to the household roster, cognitive and non-cognitive tests, the community questionnaire and the mini community questionnaire. Throughout, the Young Lives’ child ID was the most important variable.
How did you balance Young Lives’ commitment to open access with anonymization of data (guaranteeing confidentiality in the long term)?
Tien: The Oxford team is in charge of hiding all of the personal information related to the respondent before archiving the data. We submit all the data to Oxford. Oxford required us to destroy all of the paper surveys, which we hired a company to do.
Anne Yates Solon: We removed any personal identifiers. Anything to do with names, locations, dates of birth, school names, any site level names were all removed. We had to strip addresses, GPS, all of that out of the data before we made it public.
We also didn’t want anyone emailing confidential data, so we used a secure, password protected encrypted server for sharing files. The CAPI program ensures the interview is built into an encrypted file that you can’t access without the appropriate software. So, if a tablet was stolen, someone wouldn’t be able to just open up the interview data. You had to have a specific software package to read the file. Only myself, and a few Country Data Managers had that software.
Monica: We were able to manage the aspects of confidentiality and anonymization well. We were always able to maintain confidentiality because the technology and our processes of sharing information are secure. Information such as names and personal identifiers are always anonymized. Anne Yates Solon managed this aspect very well for all four countries. As Country Data Managers we respected and complied with the processes she developed for us to follow.
Hazel : We worked very hard on that. When data was in Oxford, we had two versions, non-anonymous and anonymous, where we took out any names etc. We were extremely diligent about this.
Shyam Sunder: Anonymization of data is mandatory in the project. All the data is anonymized below district level and no one but senior researchers in the project and myself have access to the codes; they are in safe custody. For external users, only anonymized data is supplied.
Monica: We follow protocols to manage and keep the data safe. We receive many enquiries, particularly from students who want data for their thesis research; however, we cannot share data with any personal information. A risk is that we share data with a person, and they share it with others, breaking the confidentiality of Young Lives' participants.
We also have a bank of photos of the Young Lives' children in and around their homes, and with their families, starting from Round 1. We share them with the families, but it’s also very helpful for us during tracking. We can show the photo to people in a town or village and ask where the child lives, if we cannot find the young person. To ensure anonymization, I’m the only one who manages the photo bank and we mainly use it for printing the photos as gifts for the families. The families are very happy to receive the photos.
Did you have a different management process for managing the qualitative data?
Hazel: The qualitative data management process was slightly different. We built a set of guidelines for how to manage that data, and name files, but we never publicly archived any of those because the data are considered sensitive. We wanted to ensure that any data set that we collected was easily linked and identifiable with all the other data sets we collected. We did the school survey, a child questionnaire, a household questionnaire etc. It all needed to link at the community, household and individual level.
Anne: Yes, because we didn’t use databases for it. We built a set of guidelines for how to name files, based on the type of interview and the method used. I built guidelines for how to make the data confidential when it was transcribed. We had discussed archiving qualitative data. But I wasn’t comfortable archiving it, unless I read every transcript to ensure there was no personal information in it. To do that would require years of someone’s time across four countries and multiple rounds. You would also have to have local knowledge, so you knew local names of places.
As the rounds went on, how did you manage large and increasing volumes of data (survey and qualitative)?
Hazel: In the beginning with the paper survey, I wasn’t a part of it, but they would take the data, enter it into access screens, then export it into quantitative analysis software like STATA or SPSS. CAPI streamlined this process. In terms of qualitative data, I didn’t do as much. I mostly helped keep track of audio files.
Anne: We came up with a system for naming the variables, so if you asked the same question at every round it had the same variable name, it just had a round identifier attached to it. So that made it easier. That also helped when doing translations of questions, instead of re-translating it across rounds which meant the translations might be slightly different.
With the volume of data, what might happen, for instance in Round 1 you might have a mother, then in Round 2 they might be reported as deceased, then in Round 3 they might be reported as alive. So, that’s when the more data you collect makes a difference. On the one hand it’s a bit more annoying, because you have to go back and make sure all the data are consistent across rounds, but on the other hand it’s good because if you only had two rounds you wouldn’t know which one was right. Whereas by the time you got to the Round 4 you could say, “oh yeah, she’s alive, because she’s still answering questions, so obviously the data was wrong in Round 2.” So, the data by Round 5 was a lot cleaner. CAPI also reduced the amount of errors due to the data checks in the field.
Were there ever inconsistencies that distorted the research results?
Tien: Yes, sometimes. For example, we asked about income. But Vietnamese currency is in the millions and we wanted to cut it to thousands to save time for the fieldworker. But occasionally they forgot and put in the full number. So, at certain points the household income would be in the millions, and it skewed the data. For example, if the income was 10 million, we asked them to only put 10. It was a lot of work to go and clean the data. Sometimes I could not fix it by myself, because I didn’t know the range of the income there. We would have to ask the fieldworker.
Anne: Sometimes, what would happen is we would build a questionnaire in Oxford in English, then we would send it to the country teams and they would translate it. Early on we just trusted that their translations were accurate. But in one country, in Round 3 or 4 there was a food frequency section of how often you eat certain things, and they hadn’t translated an important item, so it was left off of the program. So, it looked like they weren’t eating that much. So then someone used google translate and realized it hadn’t been included, so that skewed the result massively. There was no way to go back and fix that.
I understand that Young Lives collected data in 11 different languages. How did this affect data collection and management?
Anne: It was different in each country. In Peru we translated CAPI into Spanish. In Ethiopia there are seven different languages, so we left CAPI in English, and our fieldworkers would translate on the spot in front of the respondent. In India we translated it into Telugu. In Vietnam it was in Vietnamese. For the minority languages in Vietnam, again the fieldworker translated it on the spot into the local language.
Tien: Surveys were in English and Vietnamese. A team helped with translation and I programmed SurveyBe based on their translations. When the respondent could not speak Vietnamese, we used local interpreters to help the fieldworkers with translation.
Shyam: In India the questionnaire/program is bi-lingual, in English and Telugu. However, in the Border States and the capital city of Telangana state we have to administer the survey in local languages i.e Oriya, Urdu and Kannada. We have separately arranged paper questionnaires with respect to local languages for administration.
Hazel: There are standard variable names, so translation didn’t need to be done once the data was input into CAPI and we minimized the number of free text fields.
When creating the Survey in CAPI/SurveyBe, Tien is technically strong. He set up his own screens in SurveyBe. This was a lifesaver! He helped with language issues.
In India the Data Manager would send us translations and CAPI had to be set up in Oxford. It was time consuming to copy and paste all of the translated questions into Oxford’s SurveyBe.
Monica: In the beginning, the children were very young, so the survey was generally conducted with the mother. Often, mothers spoke their mother tongue. For example, many parents speak Quechua, an indigenous language in Peru. In the rainforest there are other dialects. So, it was challenging in the early rounds when mothers spoke in their many different languages. But now the interviews centre around the young person’s answers, and the young people all speak Spanish, so it’s easier.
Often the children learn their mother tongue at home, but need Spanish for school. But in the beginning we needed to hire Enumerators who could translate languages such as Quechua, as for example, in the Cusco region people speak more Quechua.
So, the Enumerator would ask questions in Quechua, translate on the spot and record the answers into Spanish. Sometimes, we would need to hire interpreters to work with the Enumerators, especially in the rainforest where there are different dialects. But since the third round, almost all interviews are in Spanish.
Leading up to the release of our data management report, we will be publishing further conversations with our data managers. You can read the first in the series here: Conversations with Data Managers.
Due to Covid-19, our in-person research for Round 6, Young Lives at Work, has been delayed till 2021, and we have adapted our processes to deliver a telephone survey which will provide rapid new research data and insights into COVID-19 related impacts. With DFID's support Young Lives longitudinal research will be continuing into 2025.
 Double data entry is a method of data checking in which data are transcribed from the paper survey into an electronic format twice. The results are then compared, and any discrepancies found between the two data sets are addressed to ensure that data were entered correctly.
 Surveybe is a data collection and management software that utilizes computer-assisted personal interviewing (CAPI).
 The Young Lives Child ID is a unique identification that was assigned to each child in the study. This ID was retained throughout all survey rounds to track the child.