Iteration through rows of pandas dataframe: Breaking out educational details

3 Березня 2024 в 18:08 174

I am facing a challenge while iterating through rows of a pandas dataframe in Python. Specifically, I am reading data from an excel sheet using pandas and then iterating through the data using df.iterrows(), processing it further to automate a work flow using python/selenium. Each row of my spreadsheet belongs to job applicants and their attributes are getting captured in different columns. However, since a person can have multiple degrees their educational details are captured in columns as degree1, specialation1, college1, degree2, specialiation2, college2, etc. Maximum 5 qualifications can be filled in. During iteration, I want to breakout and then loop through rows of education (degree, specialization, college) for one person. How to accomplish this task? I am attaching a github link of the sample data for reference text which basically has the same data as is pasted below:

python




		
		
			# Sample data
sr_no   old_emp_id  name    address mobile  degree1 specialisation1 college1    degree2 specialisation2 college2    degree3 specialisation3 college3    emp_status
1   24  Amit    ABC Road    356363474   Computer Science    Robotics    IIT Delhi   MSC ML  MIT PHD AI  Harvard full-time
2   34  Samit   Xyz Road    367474748   Bachelor of Arts    Economics   Delih Univ  Masters of Eco  Internatioal Relation   Delhi Univ  PHD Foreign Trade   Delhi Univ  part-time
3   56  Richard PTC Street  363637677   Bsc Biology Mumbai Univ Masters of Science  Microbiology    Mumbai Univ PHD Communicable disease    Mumbai Univ part-time
			
				
					
				
					1
2
3
4
5
				
						# Sample data
sr_no   old_emp_id  name    address mobile  degree1 specialisation1 college1    degree2 specialisation2 college2    degree3 specialisation3 college3    emp_status
1   24  Amit    ABC Road    356363474   Computer Science    Robotics    IIT Delhi   MSC ML  MIT PHD AI  Harvard full-time
2   34  Samit   Xyz Road    367474748   Bachelor of Arts    Economics   Delih Univ  Masters of Eco  Internatioal Relation   Delhi Univ  PHD Foreign Trade   Delhi Univ  part-time
3   56  Richard PTC Street  363637677   Bsc Biology Mumbai Univ Masters of Science  Microbiology    Mumbai Univ PHD Communicable disease    Mumbai Univ part-time

I tried grouping it using a custom function presented below. But it does not give the desired results.

# Custom function to group attributes differently
def group_attributes_diff(df):
    new_data =[]
    for i in range(0, len(df),35):
        candidate_info={}
        for j in range(i,i+7+1):
            row = df.iloc[j]
            degree_name = row['degree_name' + str(int(j-i)//7+1)]
            specialisation = row["specialisation" + str(int(j - i)//7 + 1)] if "specialisation" + str(int(j - i)//7 + 1) in row else None
            course_start_date = row["course_start_date" + str(int(j - i)//7 + 1)] if "course_start_date" + str(int(j - i)//7 + 1) in row else None
            course_end_date = row["course_end_date" + str(int(j - i)//7 + 1)] if "course_end_date" + str(int(j - i)//7 + 1) in row else None
            marks_grades = row["marks_grades" + str(int(j - i)//7 + 1)] if "marks_grades" + str(int(j - i)//7 + 1) in row else None
            university = row["university" + str(int(j - i)//7 + 1)] if "university" + str(int(j - i)//7 + 1) in row else None
            course_type = row["course_type" + str(int(j - i)//7 + 1)] if "course_type" + str(int(j - i)//7 + 1) in row else None
            education={
                "degree_name":degree_name,
                "specialisation":specialisation,
                "course_start_date":course_start_date,
                "course_end_date":course_end_date,
                "marks_grades":marks_grades,
                "university":university,
                "course_type":course_type
            }
            candidate_info["Education "+str(int(j-i)//7+1)] = education
        new_data.append(candidate_info)
    return pd.DataFrame(new_data)
test_df = group_attributes_diff(df.copy())
print(test_df.to_excel('education.xlsx'))