I am facing a challenge while iterating through rows of a pandas dataframe in Python. Specifically, I am reading data from an excel sheet using pandas and then iterating through the data using df.iterrows(), processing it further to automate a work flow using python/selenium. Each row of my spreadsheet belongs to job applicants and their attributes are getting captured in different columns. However, since a person can have multiple degrees their educational details are captured in columns as degree1, specialation1, college1, degree2, specialiation2, college2, etc. Maximum 5 qualifications can be filled in. During iteration, I want to breakout and then loop through rows of education (degree, specialization, college) for one person. How to accomplish this task? I am attaching a github link of the sample data for reference text which basically has the same data as is pasted below:
python
1 2 3 4 5 |
# Sample data sr_no old_emp_id name address mobile degree1 specialisation1 college1 degree2 specialisation2 college2 degree3 specialisation3 college3 emp_status 1 24 Amit ABC Road 356363474 Computer Science Robotics IIT Delhi MSC ML MIT PHD AI Harvard full-time 2 34 Samit Xyz Road 367474748 Bachelor of Arts Economics Delih Univ Masters of Eco Internatioal Relation Delhi Univ PHD Foreign Trade Delhi Univ part-time 3 56 Richard PTC Street 363637677 Bsc Biology Mumbai Univ Masters of Science Microbiology Mumbai Univ PHD Communicable disease Mumbai Univ part-time |
I tried grouping it using a custom function presented below. But it does not give the desired results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# Custom function to group attributes differently def group_attributes_diff(df): new_data =[] for i in range(0, len(df),35): candidate_info={} for j in range(i,i+7+1): row = df.iloc[j] degree_name = row['degree_name' + str(int(j-i)//7+1)] specialisation = row["specialisation" + str(int(j - i)//7 + 1)] if "specialisation" + str(int(j - i)//7 + 1) in row else None course_start_date = row["course_start_date" + str(int(j - i)//7 + 1)] if "course_start_date" + str(int(j - i)//7 + 1) in row else None course_end_date = row["course_end_date" + str(int(j - i)//7 + 1)] if "course_end_date" + str(int(j - i)//7 + 1) in row else None marks_grades = row["marks_grades" + str(int(j - i)//7 + 1)] if "marks_grades" + str(int(j - i)//7 + 1) in row else None university = row["university" + str(int(j - i)//7 + 1)] if "university" + str(int(j - i)//7 + 1) in row else None course_type = row["course_type" + str(int(j - i)//7 + 1)] if "course_type" + str(int(j - i)//7 + 1) in row else None education={ "degree_name":degree_name, "specialisation":specialisation, "course_start_date":course_start_date, "course_end_date":course_end_date, "marks_grades":marks_grades, "university":university, "course_type":course_type } candidate_info["Education "+str(int(j-i)//7+1)] = education new_data.append(candidate_info) return pd.DataFrame(new_data) test_df = group_attributes_diff(df.copy()) print(test_df.to_excel('education.xlsx')) |
Thanks in advance for your inputs!