[HR Tech] Talent Sourcing with AI: Feedback-Based Candidate Recommendation System

Introduction

Finding the right candidate for a job is a time-consuming and effort-intensive process for recruiters. Even when evaluating candidates from an initial talent pool, there’s always the possibility of missing out on potential candidates. By leveraging AI technology, these issues can be addressed, and more efficient recruitment processes can be established.

AI-powered recommendation systems learn from recruiter feedback (e.g., Good Fit or Not Fit) and can re-recommend candidates who were initially overlooked but are potentially well-suited. In this post, we'll look at how recommendation systems work with text data and explore the core technologies behind them using a simple example.

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to assess the importance of words in a document. It gives higher weight to words that appear frequently within a document but rarely across all documents, indicating their significance to the content.

TF (Term Frequency)

This refers to the frequency of a word. It calculates how often a particular word appears in a document.

TF(t)={{Count(t,D)}\over{Total ~words~in~D}}

IDF (Inverse Document Frequency)

This refers to the rarity of a word. It calculates how infrequent a word is across the entire document set. Words that appear often in the dataset receive a low IDF score.

IDF(t) = \log{{Total~ documents}\over{documents~ containing~ t+ 1}}

Multiplying TF and IDF helps identify words that are not just frequent but also meaningful for characterizing the document.

For example, if the word "Python" appears frequently in one candidate's resume and rarely in others, it could act as an important clue about the candidate's skills.

Logistic Regression

Logistic Regression is an algorithm used to solve binary classification problems by predicting the probability of a candidate belonging to one of two classes. It uses the sigmoid function to convert the prediction into a probability value.

Linear Combination

Logistic regression calculates a linear combination of each input feature $x$ and its corresponding weight $w$. This can be expressed as:

z=w^Tx+b

Where z is the result of the linear combination, $w$ is the weight vector, $x$ is the feature vector, and $b$ is the bias term.

Sigmoid Function

The calculated value $z$ is passed through the sigmoid function to convert it into a probability between 0 and 1:

h(x) = {{1}\over{1 + e^{−z}}} = {{1}\over{1 + e^{−(w^Tx+b)}}}

This function outputs values closer to 1 for large inputs and closer to 0 for smaller ones.

Cost Function and Gradient Descent

Logistic regression minimizes the log loss function to find the optimal weights and bias. The cost function measures the difference between the predicted values and actual class labels, and is defined as:

J(w,b) = −{{1}\over{m}} \sum_{i=1}^m [ y^{(i)} \log(h(x^{(i)})) + ( 1 - y^{(i)}) \log ( 1 - h(x^{(i)})) ]

Where $m$ is the number of samples. Logistic regression uses Gradient Descent to minimize the cost function by updating the weights and bias iteratively. The updates are performed as follows:

w:=w−α {{\delta J(w, b)}\over{\delta w}}

b:=b−α {{\delta J(w, b)}\over{\delta b}}

Where

\alpha

is the learning rate.

Logistic regression is a fast and interpretable binary classification model that learns important patterns from data through probabilistic predictions. It is widely used in various fields such as text classification, medical diagnosis, and spam filtering.

Code Example

Now, let’s walk through a simple Python code example to build a model that recommends new candidates based on recruiter feedback. This code uses the sklearn library to analyze candidate resume text data and predict whether a candidate is a good fit for a job using logistic regression.

We aim to predict labels "good fit" and "not fit" based on the resumes. The steps in the code include data preparation, text vectorization, model training, and prediction.

Step 1: Data Preparation

resumes = [
    "Experienced software engineer specializing in Python and machine learning.",
    "Data scientist utilizing statistical analysis and AI models.",
    "Software engineer skilled in database management and full-stack development.",
    "Junior developer with basic knowledge of Python and web technologies.",
    "Experienced software engineer proficient in Python, Java, and SQL."
]
labels = ['good fit', 'not fit', 'good fit', 'not fit', 'good fit']
Python
복사

•

resumes contains the resume text of candidates.

•

labels contain evaluations for each candidate (good fit or not fit).

Step 2: TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(resumes)
Python
복사

•

TfidfVectorizer() evaluates the importance of words based on term frequency and inverse document frequency.

•

X stores the vectorized result, where the text data is converted into numerical form.

Step 3: Label Encoding

y = [1 if label == 'good fit' else 0 for label in labels]
Python
복사

•

y converts the labels into numeric form

•

good fit becomes 1, not fit becomes 0.

Step 4: Logistic Regression Model Training

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)
Python
복사

•

LogisticRegression() is used to train the model with X (features) and y (labels).

•

The model learns the optimal weights based on the training data.

Step 5: Vectorize New Data

new_resumes = [
    "Experienced Python developer with knowledge of machine learning.",
    "Junior web developer with experience in frontend technologies.",
    "Experienced software engineer specializing in database management."
]

X_new = vectorizer.transform(new_resumes)
Python
복사

•

vectorizer.transform() is used to vectorize new data (new_resumes) and store it in X_new.

Step 6: Output Predictions

predictions = model.predict(X_new)
for resume, prediction in zip(new_resumes, predictions):
print(f"Candidate: {resume} - Prediction: {'good fit' if prediction == 1 else 'not fit'}")
Python
복사

•

model.predict(X_new) performs predictions for X_new.

•

Each prediction is output as either good fit or not fit.

Final Output Example

Candidate: Experienced Python developer with knowledge of machine learning. - Prediction: good fit
Candidate: Junior web developer with experience in frontend technologies. - Prediction: not fit
Candidate: Experienced software engineer specializing in database management. - Prediction: good fit
Shell
복사

Conclusion

In this post, we introduced how to use an AI-powered recommendation system to suggest suitable candidates based on recruiter feedback.

TalentSeeker provides a robust AI-based recommendation system that simplifies the candidate search process, which used to be time-consuming and labor-intensive. By leveraging AI's analytical capabilities, you can quickly and accurately recommend the best candidates, creating a better recruitment experience.

Start using TalentSeeker today and experience AI-driven talent sourcing firsthand.

Try TalentSeeker for free now!

References

•

Breitinger, Corinna; Gipp, Bela; Langer, Stefan (2015-07-26). "Research-paper recommender systems: a literature survey". International Journal on Digital Libraries. 17 (4): 305–338. doi:10.1007/s00799-015-0156-0. ISSN 1432-5012. S2CID 207035184.

•

Luhn, Hans Peter (1957). "A Statistical Approach to Mechanized Encoding and Searching of Literary Information" (PDF). IBM Journal of Research and Development. 1 (4): 309–317. doi:10.1147/rd.14.0309. Retrieved 2 March 2015. There is also the probability that the more frequently a notion and combination of notions occur, the more importance the author attaches to them as reflecting the essence of his overall idea.

•

Spärck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28: 11–21. CiteSeerX 10.1.1.115.8343. doi:10.1108/eb026526.

•

David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 20(2):215–232, 1958.