Advanced Data Science - טכניקות מתקדמות, פתרון אתגרי ML ועוד

תוכן עניינים:

Did you miss something with your cross-validation?
How can you improve your dimensionality reduction?
Do you measure your recommendation engine properly?
How to convert my categorical features to numerical?
What is the best way to remove outliers from my data?

אחרי שלוש שנות נסיון בהרצה של מסלול הכשרה המוביל שלנו בתחום Data Science עם מאות בוגרים מרוצים מ-26 מחזורים, הגיע זמן לבנות מסלול מתקדם ולהציע אותו לכל בוגרים שלנו וכן ל-Data Scientists צעירים שנכנסו לתפקיד הזה בארגונים שונים. מסלול הכשרה חדש – Advanced Data Science Techniques – נבנה לאחר תהליך ממושך של חשיבה וצבירת נסיון בפרויקטים בהם נאיה טכנולוגיות מעורבת, ולאחר בחינת דרישות וצרכים בשוק ובתעשיה בישראל.

הכשרה נוספת בתחום שאנחנו מציעים הינה הכשרת בוקר אינטנסיבית לכל מי שכרגע פנוי ללימודים מאתגרים כל יום Data Science Bootcamp.

מטרתנו במסלול הכשרה המתקדם להביא את הסטודנטים לרמה עוד יותר גבוהה בתחום Machine Learning, להעמיק בנושאים קריטיים בתהליכי תחקור הדאטה, וכן ללמד נושאים מתקדמים מעולם ה-Deep Learning.

לקראת פתיחת המחזור הראשון, אספנו מספר נושאים שמהווים היום אתגר רציני בארגונים שכבר החלו תהליכי Data Science אבל מתמודדים עם בעיות בצד המקצועי ולא מצליחים למנף את היכולות הקיימות להבאת ערך אמיתי לחברה. במאמר הזה ריכזנו את חלק מהבעיות שנתקלנו בהן בארגונים שונים.

רשימת הטיפים שלהלן מייצגת את אותם הנושאים שיילמדו במסלול הכשרה המתקדם שלנו. הינכם מוזמנים לעיין סילבוס הקורס, להשאיר לנו פרטים ולקיים שיחת יעוץ מקצועית על מנת לקבל פרטים נוספים.

בהצלחה!

Did you miss something with your cross-validation?

It is not always possible to randomly split your data. The golden rule is that both train and test datasets should represent the real problem you are trying to solve.

Options:

Stratification – The distribution of the target variable in your train and test data should be the same as in the original dataset
Time-series aspects – Make sure you never train on data from the future and predict on data from the past
Grouping – When several rows belong to a group (e.g. sessions of the same user), they shouldn’t be split between the train and test datasets

How can you improve your dimensionality reduction?

Too many features are never a good idea when it comes to fighting with overfitting. You should make the effort to reduce their number.

Options:

Statistics – Various statistical tests can be applied to score the relevance of each feature to the target variable
Correlation – When two features are highly related, it is often useful to remove one of them
Models – Many ML models are designed for this, e.g. PCA, t-SNE & autoencoding. Specific applications may use LDA and matrix factorization

Do you measure your recommendation engine properly?

Evaluating a recommendation system often involves numerous A/B tests to measure conversion rate, but often other business-related considerations must be made.

Options:

Accuracy – How well did your recommendation suit your customer’s wishes?
Coverage – What percent of your inventory get displayed?
Novelty – How much would you like to show your customer a product he or she had never tried before?
Diversity – How important is it for you to show your customer a varied list of options?

How to convert my categorical features to numerical?

Models don’t really deal with non-numerical data. Textual data has its processing techniques, but what to do with categorical features?

Options:

Enumeration – This is NEVER the right solution
One-hot encoding – Also known as dummy variables, this replaces the N categories with N binary columns
Grouping – When there are too many categories, you can try to combine them into a shorter list. Be careful from the “Other” class, which is usually a bad idea
Target aggregation – This is a “smart” enumeration, where each category is replaced by a meaningful aggregation of its target values

What is the best way to remove outliers from my data?

Models learn from every data point, therefore it is always a good idea to identify and remove outliers.

Options:

Data exploration – Inspect your data and visualize it. Significant outliers will show themselves
Infeasible values – Don’t let corrupted data get into your model. Consult with your data source (either human or API) if necessary
Clustering – Clustering algorithms usually yield some very small clusters (hair clusters), which often relate to outliers. Don’t be afraid to apply it several times
Models – There are many models designed for anomaly detection, e.g. Isolation forest, LOF & one-class SVM

טכניקות מתקדמות ב-Machine Learning

Did you miss something with your cross-validation?

How can you improve your dimensionality reduction?

Do you measure your recommendation engine properly?

How to convert my categorical features to numerical?

What is the best way to remove outliers from my data?

מאמרים נוספים

ענן תגיות