“To be at the top, one has to be aggressive, hardworking and creative.”
Bac Nguyen Xuan
For this week’s ML practitioner’s series, Analytics India magazine got in touch with Bac Nguyen Xuan, a Kaggle master who is currently ranked 56th in the world. In this interview, Bac talks about the tricks behind his Kaggle success.
On His Initial Days
Bac has a Masters in Computer Science from Chonnam National University in South Korea, where he picked up machine learning and fell in love with computer vision, especially with medical image processing. Before starting his journey in machine learning, Bac had a brief stint as an embedded software engineer in Japan. While visiting one of the tech exhibitions in Japan, he was fascinated by the presentations of self-driving cars, robot playing table tennis and other technologies. This was when he decided to quit his job as a software engineer and pursue ML.
Bac currently works as an AI Research Engineer at VinAI, where he and his team conduct high-impact research that pushes the knowledge frontier in AI and to accelerate applications of AI in Vietnam, the Asia Pacific region, and beyond.
Bac strengthened the theory part by attending online courses of Andrew Ng on Coursera and by reading Deep Learning by Goodfellow, along with Machine Learning Coban (translates to The basics of machine learning).
“I asked my brother, who is also a Kaggler on how to become an ML Engineer. He told me to join the Kaggle and practice. It is a great place to not only practice but also to learn new things that have not been mentioned in any books. Then, I set my target to beat him first. Thus, my journey started,” says Bac
On Kaggle Success
Upon joining Kaggle, Bac skimmed through every single kernel, discussion forum, top-solution from the previous competitions to recognise the patterns of winning approaches.
He supplemented this with reading advanced research papers from top-tier conferences like CVPR, ICCV, ECCV, and has tried to implement them in Kaggle competitions. “Even though I know they might not be helpful for the competition too much, they helped me to widen my knowledge, sharpen my skills. Who knows they will be useful for the next competitions. Learn and apply everything you can,” says Bac, with great optimism.
Bac fetched his first gold medal in the Google Quick Draw competition. “In this competition, you are given large-scale data. Your mission is to predict 340 classes of sketching. There are 112200 test samples; let do a simple calculation: 112200 / 340 = 330 and remain 0. It was likely that there were 330 samples/class. We can verify it by ‘Leaderboard probing.’ If there are 330 samples/class, the leaderboard score will be: 330 * (1 + 2/3 + 1/3) / 112200 = 0.005.
“I received exactly the score from the leaderboard. Thus, the assumption that there are 330 samples/class is correct. To leverage this assumption, we should do “Probability Calibration’. The lesson learnt is to understand evaluation metrics and hack the way organisers split the leaderboard!” says Bac.
Bac used the same trick in the IEEE’s camera model identification competition, which landed him in the top-10 on the leaderboard.
In the case of Recursion Cellular Image Classification, where the contestants were provided with the images of several drug experiments, each with 1108 classes, and each class appears only once. Bac applied the same probability calibration and achieved a significant improvement of 10%. So, Bac insists that to be more successful in Kaggle, one has to be more creative.
“You can easily get lost when you are a beginner. If you want to go further, you should go together.”
Underlining the importance of having the right team, Bac remembered how unproductive his initial attempts were due to lack of direction and how having a teammate helped him progress. “I did not hesitate to contact the potential persons when I saw them at the top-50 on the leaderboard to get a hope that they would give me a chance to work together,” says Bac.
He also says that the accomplishments in Kaggle extend far beyond the leaderboard. Before Kaggle, he was told that he wouldn’t make the right fit for an ML job. Today, Bac is flooded with job invitations thanks to Kaggle.
The experience of Kaggle has also helped in the current research position that he holds. The preparation and participation have acquainted him with all the state-of-the-art techniques, which in turn, have proven to be quite handy at his workplace.
“I like the clean code. It is the key to get other members to catch up with your work.”
To construct the baseline, Bac uses 1080Tiand a 2x2080Ti to run code. When it comes to frameworks, he prefers PyTorch especially, Catalyst, which he says helps in collaborating with the team. When dealing with tabular data, Bac prefers lightgbm and rapidsai for speed and efficiency.
Tips for finding a winning solution for beginners from Bac:
- Exploratory Data Analysis. Simple visualisation can help have a good feeling of the data and problem.
- Understand the evaluation metric.
- Be careful with the evaluation metric. Ex: Log Loss is very sensitive with distribution. It might yield a great shakeup of the leaderboard.
- Understand the leaderboard proportions, which can help in good CV
- Experiment fast to increase your chances of winning.
To be at the top, Bac advises one to be aggressive and hard working. “I believe that to be at the top of Kaggle, you should work as a full-time job. Read and try all the discussions/ideas from not only current competition and previous similar competitions. The last two weeks are very important. There are a lot of changes in the leaderboard. Other teams can merge, and that will take a toll on your ranking, which can be demotivating,” warns Bac.
Talking about the unprecedented attention towards ML across the globe, Bac admits that there is a widespread misconception that deep learning should be used to solve every problem.
That said, Bac believes that the ongoing pandemic will fuel more investments towards machine learning-based application in healthcare. However, one of the biggest challenges in medicine is the lack of data. “It is not easy to publicize the dataset due to privacy. Thus, federated learning can be a possible solution along with AutoML,” speculates Bac.
Added to this is the ever-evolving machine learning communities such as Kaggle, which are creating more awareness across the globe. The democratisation of ML is fueled by the incentivisation of problem solving through competitions and will only continue to in the coming years.
For beginners, who are keen to be part of this journey, here are some additional tips from Bac:
- Focus on only one competition at this time. If you are doing many at the same time, probably, you learn nothing, and you will easily get tired.
- Don’t skip any ideas coming from the discussions. Sometimes it is not useful for others, but this idea might help you a lot.
- Don’t let your GPU sleep.
- If you come across a great method from top-solution, try it in the next competitions.
Provide your comments below
Credit: Google News